Document - Duke University

advertisement
Technological Document
Due:6/24/2008
CPU Design Report
Project Name:
MIPS CPU Design
Author:
Lin Yin
TA:
Li Tingtao
School of Software , Shanghai Jiao Tong University
ABSTRACT
This report gives a brief description of the design and VHDL implementation of a
MIPS CPU with pipeline and cache functionalities.
KEYWORDS
MIPS, CPU, Pipeline, Cache, Logic, VHDL, Digital Circuit
School of Software, Shanghai Jiao Tong University
Address:Dongchuan Road No.800
Post Code:200240
Contents
1. Executive Overview ....................................................................................................1
2. Processor and Cache Design ......................................................................................1
2.1 Processor Design ....................................................................................................1
2.1.1 Architectural Overview ..................................................................................1
2.1.2 Hazards Solving..............................................................................................2
2.1.3 The Controller ................................................................................................4
2.1.4 Detailed Design for Each Phase .....................................................................6
2.2 Cache Design .......................................................................................................10
2.2.1 Overview ......................................................................................................10
2.2.2 Design Choices .............................................................................................11
2.2.3 State Machines ..............................................................................................12
3. Processor Debugging ................................................................................................14
3.1 Debug Overview ..................................................................................................14
3.2 Debugging Synchronous Problems ......................................................................15
3.3 Debugging the Cache ...........................................................................................15
4. Results ........................................................................................................................15
4.1 Testing Code ........................................................................................................15
4.2 Single Cycle MIPS ...............................................................................................15
4.3 Pipeline MIPS ......................................................................................................18
4.4 Pipeline MIPS with Cache ...................................................................................19
5. Conclusions ...............................................................................................................20
6. Developing Environments ........................................................................................20
6.1 Hardware ..............................................................................................................21
6.2 Software ...............................................................................................................21
7. Appendices ................................................................................................................22
7.1 ISA .......................................................................................................................22
7.2 Testing Codes .......................................................................................................22
8. References..................................................................................................................23
Page 1
School of Software, Shanghai Jiao Tong University
1. Executive Overview
In this CPU design practice, I’ve implemented 5-stage pipelined MIPS CPU with cache functionalities.
The practice is intended to make us know computer architecture better by making CPU logic design. My
design supports 15 instructions including 7 R-type instructions, 7 I-type, and 1 J-type ones (see appendix
[1]). It has successfully fulfilled all basic ideas of a simple MIPS CPU including instruction execution,
pipeling and cache functionalities. However, due to time limits, the bus arbitrator is not realized so that
separate memories are required for instructions and data. From the practice, I’ve learned how to make a
single-cycle CPU run, how to turn a single-cycle CPU into a pipelined one by solving the problem of
harzards, and how to design a cache using a finite state machine model. As a by-product, I also learned
VHDL language and how to solve the problem of synchronous problems. And above all, the goal did
achieve its goal – it made me know better about the architecture and organization of a computer.
The rest of this document is organized by first giving the details about processor and cache design,
followed by a description of problems and solutions during the processor debugging phase. Then the results
of the practice are given. Finally, conclusions are reached and the developing environments are listed.
2. Processor and Cache Design
2.1 Processor Design
2.1.1 Architectural Overview
Figure 1
Top-level Circuit
Page 1
School of Software, Shanghai Jiao Tong University
As is shown in figure1, the MIPS CPU is divided into 5 phases, according to the stages in the
pipelining: IF (Instruction Fectch), ID (Instruction Decoding), EXE (Execution), MEM (Memory Accessing)
and WB (Write Back). Whenever a phase has finished its task in an instruction, it will head for the next
instruction. Therefore, a set of backup registers are required to save the results of each phase. The long bars
in figure 1 denote the registers used for backup.
2.1.2 Hazards Solving
Like any pipelined CPU, the design of pipelined MIPS involves three types of hazards: structural
hazards, data hazards, and control hazards.
Structural hazards happen when instructions compete for the same hardware resource. For example,
both IF phase and MEM phase call for memory accesses. So they compete to gain the memory resource and
to occupy the data and address bus. In my design, this confiction is solved by using separate instruction
cache and data cache, instruction memory and data memory. However, the separation of the latter is not
necessary if an arbitrator is available. When ever a miss in one of the cache happens, the arbitrator will
freeze the memory so that the other cache can not access it until the current memory accessing finishes. The
strategy of using separate cache to solve the structure hazards can be generalized as adding hardware
resources. A general solution to this kind of harzards also includes adding bubbles, where the CPU stops
instruction fetching for one cycle to allow the competing instructions to gain resource access in a sequential
way. This approach is not adopted in my design because it delays instruction execution and is not efficient
enough.
Data hazards happen when the execution of an instruction needs the results of the instruction one or
several instructions ahead of it. In most conditions, an instruction will put its result in the register or
memory so that the following instructions can fetch it. However, if each instruction waits until the final
results of previous instructions are reached, unnecessary delay will be resulted and the pipeline will stop
working. Thus, whenever such data association happens, we should try to get the partial result of the
previous instructions for the thirsty instruction as soon as possible. This technique is called forwarding. In
MIPS, partial results have the following three sources:

ID Phase – the result of this phase may come from the register file or instruction word.

EXE Phase – the result of this phase may come from ALU.

MEM Phase – the result of this phase may come from memory (cache).
Since my design does not support multiplication instructions, it has fewer data hazard sources thus
fewer considerations are needed. The following list gives brief ideas in solving the data hazards in my
design.
 When an instruction In needs to use the result of In-3, since In-3 has already reached its WB phase
and written its result to the register file at the negative clock edge of the current cycle, no data
hazards will happen.
Page 2
School of Software, Shanghai Jiao Tong University
 When an instruction In needs to use the result of In-2, since In-2 has already reached its MEM phase
the result has already been calculated at the negative clock edge of the current cycle, forwarding is
possible. The only problem is that the calculated result hasn’t been written into the register file yet,
so we forward the the data from the backup register of MEM phase.
 When an instruction In needs to use the result of In-1, two different cases should be taken into
consideration. If the result of In-1 comes from ID or EXE phase, the results has already been
calculated, forwarding is made possible. However, if the result of In-1 comes from MEM phase, it
hasn’t been ready yet. The only solution is to insert a bubble to delay the instruction fetch for one
cycle and allow In-1 to get its result from the memory.
Figure 2
Figure 3
Solution to Data Harzards by Forwarding
Solution to Data Harzards by Adding Bubbles
The last kind of hazards – the control hazards, happen when branches are needed. We are put in a
dilemma whether to execute the following instruction or take the branch. Such decisions can not be made
until we have ideas whether the branch condition is true. Generally, three solutions are proposed (see
Page 3
School of Software, Shanghai Jiao Tong University
reference [1]) for this hazard. In my design, the third one (non-delay solution) is adopted because of its
efficiency. The branch condition is judged at the ID phase. Also, a delaying slot is appended to each branch
instruction to make fuller use of resources. Due to the introduction of delaying slot, the instruction that
directly follows a branch instruction is guaranteed to execute, whether the branch is taken or not. The
execution of a series of instruction involving the branch is illustrated in Figure 4.
Cycle
1
2
3
4
5
Branch
IF
ID
EXE
MEM
WB
IF
ID
EXE
MEM
WB
IF
ID
EXE
MEM
WB
IF
ID
EXE
MEM
Delaying Slot
Suc. Inst. 1
Suc. Inst. 2
Figure 4
6
7
8
9
WB
Solution to Control Hazards
The delaying slot will definitely be executed after branch execution. But the successor instruction 1
and 2 are executed only when the branch is not taken.
2.1.3 The Controller
pcu
rseqrt
instr[31..0]
controlw_id[26..20]
controlw_exe[26..20]
writepc
rseqrt
writepc
rseqz
writeir
rsltz
jump
instr[31..0]
branch
aludes[6..0]
f wda[1..0]
memdes[6..0]
f wdb[1..0]
rso[4..0]
rto[4..0]
rdo[4..0]
writeir
jump
branch
f wda[1..0]
f wdb[1..0]
rs[4..0]
rt[4..0]
rd[4..0]
controlw[31..0]
controlw[31..0]
inst7
Figure 5
The PCU Module
The task of a controller is to decode the MIPS instructions and generate control signals for every
design elements in the CPU. The generation control signals is the core issue involved in CPU design. It
coordinates every parts of a processor to work in phase with each other. The solution to problems of
harzards is also implemented in the controller. In my design, all controller functionalities along with
decoding are all encapsulated in a PCU (Processor Control Unit).
The type of an instruction is recognized by decoding its opcode. Different kinds of instructions will
have different opcodes. For example, R-type instruction will have an all-zero opcode in my design. For Rtype instructions, the type of calculation is recognized by decoding the func-code. The opcodes and funccodes of each instruction is given in appendix 1.
Page 4
School of Software, Shanghai Jiao Tong University
Once the concrete operation of an instruction is recognized, the PCU may start generating the control
word. The control word contains every piece of information needed to define the behavior of the processor
in the current cycle and provides enough information for the generation of control word in the next cycle.
Typically, this may include: the soonest phase at which the final result of current instruction will be
generated, whether the result will be written to the register file or memory, the source of data for port A and
port B of ALU, the func-code of ALU, etc. In practice, the control word is stored in every backup register to
enable each phase work correctly. The arrangement of control word is given in table 2-1.
Table 2-1 Control Word
Bits
Name
Usage
0-4
ALUCONTROL
Func-code of ALU
5
ALUSRCA
The data source of ALU data A
6
WRITEMEM
Control write enable port of memory
7
MEMTOREG
Whether the reg file should write the data from memory
8
WRITEREG
Whether result will be written to reg file in the WB phase
9-13
ResltDes
Control the register to be written in the WB phase
14
ALURESOK
Whether EXE phase will use ALU
15
MEMRESOK
Whether MEM phase will access memory
16-17
ALUSRCB
The data source of ALU data B
18-32
Reserved
Table 2-2 Other Control Signals
Name
# of Bits
Comments
fwda
2
Control the forwarding to ALU data A
fwdb
2
Control the forwarding to ALU data B
writepc
1
Control the write enable port of pc register
writeir
1
Control the write enable port of instruction register (IR)
branch
1
Whether the current instruction is a branch instruction.
jump
1
Whether the current instruction is a jump instruction.
rs
5
# of source register
rt
5
# of target register
rd
5
# of destination register
Table 2-3 Data Source for ALU Port A
ALUSRCA
Data Source
0
RS in the register file
Page 5
School of Software, Shanghai Jiao Tong University
1
INST[10..6] in the instruction code
Table 2-4 Data Source for ALU Port B
ALUSRCB
Data Source
0
RT in the register file
1
INST[15..0] in the instruction code
Table 2-5 Forwarding Source
FWDA/FWDB
Forwarding Source
00
No forwarding at all.
01
Forward data from ALU.
10
Forward data from memory.
2.1.4 Detailed Design for Each Phase
2.1.4.1 IF Phase
Figure 6
IF Phase
The IF phase should finish the following operations:
(1) Calculate the next instruction address for program counter register (PC)
(2) Read the instruction from the memory.
The PC is counted in the following way:
(1) In plain conditions, simply add 4 to the current PC counter
(2) When encountered with branch instruction, shift offset INST[[15..0] two bits left and add it to the
current instruction address. Note that “current instruction” means the instruction at IF execution,
Page 6
School of Software, Shanghai Jiao Tong University
that is, the instruction in the delaying slot. Offset INST[15..0] is in the branch code saved during
ID phase, that is, a part in the branch instruction itself.
(3) When encountered with a jump instruction, shift offset INST[[25..0] two bits left and add it to the
current instruction address. Note that “current instruction” means the instruction at IF execution,
that is, the instruction in the delaying slot. Offset INST[15..0] is in the jump instruction saved
during ID phase, that is, a part in the jump instruction itself.
Instruction fetch is to fetch the instruction at the virtual address saved in the PC register. However,
MMU is not provided in my simple MIPS CPU so that the addresses are directly sent to the instruction
memory (cache) for instructions. The instruction is saved in the instruction register.
2.1.4.2 ID Phase
Figure 7
ID Phase
The ID phase is responsible for the following tasks:
(1) Instruction decoding.
(2) Handle data harzards.
(3) Judch whether the condtion is met for a branch instruction.
(4) Fetch the operands from register file.
Once the mechanism described in section 2.1.3 is used, the decoding work is much of a language
problem. Simple and, or operation is enough for realizing the decoder. As is mentioned in section 2.1.2,
data harzard is solved by establishing extra data pathand forwarding data of the previous instructions as
soon as possible. The strategy of forwarding is illustrated in table 2-6.
Table 2-6 Forwarding Strategy
Page 7
School of Software, Shanghai Jiao Tong University
Instruction
EXE
Data Source
Phase
(
1
instruction ahead)
EXE
Condition
Phase: rssource
ALUDES[5]
Strategy
AND
fwda=01B
AND
fwdb=01B
AND
bubble
AND
bubble
AND
fwda=01B
AND
fwdb=01B
AND
fwda=01B
AND
fwdb=01B
(RSI==ALUDES)
rtsource
(RSI==ALUDES)
MEM Phase:
rssource
ALUDES[6]
(RSI==ALUDES)
rtsource
(RSI==ALUDES)
MEM
Phase
instructions ahead)
(2
EXE Phase:
rssource
MEMDES[5]
(RSI==MEMDES)
rtsource
(RSI==MEMDES)
MEM Phase:
rssource
MEMDES[6]
(RSI==MEMDES)
rtsource
(RSI==MEMDES)
2.1.4.3 EXE Phase
Figure 8
EXE Phase
The EXE Phase follows the ID Phase. The main tasks for EXE Phase are:
(1) Use ALU to realize arithmetic/logic operations
(2) Save the result of calculation to the backup register
(3) Give various control signals
Page 8
School of Software, Shanghai Jiao Tong University
Before we start calculation, we should first select operands. The input data of ALU either come from
immediate number in the instruction code or from RS/RT in the register file. This choice is decided upon
the signal ALUSRCA/ALUSRCB given by PCU.
2.1.4.4 MEM Phase
Figure 9
MEM Phase
The MEM Phase has the following main tasks:
(1) Give various memory access control signals and finish the memory access task.
(2) Save the data read from memory to backup register so they can be used by WB Phase.
The following signals are needed to access memory.
(1) DATAO[31..0]: Output data of CPU
(2) DADDR[31..0]: Address for accessing the memory.
(3) WRITEMEM: Control the write enable of the memory.
2.1.4.5 WB Phase
Page 9
School of Software, Shanghai Jiao Tong University
Figure 10 WB Phase
The only task for the WB Phase is to write the result of calculation to the register file. To achieve this,
the following signals are needed.
(1) RESULT[31..0]: The final result of the execution of an instruction
(2) CONTROLW_MEM[13..9] (ResltDes in PCU signals): The index of the register to be written.
(3) CONTROLW_MEM[8] (WRITEREG in PCU signals): Control the write enable of the register
file.
2.2 Cache Design
2.2.1 Overview
CPU Core
Cache Controller
i-cache
Main Memory
d-cache
Figure 11 Cache Overview
Page 10
School of Software, Shanghai Jiao Tong University
As is shown in figure 11, the cache lies in the CPU core and is coordinated by a cache controller. All
addresses sent by the CPU is sent to the cache controller first instead of directly to the main memory. The
cache controller decides whether the data needed is in the cache. If it is, then no memory access is needed,
the data is directly got from the cache and sent back to the CPU; if not, then it fetches several words from
the main memory consecutively to fill the corresponding line in the cache. Since cache access is much
faster than memory access. Once the data hit the cache, the performance may increase significantly. By
using cache, we believe the time locality and space locality of data access; that is, we believe that a datum
used currently will soon be used again; and a datum access may denotes access to the adjacent data in the
near future. Therefore, instead of only fetch the data we currently need, we also fetch their adjacent blocks –
sequential access to memory is much faster than random access.
Since my design does not include an MMU module, the cache receives physical addresses from CPU
instead of virtual addresses. Lock functionality is not supported. Two state machines are used in
coordination in the cache controller to schedule data read and write.
2.2.2 Design Choices
Out of simplicity, my design adopts the direct mapped cache. Instruction cache (i-cache) and data
cache (d-cache) are separated in order to avoid structural hazards. The size of both caches is 2KB. Both of
them have 128 lines; each line has 4 words; and each word is 4 bytes in length. Therefore, signals
DADDR[10..2] are used for addressing the cache. Among them, DADDR[10..4] are used for indexing line
and DADDR[3..2] are used for deciding the column. Each line also has 21 tag bits, 1 valid bit and 1 dirty
bit. The tag bits are used for recording the high 21 bits of a virtual address so that the cache controller may
decide whether a read/write hits or misses the cache. The valid bit is used to distinguish whether a line in
the cache is valid. All valid bits are set to false when the machine restarts. The dirty bit is used to decide
whether the line is dirty, that is, whether a block in the line has been overwritten without writing the data
back to the memory.
Choices involved in any cache design are also made in this practice. The strategies I adopted are listed
below:
(1) Block (re)placement & block identification. Since I adopt the direct mapped way, there is no
choice for these two strategies. The block to be (re)placed is decided by the low 11 bits of the
address. And the block is identified by using the low 11 bits for addressing and high 21 bits and
valid bit for verification.
(2) Write strategy. Out of the consideration of efficiency, I adopt the write back strategy in my design.
The information is written only to the block in the cache. The modified cache block is written to
main memory only when it is replaced.
(3) Allocation strategy. I assume that data written to memory will soon be read and therefore adopt the
read-write-allocation strategy. Upon a cache miss, the cache controller allocates a cache line for
Page 11
School of Software, Shanghai Jiao Tong University
either a read or a write to memory. Any load or store operation made to main memory, which is
not in cache memory, allocates a cache line.
Since memory clock is much slower than the CPU clock, we also have choices about the mechanism to
let the cache controller know when a memory read/write has been completed. We can do this by either sent
both memory clock and CPU clock to the cache controller to let it arbitrate or add a memready signal
between the cache controller and memory so that the memory can tell when the data are ready. In my design,
I choose the former.
2.2.3 State Machines
2.2.3.1 Cache State Switching
Figure 12 and figure 13 demonstrate state switching of instruction cache and data cache, respectively.
Cache
Valid
Read
Invalid
Write
Figure 12 State Switching for Instruction Cache
Invalid
Write
Write
Cache
Cache
Cache
Write,
Read
Valid
Write
Valid
Clean
Read
Figure 13 State Switching for Data Cache
2.2.3.2 Finite State Machine for I-cache and D-cache
The differences between I-cache and D-cache lies in that I-cache does not allow write action. To make
this document more concise and clear, only state machines for D-cache is given here. The state machines
for I-cache are just a submachine of D-cache by removing the edges involving write actions and
unnecessary isolated states after the edges are removed.
Page 12
School of Software, Shanghai Jiao Tong University
To realize a D-cache, two state machines are needed, we call them cachefsm and fillfsm, respectively.
Cachefsm is meant for handling the data access requests, judging hit or miss, and scheduling the fillfsm, if it
is unfortunately a miss. After fillfsm has fulfilled its tasks, cachefsm fetches the data from cache and sends
them to the output pins, if necessary. Fillfsm is responsible for fetching the data from memory to fill the
corresponding line in the cache upon a miss. It does all things from necessary write back to memory read.
The state machine for cachefsm is shown in figure 14; it is a 3-state finite state machine (fsm). It starts
from a srw state. It then judges whether the request is a read or write request. If it is a read one, it switches
to the SDRW state; if it is a write one, it will turn into SDWW state; or else it will remain SRW state. It
may also schedule fillfsm, depending on whether a miss ever happens.
The state machine for fillfsm is a bit complicated; it is a 10-state fsm, starting from ss state. When it is
not work, it remains at sidle state. It turns into ss state when it is scheduled by cachefsm. Then it goes
through a series of SWW or SRW states, depending on the type of data accessing request. Each SWW or
SRW will write/read a word from/to the memory. So a group of 4 states write back/fill a line in the cache.
Concrete state machine for fillfsm is shonw in figure 14.
In figure 14, the dashed line denotes message between two state machines. Cachefsm may schedule
fillfsm by sending writeback or fillfsm signal. Fillfsm tells cachefsm whether it is busy by maintaining
cachebusy signal.
Page 13
School of Software, Shanghai Jiao Tong University
cachefsm
SRW
write
read
!cachebusy
!cachebusy
cachebusy
cachebusy
SDRW
SDRS
write back
fillcache
cachebusy
fillfsm
SS
fillcache
write back
SWW1
SWW0
SRW0
SRW1
SRW3
SRW2
fillcache
SWW2
SWW3
!fillcache
SIDLE
Figure 14 State Machine for D-Cache
3. Processor Debugging
3.1 Debug Overview
To facilitate debugging, I use ModelSim to do pre-synthesizing analysis. This tool has powerful tools
for writing benchmarks, adding breakpoints, and watching waveforms. For each module in the design, I
write a bundle of benchmarks to test it in the ModelSim. After it has passed all testcases, I integrate it to the
system. The breakpoint function of ModelSim empowered me with the ability to catch nearly any bug in my
Page 14
School of Software, Shanghai Jiao Tong University
design. After the CPU design finished, I then compile and synthesize it in the Quartus II, load test codes and
data to the memories, edit the input waveform Quaretus Waveform Editor, and then watch the result.
3.2 Debugging Synchronous Problems
The biggest headache in the design is that Quartus II does not support asynchronouos memory reading
and writing. That is, the output of memory is only given at a clock edge. Data can be written to the memory
only at the clock edge, too. This is quite different from default memory in MAX PLUS II. When I found
that the final output waveform does not conform to what we have expected, I traced it down to a delayed
memory read near the second cycle of my test program. I then detected this problem.
This problem can be solved by giving a negative clock edge to the memory while a positive one to the
CPU. Then the memory data access is done at the middle of a CPU cycle. So the memory seems
asynchronous. However, when I add caches to my design, this approach just won’t work. This is because,
the cachefsm state machine requires a half cycle delay so it can schedule fillfsm state machine, so it also
needs a negative clock edge. This conflicts with the memory access again. To solve the latter problem calls
on us to use different clocks for CPU and memory rather than reversing the CPU clock and sending it to the
memory. A memory cycle should be an integral multiple of a CPU cycle to ensure that each module in the
design work in phase.
3.3 Debugging the Cache
When I added cache to my design, the machine behaves in a strange way again. The first problem I
discovered is that both values of PC and IR have short cycles no matter the cache hit or miss. So I check the
modules that give the signals and found that I’ve forgotten to write a module that locks instruction fetch
upon a cache miss. I modified the PC calculation logic and the first problem was solved.
But other problems are still ahead. The final results of my test programs are all wrong. So I output the
states in the cache fsm and watch their waveforms. By doing this, I detected several flaws in my state
switching logic, and some conditions that have been neglected. For example, I forgot to check the dirty bit
before filling the cache with new data. Once the fsm diagram is correct, debugging state switching is made
much easier. Watch the states in the waveform and see whether they conform to your expectation. If not,
trace the wrong switch down to the program and rectify it.
4. Results
4.1 Testing Code
To test the correctness of each design, three pieces of codes are used. To make this document more
concise, only the codes cited in reference [1] are used (a small modification of code is made at line 0C to
show the correctness of cache write back). The testing codes are attached in the appendix of this document.
4.2 Single Cycle MIPS
Page 15
School of Software, Shanghai Jiao Tong University
Page 16
School of Software, Shanghai Jiao Tong University
Page 17
School of Software, Shanghai Jiao Tong University
4.3 Pipeline MIPS
Page 18
School of Software, Shanghai Jiao Tong University
As can be seen from the result, the CPU with pipeline has the same number of cycles as single
cycle CPU, but only 1/5 cycles compared with that of multi-cycle CPU. The time cost of pipeline CPU is
much less than both single-cycle and multi-cycle CPU. It is much faster than single-cycle CPU because
single-cycle CPU contains a lot of gliches at the beginning of each cycle and requires plenty of time to
stable, whereas pipelined CPU does not. It is much faster than multi-cycle CPU just because it contains
fewer cycles, and because all modules in the CPU are now in full-time usage. In theory, a CPU with k
stages in its pipeline with have a 1/k time cost of multi-cycle CPU, assuming that the time cost of all stages
are even.
4.4 Pipeline MIPS with Cache
Page 19
School of Software, Shanghai Jiao Tong University
As is seen from the result, the cache significantly cut down the time cost when the data hit the
cache. The cycles from 3.5ms to 4.3ms demonstrates a whole process of a cache miss together with write
back. It demonstrates the penality resulted upon a cache miss.
5. Conclusions
In this practice, I’ve successfully accomplished buiding a MIPS CPU with pipeline and cache
functionalities. 3 types of harzards are solved and the memory is correctly synchronized. Two seperated
direct mapped caches are designed for instruction and data access. The hierarchy of storage is fully
demonstrated in this practice.
From the practice, I’ve get more familiar with computer architecture and organization. Through this
practice, I have got a very deep impression about harzards involved in pipelining and how to solve them. I
also learned different cache policies with their advantages and disadvantages. As a byproduct, I learned how
to use VHDL to build digital circuits and how to debug it by writing benchmarks and watching waveforms.
And above all, it endows me with patience and carefulness to do things in an orderly way. This practice is
really a valuable lesson for my college study.
6. Developing Environments
Page 20
School of Software, Shanghai Jiao Tong University
6.1 Hardware
 Processor: Intel Core Duo processor T2300 (1.66MHz FSB)

Memory: 1536MB
6.2 Software
 OS: Windows XP + SP2

Design Software: Quartus II 7.2

Debug and Simulation: ModelSim SE 6.2b
Page 21
School of Software, Shanghai Jiao Tong University

7. Appendices
7.1 ISA
Table 7-1 Instruction Set
7.2 Testing Codes
Page 22
School of Software, Shanghai Jiao Tong University
WIDTH=32;
DEPTH=32;
ADDRESS_RADIX=HEX;
DATA_RADIX=HEX;
CONTENT BEGIN
00 : 00000820;
%
add
$1, $0, $0
%
01 : 20020004;
%
addi
$2, $0, 4
%
02 : 00001820;
%
add
$3, $0, $0
%
03 : 8C240000;
%loop: lw
$4, 0($1)
%
04 : 20210004;
%
addi
$1, $1, 4
%
05 : 00641820;
%
add
$3, $3, 4
%
06 : 2042FFFF;
%
addi
$2, $2, -1
%
07 : 10400003;
%
beq
$2, $0, finish
%
08 : 00000000;
%
nop
09 : 08000003;
%
j
0A : 00000000;
%
nop
0B : AC230000;
%finish:sw
$3, 0($1)
%
0C : 8C220000;
%
$4, 0($2)
%
0D : 0800000D;
%here: j
here
%
%
loop
lw
%
%
[0E..1F] : 00000000;
END;
8. References
[1] Zhu Ziyu, Li Yamin. CPU Chip Logic Design. Tsinghua University Publishers,2005
[2] David A Patterson, John L Hennessy. Computer Architecture: A Quantitative Approach. Third
Edition. Morgan Kaufmann Publishers, Inc. 2003
[3] Purdue University ECE4371 MIPS Deisgn Labs. http://cobweb.ecn.purdue.edu/~ece437l/materials
Page 23
School of Software, Shanghai Jiao Tong University
Download