Final Report - Computer Science Division

advertisement
CS 152 Computer Architecture
Final Project Report
Spring 2001
Prof. Kubiatowicz
Section:
TA:
2-4 pm Wednesday
Ed Liao
Team members:
Lin Zhou
Sonkai Lao
Wen Hui Guan
Wilma Yeung
-0-
14255590
14429797
12270914
14499996
Table of Content:
Abstract ……………………………………………………………
Division of Labors …………………………………………………
Detailed Strategies …………………………………………………
Results ……………………………………………………………..
Conclusion …………………………………………………………
Appendix I (notebooks) ……………………………………………
Appendix II (schematics) …………………………………………..
Appendix III (VHDL files) …………………………………………
Appendix IV (testing) ………………………………………………
Appendix V (delay time table) ……………………………………..
-1-
2
2
2
12
12
13
13
14
14
15
Abstract:
We implemented a deep-pipelined processor with seven stages, along with a branch
predictor, separate instruction and data cache, and stream buffer. Based on the
original 5-staged pipeline, we divided the instruction fetch stage into two, the same as
the execution stage. At the first fetch stage, the PC is read from the instruction cache
while the branch decision is predicted at the second fetch stage. Since the ALU is the
function unit that accounts for the worst delay time (15ns), to achieve optimal
performance, we broke it down into two 16-bit adders. Each adder is paired with a
logical unit and is placed on different execution stages. We also added a branch
predictor on the second stage. With this implementation, our pipelined processor can
sustained a cycle time of 22.5ns, with the ideal memory system.
To optimize our memory system, we implemented two cache subsystems, one for the
data and the other for the instruction. Furthermore, due to the frequency of sequential
memory accesses to the instruction cache, we attached a stream buffer to it. And
finally, the performance metrics that we used for the cache are the hit time and miss
penalty. The hit time for both data and instruction caches is 7ns. The miss penalty
for the data cache is estimated as 124.5ns while that for the instruction cache is
422ns, based on the result from running the mystery program with the ideal memory.
Division of Labor:
Project Design Areas
Processor datapath, and
controllers
Cache system, controllers, and
arbiter
Branch Target predictor
Tracer monitor
Forwarding Unit
Report write-up
Team Member(s) Involved
Sonkai Lao,
Lin Zhou
Wilma Yeung,
Wen Hui Guan
Sonkai Lao
Wen Hui Guan
Wilma Yeung
Lin Zhou,
Wen Hui Guan
Detailed Strategy:
a.
General Design Strategy
Memory System
(1)
Instruction Cache and Stream Buffer
Stream buffer is added to reduce the compulsory and capacity misses.
Since instructions are likely executed in sequence as a group of four to
-2-
five instructions due to branches and jumps. We design the instruction
cache and stream buffer with fully associative lookup method. Thus,
we associate each cache and buffer block with a comparator.
For the instruction cache, we implement the FIFO replacement policy,
which is enforced by using a counter to select a block for replacement
in a fashion of a clock. Therefore, when read-miss occurs on both the
cache and the buffer, the buffer would be flushed. The counter would
advance after the instruction cache read in 2 sequential words for a
block. The buffer controller would then fill the buffer with next 4
sequential words. This sequence of reads corresponds to two memory
accesses under our implementation of burst request of 3 words. Based
on this emplementation, the miss penalty is about 124.5ns. However,
when a hit is found on the cache, the hit time is 7ns. If the requested
word misses on cache but hits on the buffer, it only costs one extra ns
to bring the word to the NPC.
(2)
Data Cache Re-design (with fully associative, burst request, and
write-back)
We implement fully associative access method for the data cache
along with write-back write policy. Since we employ two 32-bit
DRAM banks in parallel, we connect the data cache to the memory via
a 64-bit bus. We also implement the FIFO replacement policy. This
policy is enforced by using a counter to select a block for replacement
in a fashion of a clock as for the instruction cache. Thus, the counter
only advances when there is a read-miss or write-miss.
To increase the performance of the cache, we design the cache in such
a way that when a write to the DRAM is required, the cache only
writes one word, instead of two, to the memory. However, when there
is a read-miss and a block is selected for replacement, two words
would be read into the block to take advantage of the spatial locality.
This implementation also reduces the overhead for write since only
one word is written back to the DRAM if the dirty bit corresponding to
that word is set. Therefore, we implement each block with two dirty
bits.
Write-miss and write-hit involves a little complication since we need
to update only one word (32-bit) via a 64-bit bus and make the
implementation simple. For write-miss, the block selected for
replacement is handled in the same manner as for read-miss. In this
case, we just request a read of two words from the DRAMs and set the
valid bit after writing the dirty word(s) to the memory if any.
However, to put these two scenarios together and fit them into a 64-bit
input bus, we need to consider several write cases. Our approach is to
-3-
keep two valid bits and two dirty bit for each block. For example, on
write-hit, we simply update the appropriate word and set the dirty bit.
For a compulsory write-miss, we would need to update the tag, the
cache word, and the corresponding valid bit. Therefore, by keeping
two valid bit, we could do non-allocate-on-write to reduce the
overhead. Evidently, due to the data source for write can come from
either CPU or DRAM and the fact the target block may be empty (all
valid bits are not set) and the input data is from a 64-bit bus, we need
to appropriately choose the 64-bit input data. In this regard, we
implement the data cache controller to distinguish the input source and
generate a 3-bit source selection signal to choose the input data.
(3)
Data Cache controller design
From the data cache design section, write-hit and read-hit are handled
similarly. Read-hit involves selecting the right word to be output to
the CPU. On write-hit, the appropriate word needs to be updated
appropriately. These two cases can be handled on the same state by
recognizing the type of hit and generate the corresponding selection
signals.
For both misses, the cache would request permission to access the
DRAM through the arbiter (describe below). It has to wait for the
grant signal from the arbiter before performing the write or read. The
diagram below shows that the hit is handled at the START stage.
When a miss is triggered, the controller would go to either
GRANT_W or GRANT_R states until the write or read permission is
granted. Once permission is granted, the controller would make a
transition to either write or read states and wait until the DRAM is
idle. When the operation is completed, the controller would go back to
the initial state.
It is possible that a read-miss occurs after servicing a write-miss. As
shown in the diagram, if this is the case, the state machine can request
read right after finishing the write from the DRAM.
This
implementation actually saves one cycle on the fly. The state diagram
is here.
(4)
Instruction and Stream Buffer controllers design
Instruction and stream buffer controller are two communicating state
machines. When it hits on either the cache or the buffer, both machine
stay at the START state. If it is a hit on the buffer, while on START
state, the buffer controller would instruct to output the requested word
instead.
-4-
On misses, the buffer controller takes an active role to communicate
with the arbiter and memory controllers to request a sequence of 6
words (burst read 3 times, two words each). Once the memory access
request is granted to the buffer controller by the arbiter, the first two
words would be written to the selected cache block while the next 4
sequential words would be put into the buffer. After each burst read
from the memory, the buffer controller would wait for one cycle to let
the DRAM to recover. In addition, it is also quite possible that a
write-miss is followed by a read-miss. We treat this case by
requesting memory access again immediately after the write-miss is
handled because the availability of instruction data is critical to the
processor performance. The instruction cache state diagram is here.
And the stream buffer controller state diagram is here.
(5)
Memory controller design
Memory controller (memctrl) is a synchronous process, which
contains two asychronous components, memory read component
(mem_read) and memory write component (mem_write). The
synchronous input signals of memctrl will trigger the mem_read and
mem_write to do the operation. The separation of mem_read and
mem_write is easy for memctrl to distinguish the read and write
operation. In order to increase the throughput of the DRAM’s, we
implemented the burst read request of 3 words. Within the RAC and
CAS timing constraint, for each RAC signal, we toggle the CAS signal
three times to accomplish the burst read request.
(6)
Memory arbiter
We design the arbiter as a state machine consisting of 3 states, Start,
Instruction, and Data. From the start state When a request comes
either from data or instruction cache, the current state would make a
transition to the respective state along with the appropriate memory
request signal. However, when requests come from both instruction
and data at the same time, the request from data takes priority and the
instruction fetch register is stalled until the memory access is
completed. At this point of time, the request from instruction is
serviced.
Datapath
(7)
Main structure of 7 stages deep pipeline (overview).
-5-
(8)
IF2
Dec
DEC
ALU2
IF2
IF1
ALU1
IF1
MEM
Reg
WB
Branch Target Predictor and controller
To implement a branch target predictor, we put a branch prediction
table on the second instruction fetch stage. The prediction table
maintains 16 entries of PCs, predicted PCs, and the next state bits.
The prediction is based on the input PC from the next PC logic. The
prediction table is associated with a 2-bit state machine, which always
predicts “not taken” at the first time. The predicted NPC would be
latched to the next PC logic. But the actual decision is made in the
first EXE stage. If the prediction is wrong, the instruction register
would be flushed. This costs extra two cycles in delay.
T
Flush
NT
Predict Taken
Predict Taken
T
NT
instruction
instruction
R
E
G
Instr Cache
T
NT
Predict Not
Taken
T
Predict Not
Taken
T NT
DEC
P
C
Next_PC Logic
P
C
Branch Predict Table
Right decision??
Predicted_PC / Correct Pc
IF1
(9)
EXE
IF2
DEC
EXE1
ALU Modification
To reduce the cycle time, we divide the original 32-bit ALU into 2 16bit adders. Addition and subtraction can be handled by the adders
from the 2 stages. The first adder in EXE1 would operate on the lower
16 bits of operands while the upper 16 bits would be calculated by the
other adder in EXE2. Also each execution stage has a 32-bit logic
unit, which handles logic operations and the shifts. The reason of
-6-
having two logic units is to avoid stall between arithmetic and
shift/logic operations. Both logic units would do the operations at the
different stages, and the output of the first logic unit will be forwarded
to the EXE2 stage. If there is a RAW hazard, the controller will detect
the hazard and select the right output of logic unit through the control
of the MUX. Forwarding can also be completed as usual.
For example,
choose the first output of Logic Unit :
sll
$1 $2 $3
addiu $4 $1 3
choose the second output of Logic Unit:
addiu $1$2 $3
sll
$4 $1 2
In addition, the gate level logic block, such as SLT logic unit, is built
in order to reduce the cycle time of the critical path.
A[15:0]
A_Half[31:16]
32
B_Hal[15:0], A_Half[15:0]
A_Half[31:0]
32
R
E
G
B[31:0]
B[31:16]
B_Half[31:16]
SLT Logic
Out_1[31:0]
M
U
Out_2[31:0]
X
A[31:0]
A[31:0]
B[31:0]
B[31:0]
Out_2[31:0]
Logic
Logic
M
U
X
R
E
G
Forward to DE (Mux32 x8)
16
Adder
B[31:0]
Adder
R
E
G
B[15:0]
M
U
X
Out_1[31:0]
(10)
Forwarding Unit
The forwarding unit (hazard controller) is written in VHDL. It
compares the registers among the decode, 1st execution, 2nd execution,
and memory stage, then compares them. When data dependency is
detected, it would forward the depended value to the previous stage.
As a result, it avoids the RAW hazard. To implement the forwarding
unit, we code the VHDL to generate the necessary selection signals for
various MUXes to accomplish the correct data flow based on the result
of the comparison. There are 2 muxes at the decode stage, 4 muxes at
the 1st execution stage, and 5 muxes at the 2nd execution stage. It also
controls which logical unit to to use for each shift/logic operations.
-7-
b.
Testing Strategy
Delay modeling of major components:
(1)
Forwarding Unit
The forwarding unit is used to generate the selection signals for
muxes. It compares registers from different stages. In hardware
implementation, it can be designed using muxes and simple logic
gates. As a result, we model the time delay as 5ns.
(2)
Predict entry
The predict entry is used to keep the PC, NPC, and the state bits of the
prediction state machine. It is can be implemented as a register. Thus,
the delay time of 3ns is reasonable.
(3)
Branch Table Controller
The state transitions of the controller are following the edge-triggered
flip flops.
Since all the output signals result from simple
combinational logic, we consider the delay of the controller having the
same delay as for a series of two register, which is 6 ns.
(4)
Comparator
The 9-bit comparator is used to compare the tag can be implemented
with 9 2-input NAND gates, 9 inverters. However, ANDing these 9
single bit values may need a 9-input AND gate. This high fanout does
increase the delay. Another way to do it is to break it down into more
-8-
levels. This also increases the time delay for 3 levels of gates,
excluding inverters. Therefore, we give it a delay of 6 ns.
(5)
3-bit counter
The 3-bit counter can be implemented with 3 toggle flip flops in
parallel and a fair amount of combinational logic to select the output
bits. Thus, the delay of the counter is tied to the time delay of the flip
flops. It is reasonable to model the delay for toggle flip flop as 2ns,
that of register. Therefore, that the 3-bit counter has delay time of 2ns
seems justifiable.
(6)
Cache block select decoder
The decoder takes in a 3-bit value as input and output 8 single bits
value in parallel with only one bit got asserted at most, based on the
value of the 3-bit. This can be realized as a one level of 8 NAND
gates in parallel, excluding inverters. As a result, we assign to it a
time delay of 2 ns.
(7)
Instruction, Data Cache controllers
The cache controllers can be implemented with pure combinational
logic. Thus state transitions are merely triggered by flip flops. The
reasonable delay is 3ns.
(8)
Memory controller
The state transitions of the controller are following the edge-triggered
flip flops.
Since all the output signals result from simple
combinational logic, we consider the delay of the controller having the
same delay as a series of two registers, which is 6 ns.
(9)
Memory arbiter
The memory arbiter is designed as a state machine using state bit
assignment. The state bits are done by direct bit assignment.
Therefore, its structure is a two level combinational logic. Same as the
memory controller, we use 6ns delay time to model the arbiter.
(10)
Other common components are modeled with the same values as for
lab 4 to lab 6.
Memory System Testing Methodology
Cache, Stream buffer, and controllers:
-9-
(1)
We first test individual cache and controller components. The
individual testing scripts (.cmd file) can be found in Appendix IV.
Then we do a comprehensive test on the memory cache system as a
whole. The read-hit is tested on a straightforward manner by
addressing cache content and verify it value.
(2)
In particular, we check for the correct handling of the replacement
policy by doing write-misses on all blocks first. An additional writemiss would choose the first block for replacement because of the FIFO
policy. As a result, the dirty word would be written back to the dram
before writing the new word into the cache block. We check the
memory content for the correct update. The read-miss is tested
similarly.
(3)
For the instruction cache, we particularly check for the read-miss on
the cache but have a first hit on the buffer. Then reading the next 3
sequential words shouldn’t cause any unnecessary delay and stall
because the buffer has the 4 sequential words already. After that, miss
both on the cache and buffer would require buffer flushing and
bringing in another 6 sequential words.
Memory controller:
(1) We first test the basic write and read operations and verify the
corresponding memory content and the output.
(2) To test the burst read of 3 words from memory, we input a row address
and then lower the RAS signal. During this period, we supply the
controller with three consecutive column addresses, each followed by
lowering the CAS signal. After one access of the DRAM’s, we expect
that 6 sequential words would be the output since we have two DRAM’s
in parallel and we access both at the same time.
Comprehensive testing:
(3) Our testing strategy is in hierarchical order. First of all, we break down
the pipeline into 7 stages and test it with the mystery program used in the
previous labs to make sure the pipeline works. After testing all the basic
components such as cache block, caches, controllers, forwarding unit,
branch predictor, and arbiter, etc. We proceed to add one functional unit
at a time to the pipeline and run it with an ideal memory. After ensuring
those functional units work properly, we attach the memory and cache
system to the pipeline. And we do a series of intensive tests to further
confirm the proper functionality of each unit and the whole processor.
Finally, a trace monitor is added to datapath to monitor the committed
result on the write-back stage.
- 10 -
Results:
a. Critical path
(4) The processor critical path:
Time Delay for Critical Path
Mux16X 2
Mux 32x8
Mux32x5
16-Bit Adder
SLT Logic
32-Bit Reg
: 1.5 ns
:3.5 ns
: 3.5 ns
: 8 ns
: 3 ns
: 3 ns
Total = 22.5 ns
A[15:0]
A_Half[31:16]
32
B_Hal[15:0], A_Half[15:0]
A_Half[31:0]
32
R
E
G
B[31:16]
B[31:0]
B_Half[31:16]
SLT Logic
Out_1[31:0]
M
U
Out_2[31:0]
X
A[31:0]
A[31:0]
B[31:0]
B[31:0]
M
U
X
Out_2[31:0]
Logic
Logic
R
E
G
Forward to DE (Mux32 x8)
16
Adder
B[31:0]
Adder
R
E
G
B[15:0]
M
U
X
Out_1[31:0]
Cmparing with the cycle time of 37ns obtained from the 5-stage
pipeline, the cycle time for the deep-pipeline processor is a 39.2%
improvement.
(5) Cache performance metrics:
Data and Instruction Hit Time (on cache or buffer):
- 11 -
9-bit comparator(6ns) + 32-bit Tristate Buffer(1ns)  7ns
32
32
32-bit
Tristate
Selected Word
6ns
1ns
To PC
9
Tag[8:0]
9-bit
Comparator
9
I_addr[9:1]
hit
Instruction Cache Miss penalty = 422ns.
Cache
64
32
Word Selection
Logic (Tristate)
miss
request
Arbiter
2ns
Tag[8:0]
9-bit
9-Bit
Comparator
Comparator
9
1ns
Tristate
I_Addr[9:1]
6ns
To PC
1ns
granted
addresses
Stream Buffer Controller
6ns
Memory
Controller
6ns
access
DRAM
400ns
64
Miss Penalty = (6 + 2 + 6 + 6 + 400 + 1 + 1)ns = 422ns
Data Cache Miss penalty => 124.5ns:
3.5ns
3.5ns
mux
mux
Comparator
6ns
Arbiter
2ns
Cache Controller 6ns
Mem Controller 6ns
Dram
100ns
Mux
3.5ns
triState
1ns
Cache
Cache
64
64
Wordselection
selection
Word
logic(Tristate)
(Tristate)
logic
32
To PC
1ns
1ns
miss
miss
Tag[8:0]
Tag[8:0]
99
I_Addr[9:1]
I_Addr[9:1]
9-Bit
9-Bit
Comparator
Comparator
6ns
6ns
request
Total Delay = 124.5ns
Arbiter
2ns
granted
addresses
Memory
Controller
6ns
access
DRAM
64
Miss penalty for the data and instruction cache differs significantly due to the
fact that instruction cache burst read 3 times (page mode) on a miss while the
data cache only read once (normal mode).
b. Cycle time for the pipelined processor
- 12 -
The calculated delay time for the critical path is 22.5 ns. And the clock rate
for the new process is 1/22.5ns = 44.44MHZ.
Conclusion:
We have implemented a deep-pipeline processor running at 44.44 MHZ with
branch prediction capability. We also implemented a stream buffer to
complement the instruction cache. Both the data and instruction caches’ access
method is fully associative access with write-back write policy and FIFO
replacement policy. Furthermore, we implemented the burst memory read request
of 3 words for each dram. It works well with the memory controller, DRAM,,
cache, and cache controller as a whole via a 64-bit bus. An arbiter is constructed
to prioritize the simultaneous data and instruction. The data cache request always
takes priority. If time permits, in order to further enhance the cache system, we
would implement a victim cache for the data cache.
Appendix I (notebooks):
a.
b.
c.
d.
Lin Zhou
Sonkai Lao
Wilma Leung
Wen Hui Guan
Appendix II (schematics):
All schematics shown on this report are not listed.
Component
Data cache controller
Instruction cache controller
Cache and memory system
Arbiter controller
Stream buffer controller
Presentation schematics
Schematic file
Data_cache_fsm.ppt
Instr_cache_fsm.ppt
Mem_cache_system.ppt
Fsm_for_arbiter.ppt
Stream_buffer.fsm.ppt
Sum.ppt
Appendix III (VHDL files):
VHDL files for all components and blocks used:
Component
Data cache controller
Predict entry
Instruction cache controller
Stream buffer controller
VHDL file
Data_cache_ctrl.vhd
Predict_entry.vhd
Instr_cache_ctrl.vhd
Stream_buf_ctrl.vhd
- 13 -
Memory controller
Branch table controller
Memory arbiter
Hazard controller (forwarding)
16-bit ALU
Debugging Tracer (monitor)
3-bit counter
m1x8 mux
m10x2 mux
m32x2 mux
m8x2 mux
10-bit tristate
32-bit tristate
9-bit comparator
9-bit register
1-bit register
Block select decoder
Memory read component
Memory write component
Note:
Memctrl.vhd
B_table_ctrl.vhd
Final_mem_arbiter.vhd
Hazardcontroller.vhd
Half_alu.vhd
Display.vhd
Counter_3bit.vhd
Mux1x8.vhd
M10x2.vhd
M32x2.vhd
M8x2.vhd
Tristate_10bit.vhd
Tristate.vhd
Comparator9.vhd
Register9int.vhd
Reg1.vhd
Block_sel_decoder.vhd
Mem_read.vhd
Mem_write.vhd
All other components such as extender, shifter, muxes are the same as
those used in lab 4 to lab 6. So they are not listed here again.
Appendix IV (testing):
Component testing:
Tests are performed in hierachical order:
Component
Overall system test
Cache system
Pipeline datapath
Forwarding unit
Forwarding on
quickSort
Instruction cache
Memory system
Dram bank
Data cache
Command file
Final.cmd
Quicksort.cmd
Cachesystem.cmd
Test_cache_sys.cmd
A_partial.cmd
Test_datapathfinal.cmd
Forward.cmd
Forward_no_sort.cmd
Forward_sort.cmd
Instr_cache
Test_mem_sys.cmd
Test_memoryBank.cmd
Test_datacache.cmd
- 14 -
Other files
All testbenches are
saved in folder:
finalproj/behv/testbnc
h/
Replacement policy
arbiter
Replace.cmd
Test_arbiter.cmd
Log files:
(1) for testing system  finalproj/fsimtemp.log
Final Datapath testing:
Datapath schematic file:
Datapath command file:
Monitor output file:
../ finalproj/sch/datapathhazard.1
../ finalproj /cmd/final.cmd
../ finalproj /disassembly.out
Appendix V : delay time for each component
Component
Stream buffer controller
Predict entry
Hazard controller (forwarding)
Branch table controller
Data cache controller
Instruction cache controller
Block select decoder
3-bit counter
m1x8 mux
m32x8 mux
m10x2 mux
m32x2 mux
m8x2 mux
10-bit tristate
32-bit tristate
9-bit comparator
9-bit register
32-bit register
Logic Unit (shifter/logic unit)
16-bit Adder
SLT logic
1-bit register
Data, instr. Cache controller
Memory controller
Memory arbiter
- 15 -
Dalay time
6ns
3ns
5ns
6ns
6ns
6ns
2 ns
2 ns
3.5 ns
3.5 ns
1.5 ns
1.5 ns
1.5 ns
1 ns
2 ns
6 ns
3 ns
3 ns
10 ns
8 ns
3 ns
3 ns
6 ns
6 ns
6 ns
Memory read component
Memory write component
Both are part of the
memory controller
- 16 -
Download