Uploaded by Godfrey Osagiede

cs3375 final exam practice sol fall 23

advertisement
Name ONLY: ____________________________
CS3375: Computer Architecture
Fall 2023
Final Exam Practice Test Solution
Show all your work,
Closed book and note, and
No calculator/smartphone.
1
1. The performance of a machine is determined by instruction count (IC), clock cycles per instruction (CPI), and
clock cycle time. Here, IC is determined by compiler and [
?
]. CPI and the clock cycle time are
determined by the implementation of the hardware.
•
Instruction Set Architecture
2. In a single cycle data path, every instruction begins on one clock edge and completes execution on the next
clock edge. So the CPI is [
?
].
•
CPI = 1
3. The single-cycle datapath must have separate instruction and data memories, because [
a.
b.
c.
d.
c
]
the formats of data and instructions are different in MIPS, and hence different memories are needed.
having separate memories is less expensive.
the processor operates in one cycle and cannot use a single-ported memory for two different accesses
within that cycle.
above all three are correct
4. Based on the following instruction, what are the values (in decimal) marked as “?” in the single cycle datapath?
Suppose $29 initially contains the number 129. If a value cannot be determined, mark it as “x” (don’t care).
addi $29, $29, 16
2
5. Assume that there are no pipeline stalls and that the breakdown of executed instructions is as follows:
add
addi
not
beq
lw
sw
20%
20%
0%
25%
25%
10%
a. In what fraction of all cycles is the data memory used?
•
•
The data memory is used by LW and SW instructions, so the answer is:
25% + 10% = 35%
b. In what fraction of all cycles is the input of the sign-extended circuit needed?
•
•
The sign-extend circuit is actually computing a result in every cycle, but its output is ignored for ADD and
NOT instructions. The input of the sign-extend circuit is needed for ADDI (to provide the immediate
ALU operand), BEQ (to provide the PC-relative off set), and LW and SW (to provide the off set used in
addressing memory) so the answer is:
20% + 25% + 25% + 10% = 80%
6. The operation times for the major functional units are 200ps for memory access, 200ps for ALU operation, and
100ps for register file read or write. For example, in single-cycle design, the time required for every instruction is
800ps due to lw instruction (instruction fetch, register read, ALU operation, data access, and register write). Here,
we only consider lw instruction for speedup comparison.
a. [True/False]: If the time for an ALU operation can be shortened by 25%, it will affect the speedup obtained
from pipelining.
• False, shortening the ALU operation will not affect the speedup obtained from pipelining. It would not
affect the clock cycle.
b. If the ALU operation now takes 20% more time, the clock cycle needs to be [
•
] ps.
If the ALU operation takes 20% more time, it becomes the bottleneck in the pipeline. The clock cycle
needs to be 240ps.
3
7. Based on the following instruction, fill out the control signal table. If a control signal value cannot be determined,
mark it as “x” (don’t care).
add $t1, $t2, $t3.
Inst
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
ALUOp1
ALUOp2
add
1
0
0
1
0
0
0
1
0
8. Assume that individual stages of the datapath have the following latencies:
IF
ID
EX
250ps
350ps
150ps
MEM
300ps
WB
200ps
a. What is the clock cycle time in a pipelined and non-pipelined (single-cycle) processor, respectively? Show all
your work.
•
Pipelined: 350ps (longest); Non-pipelined: 1250ps = (250 + 350 + 150 + 300 + 200)
b. What is the total latency of a lw instruction in a pipelined and non-pipelined (single-cycle) processor,
respectively? Show all your work.
•
Pipelined: 1750ps (each stage will be 350ps and thus, 350 x 5 stages); Non-pipelined: 1250ps
4
9. Consider executing the following code on the pipelined datapath,
lw $4, 100($2)
sub $6, $4, $3
add $2, $3, $5
(i) How many clock cycles will it take to execute this code? (ii) Draw a multiple clock cycle pipeline diagram that
illustrates the dependencies that need to be resolved.
•
It will take 8 clock cycles to execute this code, including a bubble of 1 cycle due to the dependency
between the lw and sub instructions.
10. Consider three branch prediction schemes: predict not taken, predict taken, and dynamic prediction. Assume
that they all have zero penalties when they predict correctly and two cycles when they are wrong. Assume that
the average predict accuracy of the dynamic predictor is 90%. If a branch is taken with 5% frequency, which
c ]
predictor is the best choice? [
a. Predict taken
b. Dynamic prediction
c. Predict not taken
d. above three predictors are same
11. Refer to the following sequence of instructions, and assume that it is executed on a 5-stage pipelined datapath.
add r5, r2, r1
lw
r3, 4(r5)
lw
r2, 0(r2)
or
r3, r5, r3
sw
r3, 0(r5)
If there is no forwarding or hazard detection, insert nops to ensure correct execution.
[2 pts]
•
5
12. Which of the following statements are generally true? Multiple answers. [
]
a. On a read, the value returned depends on which blocks are in the cache.
b. Memory hierarchies take advantage of temporal locality.
c. Most of the capacity of the memory hierarchy is at the lowest level.
d. Most of the cost of the memory hierarchy is at the highest level.
• b, c
13. The following code is written in C, where elements within the same row are stored contiguously. Assume each
word is a 32-bit integer.
For (i = 0; i < 8; i ++)
For (j = 0; j < 8000; j ++)
A[i][j] = B[i][0] + A[j][i];
a. References to which variables exhibit spatial locality.
• A, B (e.g., array)
b. References to which variables exhibit temporal locality.
• i, j
14. For a direct-mapped cache design with a 32-bit address, the following bits of the address are used to access the
cache.
Tag
31-13
Index
12-6
Offset
5-0
a. What is the cache block size in words? Show all your work.
• Here, 1 word = 4 bytes; 2^6 (offset bits) / 4 = 16
b. How many entries does the cache have? Show all your work.
• 2^7 (index bits) = 128
15. Although a larger bock size [increases? or decreases?] the miss rate, it can also [increase? or decrease?] the
miss penalty. If the miss penalty increases linearly with the block size, larger block could easily lead to [higher? or
lower?] performance.
•
decreases; increase; lower
16. The speed of the memory system affects the designer’s decision on the size of the cache block. Which of the
following cache designer guidelines are generally valid? Multiple answers. [
]
a. The shorter the memory latency, the larger the cache block.
b. The shorter the memory latency, the smaller the cache block.
c. The higher the memory bandwidth, the larger the cache block.
d. The higher the memory bandwidth, the smaller the cache block.
• b, c
17. Consider a cache with 128 blocks and a block size of 8 bytes. What block number does byte address 1000 map
to? Show all your work.
•
block address % cache block
o |_ 1000 / 8 _| = 125 (change byte address into block address)
o 125 % 128 = 125 block number
6
18. How many tag bits are required for a direct-mapped cache with 64 KByte of data and 32-word blocks? Assume
a 32-bit address. Show all your work. Here, 1K = 1024. Show all your work.
64 KByte of data is 16K words.
With a block size of 32 words, there are (64 x 1024 byte) / (32 x 4 byte) blocks  512 (= 2^9)
blocks
32 word block  32 x 4 byte  2^7 byte (7 bit byte offset)
Tag bit = 32 – (9 + 7) = 16
Thus, total tag bits are 512 x 16 = 8192 bits (= 8K bits)
•
•
•
•
•
19. Suppose a cache of 8K blocks, 2-word block size, and a 32-bit address, (i) find the total number of sets and (ii)
the total number of tag bits for 2-way set associative cache. Show all your work. Here, 1K = 1024. Show all your
work.
2-word cache block  8 bytes  2^3 bytes (3 bits for byte offset)
(8 x 1024 blocks) / 2 = 4096 sets  4 x 1024 sets  2^(12) sets (12 bits for set number)
o Total number of sets = 4096  (or 4K)
Tag bits = 32 – (12 + 3) = 17 bits
o Total number of tag bits = 17 x 2 x 4096  (or 136 K)
•
•
•
20. There is a small 2-way set associative cache, consisting of four one-word blocks. (i) How many hits and (ii)
show the final cache contents based on the given block address sequences 0, 8, 0, 6, 8, 3, 1, 8. Here, the least
recently used (LRU) is used for a cache replacement policy. Show all your work – fill out the cache & hit/miss.
•
•
0
miss
Memory[0]
8
miss
Memory[0]
Memory[8]
0
hit
Memory[0]
Memory[8]
6
miss
Memory[0]
Memory[6]
8
miss
Memory[8]
Memory[6]
3
miss
Memory[8]
Memory[6]
Memory[3]
1
miss
Memory[0]
Memory[6]
Memory[3]
Memory[1]
8
hit
Memory[8]
Memory[6]
Memory[3]
Memory[1]
2 hits
Memory[8]
Memory[6]
Memory[3]
7
Memory[1]
Download