Suppose that you have analyzed a benchmark that runs on your

advertisement
Problem 1:
Suppose that you have analyzed a benchmark that runs on your company’s processor.
This processor runs at 300MHz and has the following characteristics:
Instruction Type
Frequency (%)
Cycles
Arithmetic and logical
35
1
Load and Store
25
2
Branches
25
3
Floating Point
15
5
What is the CPI and MIPS rating of this processor running this benchmark?
CPI = (0.351) + (0.252) + (0.253) + (0.155) = 2.35
MIPS = 300MHzCPI = 300MHz/2.35 = 127.66 MIPS
Your company is considering a cheaper, lower-performance version of the processor.
Their plan is to remove some of the floating-point hardware to reduce the die size.
The wafer on which the chip is produced has a diameter of 10cm, a cost of $2000, and a
defect rate of 1 / (cm2). The manufacturing process has an 80% wafer yield and a value of
2 for alpha.
The current procesor has a die size of 12mm 12mm. The new chip has a die size of
10mm x 10mm, and floating point instructions will take 13 cycles to execute.
What is the CPI and MIPS rating of the new processor?
CPI = (0.351) + (0.252) + (0.253) + (0.1513) = 3.55
MIPS = 300MhZ/3.55 = 84.51
What is the original cost per (working) processor?
What is the new cost per (working) processor?
Assume that we are considering the other direction of improving the original processor
by increasing the speed of floating point. What is the best possible speedup that we could
get, and what would the CPI and MIPS rating be of the new processor?
(i.e. speeding up floating-point really well). In this case, f is the fraction of time normally
devoted to floating point (in time!). So, f=CPIfloat/CPI=(0.155)2.35 = 0.319
Max speedup = (1-0.319)-1= 1.47
CPI computed with “zerocycle” floating-point instructions: 2.35-(0.155) = 1.6
MIPS = 300/1.6 = 187.5
Problem 2:
Here is some code for a vector machine:
LP:
LV
MULTSV
MULTSV
ADDV
MULTSV
ADDV
SV
SUBI
BNEZ
V1, R5
V2, F0, V1
V3, F1, V1
V2, V2, V3
V3, F2, V1
V2, V2, V3
V2, R5
R5, R5, 8
R5, LP
#load V1 with new value
#perform the calculation
Show the convoys in the above code:
LP:
LV
MULTSV
MULTSV
ADDV
MULTSV
ADDV
SV
SUBI
BNEZ
V1, R5
V2, F0, V1
V3, F1, V1
V2, V2, V3
V3, F2, V1
V2, V2, V3
V2, R5
R5, R5, 8
R5, LP
#load V1 with new value
#perform the calculation
.
Show the execution time in clock cycles of this loop with n elements (Tn); assume Tloop
= 15. Show the equation, and give the value of execution time for n = 64.
Tn = n/64 x (Tloop + Tstart) + n x Tchime
T64 = 64/64 x (Tloop + LVstart + MULTSVstart + MULTSVstart + ADDVstart +
MULTSVstart + ADDVstart + SVstart) + 64 x 3
T64 = 64/64 x (15 + 10 + 5 + 5+ 5 + 5 + 5 + 10) + 64 x 3
T64 = 252.
Problem 3:
Consider a branch-target buffer that has penalties of 0, 2, and 2 clock cycles for correct
conditional branch prediction, incorrect prediction, and a buffer miss, respectively.
Consider a branch-target buffer design that distinguishes conditional and unconditional
branches, storing the target address for a conditional branch and the target instruction for
an unconditional branch.
What is the penalty in clock cycles when an unconditional branch is found in the buffer?
Storing the target instruction of an unconditional branch effectively removes one
instruction. If there is a BTB hit in instruction fetch and the target instruction is
available, then that instruction is fed into decode in place of the branch instruction. The
penalty is –1 cycle. In other words, it is a performance gain of 1 cycle.
Determine the improvement from branch folding for unconditional branches. Assume a
90% hit rate, an unconditional branch frequency of 5%, and a 2-cycle penalty for a buffer
miss. How much improvement is gained by this enhancement? How high must the hit
rate be for this enhancement to provide a performance gain?
If the BTB stores only the target address of an unconditional branch, fetch has to retrieve
the new instruction. This gives us a CPI term of 5%*(90%*0 +10%*2) or 0.01. The term
represents the CPI for unconditional branches (weighted by they frequency of 5%). If the
BTB stores the target instruction instead, the CPI term becomes 5%*(90%*(–1) +
10%*2) or –0.035. The negative sign denotes that it reduces the overall CPI value. The
hit percentage to just break even is simply 20%.
Problem 4:
What are pipelining hazards? Give a few examples and describe how they can be
resolved.
Structural hazards – HW can’t support this combination of instructions
Data hazards – dependence on prior instruction(s)
Control hazards – branch/jump etc.
How do you optimize cache performance to achieve the following:
a) Reduce miss rate
Larger block size, larger cache size, higher associativity
b) Reduce miss penalty
Multilevel caching
c) Reduce hit time
Reads prioritized over writes
What is static branch prediction? What is the downside to it?
Always predict “taken” (or not taken). Misprediction rate is substantial.
Why is it important for Tomasulo to issue instructions in-order?
The Tomasulo algorithm must issue instructions in-order so that the dataflow is properly
evaluated.
What is the limiting factor of Tomasulo’s performance?
CDB…same bus has to go through multiple functional units
Only one functional unit can complete per cycle.
Download