Document

advertisement
1. In the following code fragment, memory reads and writes (LW and SW) take 2
clock cycles, MUL takes 4 clock cycles and ADD takes 1 clock cycle. Assume
that A holds the value of 3, B holds the value of 4. Assume r1 initially holds the
value of 0.
LW r3, A
; Load value of variable A
LW r2, B
; Load value of variable B
MUL r1, r2, r3 ; r1 = r2 * r3
ADD r1, r1, r2 ; r1 = (r2 * r3) + r2
SW r1, C
; C = (r2 * r3) + r2
a) On a scalar processor, how many clock cycles does it take to execute this
program? What is the final answer stored in C?
11 clock cycles, C = 16
b) On a superscalar processor with two identical functional units and a fetch –
dispatch-retire policy that completely ignores data dependencies, what is the
final answer stored in C? How many clock cycles are required to execute the
program now?
CC #
1
2
3
4
5
6
FU1
LW r3,A
FU2
LW r2, B
MUL r1,r2,r3
ADD r1,r1,r2
SW r1, C
R1
0
0
4
4
4
12
R2
R3
C
4
4
4
4
4
3
3
3
3
3
4
4
Takes 6 clock cycles, C = 4
c) Suggest some ways how correct execution can generally be enforced on this
naïve processor.
i)
ii)
Reorder instructions so that those not dependent on previous
computations “go first”.
Pad with required number of NOP instructions
2. Suppose that there is a CPU with a 4-stage pipeline (fetch, decode, execute,
writeback) with no branch predictions, but with branching decisions becoming
available at the end of the decode stage.
a. Explain what delayed-branches are, and why they are necessary in this
architecture.
-
Instruction(s) following the branch is/are effectively executed
before the branch itself.
-
Caused by the fetch stage loading up the instruction(s)
following the branch before a branching decision is made.
b. How many instructions are executed before the branch is taken?
-
One
F
branch
i1
D
E
W
F
D
E
W
F
D
E
t1
W
i1,i2 = instructions immediately after branch, t1 = instruction at target
c. Suppose that we now duplicate the pipeline (i.e. there are now two
identical pipelines). Discuss how this affects the number of instructions
executed before a branch is taken, assuming that there are no data
dependencies.
-
branch
i1
Number of instructions executed before branch is taken
increases to 3.
F
D
E
W
F
D
E
W
F
D
E
W
F
D
E
W
F
D
E
W
F
D
E
W
i2
i3
t1
d. In a “normal” program with “normal” data dependencies, discuss why
delayed branches are bad for efficient execution in superscalar pipelines.
-
May not have instructions to insert into delayed slots. Forced
to insert NOPs.
Decreases efficiency of pipeline. Three times as drastic in
superscalar pipeline than in scalar pipeline.
3. Given a processor with 4 architectural registers A, B, C and D and 16 physical
registers R0 to R15. A program can have 3 types of dependencies:
a. Write-write dependencies: This occurs when consecutive instructions
write to the same register:
A = A + 1;
A = A + B;
b. Read-write dependencies: This occurs when an instruction writes into a
register read by a previous instruction:
A = B + C;
B = D + 2;
c. True Data Dependency: This occurs when an instruction depends on the
results written by a previous instruction:
A = B + C;
D = A + 2;
Identify the dependencies in this program and describe how the dependencies
affect the dispatch of each instruction within a processor that can handle 2
instructions at a time (i.e. you only need to test for dependencies between
pairwise adjacent instructions I0 and I1, I1 and I2, I2 and I3 etc.). For
example, a true-dependency between instructions I0 and I1 will prevent them
from being executed together because I1 will receive the wrong value in
register A).
I0: A = B + 1
I1: C = A + D
I2: A= B + 2
I3: D = A –1
I4: C = D + 1
I5: C = 1
Dependency
TDD between I0 and I1
RWD between I2 and I1
Effect
I1 must wait for I0 to complete
I2 cannot commit changes to A until I1
has read A
TDD between I3 and I2
I3 must wait for I2 to commit results to
register A before reading it.
TDD between I4 and I3
I4 must wait for I3 to commit results to
D.
WWD between I4 and I5
I5 must commit to C only after I4 has
done so.
d. Show how register renaming using R0 to R15 can remove all
dependencies except true dependencies.
Map A in I0 and I1 to R0
Map A in I2 and I3 to R1
Map B in I0, I2 to R2
Map C in I1, I4 to R3
Map C in I5 to R4
Map D in I1, I3, I4 to R5
I0: R0 = R2 + 1
I1: R3 = R0 + R5
I2: R1= R2 + 2
I3: R5 = R1 –1
I4: R3 = R5 + 1
I5: R4 = 1
Download