Due: Beginning of class on 03/23/15

advertisement
CSCE 614 (Spring 2015)
Eun Jung Kim
Computer Architecture
Homework # 3
COVER SHEET
(Due: Beginning of class on 03/23/15)
Name :
ID Number :
Directions: Write your answers on the sheets provided. Submit with the COVER SHEET. If you
need additional sheets for any of the problem, add as many blank papers as you require. Print your
name clearly. No late homework will be accepted. You are expected to write up your solutions
on your own, without referring to other students' works or to solutions you may find on the web.
The total score is 160 points. This homework is due at the beginning of class on Monday, Mar
23, 2015.
Dynamic Hardware Branch Prediction
1. Suppose the following branch instructions have been executed.
Label
Address
branch
Taken/Not Taken
1
2
3
4
5
…101101
…101101
…101101
…110011
…110011
b1
b1
b1
b2
b2
T
T
NT
NT
T
.
Draw a (1, 2) predictor and indicate the state of the buffer (with 4 prediction entries) after executing
the above branch instructions. Also show the prediction for each branch instruction. Assume that
a predictor uses 2bit saturating counter implemented in Simplescalar by default (20)
Instruction
Prediction
1
2
3
4
5
2. Suppose we have a deeply pipelined processor, for which we implement a branch-target buffer
for the conditional branches only. Assume that the mis-prediction penalty is always 4 cycles and
the buffer miss penalty is always 3 cycles. Assume 90% hit rate, 95% accuracy and 15%
conditional branch frequency. How much faster is the processor with the branch-target buffer
versus a processor that has a fixed 2-cycle branch penalty? Assume a base CPI without branch
stalls of 1. (10)
3. Suppose the following branch instructions have been executed.
Label
Address
branch
Taken/Not Taken
1
2
3
4
5
…101101
…101101
…101101
…110011
…110011
b1
b1
b1
b2
b2
T
NT
T
NT
NT
.
a) Draw a (2, 2) predictor and indicate the state of the buffer (with 4 prediction entries per a table)
after executing the above branch instructions. Also show the prediction for each branch instruction.
Assume that a predictor uses 2bit saturating counter implemented in Simplescalar by default. (20)
Instruction
Prediction
1
2
3
4
5
b) Show the prediction for each branch instruction using a tournament predictor with 2 entries.
Also show the final contents of Predictor 1 buffer and Predictor 2 buffer. Predictor 1 and Predictor
2 are 2-bit saturating counters with 2 prediction entries. Note that Predictor 1 is a local predictor
while Predictor 2 is global. Assume all table and buffer contents are initialized to ‘zero’. (20)
Instruction
1
2
3
4
5
Prediction
c) Explain why a branch target buffer (BTB) reduces the CPI compared to a branch prediction
buffer. Explain why the BTB must include tags for the buffer entries while the branch prediction
buffer does not.(10)
4. With a MIPS pipeline architecture, we can have four different branch alternatives as follows.
Assume we have a 2 GHz machine for which the following measurements have been made and
the CPI of instruction except branch is 1. What are the MIPS rates for each scheme? Assume 5%
unconditional branch, 10% conditional branch untaken, 7% conditional branch taken. Use the
following performance penalty table. (10)
Scheduling
Stall pipeline
Predict taken
Predict not taken
Delayed branch
Penalty
4
1
1
0.5
Dynamic Pipelining and Hardware Speculation
5. Assume there are a floating-point unit with 2 add, 2 multiple/divide, and 2 load/store units,
with execution latencies of 2 clock cycles for add, 10 for multiply, 40 for divide, and 3 for
load/store (1 for address calculation, 2 for memory access), an integer unit for ALU operation,
another unit for address calculation and the other unit for branch condition evaluation. For the code
sequence below, answer the following questions. Note that the number of reservation stations is
same as that of functional units and single issue.
1.
2.
3.
4.
5.
6.
LD
LD
MULTD
SUBD
DIVD
ADDD
F6,34(R2)
F2,45(R3)
F0,F2,F4
F8,F6,F2
F10,F0,F6
F6,F8,F2
a. Identify all the data hazards in the above code fragment, along with the type of each hazard
identified. You can mark them appropriately on the code fragment and use acronyms to specify
the hazard type. (10)
b. For the above code sequence, show the status tables when all instructions have completed with
single-issue Tomasulo's algorithm. For the instruction status table, list the clock cycle when the
event happens. (20)
Instruction Status
Instruction
Issue
Memory
Access
Execute
Write Result
LD F6,34(R2)
LD F2,45(R3)
MULTD F0,F2,F4
SUBD F8,F6,F2
DIVD F10,F0,F6
ADDD F6,F8,F2
Reservation Stations
Name
Busy
ADD1
Op
Vj
Vk
Qj
Qk
A
ADD2
MUL/DIV1
MUL/DIV2
LD/STR1
LD/STR2
Register Status
Field
F0
Qi
F2
F4
F6
F8
F10
F12
.
F30
6. Consider the execution of a loop on a two-issue processor. Assume there are two functional
units; one for effective address calculation and integer ALU operation, and the other for branch
condition evaluation. Also assume that there are 1 CDB and that up to two instructions of any type
can commit per clock cycle. Assume that branches single issue but that branch prediction is perfect.
(Latency: integer ALU operation 1, load and store (memory access only) 2, FP ALU operation 3).
a. Fill out the time table of a pipeline with Dynamic Scheduling. (20)
Instruction
L.D
F0, 0(R1)
ADD.D F4, F0, F2
S.D
F4, 0(R1)
DADDIU R1, R1, #-8
BNE
R1, R2, LOOP
L.D
F0, 0(R1)
ADD.D F4, F0, F2
S.D
F4, 0(R1)
DADDIU R1, R1, #-8
BNE
R1, R2, LOOP
Issue
Execute Memory Access
Write CDB
b. Fill out the time table of a pipeline with Hardware Speculation. (20)
Instruction
L.D
F0, 0(R1)
ADD.D F4, F0, F2
S.D
F4, 0(R1)
DADDIU R1, R1, #-8
BNE
R1, R2, LOOP
L.D
F0, 0(R1)
ADD.D F4, F0, F2
S.D
F4, 0(R1)
DADDIU R1, R1, #-8
BNE
R1, R2, LOOP
Issue
Execute Memory
Access
Write
CDB
Commit
Download