Question #1: David vs. Goliath (14 points) [30 minutes]

advertisement
CS252 QUIZ #2: 4/18/01
Last Name _______________________
Question
1
2
3
TOTAL
Name
David vs. Goliath
That’s Out of Order!
Who Needs Compilers?
D. A. Patterson
First Name _____________________
Time (minutes)
30
30
50
110
Max Points
14
16
18
48
Your Points
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
Question #1: David vs. Goliath (14 points) [30 minutes]
The Intel Pentium III and the Transmeta Crusoe both translate 80x86 instructions into a different
instruction set for execution.
a) (4 points) List the following characteristics of each of the internal instruction sets:
Pentium III
Registers
(approximate number, size)
Instruction
(approximate size, style)
b) (2 points) What is the role of interpretation in each machine?
Pentium III:
Transmeta Crusoe:
2
Transmeta Crusoe
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
c) (2 points) What are the methods of translation in each machine?
Pentium III:
Transmeta Crusoe:
d) (2 points) In addition to performance and cost, an increasingly important consideration is
power. What is the impact on power of each approach? Why?
Pentium III:
Transmeta Crusoe:
e) (2 points) Which is a better match to multithreading? Why?
Expected:
Pentium III:
Transmeta Crusoe:
3
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
4
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
Question 2: That’s Out of Order! (16 points) [30 minutes]
Using the MIPS code shown below, show the state of the Reservation stations, Reorder buffers,
and floating point (FP) register status for a speculative processor implementing Tomasulo’s
algorithm. Assume the following:

Only one instruction can issue per cycle.

The reorder buffer has 8 slots.

The reorder buffer implements the functionality of the load buffers and store buffers.

All function units are fully pipelined.

There are 2 floating point multiply reservation stations.

There are 3 floating point add reservation stations.

There are 3 integer reservation stations, which also execute load and store instructions.

No exceptions occur during the execution of this code.

All integer operations require 1 execution cycle. Memory requests occur and complete in
this cycle.

All FP multiply operations require 4 execution cycles.

All FP addition operations require 2 execution cycles.

On a common data bus write conflict, the instruction issued earlier gets priority.

Execution for a dependent instruction can begin on the cycle after its operand is broadcast on
the common data bus.

If any item changes from “Busy” to “Not Busy”, you should update the “Busy” column to
reflect this, but you should not erase any other information in the row (unless another
instruction then overwrites that information).

Assume the all reservation stations, reorder buffers, and functional units were empty and not
busy when the code show below began execution.

The “Value” column gets updated when the value is broadcast one the common data bus.

Integer registers are not shown, and you do not have to show their state.
5
CS252 - Quiz #2, Spring 2001

Your last name: ______________________
For parts a) and b), fill in the new entry only when the entry value changes; leaving the
new column blank means it’s unchanged. Use dash to indicate the new entry value is empty.
For the instruction column in reorder buffer, use the empty entry for any new instruction.
a) (7 points) Assume the tables below show the old state at the end of the cycle in which ADDI
from the code below is issued. Modify the tables to show new state at the end of next clock
cycle. Assume the execute states for the floating point instructions MULT.D F3, F1, F11 and
MULT.D F4, F1, F10 are at the first and second cycle of the execution stage, respectively.
(In case you mess up this version, there is an extra copy on the next page.)
L.D
F0, 0(R1)
MULT.D
F2, F0, F12
ADD.D
F0, F2, F1
MULT.D
F3, F1, F11
MULT.D
F4, F1, F10
ADDI
R3, R3, 1
SUBI
R1, R1, 8
Name
Busy
old
Add1
Add2
Add3
Mult1
Mult2
Int1
Int2
Int3
new
Y
old
ADD.D
Y
Y
N
Y
MULT.D
MULT.D
L.D.
ADDI
Entry
Busy
old
1
2
3
4
5
6
7
8
Reorder #
Busy
3
Y
old
.
Instruction
new
N
new
old
Qj
new
old
old
2
N
Qk
new
old
new
ROB dest
old new
F1
#3
F1
F1
F0
R3
F10
F11
R1
1
#5
#4
#1
#6
Reorder buffer
State
old
Commit
Commit
Write
Execute
Execute
Issue
Destination
new
old
new
F0
F2
F0
F3
F4
R3
FP register status
F2
F3
F4
F1
old
new
F2
L.D F0, 0(R1)
MULT.D F2, F0, F12
ADD.D F0, F2, F1
MULT.D F3, F1, F11
MULT.D F4, F1, F10
ADDI R3, R3, 1
F0
old
new
new
N
N
Y
Y
Y
Y
Field
Reservation stations
Vj
Vk
Op
new
old
4
Y
new
old
5
Y
6
new
Value
old
Mem[0(R1)]
F0*F12
F2+F1
F5
old
N
new
new
…
F6
old
N
new
old
N
new
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
7
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
b) (9 points) The tables below, for a different program, show the state at the end of the cycle in
which the S.D from the code below is issued. Modify the tables to show state at the end of
the next three clock cycles. Assume the execute states for the floating point instructions for
MULT.D F2, F1, F11 and MULT.D F0, F0, F10 are at the end of fourth and third cycle of
the execution stage, respectively. (There is an extra copy on the next page.)
L.D
F0, 0(R1)
MULT.D
F2, F1, F12
ADD.D
F0, F2, F0
MULT.D
F2, F1, F11
MULT.D
F0, F0, F10
ADD.D
F0, F0, F2
S.D
F0, 0(R1)
ADDI
R1, R1, 8
SUBI
R2, R2, 1
Name
Busy
old
Add1
Add2
Add3
Mult1
Mult2
Int1
Int2
Int3
N
Y
old
ADD.D
ADD.D
Y
Y
N
Y
MULT.D
MULT.D
L.D
S.D
Busy
old
1
2
3
4
5
6
7
8
1
5
Y
old
N
new
new
old
Qk
new
old
Reorder buffer
State
old
Commit
Commit
Commit
Execute
Execute
Issue
Issue
new
4
Y
new
old
N
new
#5
#4
#1
#7
Destination
old
new
F0
F2
F0
F2
F0
F0
F0
old
N
8
ROB dest
old new
#3
#6
#4
F10
F11
0
0
FP register status
F2
F3
F10
old
new
F0
F0
F1
R1
R1
F1
old
old
Qj
#5
L.D F0, 0(R1)
MULT.D F2, F1, F12
ADD.D F0, F2, F0
MULT.D F2, F1, F11
MULT.D F0, F0, F10
ADD. D F0, F0, F2
S.D F0, 0(R1)
new
new
F2
Instruction
F0
old
new
new
N
N
N
Y
Y
Y
Y
Field
Reorder #
Busy
Op
new
Entry
Reservation stations
Vj
Vk
new
Value
old
Mem[0(R1)]
F1*F12
F2+F0
F11
old
N
new
new
…
F12
old
N
new
old
N
new
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
3. Who needs compilers? (18 points) [50 minutes]
In the following problem, use a simple pipelined RISC architecture with a single branch delay
cycle. The architecture has pipelined functional units with the following execution cycles:
1. Floating point op: 3 cycles (7 stages total)
2. Integer op: 1 cycles (5 stages total)
The following table shows the minimum number of intervening cycles between the producer and
consumer instructions to avoid stalls. Assume 0 intervening cycle for combinations not listed.
Instruction producing result
FP ALU op
FP ALU op
Load double
Load double
Instruction using result
Another FP ALU op
Store and move double
FP ALU op
Store double
Latency in clock cycles
2
2
1
0
The following code computes a 3-tap filter. R1 contains address of the next input to the filter,
and the output overwrites the input for the iteration. R2 contains the loop counter. The tap
values are contained in F10, F11, and F12.
LOOP:
L.D
MULT.D
ADD.D
MULT.D
MOV.D
MULT.D
ADD.D
S.D
ADDI
BNEZ
SUBI
F0, 0(R1)
F2, F1, F12
F0, F2, F0
F2, F1, F11
F1, F0
F0, F0, F10
F0, F0, F2
F0, 0(R1)
R1, R1, 8
R2, LOOP
R2, R2, 1
#load the filter input for the iteration
#multiply elements
#add elements
#move value in F0 to F1
#store the result
#increment pointer, 8 bytes per DW
#continue till all inputs are processed
#decrement element count
a) (4 points) How many cycles does the current code take for each iteration?
__________ cycles
9
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
b) (4 points) Rearrange the code without unrolling to achieve 2 less cycles per iteration. You can
reorder and drop any line of code, but do not change any line of code. To save writing, just
draw arrows in the below copy of the code to show any code movement. Show the
execution clock cycle number next to each code line. Assume initialization can be adjusted.
_________ cycles
c) (2 points) Can the original code be optimized with loop unrolling and software pipeline to
avoid stalls in the loop due to data dependencies? Why or why not?
10
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
Suppose the original code is modified to the following; the MOV.D instruction was removed.
LOOP:
L.D
F0, 0(R1)
#load the filter input for the iteration
MULT.D
F2, F1, F12 #multiply elements
ADD.D
F0, F2, F0
#add elements
MULT.D
F2, F1, F11
MULT.D
F0, F0, F10
ADD.D
F0, F0, F2
S.D
F0, 0(R1)
#store the result
ADDI
R1, R1, 8
#increment pointer, 8 bytes per DW
BNEZ
R2, LOOP
#continue till all inputs are processed
SUBI
R2, R2, 1
#decrement element count
d) (2 points) Unroll the original loop twice (so contains 3 iterations) and schedule it to avoid
stalls. Assume the second iteration has F0 renamed to F3, F1 renamed to F4, and F2 renamed to
F5. Assume the third iteration has F0 renamed to F6, F1 renamed to F7, and F2 renamed to F8.
Write the code on the next page. Write the number reference when writing any instruction listed
below. If you need to use any instruction not listed below, write out the instruction explicitly.
11
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
d) (continued) To save writing, just write the instruction number in the table below if
instruction can be used as is from the prior page. If it’s not there, write out the new instruction. If
you need fewer instructions than you have space below in the table, just leave the rest blank.
Number (if instruction unchanged)
Instruction (if not on prior page)
e) (2 points) What is the effective cycle per iteration for the unrolled loop, where the iteration is
referring to the iteration for the original code?
____________ cycles
12
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
13
CS252 - Quiz #2, Spring 2001
Your last name: ______________________
f) (4 points) For DSP processor, special instructions are provide to speed up DSP applications
such as an n-tap filter. Suppose the following instructions are provided in addition. How can
one use them to speed up the original 3-tap filter code at the beginning of the question? Write
out the new code below, starting with the version in part a). How many cycles do the DSP
instructions save? How does this compare to your answer to part b)?
LP RX, LABEL
Zero over head loop that loops the segment with the number of times specified in the
register RX. This eliminates branch delay.
LT FX, RY
Auto increment. Load MEM(RY) to register FX. Then increments the base register RY
to the next element.
14
Download