Instruction-Level Parallelism ELEC 5200-001/6200-001 Computer Architecture and Design Spring 2016

advertisement
ELEC 5200-001/6200-001
Computer Architecture and Design
Spring 2016
Instruction-Level Parallelism
Vishwani D. Agrawal
James J. Danaher Professor
Department of Electrical and Computer Engineering
Auburn University, Auburn, AL 36849
http://www.eng.auburn.edu/~vagrawal
vagrawal@eng.auburn.edu
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
1
A Computer System
Interrupts
Processor
Cache
Memory – I/O bus
Main
memory
I/O controller
Virtual
memory
Physical
memory
Disk
Spr 2016, Apr 20 . . .
Disk
I/O controller
I/O controller
Graphics
output
Network
ELEC 5200-001/6200-001 Lecture 12
2
Advanced Architectures – ILP
Instruction level parallelism (ILP): multiple
instructions fetched and executed simultaneously.
ILP is used in addition to pipelining.
Processors with ILP are called multiple-issue
processors – multiple instructions launched in 1
clock cycle. Two ways:
– MIMD: Multiple Instructions Multiple Data
Superpipeline
Superscalar – dynamic multiple issue
Very long instruction word (VLIW) – static multiple issue
– SIMD: Single Instruction Multiple Data
Vector processor
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
3
Superpipeline and Superscalar
IF
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
MEM
WB
ID
IF
EX
ID
IF
EX
ID
IF
0
MEM
EX
ID
Pipeline
MEM
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
2
Spr 2016, Apr 20 . . .
3
4
Superpipeline
(Pipeline clock is twice as
fast as the system clock)
WB
IF
1
WB
WB
MEM
EX
1 instruction/cycle
WB
2 instructions per cycle
Superscalar
5
2 (or more) instructions/cycle
6
7
ELEC 5200-001/6200-001 Lecture 12
8
System clock cycles
4
A Static Two-Issue MIPS Pipeline
Read two instructions per cycle:
An ALU or branch instruction, and
A load or store instruction
Insert one nop if above pair is not available
Added hardware (Figure 4.69, page 336):
A second instruction memory
Additional input/output ports in register file
Additional ALU in execute stage for address
calculation
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
5
An Example (Page 337)
Loop:
Spr 2016, Apr 20 . . .
lw
addu
sw
addi
bne
$t0, 0($s1)
$t0, $t0, $s2
$t0, 0(s1)
$s1, $s1, – 4
$s1, $0, Loop
ELEC 5200-001/6200-001 Lecture 12
6
Static Two-Issue Execution
ALU or branch
instruction
Loop: nop
addi $s1, $s1, – 4
addu $t0, $t0, $s2
bne $s1, $0, Loop
Data transfer
instruction
lw $t0, 0($s1)
nop
nop
sw $t0, 4($s1)
Clock
cycle
1
2
3
4
Note code reordering and change in sw argument.
CPI
Spr 2016, Apr 20 . . .
=
4/5
=
0.8
ELEC 5200-001/6200-001 Lecture 12
>
0.5 (ideal)
7
Loop Unrolling (Index Multiple of 4)
ALU or branch
instruction
Loop: addi $s1, $s1, – 16
Data transfer
instruction
lw $t0, 0($s1)
Clock
cycle
1
nop
addu $t0, $t0, $s2
lw $t1, 12($s1)
lw $t2, 8($s1)
2
3
addu $t1, $t1, $s2
addu $t2, $t2, $s2
addu $t3, $t3, $s2
nop
bne $s1, $0, Loop
lw $t3, 4($s1)
sw $t0, 16($s1)
sw $t1, 12($s1)
sw $t2, 8($s1)
sw $t3, 4($s1)
4
5
6
7
8
CPI
Spr 2016, Apr 20 . . .
=
8/14
=
0.57
ELEC 5200-001/6200-001 Lecture 12
>
0.5 (ideal)
8
VLIW: Very Long Instruction Word
Static multiple issue, ILP determined by compiler.
Datapath contains multiple execution units.
Compiler groups instructions that have no data or resource
conflicts for parallel execution.
Grouped instructions are packed in very long words of a
wide instruction memory.
Speedup benefit of VLIW is highly program dependent.
J. A. Fisher, “Very Long Instruction Word Architecture and
ELI-512,” Proc. 10th Symp. on Computer Architecture,
Stockholm, June 1983, pp. 478-490.
J. A. Fisher, P. Faraboschi and C. Young, Embedded
Computing: A VLIW Approach to Architecture, Compilers
and Tools, Morgan Kaufmann.
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
9
Superscalar: Dynamic Scheduling
and Out-of-Order Execution
Instruction fetch
and decode unit
Reservation
station
Reservation
station
Reservation
station
Reservation
station
Out-of-order
issue
Floating
point
Load/
store
Out-of-order
execution
Functional
units
integer
In-order issue
integer
Commit unit
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
In-order commit
10
Out of Order Execution (OOE)
A procedural programming language
sequences instructions.
Sequencing assumes an order of
execution – no parallelism.
OOE must preserve correctness of result.
Principle: Two instructions can be
executes in parallel if they do not have
dependences.
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
11
RAW Dependence
Read after write (RAW): A dependent
instruction reads from a register being
written to by another instruction.
Example:
add $s1, $s2, $s3
sub $s2, $s1, $s3
sub has RAW dependence on add
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
12
WAR Dependence
Write after read (WAR): A dependent
instruction writes to a register being read
by another instruction.
Example:
add $s1, $s2, $s3
sub $s2, $s1, $s3
sub has WAR dependence on add
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
13
WAW Dependence
Read after write (RAW): One instruction
writes to a register to being written to by
another instruction.
Example:
add $s2, $s2, $s3
sub $s2, $s1, $s3
sub has WAW dependence on add
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
14
Superscalar Instruction Issue
Rules:
RAW dependence – If any operand is being written, do
not issue.
WAR dependence – If the result register is being read,
do not issue.
WAW dependence – If the result register is being
written, do not issue.
Scoreboard:
Cycle by cycle record of registers and execution units
showing how many instructions are using them.
Example 1: In-order issue (next 2 slides).
Example 2: Out-of-order issue (3rd slide).
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
15
Dynamic Scheduling
Consider an example:
First with in-order issue
Then with out-of-order issue
Assume:
Up to two instructions are fetched in a cycle
Instruction register can hold two instructions
An Instruction is issued in decode cycle, or must wait
until there is no RAW, WAR or WAW dependence
An instruction can retire two or three cycles after it is
issued
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
16
Ck
Inst
Reg. to write
Issue Retire Reg. to read
Inst# Inst# 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
#
Decoded
1
1
2
R3 = R0 * R1
R4 = R0 + R2
1
2
1 1
2 1 1
1
1 1
2
3
4
R5 = R0 + R1
R6 = R1 + R4
3
-
3 2 1
3 2 1
1 1 1
1 1 1
3 2 1
1 1 1
2 1 1
1 1
1 1
1
cycle
3
4
1
2
3
5
6
5
R7 = R1 * R2
4
5
6
R1 = R0 – R2
-
7
4
8
5
9
7
R3 = R3 * R1
Spr 2016, Apr 20 . . .
6
-
1
2 1
1
1
1
1 1
2 1
1
1 1
1 1
1
1
1
1
ELEC 5200-001/6200-001 Lecture 12
1
1
1
17
In-order Issue scoreboard (Continued)
Ck Instr
cycle
#
Decoded
Reg. to read
Reg. to write
Issue Retire
Inst# Inst# 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
10
1
11
1
1
6
12
1
1
1
1
1
1
13
1
1
1
14
1
1
1
8
R1 = R4 + R4
7
-
15
16
7
8
17
18
2
1
2
1
8
Out-of-order scoreboard (Next 2 Slides)
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
18
Questions?
RAW dependence: Inst# 4 (R6 = R1 + R4) could
not be issued until cycle 5. Should Inst# 5 (R7 = R1
* R2) wait in queue?
Answer: No. Inst# 5 can be issued in cycle 3 as
there is no register conflict (out-of-order issue).
WAR dependence: Must the issue of Inst#6 (R1 =
R0 – R2) waits until cycle 9 when all instructions
reading R1 have retired?
Answer: No. Provided new result of Inst#6 does not
affect R1 being used by previous instructions
(register renaming).
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
19
Ck
cycle
Reg. to read
Reg. to write
Inst
Issue Retire
Decoded
Inst# Inst# 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
#
1
1
2
R3 = R0 * R1
R4 = R0 + R2
1
2
1 1
2 1 1
1
1 1
2
3
4
R5 = R0 + R1
R6 = R1 + R4
3
-
3 2 1
3 2 1
1 1 1
1 1 1
3
5
6
R7 = R1 * R2
S1 = R0 – R2
5
6
2
3 3 2
4 3 3
3 3 2
1 1 1
1 1 1
1
1
1
3
3
3
3
2
1
4
7
8
R3 = R3 * S1
S2 = R4 + R4
4
8
5
6
2
2
2
2
2
1
1
3
3
3
6
2 1
3
1
4
5
8
2 1 1 3
1 1 1 2
1 2
1
1
1
1
7
4
4
4
3
2
1
1
1
1
1
1
1
1
1
8
1
1
Spr 2016, Apr 20 . . .
1
1
1
1
1
1 1
1
7
ELEC 5200-001/6200-001 Lecture 12
1
1
1
1
1
1 1
7
9
1
1
1
1
1
1
1
20
References
Previous example is from:
A. S. Tanenbaum, Structured Computer
Organization, Fifth Edition, Prentice-Hall, 2006, pp.
304-309, Section 4.5.3.
Further reading:
D. W. Anderson, F. J. Sparacio and R. M.
Tomasulo, “The IBM 360 Model 91: Processor
Philosophy and Instruction Handling,” IBM J. Res.
& Dev., vol. 11, no. 1, pp. 8-24, Jan. 1967.
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
21
Power Reduction by Slack Scheduling
Application: Superscalar, out-of-order execution:
An instruction is executed as soon as the required data and
resources become available.
A commit unit reorders the results.
Delay the completion of instructions whose result
is not immediately needed.
Example of RISC instructions:
add
sub
and
or
xor
Spr 2016, Apr 20 . . .
r0, r1, r2;
r3, r4, r5;
r9, r1, r9;
r5, r9, r10;
r2, r5, r11;
(A)
(B)
(C)
(D)
(E)
J. Casmira and D. Grunwald,
“Dynamic Instruction Scheduling
Slack,” Proc. ACM Kool Chips
Workshop, Dec. 2000.
ELEC 5200-001/6200-001 Lecture 12
22
Slack Scheduling Example
Standard scheduling
A
B
C
D
add
sub
and
or
xor
r0, r1, r2;
r3, r4, r5;
r9, r1, r9;
r5, r9, r10;
r2, r5, r11;
(A)
(B)
(C)
(D)
(E)
E
Slack scheduling
B
C
A
D
E
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
23
Slack Scheduling
Scheduling logic
Re-order buffer
Slack bit
Spr 2016, Apr 20 . . .
Low-power
execution units
(Reduced voltage)
ELEC 5200-001/6200-001 Lecture 12
24
Superscalar Design of P4 (CISC)
CISC shell:
– Processor fetches instructions from memory in the
order of static program.
– Each instruction is translated into one or more fixedlength RISC instructions, known as micro-operations
(micro-ops).
RISC core:
– Micro-ops are executed out-of-order in a dynamically
scheduled pipeline.
– Processor commits the result of each micro-op
execution to register file in the order of original
program flow.
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
25
Superscalars
3 or more instruction issues per clock:
Intel P6
AMD K5
Sun UltraSPARC
Alpha 21164
MIPS R10000
PowerPC 604/620
HP 8000
References:
D. W. Anderson, F. J. Sparacio and R. M. Tomasulo, “The IBM
360 Model 91: Processor Philosophy and Instruction Handling,”
IBM J. Res. Dev., vol. 11, pp. 8-24, January 1967.
T. Agerwala and J. Cocke, “Reduced Instruction Set
Processors,” Technical Report RC12434 (#55845), Yorktown
Heights, NY: IBM T. J. Watson Research Center, January 1987.
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
26
Topics in Computer Architecture
Instruction set
Program execution through register transfer
See Lectures 13. Binary arithmetic (2’s complement, IEEE 754
floating point standard, addition, multiplication)
Datapaths (single-cycle, multicycle, pipeline)
Control (combinational logic, FSM, microcode)
Pipelining (throughput, hazards, forwarding, stall, branch
prediction)
Memory organization (cache, virtual memory)
Performance (benchmarks, energy efficiency, Amdal’s law)
Advanced architectures (ILP, OOE, superscalar, etc.)
Not discussed in this course:
–
–
–
–
Multiprocessors
Compiler and software techniques – loop unrolling, trace execution, etc.
Input and output
Power management
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
27
One who claims to know much about computer
architecture speaks from ignorance . . . because
a lot is going to happen in the future, which is . . .
http://www.youtube.com/watch?v=xZbKHDPPrrc
Doris Day in Hitchcock’s 1956 Movie
“The Man Who Knew Too Much”
Spr 2016, Apr 20 . . .
ELEC 5200-001/6200-001 Lecture 12
28
Download