Pipelining

advertisement

ECE 587

Advanced Computer Architecture I

Chapter 6

Pipelining

Herbert G. Mayer, PSU

Status 6/27/2015

1

Syllabus

 High-Level View of Pipelining

 Idealized Pipeline

 Definitions

 Goal of Pipelining

 Causes for Dependences

 Stalls and Hazards

 Realistic Constraints

 Reservation Tables in Pipelining

 Collision Vector

 Vertical Expansion

 Horizontal Expansion

 IBM Measurements

2

High-Level View of Pipelining

 HW Pipelining is an old, 1970s, architectural design method for accelerated processor execution

 Pipelining improves performance not by adding HW, but by separating individual HW modules of a conventional uniprocessor (UP) architecture

 Instead of designing one composite, complex piece of HW, the architect for a pipelined machine designs a sequence of simpler and thus faster, consecutive modules

 Ideally all modules m i would be of similar complexity

 These separate modules m i are significantly simpler each than the original composite processor, and execute in an overlapped (pipelined) fashion, simultaneously progressing on more than one machine instruction at any one time

3

High-Level View of Pipelining

 Instead of executing one complex instruction in one longer cycle required for one complex step, a pipelined architecture executes a sequence of multiple, faster sub-instructions

 Each sub-instruction is much simpler and thus much faster to execute than the one complex singleinstruction

 Such single-cycle, pipelined sub-instructions are initiated once per –short– clock-cycle

 Each instruction then progresses to completion while migrating through the various stages of the separate hardware modules m i

, called the pipeline

4

High-Level View of Pipelining

Left: Traditional Hardware Architecture

Right: equivalent Pipelined Architecture

Decode

Decode

O1-Fetch

O1-Fetch

ALU op

ALU op

.

.

5

Idealized Pipeline

 Arithmetic Logic Unit (ALU) split into separate, sequential modules

 Each of which can be initiated once per cycle, but shorter, pipelined clock cycle

 Each module m i a regular UP is replicated in HW just once , like in

 note exceptions , when module is used more than once by the original, non-pipelined instruction

 example: normalize operation in FP instruction

 also FPA in FP multiply operation

 Multiple modules operate in parallel on different instructions, at different stages

6

Idealized Pipeline

 Ideally, all modules require unit time (1 cycle)! Ideal only!

 Ideally, all original, complex operations (fp-add, divide, fetch, store, increment by integer 1, etc.) require the same number n of steps to completion

 But they do not! E.g. FP divide takes way longer than, say, an integer increment, or a no-op

 Differing numbers of cycles per instruction cause different terminations

 Operations may abort in intermediate stages, e.g. in case of a pipeline hazard, caused by: branch, call, return, conditional branch, exception

 An operation also stalls in case of operand dependence

7

Idealized Pipeline

Instruct. i8 i7 i6 i5 i4 i3 i2 i1

Retire ?

1 if

Horizontal: time, units of cycles. Vertical: consecutive instructions

2 3 4 5 6 7 8 if

9 10 11 12 13 de op1 op2 exec wb if de op1 op2 exec wb de op1 op2 exec wb if if de op1 op2 exec wb de op1 op2 exec wb if if de op1 op2 exec wb de op1 op2 exec wb If de op1 op2 exec wb

? i1 ? i2 ? i3 ? i4 ? i5 ? i6 ? i7 ? i8

6 CPI from here: 1 clock per new instruction retirement, CPI = 1 time

8

Some Definitions

9

Definitions

Basic Block

 Sequence of instructions (one or more) with a single entry point and a single exit point; entry- and exit w.r.t. transfer of control

 Entry point may be the destination of a branch, a fall through from a conditional branch, or the program entry point; i.e. destination of an OS jump

 Exit point may be an unconditional branch instruction, a call, a return, or a fall-through

 Fall-through means: one instruction is a conditional flow of control change, and the subsequent instruction is executed by default, if the change in control flow does not take place

 Or fall-through can mean: The successor of the exit point is a branch or call target

10

Definitions

Collision Vector

 Observation: An instruction requiring n cycles to completion may be initiated a second time n cycles after the first without possibility of conflict

 For each of the n-1 cycles before that, a further instruction of identical type causes a resource conflict, if initiated

 The Boolean vector of length n-1 that represents this fact, whether re-issue is possible, is called collision vector

 It can be derived from the Reservation Table

11

Definitions

Cycles Per Instruction: cpi

 cpi quantifies how long it takes for a single instruction to execute

 Generally, the number of execution cycles per instruction cpi > 1

 However, on a pipelined UP architecture, where a new instruction can be initiated at each cycle, it is conceivable to reach a cpi rate of 1; assuming no hazards

 Note different meanings, i.e. durations, of cycle

 On a UP pipelined architecture the cpi rate cannot shrink below one

 Yet on an MP architecture, or superscalar machine that is pipelined, the rate may be cpi < 1

12

Definitions

Dependence

 If the logic of the underlying program imposes an order between two instructions, there exists dependence -data or control dependence- between them

 Generally, the order of execution cannot be permuted

 Conventional in CS to call this dependence , not dependency

13

Definitions

Early Pipelined Computers/Processors:

1.

CDC 6000 Series of the late 1960s

2.

CDC Cyber series of the 1970s

3.

IBM 360/91 series

4.

Intel® Pentium ®

IV or Xeon TM processor families

14

Definitions

Flushing

 When a hazard occurs due to a change in flow of control, the partially execution instructions after the hazard are discarded

 This discarding is called flushing

 Antonym: priming

 Flushing is not needed in case of a stall caused by dependences; waiting instead will resolve this

15

Definitions

Hazard

 Instruction i+1 is pre-fetched under the assumption it would be executed after instruction i

 Yet after decoding i it becomes clear that that operation is a control-transfer operation

 Hence all subsequently pre-fetched instructions i+1 … and on are wasted

 This is called a hazard

 A hazard causes part of the pipeline to be flushed, while a stall (caused by data dependence) also causes a delay, but a simple wait will resolve such a conflict

16

Definitions

ILP

 Instruction Level Parallelism: Architectural attribute, allowing multiple instructions to be executed at the same time

 Related: Superscalar

17

Definitions

Interlock

 If HW detects a conflict during execution of instructions i and j and i was initiated earlier, such a conflict, called a stall , delays execution of some instructions

 Interlock is the architecture ’s way to respond to and resolve a stall at the expense of degraded performance

 Synonym: delay or wait

18

Definitions

IPC

 Instructions per cycle: A measure for Instruction

Level Parallelism. How many different instructions are being executed –not necessarily to completion —during one single cycle?

 Desired to have an IPC rate > 1, but ideally, given some parallelism, IPC >> 1

 On conventional UP CISC architectures it is typical to have IPC << 1

19

Definitions

Pipelining

 Mode of execution, in which one instruction is initiated every cycle and ideally one retires every cycle, even though each requires multiple

(possibly many) cycles to complete

 Highly pipelined Xeon processors, for example, have a > 20-stage pipeline

20

Definitions

Prefetch (Instruction Prefetch)

 Bringing an instruction to the execution engine before it is reached by the instruction pointer (ip) is called instruction prefetch

 Generally this is done, because some other knowledge exists proving that the instruction will be executed soon

21

Definitions

Priming

 Filling the various modules of a pipelined processor (the stages) with different instructions to the point of retirement of the first instruction is called priming

 Antonym: flushing

22

Definitions

Register Definition

 If an arithmetic or logical operation places the result into register r i we say that r i is being defined

 Synonym: Writing a register

 Antonym: Register use

23

Definitions

Reservation Table

 Table that shows, which hardware resource

(module m i

) is being used at which cycle in a multi-cycle instruction

 Typically an X written in the Reservation Table

Matrix indicates use

 Empty field indicates the corresponding resource is free during that cycle

24

Definitions

Retire

 When all parts of an instruction have successfully migrated through all execution stages, that instruction is complete

 Hence, it can be discarded, this is called being retired

 All results have been posted

25

Definitions

Stall

 If instruction i requires operand o that is being computed by another instruction j, and j is not complete when i needs o, there exists dependence between the two instructions i and j, the wait thus created is called stall

 A stall prevents the two instructions from being executed simultaneously, since the instruction at step i must wait for the other to complete. See also: hazard, interlock

 Stall can also be caused by HW resource conflict:

Some earlier instruction i may use HW resource m, while another instruction j needs m

 Generally j has to wait until i frees m, causing a stall

26

Goal & Obstacles to

Pipelining

27

Goal of Pipelining

 Complete instructions at a rate higher than the number of cycles per instruction allow

 Program completion time on a pipelined architecture is shorter than on a non-pipelined architecture, achieved by having the separate hardware module progress on multiple instructions at the same time. The same hardware is being reused frequently

 Pipelined instructions must be retired in the original, sequential order, or in an order semantically equivalent

 Stalls and hazards should be minimized

 HW resolves dependence conflicts via interlocking

28

Causes for Dependences

 Load into register r i in one instruction, followed by use of that register r i

: True Dependence

 Load into register r i in one instruction, followed by use of any register (if hardware does not bother to check register id; e.g. early HP PA)

 Definition of register r i in one instruction (other than a load), followed by use of register r i

(on hardware designed with severe limitations)

 Store into memory followed by a load from memory; generally memory subsystems does not know (i.e. do not check) whether the load comes from the same address as the earlier store; if it does, then the long wait can be bypassed (note PCI and PCI-X protocols)

29

Basic Block: Find Dependences

-- result: is leftmost operand after opcode, except for st

-- other operands, if any, are instruction sources

-- Mxx is Memory address at xx, implies indirection for ld

-- The parens in (Mxx) renders such indirection explicit

-- 8(sp) means indirect through sp register, offset by 8

-- #4 stands for literal value 4, decimal base

1 ld r2, (M0)

2 add sp, r2, #12

3 st r0, (M1)

4 ld r3, -4(sp)

5 ld r4, -8(sp)

6 add sp, sp, #4

7 st r2, 0(sp)

8 ld r5, (M2)

9 add r4, r0, #1

30

Basic Block: Find Dependences

1-2 load of a register followed by use of that register

3-4 store followed by load into any register

3-4 load from memory at -4(sp) while write to memory in progress (M1)

3-5 load from memory at -8(sp) while write to memory in progress (M1)

2-4, 2-5 define register sp, followed by use of same register; distance sufficient to avoid stall on typical architectures

6-7 define register sp before use; forces sequential execution, reduces pipelining

7-8 store followed by load!

8-9 load into register r5 followed by use of any register (on few, simple architectures)

31

Stalls and Hazards

 Hardware interlock slows down execution due to delay of dependent instruction, benefit is correct result

 Programmer can re-arrange instructions or insert delays at selected places

 Compiler can re-schedule instructions, or insert delays (like programmer)

 Unless programmer ’s / compiler’s effort are provably complete, the hardware interlock must still be provided

32

Stalls and Hazards

 CDC 6000 and IBM360/91 already used automatic hardware interlock

 Advisable to have compiler re-schedule the instruction sequence, since re-ordering may minimize the number of interlocks actually occurring

33

Stalls and Hazards

 Not all HW modules are used for exactly once and only for one single cycle

 Some HW modules m i are used more than once in one instruction; e.g. normalizer in floating-point operations

 Basic Block analysis is insufficient to detect all stalls or hazards; may span separate Basic Blocks

 Can add delay circuits to HW modules, which slows down execution, but ensures correct result

34

Instruction Reservation Tables

&

Collision Vectors

35

Reservation Tables in Pipelining

 Table 1 below, known as a reservation table , shows an ideal schedule using hardware modules m

1 to m

6

, required for execution of one instruction

 Ideal: because each requires exactly 1 cycle per m i

, and each HW module is used exactly once

 Always 1 cycle is also unrealistic

 The time to complete 3 instructions i

1

, i

2

, and i

3 is 8 cycles, while the time for any single instruction is

6 cycles, clearly a net saving over time:

 Since the completion time for a full instruction in the steady state is 1 cycle on the pipelined architecture

36

Reservation Tables in Pipelining

 For fairness sake: on a none-pipelined architecture any one of these 3 instructions would

NOT necessarily take the equivalent time of 6 cycles of the pipelined machine; perhaps 4, or 5

 Also for fairness sake  it is not usual that 3 of the same instructions are arranged one after the other

 That model is used here to explain Reservation

Tables

37

Reservation Tables in Pipelining

 Key learning: pipelined architecture does NOT speed up execution of a single instruction, may slow it down; but improves throughput of multiple instruction in a row: only in the steady state t1 t2 t3 t4 t5 t6 t7 t8 m6 i1 i2 i3 m5 i1 i2 i3 m4 i1 i2 i3 m3 i1 i2 I3 m2 i1 i2 i3 m1 i1 i2 i3

Table 1: Instructions i

1 to i

6 use 6 HW modules m i

38

Reservation Tables in Pipelining

 Table 2 shows a more realistic schedule of HW modules required for a single instruction

 In this schedule some modules are used repeatedly; for example m

3 is used 3 times in a row, and m

6

4 times

 But all these cycles are contiguous; even that is not always a realistic constraint

 The schedule in Table 2 attempts to exploit the greedy approach for instruction i

2

: it is initiated as soon as possible, but we see that this doesn ’t help the time to completion

39

Reservation Tables in Pipelining

 We could have let instruction i

2 start at cycle t

4 no additional delay would be caused by m

3 so

 Or instruction i

2 could start at t additional delay due to m

3

3 with just one

 However, in both cases m

6 later anyway would cause a delay

 To schedule i

2 resources, m

3 we must consider these multi-cycle and m

6

, that are in use continuously

 In case of a load, the actual module would wait many more cycles, until the data arrive, but would not progress until then

40

Reservation Tables in Pipelining t1 t2 t3 t4 t5 t6 t7 t8 t9 t

10 t

11 t

12 t

13 t

14 t

15 t

16 t

17 t

18 t

19 m6 i1 i1 i1 i1 i2 i2 i2 i2 i3 i3 i3 i3 m5 i1 i2 d2 i3 m4 i1 i2 i3 m3 i1 i1 i1 i2 i2 i2 i3 i3 i3 m2 i1 i 2 d2 d2 i3 m1 i1 i2 d d d d d d i3

Table 2: Instructions i

1 to i

3 use 6 H modules m i for 1-4 cycles

41

Reservation Tables in Pipelining

 Instead of using the single resources m

3 and m repeatedly and continuously, an architect can

6 replicate them as many times as needed in a single instruction

 This costs more hardware , and does not speed up execution of a single instruction

 For a single operation, all would still have to progress in sequence

 But it avoids the delay of subsequent instructions needing the same type of HW module

 See Table 3: shaded areas indicate the duplicated hardware modules

42

Reservation Tables in Pipelining m6,4 m6,3 m6,2 m6 m5 m4 m3,3 m3,2 m3 m2 m1 t1 t2 t3 t4 t5 t6 t7 t8 t9 t

10 t

11 t

12 t

13 t

14 i1 i2 I3 i1 i2 i3 i1 i2 i3 i1 i2 i3 i1 i2 i3 i1 i2 i3 i1 i2 I3 i1 i2 i3 i1 i2 i3 i1 i2 i3 i1 i2 i3

Table 3: Instructions i

1 to i

3 use replicated HW modules m

3 and m

6

43

Reservation Tables in Pipelining

 These replicated circuits in Table 3 do not speed up the execution of any individual instruction

 But by avoiding the delay for other instructions, a higher degree of pipelining is enabled, and multiple instructions can retire earlier

 Even this is unrealistically simplistic

 Some of the modules are used for more than one cycle, but not necessarily continually

 Instead, a Reservation Table offers a more realistic representation

 Use the Reservation Table in Table 4 to figure out, how closely same instruction can be scheduled one after another

44

Collision Vector

 B est case: the next identical instruction can be scheduled at the next cycle

 Worst case: next instruction to be scheduled n cycles after the start of the first, if the first requires n cycles for completion

 Goal for a HW designer is to find, how many can be scheduled in between on a regular basis

 To analyze this for speed, we use the Reservation

Table and Collision Vector (CV)

45

Collision Vector

 Goal: Find Collision Vector by overlapping two identical Reservation Tables (e.g. plastic transparencies  ) within the window of the cycles of one operation

 If, after shifting second transparency i = 1..n-1 time steps, 2 resource-marks of one row land in the same field, we have a collision : both instructions claim that resource at the same time!

 Collision means: mark field i in the CV with a 1

 Otherwise mark it with a 0 , or leave blank

 Do so n-1 times, and the CV is complete. But do check for all rows, i.e. for all HW modules m j

46

Collision Vector m1 m2 m3 m4 m5 t1 t2 t3 t4 t5 t6 t7

X

X X

X X

X

X

X

Table 4: Reservation Table for 7-step, 5-Module instruction

Table 5: Find Collision Vector for above instruction

Collision Vector has n-1 entries for n-cycle instruction

47

Collision Vector

 If a second instruction of the kind shown in Table

4 were initiated one cycle after the first, resource m

2 will cause a conflict

 This is, because instruction 2 requires m

2 cycles 3 and 4 at

 However, instruction 1 is already using m

2 cycles 2 and 3 at

 At step 3 there would be a conflict

 Also resource m

3 would cause a conflict

 The good news, however, is that this doubleconflict causes no further entry in CV

48

Collision Vector

 Similarly, a new instruction cannot be initiated 2 cycles after the start of the first

 This is, because instruction 2 requires m

4 cycles t

7 and t

9

 However, instruction 1 is already using m

4 and t

7

. At step t

7 there would be a conflict at at t

5

 At all other steps a second instruction may be initiated. See the completed CV in Table 6

1 1 0 0 0 0

Table 6: Collision Vector for above 7-cycle, 5-module instruction

49

Reservation Table:

Main Example

 The next example is a hypothetical instruction of a hypothetical architecture, characterized by the

Reservation Table 7, 7 cycles, 4 modules

 It will be used throughout this section m1 m2 m3 m4 t1 t2 t3 t4 t5 t6 t7

X

X

X X

X

X

X

X

Table 7: Reservation Table for 7-cycle, 4-module Main Example

50

Reservation Table:

Main Example

 The Collision Vector for the Main Example says:

We can start new instruction of the same kind at step t

6 or t

7

 Of course, we can always start a new instruction, identical or of another type, after the current one has completed; no resource will be in use then

 Challenge is to start another, while the current is still executing one has completed. No resource will be in use then. The challenge is to start another one while the current is executing

1 1 1 1 0 0

Table 8: Collision Vector for Main Example

 Goal now is to show, that by adding delays we can sometimes speed up execution of pipelined ops!

to start another one while the current is executing

51

Main Example Pipelined

 For Main Example , initiate a second, pipelined instruction Y at step t

6

5 cycles after start of X

 Greedy Approach to pipeline X and Y as follows: m1 m2 m3 m4 t1 t2 t3 t4 t5 t6 t7 t8 t9 t

10 t

11 t

12

X X X Y Y Y Z

X X Y Y Z

X X

X Y

Y Y

Table 9: Pipelining 2 Instructions of Main Example

 Observe two-cycle overlap. This is all the speedgain we can gain. Starting Y earlier ( greedy approach ) would create delays, and not retire the second one any earlier

52

Main Example Pipelined

 The 3 rd pipelined instruction Z can start at time step t

11

, by which time the first is retired, the second half-way through

 The fourth instruction can start at step t

16

, etc.

m1 m2 m3 m4 t1 t2 t3 t4 t5 t6 t7 t8 t9 t

10 t

11 t

12 t

13 t

14 t

15 t

16 t

17

X X X Y Y Y Z Z Z

X X Y Y Z Z

X X

X Y

Y Y

Z

Z Z

Table 10: Pipelining 3 Instructions of Main Example

53

Main Example Pipelined

 Though Reservation Table 7 for the Main Example is sparsely populated, no high degree of pipelining is possible

 The maximum overlap is 2 cycles

 Can one already infer this low degree of pipelining from the Collision Vector ? In pipelined execution there are 5 cycles per instruction retirement in the steady state, cpi = 5

 That means 5 cycles per completion of an instruction, assuming the same instruction is executed over and over, and only after the steady state has been reached, not during the priming phase!

 We ’ll come back to the Main Example and analyze it after further study of Examples 2 and 3

54

Pipeline Example 2

 Reservation Table for Example 2 has 7 entries, 24 fields, 6 steps, density = 0.29166

. Main Example has

8 entries in 28 fields, density = 0.28571

 We ’ll attempt to pipeline as many identical Example 2 instructions as possible m1 m2 m3 m4 t1 t2 t3 t4 t5 t6

X

X

X

X

X

X

X

Table 11: Reservation Table for Example 2

1 0 1 0 1

Table 12: Collision Vector for Example 2

55

Pipeline Example 2

 The Collision Vector suggests to initiate a new pipelined instruction at time t

3

, t

5

, t

7

, etc.

 That would allow 3 instructions simultaneously, overlapped, pipelined. By step t

7 the first instruction would already be retired m1 m2 m3 m4 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14

X

X

Y

X

Y

X

Z

X

Y

X

X

Z

Y

A

Y

Z

Y

Y

A

Z

B

Z

A

Z

Z

B

A

A

B

A

A

B

B

B

B

Table 13: Schedule for Pipelining Example 2

56

Pipeline Example 2

 We were just lucky to pipeline 3 identical instructions at the same time

 The CV is not a direct indicator. The reader was mildly misled  to make inferences that don ’t strictly follow

 However, if all positions in the CV were marked 1, there would be no pipelining

 For Example 2 the number of cycles per instruction retirement is cpi = 2

 Even though the operation density is slightly higher than in the Main Example , the pipelining overlap is significantly higher, which is counterintuitive! On to Example 3!

57

Pipeline Example 3

 Interesting to see Example 3, analyzing the Collision

Vector , how much we can pipeline. The Reservation

Table has numerous resource fields filled, yet the

Collision Vector is sparser than the one in Example 2 m1 m2 m3 m4 t1 t2 t3 t4 t5 t6

X

X

X

X

X

X

X

X

X

X

X

X

Table 14: Reservation Table for Example 3

0 1 0 1 0

Table 15: Collision Vector for Example 3

58

Pipeline Example 3

 The Collision Vector suggests to start new pipelined instruction 1, 3, or 5 cycles after initiation of the first

 CV is less packed with 1s than the previous case

Example 2 , where we could overlap 3 identical instructions and get a rate of cpi = 2 . Goal to find the best cpi rate for Example 3 m1 m2 m3 m4 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12

X

X

Y

X

Y

X

X

Y

X

Y

Y

X

Y

X

X

Y

X

Y

Y

X

Y

X

X1 Y2 X1 Y2 X1 Y2

Y

X1

Y

X1

Y2

X1

Y2

X1

Y2

X1

Y2

X1

Y2

X1

Y2

X1

Y2

X1

Table 16: Schedule for Pipelining Example 3

59

Pipeline Example 3

 Example 2 earlier with Collision Vector 10101 allows a higher degree of pipelining

 Here in Example 3 , cpi = 3 , every 6 cycles two instructions can retire

 This is in contrast to cpi = 2 of Example 2

 The reason for the lower retirement rate is clear:

 All 4 HW modules are used every other cycle by one of two instructions, thus one cannot overlap more than twice

60

Vertical Expansion for Example 3

 If we need higher degree of pipelining for Example

3 with fill-factor of 0.5, we must pay! Vertically with more hardware, or horizontally with more time for added delays

 Analyze vertically expanded Reservation Table now with 8 Modules; every hardware resource m

1 to m

4 replicated; density = 0.25

t1 t2 t3 t4 t5 t6 m1 m2 m3 m4 m1,2 m2,2

X m3,2 X m4,2

X

X

X

X

X

X

X

X

X

X

Table 17: Reservation Table Example 3 with Replicated HW

61

Vertical Expansion for Example 3

 Let us pipeline multiple identical instructions for

Reservation Table 17 as densely as possible

 With twice the HW, can we overlap perhaps twice as much? The previous rate with half the hardware was cpi = 3 . A solution is shown in Table 18 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 m1 m2 m3 m4 m1,2 m2,2 m3,2 X m4,2

X

Y

X

Y

X

X

Z

Y

Z

Y

X

Y

X

A

Z

A

Z

Y

X

Z

Y

X

A

X

A

Z

Y

A

Z

Y

X

Y

X

A

Z

A

Z

Y

Z

Y

A

A

Z

A

Z

A

X

A

X

Y

X

X

Y

X

Z

Y

X

Y

X

Z

Y

A

Z

Y

X

Table 18: Schedule for Pipelining Example 3

62

Vertical Expansion for Example 3

 The initiation rate and retirement rate are 4 instructions per 8 cycles, cpi = 2

 This is, as suspected, better than the rate of the original Example 3 , not surprising with double the hardware modules

 But this is not twice as good a retirement rate. The original rate was cpi = 3, the modified rate with double the hardware is cpi = 2 .

 Our next case, a variation of the Main Example , shows an expansion of the Reservation Table horizontally

 I.e. delays are built-in, but HW modules are kept constant. Only the 4 modules m

Example are provided

1 to m

4 from Main

63

Horizontal Expansion, Main Example

 After Examples 2 and 3, we expand the Main

Example , repeated below, by adding delays, AKA

Horizontal Expansion

 If we insert delay cycles, clearly execution for a single instruction will slow down

 However, if this yields a sufficient increase in the degree of pipelining, it may still be a win

 Building circuits to delay an instruction is cheap

 We analyze this variation next:

64

Horizontal Expansion, Main Example m1 m2 m3 m4 t1 t2 t3 t4 t5 t6 t7

X

X

X X

X

X

X

X

Table 19: Original Reservation Table for Main Example

Inserting a Delay Cycle after t

3

, will be the new t

4

65

Horizontal Expansion, Main Example

 We ’ll insert delays; but where?

An methodical way to compute optimum position, number not shown

 Instead, we ’ll suggest a sample position for a single delay and analyze the performance impact

 Table 20 shows delay inserted after cycle 3 m1 m2 m3 m4 t1 t2 t3 t4 t5 t6 t7 t8

X

X

X X

X

X

X

X

Table 20: Reservation Table for Main Example with 1 Delay, at t

4

0 1 0 1 1 0 0

Table 21: Collision Vector for Main Example with 1 Delay

66

Horizontal Expansion, Main Example

 The Greedy Approach is to schedule instructions

Y as soon as possible, when CV has a 0 entry

 This would lead us to initiate a second instruction

Y at time step t

2

, one cycle after instruction X. Is it optimal?

m1 m2 m3 m4 t1 t2 t3 t4 t5 t6 t7 t8 t9 t1

0 t1

1 t1

2 t1

3 t1

4 t1

5 t1

6

X Y X Y X Y

X Y X

Z

Y

A

Z

X Y X Y

Z

A

A Z A

Z

Z

A

A

Z A

X Y Z A

Table 22: Schedule for Pipelined Instructions X, Y, Z,

With Delay Slot, Using Greedy Approach

67

Horizontal Expansion, Main Example

 Initiation and retirement rates are 2 instructions every 7 cycles, or cpi = 3.5

; see the purple header at each retired instruction in Table 23

 This is already better than cpi = 5 for the original

Main Example without the delay

 Hence we have shown that adding delays can speed up the throughput of pipelined instructions

 But can we do better?

 After all, we have only tried out the first sample of a Greedy Approach !

68

Horizontal Expansion, Main Example

 In this experiment we start the second instruction at cycle t

4

, three cycles after the start of the first

 Which cpi shall we get? See Table 23 m1 m2 m3 m4 t1 t2 t3 t4 t5 t6 t7 t8 t9 t

10 t

11 t

12 t

13 t

14 t

15 t

16 t

17 t

18

X X Y X Y Z Y Z X Z X Y X Y Z Y Z

X Y X Z Y X Z Y X Z

X

X X

Y

Y Y

Z

Z Z

X

X X

Y

Y

Table 23: Schedule for Pipelined Main Example, with delay.

Initiation Later than First Opportunity

Resulting in better Throughput:

Message: Start later, makes it Slower, to Run Faster

69

Horizontal Expansion, Main Example

 In the more patient schedule of Table 23 we complete one identical instruction every 3 cycles in the steady state

 Purple cells indicate instruction retirement

 X retires at completion t

8

, Y after t

11

, and Z after t

14

 Then X again after t

17

 Now cpi = 3 with the Not-So-Greedy Approach

 Key learning: To speed up pipelined execution, one can sometimes enhance throughput by adding delay circuits, or by replicating hardware, or postponing instruction initiation, or a combination of the above

 The greedy approach is not necessarily optimal

 The collision vector only states, when one cannot initiate a new instruction (value 1); a 0 value is not a solid hint for initiating a new instruction

70

IBM Measurements

Agarwala and Cocke 1987; see [1]

 Memory Bandwidth:

 1 word/cycle to fetch 1 instruction/cycle from I-cache

 40% of instructions are memory-access (load-store)

 Those would all benefit from access to D-cache

 Code Characteristics, dynamic:

 25% of all instructions: loads

 15% of all instructions: stores

 40% of all instructions: ALU/RR

 20% of all instructions: Branches

1/3 unconditional

1/3 conditional taken

1/3 conditional not taken

71

How Can Pipelining Work?

 About 1 out of 4 or 5 instructions is a branch

 Branches include all transfer-of-control instructions; these are: call, return, unconditional and conditional branch, abort, exception and similar machine instructions

 If a processor pipeline is deeper than say, 5 stages, there will almost always be a branch in the pipeline, rendering the several of the perfected operations useless

 Some processors (e.g. Intel Willamette, [6]) have

20 stages. For this processor pipelining would always cause a stall

72

How Can Pipelining Work?

 Remedy is branch prediction

 If the processor knows dynamically, from which address to fetch, instead of blindly assuming the subsequent code address pc+1 , this would solve the pipeline flushes

 Luckily, branch prediction in the 2010s has become about 97% accurate, causing only rarely the need to re-prime the pipe

 Also, processors no longer are designed with the deep pipeline of the Willamette, of about 20 stages

 Here we see interesting interactions of several computer architecture principles: pipelining and branch prediction, one helping the other to become exceedingly advantageous

73

Summary

 Pipelining can speed up execution

 Not due to the faster clock rate

 That fast clock rate is for significantly simpler subinstructions and cannot be equated with the original, i.e. non-pipelined clock

 Pipelining may even benefit from inserting delays

 May also benefit from initating an instruction later than possible

 And benefits, not surprisingly, from added HW resources

 Branch prediction is a necessary architecture attribute to make pipelining work

74

Bibliography

1. Cocke and Schwartz, Programming Languages and their Compilers, unpublished, 1969, http://portal.acm.org/itation.cfm?id=1097042

2. Harold Stone, High Performance Computer

Architecture, 1993 AW

3. cpi rate: http://en.wikipedia.org/wiki/Cycles_per_instruction

4. Introduction to PCI: http://electrofriends.com/articles/computerscience/protocol/introduction-to-pci-protocol/

5. Wiki PCI page: http://en.wikipedia.org/wiki/Conventional_PCI

6. http://en.wikipedia.org/wiki/NetBurst_(microarchitect ure)

75

Download