Chapter One Introduction to Pipelined Processors Principle of Designing Pipeline Processors (Design Problems of Pipeline Processors) Data Buffering and Busing Structures Speeding up of pipeline segments • The processing speed of pipeline segments are usually unequal. • Consider the example given below: S1 S2 S3 T1 T2 T3 Speeding up of pipeline segments • If T1 = T3 = T and T2 = 3T, S2 becomes the bottleneck and we need to remove it • How? • One method is to subdivide the bottleneck – Two divisions possible are: Speeding up of pipeline segments • First Method: S1 T S3 T 2T T Speeding up of pipeline segments • First Method: S1 T S3 T 2T T Speeding up of pipeline segments • Second Method: S1 T S3 T T T T Speeding up of pipeline segments • If the bottleneck is not sub-divisible, we can duplicate S2 in parallel S2 3T S1 S2 S3 T 3T T S2 3T Speeding up of pipeline segments • Control and Synchronization is more complex in parallel segments Data Buffering • Instruction and data buffering provides a continuous flow to pipeline units • Example: 4X TI ASC Example: 4X TI ASC • In this system it uses a memory buffer unit (MBU) which – Supply arithmetic unit with a continuous stream of operands – Store results in memory • The MBU has three double buffers X, Y and Z (one octet per buffer) – X,Y for input and Z for output Example: 4X TI ASC • This provides pipeline processing at high rate and alleviate mismatch bandwidth problem between memory and arithmetic pipeline Busing Structures • PBLM: Ideally subfunctions in pipeline should be independent, else the pipeline must be halted till dependency is removed. • SOLN: An efficient internal busing structure. • Example : TI ASC Example : TI ASC • In TI ASC, once instruction dependency is recognized, update capability is incorporated by transferring contents of Z buffer to X or Y buffer. Internal Data Forwarding and Register Tagging Internal Forwarding and Register Tagging • Internal Forwarding: It is replacing unnecessary memory accesses by register-toregister transfers. • Register Tagging: It is the use of tagged registers for exploiting concurrent activities among multiple ALUs. Internal Forwarding • Memory access is slower than register-toregister operations. • Performance can be enhanced by eliminating unnecessary memory accesses Internal Forwarding • This concept can be explored in 3 directions: 1. Store – Load Forwarding 2. Load – Load Forwarding 3. Store – Store Forwarding Store – Load Forwarding Load – Load Forwarding Store – Store Forwarding Register Tagging Example : IBM Model 91 : Floating Point Execution Unit Example : IBM Model 91-FPU • The floating point execution unit consists of : – Data registers – Transfer paths – Floating Point Adder Unit – Multiply-Divide Unit – Reservation stations – Common Data Bus Example : IBM Model 91-FPU • There are 3 reservation stations for adder named A1, A2 and A3 and 2 for multipliers named M1 and M2. • Each station has the source & sink registers and their tag & control fields • The stations hold operands for next execution. Example : IBM Model 91-FPU • 3 store data buffers(SDBs) and 4 floating point registers (FLRs) are tagged • Busy bits in FLR indicates the dependence of instructions in subsequent execution • Common Data Bus(CDB) is to transfer operands Example : IBM Model 91-FPU • There are 11 units to supply information to CDB: 6 FLBs, 3 adders & 2 multiply/divide unit • Tags for these stations are : Unit Tag Unit Tag FLB1 FLB2 FLB3 0001 0010 0011 ADD1 ADD2 ADD3 1010 1011 1100 FLB4 0100 M1 1000 FLB5 0101 M2 1001 FLB6 0110 Example : IBM Model 91-FPU • Internal forwarding can be achieved with tagging scheme on CDB. • Example: • Let F refers to FLR and FLBi stands for ith FLB and their contents be (F) and (FLBi) • Consider instruction sequence ADD F,FLB1 F (F) + (FLB1) MPY F,FLB2 F (F) x (FLB2) Example : IBM Model 91-FPU • During addition : – Busy bit of F is set to 1 – Contents of F and FLB1 is sent to adder A1 – Tag of F is set to 1010 (tag of adder) F Busy Bit = 1 Tag=1010 Storage Bus Instruction Unit 6 5 Floating Point Buffers (FLB) 4 Control 3 2 Floating Point Operand Stack(FLOS) Busy Bit = 1 Tag=1010 Tags 1 Decoder Tag Sink Tag Sink 1010 F Tag Tag 0001 Source Source FLB1 CTRL CTRL CTRL Tag Sink Tag Sink Adder Tag Tag Source CTRL Source CTRL Multiplier (Common Data Bus) Store 3 data buffers 2 (SDB) 1 Example : IBM Model 91-FPU • Meantime, the decode of MPY reveals F is busy, then – F should set tag of M1 as 1010 (Tag of adder) – F should change its tag to 1000 (Tag of Multiplier) – Send content of FLB2 to M1 F Busy Bit = 1 Tag=1000 Storage Bus Instruction Unit 6 5 Floating Point Buffers (FLB) 4 Control 3 2 Floating Point Operand Stack(FLOS) Busy Bit = 1 Tag=1000 Tags 1 Decoder Tag Sink Tag Source Tag Sink Tag Source Tag Sink Tag Source CTRL CTRL CTRL Tag Sink Tag 1000 F 0010 Adder Source CTRL FLB2 CTRL Multiplier (Common Data Bus) Store 3 data buffers 2 (SDB) 1 Example : IBM Model 91-FPU • When addition is done, CDB finds that the result should be sent to M1 • Multiplication is done when both operands are available Hazard Detection and Resolution Hazard Detection and Resolution • Hazards are caused by resource usage conflicts among various instructions • They are triggered by inter-instruction dependencies Terminologies: • Resource Objects: set of working registers, memory locations and special flags Hazard Detection and Resolution • Data Objects: Content of resource objects • Each Instruction can be considered as a mapping from a set of data objects to a set of data objects. • Domain D(I) : set of resource of objects whose data objects may affect the execution of instruction I. Hazard Detection and Resolution • Range R(I): set of resource objects whose data objects may be modified by the execution of instruction I • Instruction reads from its domain and writes in its range Hazard Detection and Resolution • Consider execution of instructions I and J, and J appears immediately after I. • There are 3 types of data dependent hazards: 1. RAW (Read After Write) 2. WAW(Write After Write) 3. WAR (Write After Write) RAW (Read After Write) • The necessary condition for this hazard is R( I ) D( J ) RAW (Read After Write) • Example: I1 : LOAD r1,a I2 : ADD r2,r1 • I2 cannot be correctly executed until r1 is loaded • Thus I2 is RAW dependent on I1 WAW(Write After Write) • The necessary condition is R( I ) R( J ) WAW(Write After Write) • Example I1 : MUL r1, r2 I2 : ADD r1,r4 • Here I1 and I2 writes to same destination and hence they are said to be WAW dependent. WAR(Write After Read) • The necessary condition is D( I ) R( J ) WAR(Write After Read) • • • • Example: I1 : MUL r1,r2 I2 : ADD r2,r3 Here I2 has r2 as destination while I1 uses it as source and hence they are WAR dependent Hazard Detection and Resolution • Hazards can be detected in fetch stage by comparing domain and range. • Once detected, there are two methods: 1. Generate a warning signal to prevent hazard 2. Allow incoming instruction through pipe and distribute detection to all pipeline stages.