Principles of pipelining • The two major parametric considerations in designing a parallel computer architecture are: – executing multiple number of instructions in parallel, – increasing the efficiency of processors. There are various methods by which instructions can be executed in parallel – Pipelining is one of the classical and effective methods to increase parallelism where different stages perform repeated functions on different operands. – Vector processing is the arithmetic or logical computation applied on vectors whereas in scalar processing only one data item or a pair of data items is processed. • Superscalar processing : For improving the processor’s speed by having multiple instructions per cycle is known as Superscalar processing. • Multithreading : used for increasing processor utilization which is also used in parallel computer architecture. OBJECTIVES • • • • • Principles of linear pipelining . Classification of pipeline processor Instruction and arithmetic pipeline Principles of designing pipeline processors vector processing requirements . PARALLEL PROCESSING Execution of Concurrent Events in the computing process to achieve faster Computational Speed Levels of Parallel Processing - Job or Program level - Task or Procedure level - Inter-Instruction level - Intra-Instruction level PARALLEL COMPUTERS Architectural Classification – Flynn's classification • Based on the multiplicity of Instruction Streams and Data Streams • Instruction Stream – Sequence of Instructions read from memory • Data Stream – Operations performed on the data in the processor Number of Data Streams Number of Single Instruction Streams Multiple Single Multiple SISD SIMD MISD MIMD COMPUTER ARCHITECTURES FOR PARALLEL PROCESSING Von-Neuman based SISD Superscalar processors Superpipelined processors VLIW MISD Nonexistence SIMD Array processors Systolic arrays Dataflow Associative processors MIMD Reduction Shared-memory multiprocessors Bus based Crossbar switch based Multistage IN based Message-passing multicomputers Hypercube Mesh Reconfigurable Pipelining PIPELINING A technique of decomposing a sequential process into sub operations, with each sub process being executed in a partial dedicated segment that operates concurrently with all other segments. Ai * Bi + Ci for i = 1, 2, 3, ... , 7 Ai Bi R1 R2 Memory Ci Segment 1 Multiplier Segment 2 R4 R3 Segment 3 Adder R5 R1 Ai, R2 Bi R3 R1 * R2, R4 Ci R5 R3 + R4 Load Ai and Bi Multiply and load Ci Add Pipelining OPERATIONS IN EACH PIPELINE STAGE Clock Segment 1 Pulse Number R1 R2 1 A1 B1 2 A2 B2 3 A3 B3 4 A4 B4 5 A5 B5 6 A6 B6 7 A7 B7 8 9 Segment 2 R3 A1 * B1 A2 * B2 A3 * B3 A4 * B4 A5 * B5 A6 * B6 A7 * B7 R4 C1 C2 C3 C4 C5 C6 C7 Segment 3 R5 A1 * B1 + C1 A2 * B2 + C2 A3 * B3 + C3 A4 * B4 + C4 A5 * B5 + C5 A6 * B6 + C6 A7 * B7 + C7 GENERAL PIPELINE General Structure of a 4-Segment Pipeline Clock Input S1 R1 S2 R2 S3 R3 S4 R4 Space-Time Diagram Segment 1 2 3 4 1 2 3 4 5 6 7 8 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 9 T6 Clock cycles PIPELINE PROCESSING • Pipelining is a method to realize, overlapped parallelism in the proposed solution of a problem, on a digital computer in an economical way . • To introduce pipelining in a processor P, the following steps must be followed: • Sub-divide the input process into a sequence of subtasks. These subtasks will make stages of pipeline, which are also known as segments. • Each stage Si of the pipeline according to the subtask will perform some operation on a distinct set of operands. • When stage Si has completed its operation, results are passed to the next stage Si+1 for the next operation. • The stage Si receives a new set of input from previous stage Si-1 . parallelism in a pipelined processor can be achieved such that m independent operations can be performed simultaneously in m segments as shown pipeline processor • A pipeline processor can be defined as a processor that consists of a sequence of processing circuits called segments and a stream of operands (data) is passed through the pipeline. • In each segment partial processing of the data stream is performed and the final output is received when the stream has passed through the whole pipeline. • An operation that can be decomposed into a sequence of well-defined sub tasks is realized through the pipelining concept. Classification of Pipeline Processors • Level of Processing • Pipeline configuration • Type of Instruction and data Classification according to level of processing • Instruction pipeline • Arithmetic pipeline Instruction Pipeline • An instruction cycle may consist of many operations like, fetch opcode, decode opcode, compute operand addresses, fetch operands, and execute instructions. • These operations of the instruction execution cycle can be realized through the pipelining concept. Each of these operations forms one stage of a pipeline. • The overlapping of execution of the operations through the pipeline provides a speedup over the normal execution. Thus, the pipelines used for instruction cycle operations are known as instruction pipelines. Instruction Pipelines • The stream of instructions in the instruction execution cycle, can be realized through a pipeline where overlapped execution of different operations are performed. • The process of executing the instruction involves the following major steps: • Fetch the instruction from the main memory • Decode the instruction • Fetch the operand • Execute the decoded instruction INSTRUCTION CYCLE Six Phases* in an Instruction Cycle [1] Fetch an instruction from memory [2] Decode the instruction [3] Calculate the effective address of the operand [4] Fetch the operands from memory [5] Execute the operation [6] Store the result in the proper place * Some instructions skip some phases * Effective address calculation can be done in the part of the decoding phase * Storage of the operation result into a register is done automatically in the execution phase ==> 4-Stage Pipeline [1] FI: Fetch an instruction from memory [2] DA: Decode the instruction and calculate the effective address of the operand [3] FO: Fetch the operand [4] EX: Execute the operation INSTRUCTION PIPELINE Execution of Three Instructions in a 4-Stage Pipeline Conventional (sequential) i FI DA FO EX i+1 FI DA FO EX i+2 FI DA FO EX Pipelined i FI DA FO EX i+1 FI DA FO EX i+2 FI DA FO EX INSTRUCTION EXECUTION IN A 4-STAGE PIPELINE Segment1: Segment2: Fetch instruction from memory Decode instruction and calculate effective address yes Branch? no Segment3: Fetch operand from memory Segment4: Execute instruction Interrupt handling Update PC Empty pipe yes Interrupt? no Step: 1 Instruction 1 2 (Branch) 3 4 5 6 7 FI 2 3 4 5 6 7 8 9 10 11 12 13 DA FO EX FI DA FO EX FI DA FO EX FI FI DA FO EX FI DA FO EX FI DA FO EX FI DA FO EX Instruction buffers • For taking the full advantage of pipelining, pipelines should be filled continuously. • Instruction fetch rate should be matched with the pipeline consumption rate. To do this, instruction buffers are used. • Instruction buffers in CPU have high speed memory for storing the instructions. The instructions are pre-fetched in the buffer from the main memory. • Another alternative for the instruction buffer is the cache memory between the CPU and the main memory. • The advantage of cache memory is that it can be used for both instruction and data. But cache requires more complex control logic than the instruction buffer. Arithmetic Pipeline • The complex arithmetic operations like multiplication, and floating point operations consume much of the time of the ALU. • These operations can also be pipelined by segmenting the operations of the ALU and as a consequence, high speed performance may be achieved. • Thus, the pipelines used for arithmetic operations are known as arithmetic pipelines. Arithmetic Pipelines • The technique of pipelining can be applied to various complex and slow arithmetic operations to speed up the processing time. • Arithmetic pipelines based on arithmetic operations. Arithmetic pipelines are constructed for simple fixed-point and complex floating-point arithmetic operations. • These arithmetic operations are well suited to pipelining as these operations can be efficiently partitioned into subtasks for the pipeline stages. • For implementing the arithmetic pipelines we generally use following two types of adder: • Carry propagation adder (CPA): It adds two numbers such that carries generated in successive digits are propagated. • Carry save adder (CSA): It adds two numbers such that carries generated are not propagated rather these are saved in a carry vector . Fixed Arithmetic pipelines • Ex: Multiplication of fixed numbers. – Two fixed-point numbers are added by the ALU using add and shift operations. – This sequential execution makes the multiplication a slow process. – Multiplication is the process of adding the multiple copies of shifted multiplicands as shown • The first stage generates the partial product of the numbers, which form the six rows of shifted multiplicands. • In the second stage, the six numbers are given to the two CSAs merging into four numbers. • In the third stage, there is a single CSA merging the numbers into 3 numbers. • In the fourth stage, there is a single number merging three numbers into 2 numbers. • In the fifth stage, the last two numbers are added through a CPA to get the final product Floating point Arithmetic pipelines • Floating point computations are the best candidates for pipelining. • example :Addition of two floating point numbers. – Following stages are identified for the addition of two floating point numbers • First stage will compare the exponents of the two numbers. • Second stage will look for alignment of mantissas. • In the third stage, mantissas are added. • In the last stage, the result is normalized ARITHMETIC PIPELINE Exponents a Floating-point adder b R Mantissas A B R 2a X=Ax Y = B x 2b Segment 1: [1] [2] [3] [4] Compare the exponents Align the mantissa Add/sub the mantissa Normalize the result Compare Difference exponents by subtractn R Segment 2:Choose exponent Align mantissa R Add or subtract mantissas Segment 3: R Segment 4: Adjust exponent R R Normalize result R Classification according to pipeline configuration • Unifunction Pipelines: When a fixed and dedicated function is performed through a pipeline, it is called a Unifunction pipeline. • Multifunction Pipelines: When different functions at different times are performed through the pipeline, this is known as Multifunction pipeline. – Multifunction pipelines are reconfigurable at different times according to the operation being performed Classification according to type of instruction and data • Scalar Pipelines: This type of pipeline processes scalar operands of repeated scalar instructions. • Vector Pipelines: This type of pipeline processes vector instructions over vector operands. Performance and Issues in Pipelining • Speedup : How much speed up performance we get through pipelining. – n: Number of tasks to be performed • Conventional Machine (Non-Pipelined) – tn: Clock cycle – t1: Time required to complete the n tasks – t1 = n * tn • Pipelined Machine (k stages) – tp: Clock cycle (time to complete each sub operation) – tk: Time required to complete the n tasks – tk = (k + n - 1) * tp • Speedup – Sk: Speedup – • Sk = n*tn / (k + n - 1)*tp lim Sk = n tn tp ( = k, if tn = k * tp ) Pipelining PIPELINE AND MULTIPLE FUNCTION UNITS Example - 4-stage pipeline - sub operation in each stage; tp = 20nS - 100 tasks to be executed - 1 task in non-pipelined system; 20*4 = 80nS Pipelined System (k + n - 1)*tp = (4 + 99) * 20 = 2060nS Non-Pipelined System n*k*tp = 100 * 80 = 8000nS Speedup Sk = 8000 / 2060 = 3.88 4-Stage Pipeline is basically identical to the system with 4 identical function units Ii I i+1 I i+2 I i+3 P1 P2 P3 P4 Multiple Functional Units • Efficiency: The efficiency of a pipeline can be measured as the ratio of busy time span to the total time span including the idle time. • Let c be the clock period of the pipeline, the efficiency E can be denoted as: • E = (n. m. c) / m. [m. c + (n-1).c] = n / [(m + (n-1)] • As n-> ∞ , E becomes 1. • Throughput: Throughput of a pipeline can be defined as the number of results that have been achieved per unit time. • It can be denoted as: – T = (n / [m + (n-1)]) / c = E / c • Throughput denotes the computing power of the pipeline. • Maximum speedup, efficiency and throughput are the ideal cases. Limitations to speed up • Data dependency between successive tasks: There may be dependencies between the instructions of two tasks used in the pipeline. • For example: – One instruction cannot be started until the previous instruction returns the results, as both are interdependent. – Another instance of data dependency will be when that both instructions try to modify the same data object. These are called data hazards. Resource Constraints: When resources are not available at the time of execution then delays are caused in pipelining. For example: 1)If one common memory is used for both data and instructions and there is need to read/write and fetch the instruction at the same time, then only one can be carried out and the other has to wait. 2)Limited resource like execution unit, which may be busy at the required time. • Branch Instructions and Interrupts in the program: A program is not a straight flow of sequential instructions. There may be branch instructions that alter the normal flow of program, which delays the pipelining execution and affects the performance. Similarly, there are interrupts that postpones the execution of next instruction until the interrupt has been serviced. Branches and the interrupts have damaging effects on the pipelining. PRINCIPLES OF DESIGNING PIPELINE PROCESSORS CONTENTS INSTRUCTION-PREFETCH AND BRANCH HANDLING. DATA BUFFERING AND BUSSING STRUCTURES. INTERNAL FORWARDING AND REGISTER TAGGING. HAZARD DETECTION AND RESOLUTION. INSTRUCTION PREFETCH AND BRANCH HANDLING • FOR DESIGNING PIPELINED INSTRUCTION UNITS : • Interrupts and branch, produce damaging effects on the performance of pipeline computers. • Two possible path for conditional branch operation: • 1) Yes path. 2) No path Five segments of Instruction Pipeline Fetch Instruction Decode Fetch Operands Execute Store Results Overlapped Execution of Instruction Without Branching 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 I1 I2 I3 I4 I5 I6 I7 I8 Effect of Branching on performance of Instruction pipeline 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 I1 I2 I3 I4 I5 I6 I7 I8 Timing Diagram for Instruction Pipeline Operation The Effect of a Conditional Branch on Instruction Pipeline Operation Instruction 3 is a conditional branch to instruction 15: Alternative Pipeline Depiction Instruction 3 is conditional branch to instruction 15: Instruction Prefetching Strategy • Instruction words ahead of one currently being decoded are fetched from the memory before the instruction decoding units requests them. • 2 prefetch buffers : – Sequential prefetch buffer ( s size) • Holds instruction fetched during sequential run of pgm. • When a branch is successful, the contents of this buffer are invalidated . – Target prefetch buffer ( r size) • Holds instruction fetched from the target of a conditional branch • When the conditional branch is unsuccessful, the contents of this buffer are invalidated • Unconditional branch (Jump): – The instruction word at the target of instruction is requested immediately by decoder and decoding ceases until the target inst. Returns from memory. • conditional branch – Sequential prefetching is suspended . – Instructions are prefetched from the target memory address of conditional branch instruction – If branch is successful the target instruction stream becomes the sequential stream • Instruction prefetching reduces the damaging effect of branching . An Instruction Pipeline with Both Sequential and Target Pre fetch Buffer Memory system (Access Time T) Sequential Prefetch Buffer (s words) Target Prefetch Buffer (t words) Decoder (r Time units) 1 2 } Execution Pipeline E DATA BUFFERING AND BUSING STRUCTURES • • • The processing speeds of pipeline segments are usually unequal. The throughput of the pipeline is inversely proportional to the bottleneck. It is desirable to remove the bottleneck which causes the unnecessary congestion. Segment 2 is The bottleneck S1 T1 S2 T2 T1 = T3 = T, S3 T3 T2 = 3T Segments s1, s2 , s3 having delays T1, T2, T3 Subdivision of Segment 2 S2 S1 T S2 T S1 T S2 T S2 T S2 T S3 T S3 2T T SUBDIVIDE THE BOTTLENECK INTO 2 DIFFERENT DIVISIONS OF S2 Replication of segment 2 S2 3T S1 S2 S3 T 3T T S2 3T If bottleneck is not sub divisible, use duplicate of bottleneck in Parallel to smooth congestion . Data and Instruction Buffers • To smooth the traffic flow in a pipeline is to use buffers to close up the speed gap between the memory accesses for either instructions or operands. • Buffering can avoid unnecessary idling to the processing stages caused by memory-access conflicts or by unexpected branching or interrupts. Busing Structures • Ideally,the subfunction being executed by one stage should be independent of the other subfunctions being executed by the remaining stages;otherwise some processes in the pipeline must be halted until the dependency is removed. • These problems cause additional time delays.An efficient internal busing structure is desired to route results to the requesting stations with minimum time delays. Internal Forwarding and Register Tagging • Internal forwarding refers to a “short circuit” technique for replacing unnecessary memory accesses by register-to-register transfers in a sequence of fetch-arithmetic-store operations. • Register tagging refers to the use of tagged registers, buffers,and reservation stations for exploiting concurrent activities among multiple arithmetic units. Internal Forwarding Examples Mi Mi Mi R1( store) R2Mi (Fetch) 2 memory access R1 R2 R1 R2 a)store-fetch forwarding Mi R1( store) R2R1 (register transfer) Mi Mi R1 Mi( fetch) R2Mi (Fetch) 2 memory access b)Fetch-Fetch forwarding R1 R2 Mi R1 R2 R1 Mi( fetch) R2R1( Register Transfer) 1 memory access Mi R1( Store ) Mi Mi R2( Store ) 2 memory access c)Store-Store Forwarding R1 R2 R2 Mi R2( Store ) 1 memory access HAZARD DETECTION AND RESOLUTION • Pipeline hazards are caused by resource-usage conflicts among various instruction in the pipeline. • Such hazards are triggered by inter instruction dependencies. • Three classes of data dependencies hazards, according to various data update patterns: • 1)write after read(WAR) 2)read after write(RAW) 3)write after write(WAW) Continued……. • Hazard detection can be done in the instruction-fetch stage of a pipeline processor by comparing the domain and range of the incoming instruction with those of the instructions being processed in the pipe. • A warning signal can be generated to prevent the hazard from taking place. MAJOR HAZARDS IN PIPELINED EXECUTION Structural hazards(Resource Conflicts) Hardware Resources required by the instructions in simultaneous overlapped execution cannot be met Data hazards (Data Dependency Conflicts) An instruction scheduled to be executed in the pipeline requires the result of a previous instruction, which is not yet available R1 <- B + C R1 <- R1 + 1 ADD Data dependency DA B,C + INC DA bubble R1 +1 Control hazards Branches and other instructions that change the PC make the fetch of the next instruction to be delayed JMP ID PC + bubble Hazards in pipelines may make it necessary to stall the pipeline PC Branch address dependency IF ID OF OE OS Pipeline Interlock: Detect Hazards Stall until it is cleared STRUCTURAL HAZARDS Structural Hazards Occur when some resource has not been duplicated enough to allow all combinations of instructions in the pipeline to execute Example: With one memory-port, a data and an instruction fetch cannot be initiated in the same clock i i+1 i+2 FI DA FO EX FI DA FO EX stall stall FI DA FO EX The Pipeline is stalled for a structural hazard <- Two Loads with one port memory -> Two-port memory will serve without stall DATA HAZARDS Data Hazards Occurs when the execution of an instruction depends on the results of a previous instruction ADD R1, R2, R3 SUB R4, R1, R5 Data hazard can be dealt with either hardware techniques or software technique Hardware Technique Interlock - hardware detects the data dependencies and delays the scheduling of the dependent instruction by stalling enough clock cycles Forwarding (bypassing, short-circuiting) - Accomplished by a data path that routes a value from a source (usually an ALU) to a user, bypassing a designated register. This allows the value to be produced to be used at an earlier stage in the pipeline than would otherwise be possible Software Technique Instruction Scheduling (compiler) for delayed load FORWARDING HARDWARE Example: ADD SUB Register file R1, R2, R3 R4, R1, R5 3-stage Pipeline MUX MUX I: Instruction Fetch A: Decode, Read Registers, ALU Operations E: Write the result to the destination register Bypass path Result write bus ALU R4 ALU result buffer ADD I A SUB I SUB I E A A E E Without Bypassing With Bypassing INSTRUCTION SCHEDULING a = b + c; d = e - f; Unscheduled code: LW Rb, b LW Rc, c ADD Ra, Rb, Rc SW a, Ra LW Re, e LW Rf, f SUB Rd, Re, Rf SW d, Rd Scheduled Code: LW Rb, b LW Rc, c LW Re, e ADD Ra, Rb, Rc LW Rf, f SW a, Ra SUB Rd, Re, Rf SW d, Rd Delayed Load A load requiring that the following instruction not use its result CONTROL HAZARDS Branch Instructions - Branch target address is not known until the branch instruction is completed Branch Instruction Next Instruction FI DA FO EX FI DA FO EX Target address available - Stall -> waste of cycle times Dealing with Control Hazards * Prefetch Target Instruction * Branch Target Buffer * Loop Buffer * Branch Prediction * Delayed Branch CONTROL HAZARDS Prefetch Target Instruction – Fetch instructions in both streams, branch not taken and branch taken – Both are saved until branch is executed. Then, select the right instruction stream and discard the wrong stream Branch Target Buffer (BTB; Associative Memory) – Entry: Address of previously executed branches; Target instruction and the next few instructions – When fetching an instruction, search BTB. – If found, fetch the instruction stream in BTB; – If not, new stream is fetched and update BTB Loop Buffer (High Speed Register file) – Storage of entire loop that allows to execute a loop without accessing memory Branch Prediction – Guessing the branch condition, and fetch an instruction stream based on the guess. Correct guess eliminates the branch penalty. Delayed Branch – Compiler detects the branch and rearranges the instruction sequence by inserting useful instructions that keep the pipeline busy in the presence of a branch instruction DELAYED LOAD LOAD: LOAD: ADD: STORE: R1 M[address 1] R2 M[address 2] R3 R1 + R2 M[address 3] R3 Three-segment pipeline timing Pipeline timing with data conflict clock cycle Load R1 Load R2 Add R1+R2 Store R3 1 2 3 4 5 6 I A E I A E I A E I A E Pipeline timing with delayed load clock cycle Load R1 Load R2 NOP Add R1+R2 Store R3 1 2 3 4 5 6 7 I A E I A E I A E I A E I A E The data dependency is taken care by the compiler rather than the hardware DELAYED BRANCH Compiler analyzes the instructions before and after the branch and rearranges the program sequence by inserting useful instructions in the delay steps Using no-operation instructions Clock cycle s : 1. Load 2. Incre m e nt 3. Add 4. Subtr act 5. Branch to X 6. NOP 7. NOP 8. Ins tr . in X 1 2 3 4 5 6 7 8 9 10 I A E I A E I A E I A E I A E I A E I A E I A E Rearranging the instructions Clock cycle s : 1. Load 2. Incre m e nt 3. Br anch to X 4. Add 5. Subt r act 6. Ins tr . in X 1 2 3 4 5 6 7 8 I A E I A E I A E I A E I A E I A E Pipeline Throughput • The average number of task initiations per clock cycle Dynamic pipeline and Reconfigurability The dynamic pipeline may initiate task from different reservation table simultaneously to allow multiple number of initiation of different function in the same pipeline. It is assumed that any computation step can be delay by inserting non-compute stage. Pipeline with perfect cycle can be better utilize than those with non perfect initiation cycle. Reconfigurability:Reconfigurable pipelines with different function types are more desirable. Such an approach requires extensive resource sharing among different functions. To achieve this , more complicated structure of pipeline segment and their interconnection control is needed. Bypass technique can be used to avoid unwanted stages. This may caused a collision when one instructions , as a result of bypassing ,attempts to used operand fetched for preceding instructions. UNIVERSITY Question Bank 1.Discuss key design problems of a pipeline processor? 2.Discuss various instruction pre fetch and branch control strategies with there effect on performance of pipeline processor? 3.Explain internal forwarding and register tagging technique? 4.Explain causes , detection ,avoidance and resolution of pipeline hazards? 5.For pipeline processor system explain i) Instruction pre fetching ii) Data dependency hazards. 6.What are the factors affecting the performance of pipeline computers? 7.What are the different hazards in pipeline processor? How are they detected and resolved?