Chapter 1 Summary

2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B Chapter One Introduction to Pipelined Processors Pipelining: • It is a technique of decomposing a sequential process into sub-operations which can be executed in a special dedicated segment that operates concurrently with all other segments • It improves processor performance by overlapping the execution of multiple instructions Example for pipeline in computer • Consider that the process of execution of an instruction involves four major steps: 1. Instruction Fetch (IF): from main memory 2. Instruction Decoding (ID): which identifies the operation to be performed 3. Operand Fetch(OF): if needed in execution 4. Execution(EX): of the decoded arithmetic/logic operation • In a non-pipelined computer, these four steps must be completed before the next instruction can be issued • In a pipelined computer, successive stages are executed in an overlapped fashion 2013-14 Advanced Microprocessor by Vincy Joseph • Theoretically a k-stage linear pipeline could be k-times faster. • But this ideal speedup cannot be achieved due to factors like : TE CMPNA/B – Data dependency – Branch and Interrupts Principles of Linear Pipelining • In pipelining, we divide a task into set of subtasks. • The precedence relation of a set of subtasks {T1, T2,…, Tk} for a given task T implies that the task Tj cannot start until some earlier task Ti finishes. • The interdependencies of all subtasks form the precedence graph. • With a linear precedence relation, task Tj cannot start until earlier subtasks { Ti} for all (i < j) finish. • A linear pipeline can process subtasks with a linear precedence graph. Basic Linear Pipeline • L: latches, interface between different stages of pipeline 2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B • S1, S2, etc. : pipeline stages • It consists of cascade of processing stages. • Stages : Pure combinational circuits performing arithmetic or logic operations over the data flowing through the pipe. • Stages are separated by high speed interface latches. • Latches : Fast Registers holding intermediate results between stages • Information Flow are under the control of common clock applied to all latches • The flow of data in a linear pipeline having four stages for the evaluation of a function on five inputs is as shown below: • The vertical axis represents four stages • The horizontal axis represents time in units of clock period of the pipeline. Clock Period (τ) for the pipeline • Let τi be the time delay of the circuitry Si and t1 be time delay of latch. • Then the clock period of a linear pipeline is defined by   max  i  t1  tm  t1 k i 1 The reciprocal of clock period is called clock frequency (f = 1/τ) of a pipeline processor Performance of a linear pipeline • Consider a linear pipeline with k stages. • Let T be the clock period and the pipeline is initially empty. 2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B • Starting at any time, let us feed n inputs and wait till the results come out of the pipeline. • First input takes k periods and the remaining (n-1) inputs come one after the another in successive clock periods. • Thus the computation time for the pipeline Tp is Tp = kT+(n-1)T = [k+(n-1)]T • For example if the linear pipeline have four stages with five inputs. • Tp = [k+(n-1)]T = [4+4]T = 8T Performance Parameters • The various performance parameters of pipeline are : 1. Speed-up 2. Throughput 3. Efficiency Speedup Speedup is defined as Speedup = Time taken for a given computation by a non-pipelined functional unit Time taken for the same computation by a pipelined version Assume a function of k stages of equal complexity which takes the same amount of time T. Non-pipelined function will take kT time for one input. Then Speedup = nkT/(k+n-1)T = nk/(k+n-1) The maximum value of speedup is Lt [Speedup] = k n∞ Efficiency It is an indicator of how efficiently the resources of the pipeline are used. If a stage is available during a clock period, then its availability becomes the unit of resource. Efficiency can be defined as Efficiency = Number of stage time units actually used during computatio n Total number of stage time units available during that computatio n No. of stage time units = nk (there are n inputs and each input uses k stages. ) Total no. of stage-time units available = k[ k + (n-1)] 2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B It is the product of no. of stages in the pipeline (k) and no. of clock periods taken for computation(k+(n-1)). Thus efficiency is expressed as follows: Efficiency  nk n  k k  n - 1 k  n  1 The maximum value of efficiency is 1. Throughput It is the average number of results computed per unit time. For n inputs, a k-staged pipeline takes [k+(n-1)]T time units Then, Throughput = n / [k+n-1] T = nf / [k+n-1] where f is the clock frequency The maximum value of throughput is Lt [Throughput] = f n∞ Throughput = Efficiency x Frequency Example : Floating Point Adder Unit • This pipeline is linearly constructed with 4 functional stages. • The inputs to this pipeline are two normalized floating point numbers of the form A = a x 10p B = b x 10q where a and b are two fractions and p and q are their exponents. • Our purpose is to compute the sum C = A + B = c x 10r = d x 10s where r = max(p,q) and 0.1 ≤ d < 1 • For example: 2013-14 A=0.9504 x 10 Advanced Microprocessor by Vincy Joseph TE CMPNA/B 3 B=0.8200 x 102 a = 0.9504 b= 0.8200 p=3 & q =2 • Operations performed in the four pipeline stages are : 1. Compare p and q and choose the largest exponent, r = max(p,q)and compute t = |p – q| Example: r = max(p , q) = 3 t = |p-q| = |3-2|= 1 2. Shift right the fraction associated with the smaller exponent by t units to equalize the two exponents before fraction addition. • Example: Smaller exponent, b= 0.8200 Shift right b by 1 unit is 0.082 3. Perform fixed-point addition of two fractions to produce the intermediate sum fraction c • Example : a = 0.9504 b= 0.082 c = a + b = 0.9504 + 0.082 = 1.0324 4. Count the number of leading zeros (u) in fraction c and shift left c by u units to produce the normalized fraction sum d = c x 10u, with a leading bit 1. Update the large exponent s by subtracting s = r – u to produce the output exponent. • Example: c = 1.0324 , u = -1  right shift d = 0.10324 , s= r – u = 3-(-1) = 4 C = 0.10324 x 104 • The above 4 steps can all be implemented with combinational logic circuits and the 4 stages are: 2013-14 Advanced Microprocessor by Vincy Joseph 1. Comparator / Subtractor 2. Shifter 3. Fixed Point Adder 4. Normalizer (leading zero counter and shifter) Classification of Pipeline Processors • There are various classification schemes for classifying pipeline processors. • Two important schemes are 1. Handler’s Classification 2. Li and Ramamurthy's Classification TE CMPNA/B 2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B Handler’s Classification • Based on the level of processing, the pipelined processors can be classified as: 1. Arithmetic Pipelining 2. Instruction Pipelining 3. Processor Pipelining Arithmetic Pipelining • The arithmetic logic units of a computer can be segmented for pipelined operations in various data formats. • Example : Star 100 • Example : Star 100 – It has two pipelines where arithmetic operations are performed – First: Floating Point Adder and Multiplier – Second : Multifunctional : For all scalar instructions with floating point adder, multiplier and divider. – Both pipelines are 64-bit and can be split into four 32-bit at the cost of precision Instruction Pipelining 2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B • The execution of a stream of instructions can be pipelined by overlapping the execution of current instruction with the fetch, decode and operand fetch of the subsequent instructions • It is also called instruction look-ahead The organization of 8086 into a separate BIU and EU allows the fetch and execute cycle to overlap. Processor Pipelining • This refers to the processing of same data stream by a cascade of processors each of which processes a specific task • The data stream passes the first processor with results stored in a memory block which is also accessible by the second processor • The second processor then passes the refined results to the third and so on. Li and Ramamurthy's Classification 2013-14 • Advanced Microprocessor by Vincy Joseph TE CMPNA/B According to pipeline configurations and control strategies, Li and Ramamurthy classify pipelines under three schemes – Unifunction v/s Multi-function Pipelines – Static v/s Dynamic Pipelines – Scalar v/s Vector Pipelines Unifunction v/s Multi-function Pipeline Unifunctional Pipelines • A pipeline unit with fixed and dedicated function is called unifunctional. • Example: CRAY1 (Supercomputer - 1976) • It has 12 unifunctional pipelines described in four groups: – Address Functional Units: • Address Add Unit • Address Multiply Unit – Scalar Functional Units • Scalar Add Unit • Scalar Shift Unit • Scalar Logical Unit • Population/Leading Zero Count Unit – Vector Functional Units • Vector Add Unit • Vector Shift Unit • Vector Logical Unit – Floating Point Functional Units • Floating Point Add Unit • Floating Point Multiply Unit • Reciprocal Approximation Unit Multifunctional Pipelines • A multifunction pipe may perform different functions either at different times or same time, by interconnecting different subset of stages in pipeline. • Example 4X-TI-ASC (Supercomputer - 1973) • • • It has four multifunction pipeline processors, each of which is reconfigurable for a variety of arithmetic or logic operations at different times. It is a four central processor comprised of nine units. 2013-14 • • • • • • Advanced Microprocessor by Vincy Joseph TE CMPNA/B It has one instruction processing unit four memory buffer units and four arithmetic units. Thus it provides four parallel execution pipelines below the IPU. Any mixture of scalar and vector instructions can be executed simultaneously in four pipes. Static Vs Dynamic Pipeline Static Pipeline • • • It may assume only one functional configuration at a time It can be either unifunctional or multifunctional Static pipelines are preferred when instructions of same type are to be executed continuously • A unifunction pipe must be static. Dynamic pipeline • It permits several functional configurations to exist simultaneously • A dynamic pipeline must be multi-functional • The dynamic configuration requires more elaborate control and sequencing mechanisms than static pipelining Scalar Vs Vector Pipeline Scalar Pipeline • It processes a sequence of scalar operands under the control of a DO loop • Instructions in a small DO loop are often prefetched into the instruction buffer. • The required scalar operands are moved into a data cache to continuously supply the pipeline with operands • Example: IBM System/360 Model 91 • In this computer, buffering plays a major role. • Instruction fetch buffering: • provide the capacity to hold program loops of meaningful size. • Upon encountering a loop which fits, the buffer locks onto the loop and subsequent branching requires less time. • Operand fetch buffering: • provide a queue into which storage can dump operands and execution units can fetch operands. • This improves operand fetching for storage-to-register and storage-to-storage instruction types. Vector Pipelines • They are specially designed to handle vector instructions over vector operands. 2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B • • • Computers having vector instructions are called vector processors. The design of a vector pipeline is expanded from that of a scalar pipeline. The handling of vector operands in vector pipelines is under firmware and hardware control. • Example : Cray 1 Linear pipeline (Static & Unifunctional) • In a linear pipeline data flows from one stage to another and all stages are used once in a computation and it is for one functional evaluation. Nonlinear Pipeline • • • • • In floating point adder, stage (2) and (4) needs a shift register. We can use the same shift register and then there will be only 3 stages. Then we should have a feedback from third stage to second stage. Further the same pipeline can be used to perform fixed point addition. A pipeline with feed-forward and/or feedback connections is called non-linear Example: 3-stage nonlinear pipeline • • • • • • • • • • • It has 3 stages Sa, Sb and Sc and latches. Multiplexers(cross circles) can take more than one input and pass one of the inputs to output Output of stages has been tapped and used for feedback and feed-forward. The above pipeline can perform a variety of functions. Each functional evaluation can be represented by a particular sequence of usage of stages. Some examples are: • Sa, Sb, Sc • Sa, Sb, Sc, Sb, Sc, Sa • Sa, Sc, Sb, Sa, Sb, Sc Each functional evaluation can be represented using a diagram called Reservation Table(RT). It is the space-time diagram of a pipeline corresponding to one functional evaluation. X axis – time units Y axis – stages For first sequence Sa, Sb, Sc, Sb, Sc, Sa called function A , we have 2013-14 Advanced Microprocessor by Vincy Joseph 0 Sa 1 3 4 A Sb 5 A A Sc • 2 TE CMPNA/B A A A For second sequence Sa, Sc, Sb, Sa, Sb, Sc called function B, we have 0 Sa 1 2 A 4 5 A Sb A Sc 3 A A A • After starting a function, the stages need to be reserved in corresponding time units. • Each function supported by multifunction pipeline is represented by different RTs • Time taken for function evaluation in units of clock period is compute time.(For A & B, it is 6) • Marking in same row => usage of stage more than once • Marking in same column => more than one stage at a time • Hardware of multifunction pipeline should be reconfigurable. • Multifunction pipeline can be static or dynamic • Static: • • Initially configured for one functional evaluation. • For another function, pipeline need to be drained and reconfigured. • You cannot have two inputs of different function at the same time Dynamic: • Can do different functional evaluation at a time. 2013-14 Advanced Microprocessor by Vincy Joseph • TE CMPNA/B It is difficult to control as we need to be sure that there is no conflict in usage of stages. Principle of Designing Pipeline Processors Instruction Prefetch and Branch Handling • The instructions in computer programs can be classified into 4 types: – Arithmetic/Load Operations (60%) – Store Type Instructions (15%) – Branch Type Instructions (5%) – Conditional Branch Type (Yes – 12% and No – 8%) • Arithmetic/Load Operations (60%) : – These operations require one or two operand fetches. – The execution of different operations requires a different number of pipeline cycles • Store Type Instructions (15%) : – It requires a memory access to store the data. • Branch Type Instructions (5%) : – It corresponds to an unconditional jump. • Conditional Branch Type (Yes – 12% and No – 8%) : – Yes path requires the calculation of the new address – No path proceeds to next sequential instruction. • Arithmetic-load and store instructions do not alter the execution order of the program. • Branch instructions and Interrupts cause some damaging effects on the performance of pipeline computers. Interrupts • When instruction I is being executed,the occurrence of an interrupt postpones instruction I+1 until ISR is serviced. • There are two types of interrupt: – Precise : caused by illegal operation codes and can be detected at decoding stage – Imprecise: caused by defaults from storage, address and execution functions • Precise: Since decoding is the first stage, instruction I prohibits I+1 from entering the pipeline and all preceding instructions are executed before ISR • Imprecise : No new instructions are allowed and all incomplete instructions whether they precede or follow are executed before ISR. Interrupt Handling Example of Cray1 • The interrupt system is built around an exchange package. • When an interrupt occurs, the Cray-1 saves 8 scalar registers, 8 address registers, program counter and monitor flags. • These are packed into 16 words and swapped with a block whose address is specified by a hardware exchange address register 2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B • Since exchange package does not have all state information, software interrupt handler have to store remaining states In general, the higher the percentage of branch type instructions in a program, the slower a program will run on a pipeline processor. Estimation of the effect of branching on an n-segment instruction pipeline • • • • • • Consider an instruction cycle with n pipeline clock periods. Let – p – probability of conditional branch (20%) – q – probability that a branch is successful (60% of 20%) (12/20=0.6) Suppose there are m instructions Then no. of instructions of successful branches = mxpxq (mx0.2x0.6) Delay of (n-1)/n is required for each successful branch to flush pipeline. Thus, the total instruction cycle required for m instructions = 1 n  m  1  mpq(n  1) n n • As m becomes large , the average no. of instructions per instruction cycle is given as m n Lt  m   n  m  1 mpq(n  1) 1  pq(n  1)  n n • When p =0, the above measure reduces to n, which is ideal. • In reality, it is always less than n. Solution: Multiple Prefetch Buffers • Buffers can be used to match the instruction fetch rate to pipeline consumption rate 1. Sequential Buffers: for in-sequence pipelining 2. Target Buffers: instructions from a branch target (for out-of-sequence pipelining) A conditional branch cause both sequential and target to fill and based on condition one is selected and other is discarded Data Buffering and Busing Structures Speeding up of pipeline segments 2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B • • The processing speeds of pipeline segments are usually unequal. Consider the example given below: • • • • If T1 = T3 = T and T2 = 3T, S2 becomes the bottleneck and we need to remove it How? One method is to subdivide the bottleneck – Two divisions possible are: First Method: • Second Method: • If the bottleneck is not sub-divisible, we can duplicate S2 in parallel • Control and Synchronization is more complex in parallel segments Data Buffering • Instruction and data buffering provides a continuous flow to pipeline units • Example: 4X TI ASC • In this system it uses a memory buffer unit (MBU) which • Supply arithmetic unit with a continuous stream of operands • Store results in memory • The MBU has three double buffers X, Y and Z (one octet per buffer) • X,Y for input and Z for output • This provides pipeline processing at high rate and alleviate bandwidth mismatch problem between memory and arithmetic pipeline • In TI ASC, once instruction dependency is recognized, update capability is incorporated by transferring contents of Z buffer to X or Y buffer. 2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B Internal Forwarding and Register Tagging • • • • • Internal Forwarding: It is replacing unnecessary memory accesses by register-toregister transfers. Register Tagging: It is the use of tagged registers for exploiting concurrent activities among multiple ALUs. Memory access is slower than register-to-register operations. Performance can be enhanced by eliminating unnecessary memory accesses This concept can be explored in 3 directions: 1. Store – Load Forwarding 2. Load – Load Forwarding 3. Store – Store Forwarding 2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B Register Tagging Example : IBM Model 91 : Floating Point Execution Unit • The floating point execution unit consists of : – Data registers – Transfer paths – Floating Point Adder Unit – Multiply-Divide Unit – Reservation stations – Common Data Bus • • There are 3 reservation stations for adder named A1, A2 and A3 and 2 for multipliers named M1 and M2. Each station has the source & sink registers and their tag & control fields 2013-14 • • • • • • • • • • • • • • Advanced Microprocessor by Vincy Joseph TE CMPNA/B The stations hold operands for next execution. 3 store data buffers(SDBs) and 4 floating point registers (FLRs) are tagged Busy bits in FLR indicates the dependence of instructions in subsequent execution Common Data Bus(CDB) is to transfer operands There are 11 units to supply information to CDB: 6 FLBs, 3 adders & 2 multiply/divide unit Tags for these stations are : Unit Tag Unit Tag FLB1 0001 ADD1 1010 FLB2 0010 ADD2 1011 FLB3 0011 ADD3 1100 FLB4 0100 M1 1000 FLB5 0101 M2 1001 FLB6 0110 Internal forwarding can be achieved with tagging scheme on CDB. Example: Let F refers to FLR and FLBi stands for ith FLB and their contents be (F) and (FLBi) Consider instruction sequence ADD F,FLB1 F  (F) + (FLB1) MPY F,FLB2 F  (F) x (FLB2) During addition : – Busy bit of F is set to 1 – Contents of F and FLB1 is sent to adder A1 – Tag of F is set to 1010 (tag of adder) Meantime, the decode of MPY reveals F is busy, then – F should set tag of M1 as 1010 (Tag of adder) – F should change its tag to 1000 (Tag of Multiplier) – Send content of FLB2 to M1 When addition is done, CDB finds that the result should be sent to M1 Multiplication is done when both operands are available Hazard Detection and Resolution • Hazards are caused by resource usage conflicts among various instructions • They are triggered by inter-instruction dependencies Terminologies: • Resource Objects: set of working registers, memory locations and special flags • Data Objects: Content of resource objects 2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B • Each Instruction can be considered as a mapping from a set of data objects to a set of data objects. • Domain D(I) : set of resource of objects whose data objects may affect the execution of instruction I. • Range R(I): set of resource objects whose data objects may be modified by the execution of instruction I • Instruction reads from its domain and writes in its range • Consider execution of instructions I and J, and J appears immediately after I. • There are 3 types of data dependent hazards: 1. RAW (Read After Write) The necessary condition for this hazard is R( I )  D( J )   Example: I1 : LOAD r1,a I2 : ADD r2,r1 I2 cannot be correctly executed until r1 is loaded Thus I2 is RAW dependent on I1 2. WAW(Write After Write) The necessary condition is R( I )  R( J )   • • Example I1 : MUL r1, r2 I2 : ADD r1,r4 Here I1 and I2 writes to same destination and hence they are said to be WAW dependent. 3. WAR (Write After Write) The necessary condition is D( I )  R( J )   2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B • • • • Example: I1 : MUL r1,r2 I2 : ADD r2,r3 Here I2 has r2 as destination while I1 uses it as source and hence they are WAR dependent • Hazards can be detected in fetch stage by comparing domain and range. • Once detected, there are two methods: 1. Generate a warning signal to prevent hazard 2. Allow incoming instruction through pipe and distribute detection to all pipeline stages. Job Sequencing and Collision Prevention • Consider reservation table given below at t=1 1 Sa 2 3 6 A A A Sc A A Consider next initiation made at t=2 Sa 1 2 A1 A2 Sb A1 Sc • 5 A Sb • 4 3 4 5 A2 A1 A2 A1 A2 A1 6 7 A1 A2 8 A2 The second initiation easily fits in the reservation table Now consider the case when first initiation is made at t = 1 and second at t = 3. 1 Sa A1 2 3 A2 4 5 6 A1 7 8 A2 2013-14 Advanced Microprocessor by Vincy Joseph Sb A1 Sc • • • • • A2 A1A2 A1 A2 TE CMPNA/B A2 A1A2 A2 Here both markings A1 and A2 falls in the same stage time units and is called collision and it must be avoided Terminologies Latency: Time difference between two initiations in units of clock period Forbidden Latency: Latencies resulting in collision Forbidden Latency Set: Set of all forbidden latencies Considering all initiations: Sa Sb Sc 1 2 3 4 5 6 7 8 9 10 11 A1 A2 A3 A4 A5 A6A1 A2 A3 A4 A5 A6 A1 A2 A1A3 A2A4 A3A5 A4A6 A5 A6 A1 A2 A1A3 A2A4 A3A5 A4A6 A5 • Forbidden Latencies are 2 and 5 Shortcut Method of finding Latency • • • • • • • • Forbidden Latency Set = {5} U {2} U {2} = { 2,5} Latency Sequence : Sequence of latencies between successive initiations For a RT, number of valid initiations and latencies are infinite Latency Cycle: Among the infinite possible latency sequence, the periodic ones are significant. E.g. { 1, 3, 3, 1, 3, 3,… } The subsequence that repeats itself is called latency cycle. E.g. {1, 3, 3} Period of cycle: The sum of latencies in a latency cycle (1+3+3=7) Average Latency: The average taken over its latency cycle (AL=7/3=2.33) A6 2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B • To design a pipeline, we need a control strategy that maximize the throughput (no. of results per unit time) • Maximizing throughput is minimizing AL • Latency sequence which is aperiodic in nature is impossible to design • Thus design problem is arriving at a latency cycle having minimal average latency. State Diagram • The initial collision vector (ICV) is a binary vector formed from F such that C = (Cn…. C2 C1) where Ci = 1 if i  F and Ci = 0 if otherwise • Thus in our example F = { 2,5 } C = (1 0 0 1 0) • The procedure is as follows: 1. Start with the ICV 2. For each unprocessed state, For each bit i in the CVi which is 0, do the following: a. Shift CVi right by i bits b. Drop i rightmost bits c. Append zeros to left d. Logically OR with ICV e. If step(d) results in a new state then form a new node for this state and join it with node of CVi by an arc with a marking i. This shifting process needs to continue until no more new states can be generated. • • • • • • • The state with all zeros has a self-loop which corresponds to empty pipeline and it is possible to wait for indefinite number of latency cycles of the form (7),(8), (9),(10) etc. Simple Cycle: latency cycle in which each state is encountered only once. Complex Cycle: consists of more than one simple cycle in it. It is enough to look for simple cycles Greedy Cycle: A simple cycle is a greedy cycle if each latency contained in a cycle is the minimal latency(outgoing arc) from a state in the cycle. A good task initiation sequence should include the greedy cycle. The simple cycles are (3),(5) ,(1,3,3),(4,3) and (4) The Greedy cycle is (1,3,3) 2013-14 • • Advanced Microprocessor by Vincy Joseph TE CMPNA/B In the above example, the cycle that offers MAL is (1, 3, 3) (MAL = (1+3+3)/3 =2.333) The task initiation sequence with MAL is given as below: Sa Sb Sc 1 2 A1 A2 A1 3 4 5 6 7 8 A5 A1 A2 A8 A2 A1 A2 A5 A1 A2 A1 A2 A5 A5 9 10 11 12 A5 A8 A5 13 A8 A8 A8 A8 Superscalar Processors • Scalar processors: one instruction per cycle • Superscalar : multiple instruction pipelines are used. • Purpose: To exploit more instruction level parallelism in user programs. • Only independent instructions can be executed in parallel. • The fundamental structure (m=3) is as follows: • • Here, the instruction decoding and execution resources are increased Example: A dual pipeline superscalar processor • • • • • It can issue two instructions per cycle There are two pipelines with four processing stages : fetch, decode, execute and store Two instruction streams are from a single I-cache. Assume each stage requires one cycle except execution stage. The four functional units of execution stage are: 2013-14 Advanced Microprocessor by Vincy Joseph Functional Unit TE CMPNA/B Number of stages Adder 02 Multiplier 03 Logic 01 Load 01 • Functional units are shared on dynamic basis • Look-ahead Window: for out-of-order instruction issue Superscalar Performance • Time required by scalar base machine is T(1,1) = k+N-1 • The ideal execution time required by an m-issue superscalar machine is T(m,1)  k  (N - m) • m k is the time required to execute first m instructions (N-m)/m is the time required to execute remaining (N-m) instructions. The ideal speedup of the superscalar machine is S(m,1)  T(1,1) T(m,1) S(m,1)  m(N  k - 1) N  m(k - 1)  As N  ∞, the speedup S(m,1)  m. Superpipeline Processors • • • In a superpipelined processor of degree n, the pipeline cycle time is 1/n of base cycle. • Time to execute N instructions for a superpipelined machine of degree n with k stages is T(1,n) = k + (N-1)/n Speedup is given as T(1,1) n(k  N - 1) S(1, n)   T(1, n) nk  (N - 1) As N ∞ , S(1,n) n 2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B Superpipelined Superscalar Processors • This machine executes m instructions every cycle with a pipeline cycle 1/n of base cycle. • Time taken to execute N independent instructions on a superpipelined superscalar machine of degree (m,n) is N-m T (m, n)  k  mn The speedup over base machine is • S(m, n)  • As N  ∞, S(m,n)mn T(1,1) mn(k  N  1)  T(m, n) mnk  N  m Systolic Architecture • Conventional architecture operate on load and store operations from memory. • This requires more memory references which slows down the system as shown below: • In systolic processing, data to be processed flows through various operation stages and finally put in memory as shown below: 2013-14 • • • • • • • • • • Advanced Microprocessor by Vincy Joseph TE CMPNA/B The basic architecture constitutes processing elements (PEs) that are simple and identical in behavior at all instants. Each PE may have some registers and an ALU. PEs are interlinked in a manner dictated by the requirements of the specific algorithm. E.g. 2D mesh, hexagonal arrays etc. PEs at the boundary of structure are connected to memory Data picked up from memory is circulated among PEs which require it in a rhythmic manner and the result is fed back to memory and hence the name systolic Example : Multiplication of two n x n matrices Every element in input is picked up n times from memory as it contributes to n elements in the output. To reduce this memory access, systolic architecture ensures that each element is pulled only once Consider an example where n = 3   Conventional Method: O(n3) For I = 1 to N For J = 1 to N For K = 1 to N C[I,J] = C[I,J] + A[J,K] * B[K,J]; This will run in O(n) time! To run in n time we need n x n processing units, in our example n = 9. For systolic processing, the input data need to be modified as: 2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B and finally stagger the data sets for input. At every tick of the global system clock, data is passed to each processor from two different directions, then it is multiplied and the result is saved in a register. Example: Samba: Systolic Accelerator for Molecular Biological Applications This systolic array contains 128 processors shared into 32 full custom VLSI chips. One chip houses 4 processors, and one processor performs 10 millions matrix cells per second. Very Long Instruction Word (VLIW) Architecture VLIW Machine • It consists of many functional units connected to a large central register file • Each functional unit have two read ports and one write port • Register file would have enough memory bandwidth to balance the operand usage rate of functional units 2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B VLIW characteristics • VLIW contains multiple primitive instructions that can be executed in parallel • The compiler packs a number of primitive, independent instructions into a very long instruction word • The compiler must guarantee that multiple primitive instructions which group together are independent so they can be executed in parallel. • Example of a single VLIW instruction: F=a+b; c=e/g; d=x&y; w=z*h; VLIW instruction c=e/g F=a+b a d=x&y w=z*h F PU b e c PU g x d PU y z w PU h VLIW Principles 1. The compiler analyzes dependence of all instructions among sequential code and extracts as much parallelism as possible. 2. Based on analysis, the compiler re-codes the sequential code in VLIW instruction words.(One VLIW instruction word contains maximum 8 primitive instructions) 3. Finally VLIW hardware – Fetch the VLIWs from cache, – Decode them, – Dispatch the independent primitive instructions to corresponding functional units and – Execute Advantages of VLIW architecture 2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B • Reduces Complexity: Due to parallelism among their primitive instructions • Higher possible clock rate because of reduced complexity Drawbacks of VLIW Architecture • Compiler has to be aware of technology dependent parameters-like latencies and repetition rate. This restricts the use of same compiler for a family of VLIW processors. • Wasted memory space and bandwidth when some EUs are not used • Performance is dependent on how the compiler produces VLIW words. • • • Data Flow Computers They are based on the concept of data driven computation Conventional computers are under program flow control. Features of Control Flow Model • • • • • • • • Data is passed between instructions via shared memory Flow of control is implicitly sequential Program counters are used to sequence the execution of instruction Features of Data Flow Model Intermediate or final results are passed directly as data token between instructions There is no concept of shared data storage Program sequencing is constrained only by data dependency among instructions Data Flow Graph A data flow graph is a directed graph whose nodes correspond to operators and arcs are pointers forwarding data tokens The graph demonstrates sequencing constraints among instructions 2013-14 • • • • • • • Advanced Microprocessor by Vincy Joseph TE CMPNA/B In DFC, the machine level program is represented by DFGs. The firing rules of instructions is based on the data availability An operator of the schema is enabled when tokens are present on all input arcs The enabled operator may fire at any time by • removing the tokens on its input arc • computing a value from operands associated with input tokens • associating that value with the result token placed on its output arc The result may be sent to more than one destinations by means of a link Consider the following simple computation input : a,b y = (a+b)/x x = (a*(a+b))+b output: y,x and • The representation of conditionals and iterations requires additional types of links and actors They are as follows: • Data link : Data values pass through data links 2013-14   Advanced Microprocessor by Vincy Joseph TE CMPNA/B Control link : Control tokens are transmitted which conveys a value of either true or false Actors Operator: It removes tokens on input arc, compute a value based on operands in input arc and result is placed in output arc.  Decider : Decider receives values from its input arc applies its associated predicate and produces either a TRUE or FALSE control token at its output arc  Boolean Operator: It can be used to combine control tokens produced at a decider allowing a decision to be built up from simpler decision.   Control token direct the flow of data tokens by means of T-gates, F-gates or merge actors T-gate : A T-gate passes the data token on its input arc to its output arc only when it receives a control token with TRUE value at its control input 2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B  F-gate : A F-gate passes the data token on its input arc to its output arc only when it receives a control token with FALSE value at its control input  Merge Actor : It has a true input, a false input and a control input. It passes to its output arc a data token from the input arc corresponding to the value of the control token received  Example: Draw the data flow diagram corresponding to the following computation(xn) input x,n y=1, i=n while i > 0 do begin y=y*x; i = i-1 end z=y output z • The tokens carrying the false values on the input arcs of the merge operator allows it to initiate. The decider emits a token carrying the TRUE value each time the execution of the loop body is required. • 2013-14 • • • • • • • • • Advanced Microprocessor by Vincy Joseph TE CMPNA/B When the firing of the decider yields a FALSE, the value of y is routed to the output link z. Data Flow Machine Architectures Depending on the way of handling data tokens DFCs can be divided into – Static Model – Dynamic Model Static Data Flow Computer In this machine, data tokens are assumed to move along the arcs of the dataflow program graph to the operator nodes Nodal operation gets executed when all its operands are present at input arcs Only one token is allowed to exist on any arc at any given time. This architecture is static because – Tokens are not labeled Control Token must be used for timing SDFC Example A data flow schema to be executed is stored in the memory of the processor Memory is organized into instruction cells, each cell corresponds to an operator of dataflow program. Each instruction cell is composed of 3 registers. – First – holds instruction – Second and Third – holds the operands 2013-14 • Advanced Microprocessor by Vincy Joseph TE CMPNA/B Instruction specifies the operation to be performed and the address(es) of the register(s) to which the result of the operation is to be directed. 2013-14        • • • Advanced Microprocessor by Vincy Joseph TE CMPNA/B When a cell contains an instruction and the necessary operands, it is enabled and transmits its content as operation packets Arbitration network directs operation packet to appropriate processing unit by decoding instruction portion of packet Processing unit performs the desired function The result of an operation leaves processing unit as one or more data packets consisting of computed value and address of the register in memory to which the value is to be delivered Distribution network accepts data packets and utilizes address to store the value of operation. Many instruction cells may be enabled simultaneously and it is the task of arbitration network to efficiently deliver operation packets to processing units and to queue operation packets. Each arbitration unit passes one packet at a time to resolve ambiguity. Dynamic Dataflow Architecture It uses tagged tokens so that more than one token can exist in an arc. Tagging is achieved by attaching a label with each token which identifies the context of that token. Maximum parallelism can be exploited in this model 2013-14 Advanced Microprocessor by Vincy Joseph TE CMPNA/B

Chapter 1 Summary

Related documents

Products

Support

Chapter 1 Summary

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib