Computer Architecture 33d stage 2-5-2012 Lec. (18) College of sciences for women ((())) Dept. of computer sciences Year 2011-2012 9 Advanced Processor Design 9.1 Pipelined Processors 9.1.1 Basic Concept The key thing to grasp when considering pipelined processors is that we can have more than one instruction in-flight at a time: this is the concept of ILP being put to practical use. To see why this might be the case, consider Table 9.1 which details an execution timeline for our original sequential data-path. In cycle t0 we fetch instruction #0 which is then decoded in cycle t1, executed in cycle t2 and finally writes results in cycle t3; we then fetch instruction #1 in cycle t4 and so on. Now imagine we split the sequential data-path into four pipeline stages, one each for fetch, decode, execute and write. A conceptual diagram of how this would look, which is used as a standard starting point for further discussion, is shown in Figure 39. With such a design the fetch stage, for example, can be fetching instruction #2 while the decode stage is dealing with instruction #1 and the execute stage is dealing with instruction #0 and so on. This is highlighted in Table 9.2. With all stages working in parallel with each other, we execute around four times as many instructions in the same period; clearly this depends on factors previously introduced such as the requirement to keep the pipeline full and further complications we will introduce later. Table 9.1 An execution timeline for a processor using a sequential data-path. Lecturer: Salah Mahdi Saleh 117 Table 9.2 An execution timeline for a processor using a pipelined data-path. Figure 39 A conceptual diagram of a pipelined processor. To make this scheme possible we need to design a pipelined data-path which is capable of splitting the original into stages, each of which can operate independently on a different instruction. Figure 40 describes the “classic” way to do this; the design is composed of five stages in total: the fetch, decode, execute and write already described, together with a dedicated memory access stage. There are several ways this basic design can be improved but it offers a good, simple starting point. For brevity, we denote these stages FET, DEC, EXE, MEM and WRI. In the diagram, notice that the stages are separated by pipeline registers just like in our basic pipelined circuit; we denote these registers FET −DEC, DEC −EXE, EXE −MEM and MEM −WRI. The registers hold several different values even though we have drawn them as a single block. When we need to, we subscript the pipeline register with the conventional register name from the sequential data-path. So for example FET −DECPC is the Lecturer: Salah Mahdi Saleh 118 value of the program counter register held in pipeline FET−DEC, the value of the instruction register at DEC−EXE is DEC−EXEIR. Figure 40 A 5-stage pipelined data-path. 9.1.2 Control Strategy By and large the pipeline stages perform similar roles as the steps performed by the sequential data-path, they just operate in parallel. To expand on this, we describe the operation of each stage: Fetch (FET) The fetch stage uses its dedicated memory interface to load the next instruction to be executed; it uses the address stored in the PC register which is also located within the fetch stage. The PC is either incremented using an adder or, in the case of a taken branch, updated using values fed in from elsewhere. When the pipeline advances, the instruction to be executed and also the current value of PC are passed to the next stage by storing them in FET −DEC. Decode (DEC) The decode stage takes the instruction from FET −DEC, decodes it to recover register addresses and immediate values and uses them to generate operands for the execute Lecturer: Salah Mahdi Saleh 119 stage. It does this by loading operands from GPR and by sign extending any immediates. When the pipeline advances, these operands together with the instruction and current value of PC are passed to the next stage by storing them in DEC−EXE. The act of moving instructions into phases of actual execution is sometimes called the issue of instructions; a design using this strategy is termed single-issue in the sense that only one instruction passes into the execution stage at a time. Execute (EXE) Having taken opcode and operands from DEC−EXE, the execute stage fills a similar role as in the sequential data-path. It simply invokes the ALU to work on the operands and produce some result. Unlike the diagram, the ALU may be notionally split into two parts since these operate at the same time: the main ALU performs arithmetic and logic operations while a secondary unit performs comparisons. The ALU is fed inputs from either the main operands in DEC−EXE or secondary operands such as the current PC value which is also stored in DEC−EXE; multiplexers are used to make this selection which allows for the increment of PC for example. When the pipeline advances, generated results are stored in EXE −MEM. Memory (MEM) The memory access stage is simple. It takes the instruction that is stored in EXE −MEM and, if it is a memory access instruction (i.e., a load or a store), performs the corresponding operation. Notice this requires that the memory access stage has a dedicated interface to the memory, and that the stage now generates two intermediate results which, along with the instruction, are stored in MEM−WRI when the pipeline advances. In this particular design, the memory access stage also feeds the branch condition and computed PC value back to the fetch stage should a branch be taken. Write (WRI) Finally, the write stage takes any intermediate results which have been generated, either by execution of some operation or the loading something from memory, and writes them into the general-purpose register file. Unlike most other signals in the pipeline that are forward facing, the write stage feeds signals back to the decode stage where the registers are typically housed. Note that in the diagram, both the value and the instruction from MEM−WRI are fed back to the decode stage. Although this might seem odd, the Lecturer: Salah Mahdi Saleh 120 instruction is obviously required so that one can extract the target register operand and hence know which register to write the value into: the decode stage cannot know this because it will be operating on a different instruction at the time ! After the write stage has finished, i.e., execution of the instruction is complete, an instruction is said to have been retired. Note that with this scheme, strong processor and memory consistency are clearly guaranteed: instructions have to be retired in-order, since no instruction can overtake another one in the pipeline, and there is no re-ordering of memory accesses. 9.2 Vector Processors 9.2.1 Basic Concept vector processor An architecture and compiler model that was popularized by supercomputers in which high-level operations work on linear arrays of numbers. In a scalar processor, the natural units of computation are single values held in registers. In a vector processor, one operates using registers that contain p separate values. Consider a simplistic example where we want to execute a vector addition: Ai ←Bi+Ci , for i = {0. . . p−1} which we could implement in C using the function: void add( int* A, int* B, int* C, int p ) { for( int i = 0; i < p; i++ ) { A[ i ] = B[ i ] + C[ i ]; } } Lecturer: Salah Mahdi Saleh 121 where A, B and C are all arrays of, say, 32-bit integers. Often, this sort of operation is termed a pure vector operation in the sense that computation is happening only component-wise: the i-th element of the result is generated using only the i-th elements of the inputs. On a scalar processor we could implement the loop as follows: GPR[2]←0 GPR[3]← p loop : if GPR[2] ≥ GPR[3],PC←exit GPR[4]←MEM[&B+GPR[2]] GPR[5]←MEM[&C+GPR[2]] GPR[6]←GPR[4]+GPR[5] MEM[&A+GPR[2]]←GPR[6] GPR[2]←GPR[2]+1 PC←loop exit : ... But imagine that instead of our standard register file GPR which stores a single 32-bit value, we had a vector register file where each register stores p such values. Let VR[i] denote the i-th vector register and VR[i][ j] denote the j-th value or subword in the i-th register. Clearly we also need some vector instructions capable of operating on values held in the vector register file. Equipped as such, the whole loop collapses into a few instructions: VR[4]←MEM[&B+0] . . .MEM[&B+ p−1] VR[5]←MEM[&C+0] . . .MEM[&C+ p−1] VR[6]←VR[4]+VR[5] MEM[&A+0] . . .MEM[&A+ p−1]←VR[6] ... Of course, each vector instruction is hugely complex compared to the equivalent scalar instruction; in some sense, one might view a vector processor as an embodiment of CISC Lecturer: Salah Mahdi Saleh 122 design. Important examples of this complexity are the vector add instruction which uses VR[i] and VR[ j] as source operands and produces a result in VR[k] essentially computing p additions, and the vector load and store instructions which load or store p values from or to memory starting at some base address. One might classify the technique as using static scheduling, since the specification of parallelism is done at compile-time, and single-issue, since the processor issues one instruction per-cycle (even though each instruction implies many similar operations). There are several important benefits that result: 1. Normal work over vector operands (i.e., the loop) contains branches; without them, vector programs have fewer control dependencies. 2. Since parallelism within vector operations is explicit, less hardware is required to manage issues of data dependency and effectively exploit the parallelism. 3. The vector program captures the entire loop in a few instructions; the burden on fetch and decode stages is therefore reduced. 4. Memory access patterns resulting from vector loads and stores are much more regular (e.g., vector elements are contiguous in memory). Lecturer: Salah Mahdi Saleh 123