Computer Architecture 3 stage

advertisement
Computer Architecture
33d stage
2-5-2012
Lec. (18)
College of sciences for women ((())) Dept. of computer sciences
Year 2011-2012
9 Advanced Processor Design
9.1 Pipelined Processors
9.1.1 Basic Concept
The key thing to grasp when considering pipelined processors is that we can have more
than one instruction in-flight at a time: this is the concept of ILP being put to practical use.
To see why this might be the case, consider Table 9.1 which details an execution timeline for
our original sequential data-path. In cycle t0 we fetch instruction #0 which is then decoded in
cycle t1, executed in cycle t2 and finally writes results in cycle t3; we then fetch instruction
#1 in cycle t4 and so on. Now imagine we split the sequential data-path into four pipeline
stages, one each for fetch, decode, execute and write. A conceptual diagram of how this
would look, which is used as a standard starting point for further discussion, is shown in
Figure 39. With such a design the fetch stage, for example, can be fetching instruction #2
while the decode stage is dealing with instruction #1 and the execute stage is dealing with
instruction #0 and so on. This is highlighted in Table 9.2. With all stages working in parallel
with each other, we execute around four times as many instructions in the same period;
clearly this depends on factors previously introduced such as the requirement to keep the
pipeline full and further complications we will introduce later.
Table 9.1 An execution timeline for a processor using a sequential data-path.
Lecturer: Salah Mahdi Saleh
117
Table 9.2 An execution timeline for a processor using a pipelined data-path.
Figure 39 A conceptual diagram of a pipelined processor.
To make this scheme possible we need to design a pipelined data-path which is capable
of splitting the original into stages, each of which can operate independently on a different
instruction. Figure 40 describes the “classic” way to do this; the design is composed of five
stages in total: the fetch, decode, execute and write already described, together with a
dedicated memory access stage. There are several ways this basic design can be improved but
it offers a good, simple starting point. For brevity, we denote these stages FET, DEC, EXE,
MEM and WRI. In the diagram, notice that the stages are separated by pipeline registers just
like in our basic pipelined circuit; we denote these registers FET −DEC, DEC −EXE, EXE
−MEM and MEM −WRI. The registers hold several different values even though we have
drawn them as a single block. When we need to, we subscript the pipeline register with the
conventional register name from the sequential data-path. So for example FET −DECPC is the
Lecturer: Salah Mahdi Saleh
118
value of the program counter register held in pipeline FET−DEC, the value of the instruction
register at DEC−EXE is DEC−EXEIR.
Figure 40 A 5-stage pipelined data-path.
9.1.2 Control Strategy
By and large the pipeline stages perform similar roles as the steps performed by the
sequential data-path, they just operate in parallel. To expand on this, we describe the
operation of each stage:
Fetch (FET) The fetch stage uses its dedicated memory interface to load the next instruction
to be executed; it uses the address stored in the PC register which is also located within the
fetch stage. The PC is either incremented using an adder or, in the case of a taken branch,
updated using values fed in from elsewhere. When the pipeline advances, the instruction to be
executed and also the current value of PC are passed to the next stage by storing them in FET
−DEC.
Decode (DEC) The decode stage takes the instruction from FET −DEC, decodes it to recover
register addresses and immediate values and uses them to generate operands for the execute
Lecturer: Salah Mahdi Saleh
119
stage. It does this by loading operands from GPR and by sign extending any immediates.
When the pipeline advances, these operands together with the instruction and current value of
PC are passed to the next stage by storing them in DEC−EXE. The act of moving instructions
into phases of actual execution is sometimes called the issue of instructions; a design using
this strategy is termed single-issue in the sense that only one instruction passes into the
execution stage at a time.
Execute (EXE) Having taken opcode and operands from DEC−EXE, the execute stage fills a
similar role as in the sequential data-path. It simply invokes the ALU to work on the operands
and produce some result. Unlike the diagram, the ALU may be notionally split into two parts
since these operate at the same time: the main ALU performs arithmetic and logic operations
while a secondary unit performs comparisons. The ALU is fed inputs from either the main
operands in DEC−EXE or secondary operands such as the current PC value which is also
stored in DEC−EXE; multiplexers are used to make this selection which allows for the
increment of PC for example. When the pipeline advances, generated results are stored in
EXE −MEM.
Memory (MEM) The memory access stage is simple. It takes the instruction that is stored in
EXE −MEM and, if it is a memory access instruction (i.e., a load or a store), performs the
corresponding operation. Notice this requires that the memory access stage has a dedicated
interface to the memory, and that the stage now generates two intermediate results which,
along with the instruction, are stored in MEM−WRI when the pipeline advances. In this
particular design, the memory access stage also feeds the branch condition and computed PC
value back to the fetch stage should a branch be taken.
Write (WRI) Finally, the write stage takes any intermediate results which have been
generated, either by execution of some operation or the loading something from memory, and
writes them into the general-purpose register file. Unlike most other signals in the pipeline
that are forward facing, the write stage feeds signals back to the decode stage where the
registers are typically housed. Note that in the diagram, both the value and the instruction
from MEM−WRI are fed back to the decode stage. Although this might seem odd, the
Lecturer: Salah Mahdi Saleh
120
instruction is obviously required so that one can extract the target register operand and hence
know which register to write the value into: the decode stage cannot know this because it will
be operating on a different instruction at the time !
After the write stage has finished, i.e., execution of the instruction is complete, an instruction
is said to have been retired. Note that with this scheme, strong processor and memory
consistency are clearly guaranteed: instructions have to be retired in-order, since no
instruction can overtake another one in the pipeline, and there is no re-ordering of memory
accesses.
9.2 Vector Processors
9.2.1 Basic Concept
vector processor An architecture and compiler model that was popularized by
supercomputers in which high-level operations work on linear arrays of numbers. In a scalar
processor, the natural units of computation are single values held in registers. In a vector
processor, one operates using registers that contain p separate values. Consider a simplistic
example where we want to execute a vector addition:
Ai ←Bi+Ci , for i = {0. . . p−1}
which we could implement in C using the function:
void add( int* A, int* B, int* C, int p )
{
for( int i = 0; i < p; i++ ) {
A[ i ] = B[ i ] + C[ i ];
}
}
Lecturer: Salah Mahdi Saleh
121
where A, B and C are all arrays of, say, 32-bit integers. Often, this sort of operation is termed
a pure vector operation in the sense that computation is happening only component-wise: the
i-th element of the result is generated using only the i-th elements of the inputs.
On a scalar processor we could implement the loop as follows:
GPR[2]←0
GPR[3]← p
loop : if GPR[2] ≥ GPR[3],PC←exit
GPR[4]←MEM[&B+GPR[2]]
GPR[5]←MEM[&C+GPR[2]]
GPR[6]←GPR[4]+GPR[5]
MEM[&A+GPR[2]]←GPR[6]
GPR[2]←GPR[2]+1
PC←loop
exit : ...
But imagine that instead of our standard register file GPR which stores a single 32-bit
value, we had a vector register file where each register stores p such values. Let VR[i]
denote the i-th vector register and VR[i][ j] denote the j-th value or subword in the i-th
register. Clearly we also need some vector instructions capable of operating on values held in
the vector register file. Equipped as such, the whole loop collapses into a few instructions:
VR[4]←MEM[&B+0] . . .MEM[&B+ p−1]
VR[5]←MEM[&C+0] . . .MEM[&C+ p−1]
VR[6]←VR[4]+VR[5]
MEM[&A+0] . . .MEM[&A+ p−1]←VR[6]
...
Of course, each vector instruction is hugely complex compared to the equivalent scalar
instruction; in some sense, one might view a vector processor as an embodiment of CISC
Lecturer: Salah Mahdi Saleh
122
design. Important examples of this complexity are the vector add instruction which uses VR[i]
and VR[ j] as source operands and produces a result in VR[k] essentially computing p
additions, and the vector load and store instructions which load or store p values from or to
memory starting at some base address. One might classify the technique as using static
scheduling, since the specification of parallelism is done at compile-time, and single-issue,
since the processor issues one instruction per-cycle (even though each instruction implies
many similar operations). There are several important benefits that result:
1. Normal work over vector operands (i.e., the loop) contains branches; without them, vector
programs have fewer control dependencies.
2. Since parallelism within vector operations is explicit, less hardware is required to manage
issues of data dependency and effectively exploit the parallelism.
3. The vector program captures the entire loop in a few instructions; the burden on fetch and
decode stages is therefore reduced.
4. Memory access patterns resulting from vector loads and stores are much more regular (e.g.,
vector elements are contiguous in memory).
Lecturer: Salah Mahdi Saleh
123
Download