The Pentium 4 Architecture (Dr.C)

advertisement
The Pentium 4 Architecture. An overview.
The Pentium-4 has emerged in a historical context. Its history started with the 8088/8086 microprocessor
technology which was correctly conceived in the CISC mould, appropriate for that epoch. Only later when
the RISC concept became prevalent, Intel had to decide either to dump the 8086-CISC architecture or to be
clever. To dump would mean to render all software based on the 8086 obsolete which would lose Intel its
customer base. But the 8086 technology was unsustainable in the face of RISC architectures such as
employed by Sun in their SPARC-station. Intel was faced with a dilemma: Software required the 8086-CISC
instruction set, hardware required a RISC concept. Their solution to this dilemma was pure magic, to invent a
CPU that would read the 8086-CISC code from the software and translate that code into RISC “microcode”
operations which would run on a RISC architecture. From the outside, the Pentium reads CISC code, on the
inside, it runs RISC code. The translation mechanism between the inside and out is rather beautiful. This is
the topic of this narrative.
There are two ways to approach an analysis of microprocessor architecture. The first is takes a “systems”
point of view, and looks at how the various components of the microprocessor work together. The second
takes a “dynamic” point of view, looking at how the microprocessor deals with instructions as time flows in
other words as instructions pass through a processing pipeline. Here we shall present both approaches. First
comes….
1. Summary of the Architecture (Systems View)
There are two major groups of components in the P4, (i) the “Front End” which reads IA-32 CISC code and
translates this into a stream of micro-operations “mu-ops” and then issues these to the (ii) “Execution
Engine” which carries out RISC operations. The Front End has branch prediction logic which uses past
history of the program execution tp speculate which IA-32 CISC instructions should be executed next. It
then fetches these from the L2 instruction cache, and decodes them and stores them into the L1 Instruction
Cache as mu-ops. It also knows what IA-32 CISC instructions have been decoded and stored, so does not
need to decode a the same CISC instruction which appears further down the program code.
The Execution engine uses out-of-order (OOO) logic so that mu-ops are re-ordered in sequence so as to keep
the machine’s execution components, such as the ALUs and the cache as busy as possible. This allows
several mu-ops to be processed simultaneously, assuming there are no dependencies. Following processing
by the ALUs the instructions are re-ordered by the Retirement Unit to recover the original program code
order. This logic also reports branch history back to the Front End allowing branch prediction to occur. Up to
126 mu-ops can be in flight at the same time and also 48 load and 24 store mu-ops. The Allocator searches
for the resources required by each mu-op (such as values in registers), and when these become available they
are assigned by the Allocator to the mu-op which is then released into the pipeline. The Register Renamer
takes the set of 8 IA-32 CISC registers (eax, ebx, ecx, etc) and assigns them to the 128 RISC registers in the
machine. This is because in the original program, a register such as eax could be referred to by many lines of
code which may been translated to mu-ops in the L1 cache. Each instance of eax could be unique, holding a
different value. The Register Renaming logic stores a table of names so that an instruction coming down the
pipeline can know which RISC register corresponds to its own instance of eax. The Schedulers decide when
each mu-op is ready to execute by monitoring the availability of their input operands. And whether there is
an ALU resource waiting for a job. There are multiple schedulers which feed the various ALUs.
2. The Pentium 4 Pipeline
This is an initial discussion of the P4 architecture, which has been written to set the scene. Many (if not most)
details have been omitted and will be discussed in the document ??
Stages 1 and 2. This sets the IP (instruction pointer) to point to the next x86-CISC instruction present in the
Instruction Cache (L2) and translates/decodes it, writing micro-operations “mu-ops” into the L1 instruction
cache. The translation is not performed if the IA-32 instruction has already been decoded.
Stages 3 and 4. The micro-operations “mu-ops” are fetched from the Instruction Cache and is sent down
into the execution engine.
Stage 5. Drive. This stage is used to give the chip’s electronics enough time to get the electronic signals from
Instruction Cache to the Execution Engine (allocator/renamer). It is necessary, since the Pentium 4 is running
so fast.
Stages 6 to 8. Allocation and Rename.
Each micro-operation within a “bundle” has its own requirements in order to execute, for example registers.
The allocator checks this, and when these registers become available, then the mu-op and the association
with the registers are passed down to the execution unit. Otherwise the Allocator will stall this part of the
machine. The Renaming logic has to deal with several x86-CISC instructions which refer to the same x86CISC register such as EAX.. There may be many references to EAX in the pipeline at the same time. These
are identief and are renamed onto the 128 registers of the RISC core.
…
Stage 9. Queue. These queues are “parking places” where mu-ops wait before being dispatched to the
several execution units on the chip. They wait until a scheduler becomes available.
Stages 9-12 Schedule. Here the schedulers decide when a mu-op is ready to execute. If all of their required
register values are complete, then the mu-op can be assigned to an execution unit. The schedulers look out
for any dependencies and wait until these are resolved.
Stages 13 and 14 Dispatch. Mu-ops travel though one of four ports for execution by the RISC core.
Stages 15 and 16.. Here the registers are read, getting the required operands for the mu-ops ready to be
processed by the execution units.
Stage 17. The mu-ops are executed by the units to which they have been assigned
Stage 18 Flags. Flags are set at this stage (eg “overflow”, “zero”) which are typically input to branch
instructions.
Stage 19. Branch Check. Here the P4 checks the result of any branch instruction to see whether its
prediction was correct. If it was incorrect then 19 cycles of operation must be discarded.
Stage 20. Drive. One more clock cycle dedicated to propagating the result of the branch check back to the
Front End.
…
Download