The Pentium 4 Architecture. An overview. The Pentium-4 has emerged in a historical context. Its history started with the 8088/8086 microprocessor technology which was correctly conceived in the CISC mould, appropriate for that epoch. Only later when the RISC concept became prevalent, Intel had to decide either to dump the 8086-CISC architecture or to be clever. To dump would mean to render all software based on the 8086 obsolete which would lose Intel its customer base. But the 8086 technology was unsustainable in the face of RISC architectures such as employed by Sun in their SPARC-station. Intel was faced with a dilemma: Software required the 8086-CISC instruction set, hardware required a RISC concept. Their solution to this dilemma was pure magic, to invent a CPU that would read the 8086-CISC code from the software and translate that code into RISC “microcode” operations which would run on a RISC architecture. From the outside, the Pentium reads CISC code, on the inside, it runs RISC code. The translation mechanism between the inside and out is rather beautiful. This is the topic of this narrative. There are two ways to approach an analysis of microprocessor architecture. The first is takes a “systems” point of view, and looks at how the various components of the microprocessor work together. The second takes a “dynamic” point of view, looking at how the microprocessor deals with instructions as time flows in other words as instructions pass through a processing pipeline. Here we shall present both approaches. First comes…. 1. Summary of the Architecture (Systems View) There are two major groups of components in the P4, (i) the “Front End” which reads IA-32 CISC code and translates this into a stream of micro-operations “mu-ops” and then issues these to the (ii) “Execution Engine” which carries out RISC operations. The Front End has branch prediction logic which uses past history of the program execution tp speculate which IA-32 CISC instructions should be executed next. It then fetches these from the L2 instruction cache, and decodes them and stores them into the L1 Instruction Cache as mu-ops. It also knows what IA-32 CISC instructions have been decoded and stored, so does not need to decode a the same CISC instruction which appears further down the program code. The Execution engine uses out-of-order (OOO) logic so that mu-ops are re-ordered in sequence so as to keep the machine’s execution components, such as the ALUs and the cache as busy as possible. This allows several mu-ops to be processed simultaneously, assuming there are no dependencies. Following processing by the ALUs the instructions are re-ordered by the Retirement Unit to recover the original program code order. This logic also reports branch history back to the Front End allowing branch prediction to occur. Up to 126 mu-ops can be in flight at the same time and also 48 load and 24 store mu-ops. The Allocator searches for the resources required by each mu-op (such as values in registers), and when these become available they are assigned by the Allocator to the mu-op which is then released into the pipeline. The Register Renamer takes the set of 8 IA-32 CISC registers (eax, ebx, ecx, etc) and assigns them to the 128 RISC registers in the machine. This is because in the original program, a register such as eax could be referred to by many lines of code which may been translated to mu-ops in the L1 cache. Each instance of eax could be unique, holding a different value. The Register Renaming logic stores a table of names so that an instruction coming down the pipeline can know which RISC register corresponds to its own instance of eax. The Schedulers decide when each mu-op is ready to execute by monitoring the availability of their input operands. And whether there is an ALU resource waiting for a job. There are multiple schedulers which feed the various ALUs. 2. The Pentium 4 Pipeline This is an initial discussion of the P4 architecture, which has been written to set the scene. Many (if not most) details have been omitted and will be discussed in the document ?? Stages 1 and 2. This sets the IP (instruction pointer) to point to the next x86-CISC instruction present in the Instruction Cache (L2) and translates/decodes it, writing micro-operations “mu-ops” into the L1 instruction cache. The translation is not performed if the IA-32 instruction has already been decoded. Stages 3 and 4. The micro-operations “mu-ops” are fetched from the Instruction Cache and is sent down into the execution engine. Stage 5. Drive. This stage is used to give the chip’s electronics enough time to get the electronic signals from Instruction Cache to the Execution Engine (allocator/renamer). It is necessary, since the Pentium 4 is running so fast. Stages 6 to 8. Allocation and Rename. Each micro-operation within a “bundle” has its own requirements in order to execute, for example registers. The allocator checks this, and when these registers become available, then the mu-op and the association with the registers are passed down to the execution unit. Otherwise the Allocator will stall this part of the machine. The Renaming logic has to deal with several x86-CISC instructions which refer to the same x86CISC register such as EAX.. There may be many references to EAX in the pipeline at the same time. These are identief and are renamed onto the 128 registers of the RISC core. … Stage 9. Queue. These queues are “parking places” where mu-ops wait before being dispatched to the several execution units on the chip. They wait until a scheduler becomes available. Stages 9-12 Schedule. Here the schedulers decide when a mu-op is ready to execute. If all of their required register values are complete, then the mu-op can be assigned to an execution unit. The schedulers look out for any dependencies and wait until these are resolved. Stages 13 and 14 Dispatch. Mu-ops travel though one of four ports for execution by the RISC core. Stages 15 and 16.. Here the registers are read, getting the required operands for the mu-ops ready to be processed by the execution units. Stage 17. The mu-ops are executed by the units to which they have been assigned Stage 18 Flags. Flags are set at this stage (eg “overflow”, “zero”) which are typically input to branch instructions. Stage 19. Branch Check. Here the P4 checks the result of any branch instruction to see whether its prediction was correct. If it was incorrect then 19 cycles of operation must be discarded. Stage 20. Drive. One more clock cycle dedicated to propagating the result of the branch check back to the Front End. …