Speculative, Out-of-Order Execution Gets a New Name PC Processor Microarchitecture TABLE OF CONTENTS • Introduction • Building a Framework for Comparison • What Does a Computer Really Do? • The Memory Subsystem • Exploiting ILP Through Pipelining • Exploiting ILP Via Superscalar Processing • Exploiting Data-Level Parallelism Via SIMD • Where Should Designers Focus The Effort? • A Closer Look At Branch Prediction • Speculative, Out-of-Order Execution Gets a New Name • Analyzing Some Real Microprocessors: P4 • Pentium 4's Cache Organization • Pentium 4's Trace Cache • The Execution Engine Runs Out Of Order • AMD Athlon Microarchitecture • AMD Athlon Scheduler, Data Access • Centaur C3 Microarchitecture • Overall Conclusions • List of References Introduction Introduction Isn't it interesting that new high-tech products seem so complicated, yet only a few years later we talk about how much simpler the old stuff was? This is certainly true for microprocessors. As soon as we finally figure out all the new features and feel comfortable giving advice to our family and friends, we're confronted with details about a brand-new processor that promises to obsolete our expertise on the "old" generation. Gone are the simple and familiar diagrams of the past, replaced by arcane drawings and cryptic buzzwords. For a PC technology enthusiast, this is like discovering a new world to be explored and conquered. While many areas will seem strange and unusual, much of the landscape resembles places we've traveled before. This article is meant to serve as a faithful companion for this journey, providing a guidebook of the many wondrous new discoveries we're sure to encounter. An Objective Tutorial and Analysis of PC Microarchitecture The goal of this article is to give the reader some tools for understanding the internals of modern PC microprocessors. In the article "PC Motherboard Technology", we developed some tools for analyzing a modern PC motherboard. This article takes us one step deeper, zooming into the complex world inside the PC processor itself. The internal design of a processor is called the "microarchitecture". Each CPU vendor uses slightly different techniques for getting the most out of their design, while meeting their unique performance, power, and cost goals. The marketing departments from these companies will often highlight microarchitectural features when promoting their newest CPUs, but it's often difficult for us PC technology enthusiasts to figure out what it really means. What is needed is an objective comparison of the design features for all the CPU vendors, and that's the goal of this article. We'll walk through the features of the latest x86 32-bit desktop CPUs from Intel, AMD, and VIA (Centaur). Since the Transmeta "Crusoe" processor is mostly targeted at the mobile market, we'll analyze their microarchitecture in another article. It will also be the task for another article to thoroughly explore Apple's PowerPC G4 microprocessor, and many of the analytical tools learned here will apply to all high-end processors. Building a Framework for Comparison Building a Framework for Comparison Before we can dive right into the block diagram of a modern CPU, we need to develop some analytical tools for understanding how these features affect the operation of the PC system. We also need to develop a common framework for comparison. As you'll soon see, that is no easy task. There are some radical differences in architecture between these vendors, and it's difficult to make direct comparisons. As it turns out, the best way to understand and compare these new CPUs is to go back to basic computer architectural concepts and show how each vendor has solved the common problems faced in modern computer design. In our last section, we'll gaze into the future of PC microarchitecture and make a few predictions. Let's Not Lose Sight of What Really Matters There is one issue that should be stressed right up front. We should never lose sight of the real objective in computer design. All that really matters is how well the CPU helps the PC run your software. A PC is a computer system, and subtle differences in CPU microarchitecture may not be noticeable when you're running your favorite computer program. We learned this in our article on motherboard technology, since a well-balanced PC needs to remove all the bottlenecks (and meet the cost goals of the user). The CPU designers are turning to more and more elaborate techniques to squeeze extra performance out of these machines, so it's still really interesting to peek in on the raging battle for even a few percent better system performance. For a PC technology enthusiast, it's just downright fascinating how these CPU architects mix clever engineering tricks with brute-force design techniques to take advantage of the enormous number of transistors available on the latest chips. What Does a Computer Really Do? What Does a Computer Really Do? It's easy to get buried too deeply in the complexities of these modern machines, but to really understand the design choices, let's think again about the fundamental operation of a computer. A computer is nothing more than a machine that reads a coded instruction, decodes the instruction, and executes it. If the instruction needs to load or store some data, the computer figures out the location for the data and moves it. That's it; that's all a computer does. We can break this operation into a series of stages: The 5 Computer Operation Stages Stage 1 Instruction Access (IA) Stage 2 Instruction Decode (ID) Stage 3 Execution (EX) Stage 4 Data Access (DA) Stage 5 Store (write back) Results (WB) Some computer architects may re-arrange, combine, or break up the stages, but every computer microarchitecture does these five things. We can use this framework to build on as we work our way up to even the most complicated CPUs. For those of you who eat this stuff for breakfast and are anxious to jump ahead, remember that we haven't yet talked about pipelines. These stages could all be completely processed for a single instruction before starting the next one. If you think about that idea for a moment, you'll realize that almost all the complexity comes when we start improving on that limitation. Don't worry; the discussion will quickly ramp up in complexity, and some readers might appreciate a quick refresher. Let's see what happens in each of these stages: Instruction Access A coded instruction is read from the memory subsystem at an address that is determined by a program counter (PC). In our analysis, we'll treat memory as something that hangs off to the side of our CPU "execution core", as we show in the figure below. Some architects like to view memory and the system bus as an integral part of the microarchitecture, and we'll show how the memory subsystem interacts with the rest of the machine. Instruction Decode The coded instruction is converted into control information for the logic circuits of the machine. Each "operation code (Opcode)" represents a different instruction and causes the machine to behave in different ways. Embedded in the Opcode (or stored in later bytes of the instruction) can be address information or "immediate" data to be processed. The address information can represent a new address that might need to be loaded into the PC (a branch address) or the address can represent a memory location for data (loads and stores). If the instruction needs data What Does a Computer Really Do? from a register, it is usually brought in during this stage. Execute This is the stage where the machine does whatever operation was directed by the instruction. This could be a math operation (multiply, add, etc.) or it might be a data movement operation. If the instruction deals with data in memory, the processor must calculate an "Effective Address (EA)". This is the actual location of the data in the memory subsystem (ignoring virtual memory issues for now), based on calculating address offsets or resolving indirect memory references (A simple example of indirection would be registers that house an address, rather than data). Data Access In this stage, instructions that need data from memory will present the Effective Address to the memory subsystem and receive back the data. If the instruction was a store, then the data will be saved in memory. Our simple model for comparison gets a bit frayed in this stage, and we'll explain in a moment what we mean. Write Back Once the processor has executed the instruction, perhaps having been forced to wait for a data load to complete, any new data is written back to the destination register (if the instruction type requires it). Was There a Question From the Back of the Room? Some of the x86 experts in the audience are going to point out the numerous special cases for the way a processor must deal with an instruction set designed in the 1970s. Our five-stage model isn't so simple when it must deal with all the addressing modes of an x86. A big issue is the fact that the x86 is what is called a "register-memory" architecture where even ALU (Arithmetic Logic Unit) instructions can access memory. This is contrasted with RISC (Reduced Instruction Set Computing) architectures that only allow Load and Store instructions to move data (register-register or more commonly called Load/Store architectures). The reason we can focus on the Load/Store architecture to describe what happens in each stage of a computer is that modern x86 processors translate their native CISC (Complex Instruction Set Computing) instructions into RISC instructions (with some exceptions). By translating the instructions, most of the special cases are turned into extra RISC instructions and can be more efficiently processed. RISC instructions are much easier for the hardware to optimize and run at higher clock rates. This internal translation to RISC is one of the ways that x86 processors were able to deal with the threat that higher-performance RISC chips would take over the desktop in the early 1990s. We'll talk about instruction translation more when we dig into the details of some specific processors, at which point we'll also show several ways in which our model is dramatically modified. To the questioner in the back of the room, there will be several things we're going to have to gloss over (and simplify) in order to keep this article from getting as long as a computer textbook. If you really want to dig into details, check out the list of references at the end of this article. The Memory Subsystem The Memory Subsystem The memory subsystem plays a big part in the microarchitecture of a CPU. Notice that both the Instruction Access stage and the Data Access stage of our simple processor must get to memory. This memory can be split into separate sections for instructions and data, allowing each stage to have a dedicated (hence faster) port to memory. This is called a "Harvard Architecture", a term from work at Harvard University in the 1940s that has been extended to also refer to architectures with separate instruction and data caches--even though main memory (and sometimes L2 cache) is "unified". For some background on cache design, you can refer to the memory hierarchy discussion in the article, "PC Motherboard Technology". That article also covers the system bus interface, an important part of the PC CPU design that is tailored to support the internal microarchitecture. Virtual Memory: Making Life Easier for the Programmer and Tougher for the Hardware Designer To make life simpler for the programmer, most addresses are "virtual addresses" that allow the software designer to pretend to have a large, linear block of memory. These virtual addresses are translated into "physical addresses" that refer to the actual addresses of the memory in the computer. In almost all x86 chips, the caches contain memory data that is addressed with physical addresses. Before the cache is accessed, any virtual addresses are translated in a "Translation Look-aside Buffer (TLB)". A TLB is like a cache of recently-used virtual address blocks (pages), responding back with the physical address page that corresponds to the virtual address presented by the CPU core. If the virtual address isn't in one of the pages stored by the TLB (a TLB miss), then the TLB must be updated from a bigger table stored in main memory--a huge performance hit (especially if the page isn't in main memory and must be loaded from disk). Some CPUs have multiple levels of TLBs, similar to the notion of cache memory hierarchy. The size and structure of the TLBs and caches will be important during our CPU comparisons later, but we'll focus mainly on the CPU core for our analysis. Exploiting ILP Through Pipelining Exploiting ILP Through Pipelining Instead of waiting until an instruction has completed all five stages of our model machine, we could start a new instruction as soon as the first instruction has cleared stage 1. Notice that we can now have five instructions progressing through our "pipeline" at the same time. Essentially, we're processing five instructions in parallel, referred to as "Instruction-Level Parallelism (ILP)". If it took five clock cycles to completely execute an instruction before we pipelined the machine, we're now able to execute a new instruction every single clock. We made our computer five times faster, just with this "simple" change. Let's Just Think About This a Minute We'll use a bunch of computer engineering terms in a moment, since we've got to keep that person in the back of the room happy. Before doing that, take a step back and think about what we did to the machine. (Even experienced engineers forget to do that sometimes.) Suddenly, memory fetches have to occur five times faster then before. This implies that system and cache must now run five times as fast, even though each instruction still takes five cycles to completely execute. We've also made a huge assumption that each stage was taking exactly the same amount of time, since that's the rule that our pipeline clock is enforcing. What about the assumption that the processor was even going to run the next four instructions in that order? We (usually) won't even know until the execute stage whether we need to branch to some other instruction address. Hey, what would happen if the sequence of instructions called for the processor to load some data from memory and then try to perform a math operation using that data in the next instruction? The math operation would likely be delayed, due to memory latency slowing down the process. They're Called Pipeline Hazards What we're describing are called "pipeline hazards", and their effects can get really ugly. There are three types of hazards that can cause our pipeline to come to a screeching halt--or cause nasty errors if we don't put in extra hardware to detect them. The first hazard is a "data hazard", such as the problem of trying to use data before it's available (a "data dependency"). Another type is a "control hazard" where the pipeline contains instructions that come after a branch. A "structural hazard" is caused by resource conflicts where an instruction sequence can cause multiple instructions to need the same processor resource during a given clock cycle. We'd have a structural hazard if we tried to use the same memory port for both instructions and data. Exploiting ILP Through Pipelining Modern Pipelines Can Have a Lot of Stall Cycles There are ways to reduce the chances of a pipeline hazard occurring, and we'll discuss some of the ways that CPU architects deal with the various cases. In a practical sense, there will always be some hazards that will cause the pipeline to stall. One way to describe the situation is to say that an instruction will "block" part of the pipe (something modern implementations help minimize). When the pipe stalls, every (blocked) instruction behind the stalled stage will have to wait, while the instructions fetched earlier can continue on their way. This opens up a gap (a "pipeline bubble") between blocked instructions and the instructions proceeding down the pipeline in front of the blocked instructions. When the blocked instruction restarts, the bubble will continue down the pipeline. For some hazards, like the control hazard caused by a (mispredicted) branch instruction, the following instructions in the pipeline need to be killed, since they aren't supposed to execute. If the branch target address isn't in the instruction cache, the pipeline can stall for a large number of clock cycles. The stall would be extended by the latency of accesses to the L2 cache or, worse, accesses to main memory. Stalls due to branches are a serious problem, and this is one of the two major areas where designers have focused their energy (and transistor budget). The other major area, not surprisingly, is when the pipeline goes to memory to load data. Most of our analysis will focus in on these 2 latency-induced problems. Design Tricks To Reduce Data Hazards For some data hazards, one commonly-used solution forward data result is to from a completed instruction straight to another instruction yet to execute in the pipeline (data "forwarding", though sometimes called "bypassing"). This is much faster than writing out the data and forcing the other instruction to read it back in. Our case of a math operation needing data from a previous memory load instruction would seem to be a good candidate for this technique. The data loaded from memory into a register can also be forwarded straight to the ALU execute stage, instead of going all the way through the register write-back stage. An instruction in the write-back stage could forward data straight to an instruction in the execute stage. Why wait 2 cycles? Why not forward straight from the data access stage? In reality, the data load stage is far from instantaneous and suffers from the same memory latency risk as instruction fetches. The figure below shows how this can occur. What if the data is not in the cache? There would be a huge pipeline bubble. As it turns out, data access is even more challenging than an instruction fetch, since we don't know the memory address until we've calculated the Effective Address. While instructions are usually accessed sequentially, allowing several cache lines to be prefetched from the instruction cache (and main memory) into a fast local buffer near the execution core, data accesses don't always have such nice "locality of reference". Exploiting ILP Through Pipelining The Limits of Pipelining If five stages made us run up to five times faster, why not chop up the work into a bunch more stages? Who cares about pipeline hazards when it gives the marketing folks some really high peak performance numbers to brag about? Well, every x86 processor we'll analyze has a lot more than five stages. Originally called "super-pipelining" until Intel (for no obvious reason) decided to rename it "hyper-pipelining" in their Pentium 4 design, this technique breaks up various processing stages into multiple clock cycles. This also has the architectural benefit of giving better granularity to operations, so there should be fewer cases where a fast operation waits around while slow operations throttle the clock rate. With some of the clever design techniques we'll examine, the pipeline hazards can be managed, and clock rates can be cranked into the stratosphere. The real limit isn't an architectural issue, but is related to the way digital circuits clock data between pipeline stages. To pipeline an operation, each new stage of the pipeline must store information passed to it from a prior stage, since each stage will (usually) contain information for a different instruction. This staged data is held in a storage device (usually a "latch"). As you chop up a task into smaller and smaller pipeline stages, the overhead time it takes to clock data into the latch ("set-up and hold" times and allowance for clock "skew" between circuits) becomes a significant percentage of the entire clock period. At some point, there is no time left in the clock cycle to do any real work. There are some exotic circuit tricks that can help, but it would burn a lot of power - not a good trade-off for chips that already exceed 70 watts in some cases. Exploiting ILP Via Superscalar Processing Exploiting ILP Via Superscalar Processing While our simple machine doesn't have any serious structural hazards, that's only because it is a "single-issue" architecture. Only a single instruction can be executed during a clock cycle. In a "superscalar" architecture, extra compute resources are added to achieve another dimension of instruction-level parallelism. The original Pentium provided 2 separate pipelines that Intel called the U and V pipelines. In theory, each pipeline could be working simultaneously on 2 different sets of instructions. With a multi-issue processor (where multiple instructions can be dispatched each clock cycle to multiple pipelines in the single processor), we can have even more data hazards, since an operation in one pipeline could depend on data that is in another pipeline. The control hazards can get worse, since our "instruction fetch bandwidth" rises (doubled in a 2-issue machine, for example). A (mispredicted) branch instruction could cause both pipelines to need instructions flushed. Issue Restrictions Limit How Often Parallelism Can Be Achieved In practice, a superscalar machine has lots of "issue restrictions" that limit what each pipeline is capable of processing. This structural hazard limited how often both the U and V pipe of the Pentium could simultaneously execute 2 instructions. The limitations are caused by the cost of duplicating all the hardware for each pipeline, so the designers focus instead on exploiting parallelism in as many cases as practical. Combining Superscalar with Super-Pipelining to Get the Best of Both Another approach to superscalar is to duplicate portions of the pipeline. This becomes much easier in the new architectures that don't require instructions to proceed at the same rate through the pipeline (or even in the original program order). An obvious stage for exploiting superscalar design techniques is the execute stage, since PC's process three different types of data. There are integer operations, floating-point operations and now "media" operations. We know all about integer and floating-point. A media instruction processes graphics, sound or video data (as well as communications data). The instruction sets now include MMX, 3DNow!, Enhanced 3DNow!, SSE, and SSE2 media instructions. The execute stage could attempt to simultaneously process all three types of instructions, as long as there is enough hardware to avoid structural hazards. In practice, there are several structural hazards that require issue restrictions. Each new execution resource could also have its own pipeline. Many floating-point instructions and media instructions require multiple clocks and aren't fully pipelined in some implementations. We'll clear up any confusion when we analyze some real processors later. For now, it's only important to understand the fundamentals of superscalar design and realize that modern architectures include combinations of multiple pipelines running simultaneously. Exploiting Data-Level Parallelism Via SIMD Exploiting Data-Level Parallelism Via SIMD We'll talk more about this later, but the new focus on media instructions has allowed CPU designers to recognize the inherent parallelism in the way data is processed. The same operation is often performed on independent data sets, such as multiplying data stored in a vector or a matrix. A single instruction is repeated over and over for multiple pieces of data. We can design special hardware to do this more efficiently, and we call this a "Single Instruction Multiple Data (SIMD)" computing model. More Pressure on the Memory System Once again, take a step back and think about the implications before that person in the back of the room gets us to dive into implementation details. With some intuitive analysis, we can observe that we've once again put tremendous pressure on our memory subsystem. A single instruction coming down our pipeline(s) could force multiple data load and store operations. Thinking a bit further about the nature of media processing, some of the streaming media types (like video) have critical timing constraints, and the streams can last for a long time (i.e. as a viewer of video, you expect a continuous flow of the video stream over time, preferably without choppiness or interruptions). Our data caches may not do us much good, since the data may only get processed once before the next chunk of data wants to replace it (data caches are most effective when the same data is accessed over and over). Thus the CPU architects have some new challenges to solve. Where Should Designers Focus The Effort? Where Should Designers Focus The Effort? By now, you've likely come to realize that every CPU vendor is trying to solve similar problems. They're all trying to take a 1970s instruction set and do as much parallel processing as possible, but they're forced to deal with the limitations of both the instruction set and the nature of memory systems. There is a practical limit to how many instructions can be processed in parallel, and it gets more and more difficult for the hardware to "dynamically" schedule instructions around any possible instruction blockage. The compilers are getting better at "statically" scheduling, based on the limited information available at compile time. However, the hardware is being pushed to the limits in an attempt to look as far ahead in the instruction stream as possible in the search for non-blocking instructions. It's All About Memory Latency As we've shown, there are 2 stages of our computer model where the designers can get the most return on their efforts. These are Instruction Fetch and Data Access, and both can cause an enormous performance loss if not handled properly. The problem is caused by the fact that our pipelines are now running at over one GHz, and it can take over 100 pipeline cycles to get something from main memory. The key to solving the problem is to make sure that the required instructions or data aren't sitting in main memory when you need them, but instead, are already in a buffer inside your pipeline--or at least sitting in an upper level of your cache hierarchy. Branch Prediction Can Solve the Problem With I-Fetch Latency If we could predict with 100% certainty which direction a program branch is going (forward or backward in the instruction stream), then we could make sure that the instructions following the branch instruction are in the correct sequence in the pipeline. That's not possible, but improvement in the branch predictor can have a dramatic performance gain for these modern, deeply-pipelined architectures. We'll analyze some branch prediction approaches later. Data Memory Latency is Much Tougher to Handle One way to deal with data latency is to have "non-blocking loads" so that other memory operations can proceed while we're waiting for the data for a specific instruction to come back from the memory system. Every x86 architecture does this now. Still, if the data is sitting in main memory when the load is being executed, the chip's performance will take a severe hit. The key is to pre-fetch blocks of data before they're needed, and special instructions have been added to directly allow the software to deal with the limited locality of data. There are also some ways that the pipeline can help by buffering up load requests and using intelligent data pre-fetching techniques based on the processor's knowledge of the instruction stream. We'll analyze some of the vendor solutions to the problem of data access. A Closer Look At Branch Prediction A Closer Look At Branch Prediction The person in the back of the room will be happy to hear that things are about to get more complicated. We're now going to explore some of the recent innovations in CPU microarchitecture, starting with branch prediction. All the easy techniques have already been implemented. To get better prediction accuracy, microprocessor designers are combining multiple predictors and inventing clever new algorithms. There really are three different kinds of branches: Forward conditional branches - based on a run-time condition, the PC (Program Counter) is changed to point to an address forward in the instruction stream. Backward conditional branches - the PC is changed to point backward in the instruction stream. The branch is based on some condition, such as branching backwards to the beginning of a program loop when a test at the end of the loop states the loop should be executed again. Unconditional branches - this includes jumps, procedure calls and returns that have no specific condition. For example, an unconditional jump instruction might be coded in assembly language as simply "jmp", and the instruction stream must immediately be directed to the target location pointed to by the jump instruction, whereas a conditional jump that might be coded as "jmpne" would redirect the instruction stream only if the result of a comparison of two values in a previous "compare" instructions shows the values to not be equal. (The segmented addressing scheme used by the x86 architecture adds extra complexity, since jumps can be either "near" (within a segment) or "far" (outside the segment). Each type has different effects on branch prediction algorithms.) Using Branch Statistics for Static Prediction Forward branches dominate backward branches by about 4 to 1 (whether conditional or not). About 60% of the forward conditional branches are taken, while approximately 85% of the backward conditional branches are taken (because of the prevalence of program loops). Just knowing this data about average code behavior, we could optimize our architecture for the common cases. A "Static Predictor" can just look at the offset (distance forward or backward from current PC) for conditional branches as soon as the instruction is decoded. Backward branches will be predicted to be taken, since that is the most common case. The accuracy of the static predictor will depend on the type of code being executed, as well as the coding style used by the programmer. These statistics were derived from the SPEC suite of benchmarks, and many PC software workloads will favor slightly different static behavior. Dynamic Branch Prediction with a Branch History Buffer (BHB) To refine our branch prediction, we could create a buffer that is indexed by the low-order address bits of recent branch instructions. In this BHB (sometimes called a "Branch History Table (BHT)"), for each branch instruction, we'd store a bit that indicates whether the branch was recently taken. A simple way to implement a dynamic branch predictor would be to check the BHB for every branch instruction. If the BHB's prediction bit indicates the branch should be taken, then the pipeline can go ahead and start fetching instructions from the new address (once it computes the target address). By the time the branch instruction works its way down the pipeline and actually causes a branch, then the correct instructions are already in the pipeline. If the BHB was wrong, a "misprediction" occurred, and we'll have to flush out the incorrectly fetched instructions and invert the BHB prediction bit. Refining Our BHB by Storing More Bits It turns out that a single bit in the BHB will be wrong twice for a loop--once on the first pass of the loop and once at the end A Closer Look At Branch Prediction of the loop. We can get better prediction accuracy by using more bits to create a "saturating counter" that is incremented on a taken branch and decremented on an untaken branch. It turns out that a 2-bit predictor does about as well as you could get with more bits, achieving anywhere from 82% to 99% prediction accuracy with a table of 4096 entries. This size of table is at the point of diminishing returns for 2 bit entries, so there isn't much point in storing more. Since we're only indexing by the lower address bits, notice that 2 different branch addresses might have the same low-order bits and could point to the same place in our table--one reason not to let the table get too small. Two-Level Predictors and the GShare Algorithm There is a further refinement we can make to our BHB by correlating the behavior of other branches. Often called a "Global History Counter", this "two-level predictor" allows the behavior of other branches to also update the predictor bits for a particular branch instruction and achieve slightly better overall prediction accuracy. One implementation is called the "GShare algorithm". This approach uses a "Global Branch History Register" (a register that stores the global result of recent branches) that gets "hashed" with bits from the address of the branch being predicted. The resulting value is used as an index into the BHB where the prediction entry at that location is used to dynamically predict the branch direction. Yes, this is complicated stuff, but it's being used in several modern processors. Using a Branch Target Buffer (BTB) to Further Reduce the Branch Penalty In addition to a large BHB, most predictors also include a buffer that stores the actual target address of taken branches (along with optional prediction bits). This table allows the CPU to look to see if an instruction is a branch and start fetching at the target address early on in the pipeline processing. By storing the instruction address and the target address, even before the processor decodes the instruction, it can know that it is a branch. The figure below shows an implementation of a BTB. A large BTB can completely remove most branch penalties (for correctly-predicted branches) if the CPU looks far enough ahead to make sure the target instructions are pre-fetched. Using a Return Address Buffer to predict the return from a subroutine One technique for dealing with the unconditional branch at the end of a subroutine is to create a buffer of the most recent return addresses. There are usually some subroutines that get called quite often in a program, and a return address buffer can make sure that the correct instructions are in the pipeline after the return instruction. Speculative, Out-of-Order Execution Gets a New Name Speculative, Out-of-Order Execution Gets a New Name While RISC chips used the same terms as the rest of the computer engineering community, the Intel marketing department decided that the average consumer wouldn't like the idea of a computer that "speculates" or runs programs "out of order". A nice warm-and-fuzzy term was coined for the P6 architecture, and "Dynamic Execution" was added to our list of non-descriptive buzzwords. Both AMD and Intel use a microarchitecture that, after decoding into simpler RISC instructions, tosses the instructions into a big hopper and allows them to execute in whatever order best matches the available compute resources. Once the instructions have finished executing out of order, results get "committed" in the original program order. The term "speculation" refers to instructions being speculatively fetched, decoded and executed. A useful analogy can be drawn to the stock market investor who "speculates" that a stock will go up in value and justify an investment. For a microprocessor speculating on instructions in advance, if the speculation turns out to be incorrect, those instructions are eliminated before any machine state changes are committed (written to processor registers or memory). Once Again, Let's Take a Step Back and Try Some More Intuitive Analysis By now that person in the back of the room has finally gotten used to these short pauses to look at the big picture. In this case, we just made a huge change to our machine, and it's hard to easily conceptualize. We've completely scrambled the notion of how instructions flow down a one-way pipeline. One thing that becomes obvious is the need for darn good branch prediction. All that speculation becomes wasted memory bandwidth, execution time, and power if we end up taking a branch we didn't expect. Following our stock investor analogy, if the value doesn't go up, then the investment was wasted and could have been more productively used elsewhere. In fact, the speculation could make us worse off. The need to wait before committing completed instructions to registers or memory should probably be obvious, since we could end up with incorrect program behavior and incorrect data--then have to try to unwind everything when a branch misprediction (or an exception) comes along. The real power of this approach would seem to be realized by having lots of superscalar stages, since we can reorder the instructions to better match the issue restrictions of multiple compute resources. OK, enough speculation, let's dig into the details: Register Renaming Creates Virtual Registers If you're going to have speculative instructions operating out of order, then you can't have them all trying to change the same registers. You need to create a "register alias table (RAT)" that renames and maps the eight x86 registers to a much larger set of temporary internal register storage locations, permitting multiple instances of any of the original eight registers. An instruction will load and store values using these temporary registers, while the RAT keeps track of what the latest known values are for the actual x86 registers. Once the instructions are completed and re-ordered so that we know the register state is correct, then the temporary registers are committed back to the real x86 registers. The Reorder Buffer (ROB) Helps Keep Instructions in Order After an instruction is decoded, it's allowed to execute out of order as soon as the operands (data) become available. A special Reorder Buffer is created to keep track of instruction status, such as when the operands become available for execution, or when the instruction has completed execution and results can be "committed" or "retired" to architectural registers or memory in the original program order. These instructions use the renamed register set and are "dispatched" to the execution units as resources become available, perhaps spending some time in "reservation stations" that operate as Speculative, Out-of-Order Execution Gets a New Name instruction queues at the front of various execution units. After an instruction has finished executing, it can be "retired" by the ROB. However, the state still isn't committed until all the older instructions (with respect to program order) have been retired first. A neat thing about using register renaming, reservation stations, and the ROB is that a result from a completed instruction can be forwarded directly to the renamed register of a new instruction. Many potential data dependencies go away completely, and the pipelines are kept moving. Load and Store Buffering Tries to Hide Data Access Latency In the same way that instructions are executed as soon as resources become available, a load or a store instruction can get an early start by using this speculative approach. Obviously, the stores can't actually get sent all the way to memory until we're sure the data really should be changed (requiring we maintain program order). Instead, the stores are buffered, retired, and committed in order. The loads are a more interesting case, since they are directly affected by memory latency, the other key problem we highlighted earlier. The hardware will speculatively execute the load instruction, calculating the Effective Address out of order. Depending on the implementation, it may even allow out-of-order cache access, as long as the loads don't access the same address as a previous store instruction still in the processor pipeline, but not yet committed. If in fact the load instruction needs the results of a previous store that has completed but is still in the machine, the store data can get forwarded directly to the load instruction (saving the memory load time). Analyzing Some Real Microprocessors: P4 Analyzing Some Real Microprocessors: P4 We've come to the end of our tutorial on processor microarchitecture. Hopefully, we've given you enough analytical tools so that you're now ready to dig into the details of some real products. There are a few common microarchitectural features (like instruction translation) that we decided would be easier to explain as we show some real implementations. We'll also look a bit deeper at the arcane science of branch prediction. Let's now take an objective look at the Intel P4, AMD Athlon, and VIA/Centaur C3. We'll then do some more big-picture analysis and gaze forward to predict the future of PC microarchitecture. Intel Pentium 4 Microarchitecture Intel is vigorously promoting the Pentium 4 as the preferred desktop processor, so we'll focus our Intel analysis on this microarchitecture. We'll make a few comparisons to previous processor generations, but our goal is to gain a detailed understanding of how the Pentium 4 meets its design goals. We'll leave it as an "exercise for the reader" to apply your new analytical tools to the Pentium III. The Pentium 4 is the first x86 chip to use some newer microarchitectural innovations, offering us an opportunity to explore some of these new approaches to dealing with the 2 key latency-induced challenges in CPU design. We should point out that our analysis only covers the "Willamette" version of the P4, while the forthcoming "Northwood" will move to a .13 micron process geometry and make slight changes to the microarchitecture (most likely improving the memory subsystem). We'll update this article when we get more information on Northwood. The NetBurstTM Moniker Describes a Collection of Design Features What's the point of introducing a new product without adding a new Intel buzzword? In this case, the name doesn't refer to a single architectural improvement, but is really meant to serve as a name for this family of microprocessors. The NetBurst design changes include a deeper pipeline, new bus architecture, more execution resources, and changes to the memory subsystem. The figure below shows a block diagram of the Pentium 4, and we'll take a look at each major section. Analyzing Some Real Microprocessors: P4 Deeply Pipelined for Higher Clock Rate The Pentium 4 has a whopping 20-stage pipeline when processing a branch misprediction. The figure below shows how this pipeline compares to the 10 stages of the Pentium III. The most interesting thing about the Pentium 4 pipe is that Intel has dedicated 2 stages for driving data across the chip. This is fascinating proof that the limiting factor in modern IC design has become the time it takes to transmit a signal across the wire connections on the chip. To understand why it's fascinating, consider that it wasn't so long ago that designers only worried about the speed of transistors, and the time it took to traverse such a short piece of metal was considered essentially instantaneous. Now we're moving from aluminum to copper, just because electrons propagate faster with copper. (I can see that person in the back of the room is still with us and is nodding in agreement.) This is fascinating stuff, and Intel is probably the first vendor to design a pipeline with "Drive" stages. What About All Those Problems with Long Pipelines? Well, Intel has to work especially hard to make sure they avoid pipeline hazards. If that long pipeline needs to be flushed very often, then the performance will be much lower than other designs. We should remind ourselves that the longer pipeline actually results in less work being done on each clock cycle. That's the whole point of super-pipelining (or hyper-pipelining, if you prefer), since doing less work in a clock cycle is what allows the clock cycle time to be shortened. The pipeline has to run at a higher frequency just to do the same amount of work as a shorter pipeline. All other things being equal, you'd expect the Pentium 4 to have less performance than parts with shorter pipelines at the same frequency. Searching for Even More Instruction-Level Parallelism As we learned, there is another thing to realize about long pipelines (besides being able to run at the high clock rates that motivate uninformed buyers). Longer pipelines allow more instructions to be in process at the same time. The compiler (static scheduler) and the hardware (dynamic scheduler) must keep the faster and deeper pipeline fed with the instructions and data it needs during a larger instruction "window". The machine is going to have to search even further to find instructions that can execute in parallel. As you'll see, the Pentium 4 can have an incredible 126 instructions in-flight as it searches further and further ahead in the instruction stream for something to work on while waiting for data or resource dependencies to clear. Pentium 4’s Cache Organization Pentium 4's Cache Organization Cache Organization in the Memory Hierarchy As we described in our article on motherboard technology, there is usually a trade-off between cache size and speed. This is mostly because of the extra capacitive loading on the signals that drive the larger SRAM arrays. Refer again to block diagram of the Pentium 4. Intel has chosen to keep the L1 caches rather small so that they can reduce the latency of cache accesses. Even a data cache hit will take 2 cycles to complete (6 cycles for floating-point data). We'll talk about the L1 caches in a moment, but further down the hierarchy we find that the L2 cache is an 8-way, unified (includes both instruction and data), 256KB cache with a 128B line size. The 8-way structure means it has 8 sets of tags, providing about the same cache miss rate as a "fully-associative" cache (as good as it gets). This makes the 256KB cache more effective than its size indicates, since the miss rate of this cache is approximately 60% of the miss rate for a direct-mapped (1-way) cache of the same size. The downside is that an 8-way cache will be slower to access. Intel states that the load latency is 7 cycles (this reflects the time it takes an L2 cache line to be fully retrieved to either the L1 data cache or the x86 instruction prefetch/decode buffers), but the cache is able to transfer new data every 2 cycles (which is the effective throughput assuming multiple concurrent cache transfers are initiated). Again, notice that the L2 cache is shared between instruction fetches and data accesses (unified). System Bus Architecture is Matched to Memory Hierarchy Organization One interesting change for the L2 cache is to make the line size 128 bytes, instead of the familiar 32 bytes. The larger line size can slightly improve the hit rate (in some cases), but requires a longer latency for cache line refills from the system bus. This is where the new Pentium 4 bus comes into play. Using a 100MHz clock and transferring data four times on each bus clock (which Intel calls a 400MHz data rate), the 64-bit system bus can bring in 32 bytes each cycle. This translates to a bandwidth of 3.2 GB/sec. Pentium 4’s Cache Organization To fill an L2 cache line requires four bus cycles- the same number of cycles as the P6 bus for a 32-byte line). Note that the system bus protocol has a 64-byte access length (matching the line size of the L1 cache) and requires 2 main memory request operations to fill an L2 cache line. However, the faster bus only helps overcome the latency of getting the extra data into the CPU from the North Bridge. The longer line size still causes a longer latency before getting all the burst data from main memory. In fact, some analysts note that P4 systems have about 19% more memory latency than Pentium III systems (measured in nanoseconds for the demand word of a cache refill). Smart pre-fetching is critical or else the P4 will end up with less performance on many applications. Pre-Fetching Hardware Can Help if Data Accesses Follow a Regular Pattern The L2 cache has pre-fetch hardware to request the next 2 cache lines (256 bytes) beyond the current access location. This pre-fetch logic has some intelligence to allow it to monitor the history of cache misses and try to avoid unnecessary pre-fetches (that waste bandwidth and cache space). We'll talk more about the pre-fetcher later, but let's take a quick pause for some of our patented intuitive analysis. We've described the problem of dealing with streaming media types (like video) that don't spend much time in the cache. The hardware pre-fetch logic should easily notice the pattern of cache misses and then pre-load data, leading to much better performance on these types of applications. Designing for Data Cache Hits Intel boasts of "new algorithms" to allow faster access to the 8KB, four-way, L1 data cache. They are most likely referring to the fact that the Pentium 4 speculatively processes load instructions as if they always hit in the L1 data cache (and data TLB). By optimizing for this case, there aren't any extra cycles burned while cache tags are checked for a miss. The load instruction is sent on its merry way down the pipeline; if a cache miss delays the load, the processor passes temporarily incorrect data to dependent instructions that assumed the data arrived in 2 cycles. Once the hardware discovers the L1 data cache miss and brings in the actual data from the rest of the memory hierarchy, the machine must "replay' any instructions that had data dependencies and grabbed the wrong data. It's unclear how efficient this approach will be, since it obviously depends on the load pattern for the applications. The worst case would be an application that constantly loads data that is scattered around memory, while attempting to immediately perform an operation on each new data value. The hardware pre-fetch logic would (perhaps mercifully) never "trigger", and the pipeline would be constantly restarting instructions. Again, the Pentium 4 design seems to have been optimized for the case of streaming media (just as Intel claims), since these algorithms are much more regular and demand high performance. The designers probably hope that the pathological worst case only occurs for code that doesn't need high performance. When the L1 data cache does have a miss, it has a "fat pipe" (32 bytes wide) to the L2 cache, allowing each 64-byte cache line to be refilled in 2 clocks. However, there is a 7-cycle latency before the L2 data starts arriving, as we mentioned previously. The Pentium 4 can have up to four L1 data cache misses in process. Pentium 4’s Trace Cache Pentium 4's Trace Cache The Trace Cache Depends on Good Branch Prediction Instead of a classic L1 instruction cache, the Pentium 4 designers felt confident enough in their branch prediction algorithms to implement a trace cache. Rather than storing standard x86 instructions, the trace cache stores the instructions after they've already been decoded into RISC-style instructions. Intel calls them "µops" (micro-ops) and stores 6 µops for each "trace line". The trace cache can house up to 12K µops. Since the instructions have already been decoded, hardware knows about any branches and fetches instructions that follow the branch. As we learned, it's the conditional branches that could really cause a problem, since we won't know if we're wrong until the branch condition check in Arithmetic Logic Unit 0 (ALU0) of the execution core. By then, our trace cache could have pre-fetched and decoded a lot of instructions we don't need. The pipeline could also allow several out-of-order instructions to proceed if the branch instruction was forced to wait for ALU0. Hopefully, the alternative branch address is somewhere in the trace cache. Otherwise, we'll have to pay those 7 cycles of latency to get the proper instructions from the L2 cache (pity us if it's not there either, as the L2 cache would need to get the instructions from main memory) plus the time to decode the fetched x86 instructions. Intel's reference to the 20-stage P4 pipeline actually starts with the trace cache, and does not include the cycles for instruction or data fetches from system memory or L2 cache. The Trace Cache has Several Advantages If our predictors work well, then the trace cache is able to provide (the correct) three µops per cycle to the execution scheduler. Since the trace cache is (hopefully) only storing instructions that actually get executed, then it makes more efficient use of the limited cache space. Since the branch target instruction has already been decoded and fetched in execution order, there isn't any extra latency for branches. The person in the back of the room just reminded us of an interesting point. We never mentioned a TLB check for the trace cache, because it does not use one. So, the Pentium 4 isn't so complicated after all. Most of you correctly observed that this cache uses virtual addressing, so there isn't any need to convert to physical addresses until we access the L2 cache. Intel documents don't give the size of the instruction TLB for the L2 cache. Pentium 4 Decoder Relies on Trace Cache to Buffer µops The Pentium 4 decoder can only convert a single x86 instruction on each clock, fewer than other architectures. However, since the µops are cached in the trace buffer (and hopefully reused), the decode bandwidth is probably adequate to match the instruction issue rate (three µops/cycle). If an x86 instruction requires more than four µops, then the decoder fetches µops directly from a µops "Read-Only Memory (ROM)". All x86 processor architectures use some sort of ROM for infrequently used instructions or multi-cycle string operations. The Execution Engine Out of Order The Execution Engine Runs Out Of Order For an out-of-order machine, the main design goal is to provide enough parallel compute resources to make it worth all the extra complexity. In this case, the machine is working to schedule instructions for 7 different parallel units, shown in the figure below. Two of these units dispatch loads and stores (the Data Access stage of our original computer model). The other processing tasks use multiple schedulers and are dispatched through the 2 Exec Ports. Each port could have a fast ALU operation scheduled every half cycle, though other µops get scheduled every cycle. The figure below shows what each port can dispatch. Notice the numerous issue restrictions (structural hazards). If you were to have just fast ALU µops on both Exec Ports and a simultaneous Load and Store dispatch, then a total of 6 µops/cycle (four double-speed ALU instructions, a Load, and a Store) can be dispatched to execution units. The performance of the execution engine will depend on the type of program and how well the schedulers can align µops to match the execution resources. Retiring Instructions in Order and Updating the Branch Predictors The Reorder Buffer can retire three µops/cycle, matching the instruction issue rate. There are some subtle differences in the way the Pentium 4 ROB and register renaming are implemented compared to other processors like the Pentium III, but the operation is very similar. As we've shown, a key to performance is to avoid mispredicted branches. As instructions are retired from the ROB, the final branch addresses are used to update the Branch Target Buffer and Branch History Buffer. In case some of you have finally figured out modern branch predictors, Intel has chosen to rename the combination of a BTB and a BHB. Intel calls the combination a "Branch Target Buffer (BTB)", insuring extra confusion for our new students of computer microarchitecture. Branch Prediction Uses a Combination of Schemes While there isn't much public information about how the Pentium 4 does branch prediction, they likely use a two-level predictor and combine information from the Static Prediction we discussed earlier. They also include a Return Address Buffer of some undisclosed size. The specific algorithms are part of the "secret sauce" that processor vendors guard closely. In the past, we've seen various patent filings describing algorithmic mechanisms used in branch predictors and other processor subsystems. The patent details shed more light on their implementations than processor vendors would otherwise choose to disclose publicly. Branch Hints Can Allow Faster Performance on a Known Data Set The Pentium 4 also allows software-directed branch hints to be passed as prefixes to branch instructions. These branch hints allow the software to override the Static Predictor and can be a powerful tool. This is particularly true if the program is The Execution Engine Out of Order compiled and executed with special features enabled to collect information about program flow. The information from the prior run can be fed back to the compiler to create a new executable with Branch Hints that avoid the earlier mispredictions. There is some potential for marketing abuse of this feature, since benchmarks that use a repeatable data set can be optimized to avoid performance-killing branch mispredictions. Support for New Media Instructions The Pentium 4 has retained the earlier x86 instruction extensions (MMX and SSE) and added 144 new instructions they call SSE2. It will be the task for another article to give a complete analysis and comparison of the x86 instruction extensions and execution resources. However, as we've noted several times, the Pentium 4 is tuned for performance on streaming media applications. Poor Thermal Management Can Limit Performance One potentially troubling feature of the Pentium 4 is the "Thermal Monitor" that can be enabled to slow the internal clock rate to half speed (or less, depending on the setting) when the die temperature exceeds a certain value. On a 1.5 GHz Pentium 4 (Willamette), this temperature currently equates to 54.7 Watts of power (according to Intel's Thermal Design Guide and P4 datasheet). This is almost certainly a limitation of the package and heat sink, but the maximum power dissipation of a 1.5 GHz part is currently about 73 Watts. Intel would argue that this maximum would never be reached, but it is quite possible that demanding applications will cause a poorly-cooled CPU to exceed the current thermal cut-off point - losing performance at a time when you need it the most. As Intel moves to lower voltages in a more advanced manufacturing process, these limits will be less of a problem at current clock rates. As higher clock rate parts are introduced, the potential performance loss will again be an issue. Certainly, the Thermal Monitor is a good feature for ensuring that parts don't destroy themselves. It also is a clever solution to the problem of turning on fans quickly enough to match the high thermal ramp rates. The concerns may only arise for low-cost, inadequate heatsinks and fans. Customers may appreciate the system stability this feature offers, but not the uncertainty about whether they're getting all the performance they paid for. We've heard from one of Intel's competitors that certain Dell and HP Pentium 4 systems they tested do not enable this clock slow-down feature. This is actually a good thing if Dell and HP are confident about their thermal solution. We plan to write a separate report on our testing of this feature soon. Overall Conclusions About the Pentium 4 The large number of complex new features in this processor has required a lot of explanation. Clearly, this is a design that is intended to scale to dramatically higher clock rates. Only at higher clock rates does the benefit of the microarchitecture become realized. It is also likely that the designers were forced to make painful trade-offs in the sizes for the on-chip memory hierarchy. With a microarchitecture so sensitive to cache misses, it will be critical to increase the size of these memories as transistor budgets increase. With good thermal management, higher clock rates and bigger caches, this chip should compete well in desktop systems in the future, while doing very well today with streaming media, memory bandwidth-intensive applications, and functions that use SSE2 instructions. AMD Athlon Microarchitecture AMD Athlon Microarchitecture The Athlon architecture is more similar to our earlier analysis of speculative, out-of-order machines. This similarity is partly due to the (comforting) maturity of the architecture, but it should be noted that the original design of the Athlon microarchitecture emphasized performance above other factors. The more aggressive initial design approach keeps the architecture sustainable while minor optimizations are implemented for clock speed or die cost. AMD will soon ship a new version of Athlon, code-named "Palomino" and possibly sporting bigger caches and subtle changes to the microarchitecture. For this article, we examine "Thunderbird", the design introduced in June 2000. Parallel Compute Resources Benefit From Out-of-Order Approach The extra complexity of creating an out-of-order machine is wasted if there aren't parallel compute resources available for taking advantage of those exposed instructions. Here is where Athlon really shines. The microarchitecture can execute 9 simultaneous RISC instructions (what AMD calls "OPs"). The figure below shows the block diagram of Athlon. Note the extra resources for standard floating-point Ops, likely explaining why this processor does so well on FP-intensive programs. (Well, that person in the back of the room is still with us.) Yes, indeed the comparative analysis gets more complex if we include the P4's SSE2 instructions for SIMD floating-point, but we'll have to leave that analysis for another article. The current Athlon architecture will certainly have higher performance for applications that don't have high data-level parallelism. Cache Architecture Emphasizes Size to Achieve High Hit Rate Note that AMD has chosen to implement large L1 caches. The L1 instruction and data caches are each 2-way, 64KB caches. The L1 instruction cache has a line-size of 64 bytes with a 64-byte sequential pre-fetch. The L1 data cache provides a second data port to avoid structural hazards caused by the superscalar design. The L2 cache is a 16-way, 256KB unified cache, backed up by the fast EV6 bus we discussed in the motherboard article. If we take a step back and think about differences between P4 and Athlon memory hierarchies, we can make a few observations. Intel's documentation states that their 12K trace cache will have the same hit rate as an "8K to 16K byte AMD Athlon Microarchitecture conventional instruction cache". By that measure, the Athlon will have much better hit rates, though hits will have longer latency for decoding instructions. An L1 miss is much worse for the P4's longer pipeline, though smart pre-fetching can overcome this limitation. Remember, at these high clock rates, it doesn't take long to drain an instruction cache. It will eventually come down to the accuracy of the branch predictor, but the Pentium 4 will still need a bigger trace cache to match Athlon instruction fetch effectiveness. Pre-Decoding Uses Extra Cache Bits To deal with the complexities of the x86 instruction set, AMD does some early decoding of x86 instructions as they are fetched into the L1 instruction cache. These extra bits help mark the beginning and end of the variable-length instructions, as well as identify branches for the pre-fetcher (and predictor). These extra bits and early (partial) decoding give some of the benefits of a trace cache, though there is still latency for the completion of the decoding. Final Decoding Follows 2 Different Paths Figure 9 shows the decode pipeline for the Athlon. Notice that it matches the flow of our original computer model, breaking up Instruction Access and Decode stages into 6 pipeline stages. AMD uses a fixed-length instruction format called a "MacroOp", containing one or more Ops. The instruction scheduler will turn MacroOp's into Op's as it dispatches to the execution units. The "DirectPath Decoder" generates MacroOp's that take one or two Ops. The "VectorPath Decoder" fetches longer instructions from ROM. Notice in the figure below that the Athlon can supply three MacroOp's/cycle to the instruction decoder (the IDEC stage), and later they'll enter the instruction scheduler, equating to a maximum of 6 Ops/cycle decode bandwidth. Note that the actual decode performance depends on the type of instructions. AMD Athlon Schedulaer, Data Access AMD Athlon Scheduler, Data Access Integer Scheduler Dispatches Ops to 6 Execution Units The figure below shows how pipeline stage 7 buffers up to 18 MacroOP's that are dispatched as Ops to the integer execution units. This (reservation station) is where instructions wait for operands (including data from memory) to become available before executing out of order. As you'll recall, there is a Reorder Buffer that keeps track of instruction status, operands, and results ensuring the instructions are retired and committed in program order. Note that Integer Multiply instructions require more compute resources and force extra issue restrictions. Data Access Forces Instructions to Wait Even for an out-of-order machine, our original computer model still holds up well. Notice in the figure below that loads and stores will use the "Address Generation Units (AGU's)" to calculate the Effective Address (cycle 9 ADDGEN stage) and access the data cache (cycle 10 DC ACC). In cycle 11, the data cache sends back a hit/miss response (and potentially the data). If another instruction is waiting in the scheduler for this data, the data is forwarded. Cache misses will cause the instructions to wait. There is a separate 44-entry Load/Store Unit (LSU) that manages these instructions. Floating Point Instructions Have Their Own Scheduler and Pipeline The Athlon can simultaneously process three types of floating-point instructions (FADD, FMUL, and FSTORE), as shown in the figure below. The floating-point units are "fully pipelined", so that new FP instructions can start while other instructions haven't yet completed. MMX/3DNow! instructions can be executed in the FADD and FMUL pipelines. The FP instructions execute out of order, and each of the three pipelines has several different execution units. There are some issue restrictions that apply to these pipelines. The performance of the Athlon's fully-pipelined FP units allow it to consistently outperform the Pentium III at similar clock speeds, and a 1.33GHz Athlon even performs better than a 1.5GHz Pentium 4 in some FP benchmarks. We haven't seen enough SSE2-optimized applications to draw a definitive conclusion with applications that may benefit from SSE2, however. AMD Athlon Schedulaer, Data Access Branch Prediction Logic is a Combination of the Latest Methods There is a 2048-entry Branch Target Buffer that caches the predicted target address. This works in concert with a Global History Table that uses a "bimodal counter" to predict whether branches are taken. If the prediction is correct, then there is a single-cycle delay to change the instruction fetcher to the new address. (Note that the P4 trace cache doesn't have any predicted-branch-taken delays). If the predictor is wrong, then the minimum delay is 10 cycles. There is also a 12-entry Return Address Buffer. Overall Conclusions About the Athlon Microarchitecture To prevent this article from beocming interminably long, we have to gloss over many features of the Athlon architecture, and undoubtedly several features will change as new versions are introduced. The main conclusion is that Athlon is a more traditional, speculative, out-of-order machine and requires fewer pipeline stages than the Pentium 4. At the same clock rate, Athlon should perform better than Pentium 4 on many of today's mainstream applications. The actual comparison ratio would depend on how well the P4's SSE2 instructions are being used, how well the P4's branch predictors and pre-fetchers are working, and how well the system/memory bus is being utilized. Memory bandwidth-intensive applications favor the P4 today. There is a lot of room for optimizing code to match the microarchitecture, and both AMD and Intel are working with software developers to tune the applications. We look forward to seeing what enhancements AMD delivers with Palomino. Centaur C3 Microarchitecture Centaur C3 Microarchitecture Even though VIA/Centaur doesn't have the same market share as Intel and AMD, they have an experienced design team and some interesting architectural innovations. This architecture also makes a nice contrast with the Intel and AMD approaches, since Centaur has been able to stay with an in-order pipeline and still achieve good performance. The Centaur chips use the same P6 system bus and Socket 370 motherboards. A great cost advantage for C3 is its diminutive size--only 52 sqmm in its .18 micron process. This compares to 120 sqmm for Athlon and 217 sqmm for P4. Also, the fastest C3 today at 800MHz consumes a very modest 17.4 watts max at 1.9V, with typical power measured at 10.4 watts. This is much more energy-efficient than Athlon and P4. Improving the Memory Subsystem to Solve the Key Problems There are some philosophical differences of opinion on how best to spend the limited transistor budget, especially for architectures specifically designed for lower cost and power. Intel and AMD are battling for the high-end where the fastest CPUs command a price premium. They can tolerate the expense of larger die sizes and more thermally-effective packages and heat sinks. However, when the goal of maximum performance drops to a number 2 or 3 slot behind power and cost, then different design choices are made. Up until now, Intel and AMD have made slight modifications to their high-performance architectures to address these other markets. As the markets bifurcate further, AMD and Intel may introduce parts with microarchitectures that are more optimized for power and cost. Centaur Uses Cache Design to Directly Deal with Latency VIA (Centaur) has made early design choices to target the low-cost markets. Centaur has stressed the value of optimizing the memory subsystem to solve the key problems of memory latency. If you're constraining your die size to reduce cost, then many processor designers feel it's often a better trade-off to use those transistors in the memory subsystem. Centaur's chip architects believe that their large L1 caches (four-way, 64KB each) give them a better performance return than if they had used the die area (and design time) to more aggressively reschedule instructions in the pipeline. If latency is the key problem, then clever cache design is a direct way to address it. The figure below shows the block diagram of the Centaur processor. The Cyrix name has recently been dropped, and this product is marketed as the "VIA C3" (internally referred to as C5B). Centaur C3 Microarchitecture Decoupling the Pipeline to Reduce Instruction Blockage Even with a pipeline that processes instructions in-order, it is possible to solve many of the key design problems by allowing the different pipeline stages to process groups of instructions. At various stages of the pipeline, instructions are queued up while waiting for resources to come available. Called a "decoupled architecture", an in-order machine like the Centaur C3 processor will have the same performance as the out-of-order approach we've described, as long as no instructions block the pipeline. If a block occurs at a later stage of the pipeline, the in-order machine continues to fill queues earlier in the pipeline while waiting for dependencies to clear. It can then proceed again at full speed to drain the queues. This is somewhat analogous to the reservation stations in the out-of-order architectures. As Centaur continues to refine their architecture, they plan to further decouple the pipeline by adding queues to other stages and execution units. Super-Pipelining an In-Order Microarchitecture The 12 stages of the C3 pipeline are shown on the right-hand side of the block diagram in figure 13. By now, you're probably able to easily identify what happens in each stage. Instructions are fetched from the large I-cache and then pre-decoded (without needing extra pre-decode bits stored in the cache). The decoder works by first translating x86 instructions into an interim x86 format and placing them into a five-deep buffer, at which point enough is known about branches to enable static prediction. Centaur C3 Microarchitecture From this buffer, the interim instructions are translated into micro-instructions, either directly or from a microcode ROM. The micro-instructions are queued again before passing through the final decoder where they also receive any data from registers. From there, the instructions are dispatched to the appropriate execution unit, unless they require access to the data cache. Note that this pipeline has the Data Access stages before execution, much different from our computer model. We'll talk about the implications in a moment. The floating-point units are not designed for the highest performance, since they run at half the pipeline frequency and are not fully pipelined (a new FP instruction starts every other cycle). After the execution stage, all instructions proceed through a "Store-Branch" stage before the result registers are updated in the final pipeline stage. Note that the C3 supports MMX and 3DNow! instructions. Breaking Our Simple Load/Store Computer Model During the Store-Branch stage, a couple of interesting things occur. If a branch instruction is incorrectly predicted, the new target address is sent to the I-cache in this stage. The other operation is to move Store data into a store buffer. Since an instruction has to pass through this pipeline stage anyway, Centaur was able to directly implement the common Load-ALU and Load-ALU-Store instructions as single micro-instructions that execute in a single cycle (with data required to be loaded before the execute stage). This completely removes the extra Load and Store instructions from the instruction stream (as found in other current x86 processors following internal RISC principles), speeding up execution time for these operations. No other modern x86 processor has this interesting twist to the microarchitecture. It also has the unfortunate side effect of complicating our original, simple model of a computer pipeline, since this is a register-memory operation. A Sophisticated Branch Prediction Mechanism Since the C3 pipeline is fairly deep (P4's pipeline has changed our perspective), good branch prediction becomes quite important. (That person in the back of the room is going to love this discussion, since Centaur uses every trick and invents some more.) Centaur takes the interesting approach of directly calculating the target for unconditional branches that use a displacement value (to an offset address). The designers decided that including a special adder early in the pipeline was better than relying on a Branch Target Buffer for these instructions (about 95% of all branches). Obviously, directly calculating the address will always give the correct target address, whereas the BTB may not always contain the target address. For conditional branches, Centaur used the G-Share algorithm we described earlier. This uses a 13-bit Global Branch History that is XOR'd with the branch instruction address (an exclusive-OR of each pair of bits returns a 1 if ONLY one input bit is a 1). The result indexes into the Branch History Buffer to look up the prediction of the branch. Centaur also uses the "agrees-mode" enhancement to encode a (single) bit that indicates whether the table look-up agrees with the static predictor. They also have another 4K-entry table that selects which predictor (simple or history-based) to use for a particular branch (based on the previous behavior of the branch). Basically, Centaur uses a static predictor and two different dynamic predictors, as well as a predictor to select which type of dynamic predictor to use. To that person in the back of the room, if you'd like to know more, check out Centaur's patent filings. A future ExtremeTech article will focus specifically on branch prediction methods. Overall conclusions about the Centaur architecture This microarchitecture has some interesting innovations that are made possible by staying with an in-order pipeline and focusing on low-cost, single-processor systems. While these microarchitectural features are interesting, our analysis doesn't draw any conclusions about performance (except to note the half-speed FP unit). The performance will depend on Centaur C3 Microarchitecture the type of applications, and a CPU that is optimized for cost should really be viewed at the system level. If cost is a primary concern, then the entire system needs to be configured with the minimum hardware required to acceptably run the applications you care about. Stay tuned to ExtremeTech for benchmarks of these budget PCs. Overall Conclusions Overall Conclusions This ends our journey of the strange world inside modern CPUs. We started from basic concepts and went very rapidly through a lot of complicated stuff. We hope you didn't have too much trouble digesting it all at one sitting. As we stated at the very beginning, the details about microarchitecture are only interesting to CPU architects and hard-core PC technology enthusiasts. As you've learned, the designers have made several trade-offs, and they've been forced to optimize for certain types of applications. If those applications are important to you, then check out the appropriate benchmarks running on real systems. In that way, the CPU microarchitecture can be analyzed in the context of the entire PC system. The Future of PC Microarchitectures It used to be easy to forecast the sort of microarchitectural features coming to PC processors. All one had to do was look at high-end RISC chips or large computer systems. Well, most of the high-end design techniques have already made their way into the PC processor world, and to go forward will require new innovation by the PC CPU vendors. Teaching an Old Dog New Tricks One interesting trend is to return to older approaches that were not previously viable for the mainstream. The most noteworthy example is "Very Long Instruction Word (VLIW)" architectures. This is what is referred to as an "exposed pipeline" where the compiler must specifically encode separate instructions for each parallel operation in advance of execution. This is much different than forcing the processor to dynamically schedule instructions while it is running. The key enabler is that compiler technology has improved dramatically, and a VLIW architecture makes the compiler do more of the work for discovering and exploiting instruction-level-parallelism. Transmeta has implemented an internal VLIW architecture for their low-power Crusoe CPUs, counting on their software morphing technology to exploit the parallel architecture. Intel's new 64-bit "Itanium" architecture uses a version of VLIW, but it has been slow to get to market. It will be several years before enough interesting desktop applications can be ported to Itanium and make it a mainstream desktop CPU. AMD Plans to Hammer Its Way into the High End of the Market Instead of counting on new compilers and the willingness of software developers to support a radically-new architecture (like Itanium), AMD is evolving the x86 instruction set to support full 64-bit processing. With a 64-bit architecture, the "Hammer" series of processors will be better at working on very large problems that require more addressing space (servers and workstations). There will also be a performance gain for some applications, but the real focus will be support for large, multi-processor systems. Eventually, the Hammer family could make its way down into the mainstream desktop. Still Some Features to Copy From RISC Some new RISC chips have an interesting and exciting feature that hasn't yet made its way into the PC space. Called "Simultaneous Multithreading (SMT)", this approach duplicates all the registers and swaps register sets whenever a "thread" comes to a long-latency operation. A thread is just an independent instruction sequence, whether explicitly defined in a single program or part of a completely different process. This is how multi-processing works with advanced operating systems, dispatching threads to different processors. Imagine that future CPUs may take thousands of pipeline cycles for a main memory load. In an SMT machine, rather than have a processor sit idle while waiting for data from memory, it could just "context switch" to a different register set and run code from the different thread. The more sets of registers, the more simultaneous threads Overall Conclusions the CPU could switch between. It is rumored that Intel's new XEON processor based on the P4 core actually has SMT capability built-in but not yet enabled. Integration and a Change in Focus Most of the recent architectural innovation has been directed at performing better on media-oriented tasks. Instead of just adding instructions for media processing, why not create a media processor that can also handle x86 instructions? A media processor is a class of CPU that is optimized for processing multiple streams of timing-critical media data. The shift in focus from "standard" x86 processing will become even more likely as CPUs are more tightly-integrated with graphics, video, sound and communications subsystems. It's unlikely that vendors would market their products as x86-compatible media processors, rather than just advanced x86 processors, but the shift in design focus is already underway. Getting Comfortable with Complexity In all too short a time, even these forthcoming technologies will seem like simple designs. We'll soon find it humorous that we thought a GHz processor was a fast chip. We'll eventually consider it quaint that most computers used only a single processor, since we could be working on machines with hundreds of CPUs on a chip. Someday we might be forced to pore through complicated descriptions of the physics of optical processing. We can easily imagine down the road that some people will long for the simple days when our computers could send data with metal traces on the chips or circuit boards. In closing, if you've made it all the way through this article, you agree with that enthusiastic person in the back of the room. As PC technology enthusiasts, our hobby will just get better and better. These complex new technologies will open up yet more worlds for our discovery, and we'll be inspired to explore every new detail. List of Reference List of References References and Suggestions for Further Reading: 1. Computer Architecture, a Quantitative Approach, 2nd Edition. Morgan Kaufmann Publishers. Written by Hennessy & Patterson. This is a great book and is a collaboration between John Hennessy (the Stanford professor who helped create the MIPS architecture) and Dave Patterson (the Berkely professor who helped create the SPARC architecture). 2. Pentium Pro and Pentium II System Architecture, 2nd Edition. Mindshare, Inc. Written by Tom Shanley. This book is slightly out of date, but Tom does a great job of exposing extra details that aren't part of Intel's official documentation. 3. The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal, First Quarter 2001. http://developer.intel.com/technology/itj/q12001/articles/art_2.htm Written by Glenn Hinton, Dave Sager, Mike Upton, Darrell Boggs, Doug Carmean, Alan Kyker, and Patrice Roussel of Intel Corporation. This is a surprisingly-detailed look at the Pentium 4 microarchitecture and design trade-offs. 4. Other Intel links: o ftp://download.intel.com/pentium4/download/netburstdetail.pdf o ftp://download.intel.com/pentium4/download/nextgen.pdf o ftp://download.intel.com/pentium4/download/netburst.pdf 5. AMD Athlon Processor x86 Code Optimization. http://www.amd.com/products/cpg/athlon/techdocs/pdf/22007.pdf Appendix A of this document has an excellent walk-through of the Athlon microarchitecture. 6. Other AMD links: o http://www.amd.com/products/cpg/athlon/techdocs/index.html 7. Other Centaur Links: o http://www.viatech.com o http://www.centtech.com