Thema Trends in High-Performance Microprocessor Design von Artur Klauser Artur Klauser TU Graz, Telematik, Dipl.-Ing. 1994 Univ. of Colorado at Boulder, Computer Science, Ph.D. 1999 Intel Microprocessor Research Lab (MRL), 1997 DEC Western Research Lab (WRL), 1998 Compaq Alpha Advanced Development (VSSAD), since 1999 Email: Artur.Klauser@computer.org Abstract General-purpose microprocessors are the devices that have fueled the personal computer and internet revolution we have experienced over the past couple of decades. A processor is at the heart of every computer system in use today, from tiny autonomous embedded control systems to large scale, powerful, networked supercomputers. The Compaq Alpha microprocessor line fits into the highperformance end of this spectrum, powering high-end workstations, servers, and supercomputers. Over the years, processor architecture and design always had to respond to and move in lock-step with technological advances in processor implementation and integrated circuit technology, as well as programming paradigms and instructions sets. In this paper I review some of the trends that have driven the Compaq Alpha processor architecture in the past decade and give an outlook at current and future trends that will have an impact on the architecture of future Alpha processors. Introduction A processor is at the heart of every computer system that we build today. Around this processor, you find several other components that make up a computer. Memory for instruction and data storage and input-output devices to communicate with the rest of the world, like disk controllers, graphics cards, keyboard interfaces, network adapters, etc. The purpose of the processor is to execute machine instructions. Thus, the logical operation of a processor is defined by the instruction set architecture (ISA) that it executes. Multiple different processors can implement the same ISA. What differentiates such processors is their processor architecture, which is the way that each processor is organized internally in order to achieve the goal of implementing its ISA. By changing the processor architecture, a processor designer can influence the performance characteristics and efficiency with which instructions are executed. Processor architecture also has to respond to implementation constraints imposed on it by the target circuit technology of the chip, in order to achieve a set performance goal. In the rest of this section I will give a short crash course in advanced computer architecture - an overview of the state of the art in 12 processor architecture for general-purpose high-performance microprocessors. Early Architectures In early computer architectures, processor operation was very simple and strictly sequential. In the first step, for each instruction the program counter (PC) would be used to send the next instruction address to memory. Potentially several clock cycles later, the instruction is returned from memory. Then the instruction would be decoded. Decoding produces a list of source and destination operands that the instruction operates on, and a specific operation that is to be performed. In the next step, source operands would be accessed and delivered to the arithmetic-logic unit (ALU). The ALU eventually performs the operation that was specified in the instruction and delivers a result. The result is then written back to the destination that was decoded. Finally, the PC would be updated to advance to the next instruction that is to be executed, after which the whole process starts from the beginning for the next instruction. It is easy to see that in this type of design, many operations of the processor are unnecessarily serialized and large portions of the processor sit idle for a majority of time. For example, the ALU is only busy during the period where the operation is performed on the source operands, but sits idle during the rest of the time it takes to execute an instruction. It is not uncommon for an instruction to consume on the order of 10 clock cycles to execute. Processor architects often quote processor performance in instructions executed per clock cycle (IPC). Therefore, this simple processor architecture would achieve a performance of 0.1 IPC [4]. A driving force for this design were sparse resources. The number of transistors that was available on a processor chip was low (tens of thousands). Much emphasis of the design had to be placed on limiting the number of transistors to implement each function of the processor. Efficiency could only be addressed when functionality was satisfied, which left little freedom. Pipelining To achieve higher performance, the various operations involved in executing a single instruction can be separated into different stages of a pipeline and performed in parallel for multiple instructions. Since the pipeline can only advance at the rate of its slowest stage, it is advantageous for all instructions to have approximately the same amount of work to do in each pipeline stage. The architectural shift to a pipelined design goes hand in hand with a shift in predominant ISAs of the time from complex instruction sets (CISC) to simpler, reduced instruction sets (RISC). In RISC instruction sets, each instruction only performs a simple operation that can be executed in a short pipeline stage. All instructions have a similar amount of work to perform, supporting a shift to pipelined architectures. Telematik 1/2001 Thema A typical pipeline of early pipelined designs would have the following stages. A fetch stage to get the instruction to be executed. A decode stage to determine operands and operation. A register fetch stage to access the source operands. An execute stage to perform the operation. And finally a writeback stage where the result is stored into its destination. Several other restrictions are supporting short, well balanced pipelines, such as restricting memory access to a few specific memory instructions and allowing sources and destination of other instructions only to come from the register file [12]. In the best case, a pipelined architecture can start to process a new instruction every cycle, for a throughput of 1.0 IPC. With the advent of pipelining comes a problem for control-flow, however. Controlflow instructions are those instructions that can affect the PC of the next instruction; they change the path the processor takes in the program. It is easy to see that this is fairly trivial in non-pipelined architectures, since the update of the PC is the last operation that is performed when an instruction is executed. It can take into consideration the outcome of the computation of the instruction. If the instruction branches to a different location depending on e.g. whether the result of the ALU was zero or non-zero, the PC update logic has this information readily available in a non-pipelined processor. Now consider a pipelined processor. The instruction following the branch needs to be fetched, i.e. its address must be known, when the branch instruction is in the second pipeline stage (decode stage). However, at this point the outcome of the branch is not yet determined. Not until two stages later, in the execute stage, will the processor know whether the branch should be taken or not. This leaves the processor architect with the problem that it is not clear which instruction to fetch for the duration of three cycles after a branch instruction was fetched. Not fetching anything, however, largely decreases the performance of the processor. In integer dominated code, approximately every 5th instruction is a branch. If the processor would hold off on instruction fetch for 3 cycles on every 5th instruction, however, its throughput would be reduced to 5/8 = 0.625 IPC, a reduction of 37% from its ideal. Branch Prediction Different architectures have solved the control-flow problem in different ways. One way is to introduce branch delay slots into the ISA. These are instructions after the branch that are always executed, independent of the branch outcome. With respect to their controlflow, these instructions look like they were before the branch. Since the processor now has some time between fetching the branch and changing the PC, the branch can advance to the execution stage and no or fewer bubble cycles have to be inserted into the pipeline. Another approach is to use branch prediction or guess which way the branch will go and speculatively proceed down this predicted path. Since many branches turn out to be highly predictable, branch prediction is very successful in preventing pipeline bubbles after branch instructions. Dynamic branch predictors are generally based on the observation that the history of previous branch behavior is a good indicator for future branch behavior. If a branch was frequently taken in the past, it has a high likelihood of being taken in the future as well. However, some branches will be mispredicted. In that case, the misprediction will be detected when the branch instruction is in the execute stage. The correct (computed) outcome Telematik 1/2001 of the branch is compared against the predicted outcome and misprediction recovery is initiated if the two mismatch. On misprediction recovery, all instructions in the pipeline after the branch are killed and the fetch stage is redirected to fetch instructions from the correct target address. In the case of a misprediction, the pipeline does incur a 3 cycle misfetch penalty during which instructions from the wrong path after the mispredicted branch were fetched. With a 90% accurate branch predictor, only one in 10 predicted branches is mispredicted and the misfetch bubble will be incurred. To continue our example from above, with 1 in 5 instructions being a branch and a 3 cycle bubble after mispredicted branches, we have an average performance of 50/53 = 0.943 IPC. Note that the performance is computed in terms of a commit IPC, i.e. instructions committed per clock cycle, since only those instructions contribute to the computation, whereas killed instructions are overhead. With pipelined architectures and branch prediction, the best possible performance is 1.0 IPC or one useful instruction completed every clock cycle. To break this barrier, a new idea is needed. Superscalar Execution Pipelined architectures as described in the previous section are also called scalar architectures - they operate on 1 instruction per cycle in every pipeline stage. To increase performance further, super-scalar architectures are used. In a n-way superscalar architecture, each pipeline stage operates on n instructions at the same time. Since such a processor fetches n instructions each cycle and can commit (up to) n instructions each cycle, the peak performance of the processor is n IPC. However, as we saw with the pipelined scalar design, mispredicted branches will decrease the average performance of the processor somewhat, so more accurate branch predictors are necessary to achieve good average performance. Besides managing control-flow obstacles with branch prediction, structural and data-flow hazards get to be a problem with wide superscalar processor designs. As described so far, instructions in a pipeline stay in order with respect to each other, i.e. an instruction can not overtake its predecessors in the pipeline. Thus, this design style is called an in-order architecture. In-order architectures can experience pipeline stalls due to long execution latencies of instructions. For example, if a load instruction is held up in the pipeline because it has missed in the cache and has to get its data from memory, none of the following instructions can proceed either, even though many of those instructions might not be dependent on the load. Accessing data from memory can take from many tens to a couple hundred of cycles, during which time all instructions in the pipeline are stalled. This is clearly an area where we could do better for the sake of performance. Out-of-Order Execution and Register Renaming To avoid stalling independent instructions unnecessarily after other long latency instructions, out-of-order execution comes to rescue. In out-of-order execution architectures, sometimes also referred to as dynamically scheduled, instructions can overtake each other in the pipeline, i.e. they do not necessarily execute in program order. Instructions are still fetched in program order but are eventually delivered into a pool of instructions in the processor execution core. From this pool, instructions are taken in data-flow order, only respecting the availability of their source operands to determine if 13 Thema they are ready for execution. After execution, instructions are companies at the time, it was apparent that just continuing with the reordered back into program order and retire in this order. implementation of the complex VAX instruction set would not With out-of-order execution comes the concept of register renaming. provide the long term performance improvements that were sought. Register renaming is used to give each result that is produced by an The complex VAX instruction set had acquired a lot of baggage instruction a unique new location to avoid confusion between mul- over the years that made implementations complicated and took a tiple results for the same register that might be live at the same time. heavy toll on the clock cycle time that was achievable in its Since out-of-order execution processors can change the execution implementations. At this point the decision was made to go with a order of instructions, two instructions I1 and I2 which produce new, reduced, streamlined instruction set which would allow faster results that have to be written into the same register R1 can be implementation and was to fuel a new processor line for the next executed in the wrong order, i.e. they incur a write-after-write conflict. couple of decades. To maintain the illusion of sequential execution order to all The new Alpha architecture and instruction set was taking into instructions in the program, however, each instruction I1 and I2 consideration studies on RISC instruction sets of the time, writes its result into a new location P1 and P2 respectively, such implementation complexities, and the experience from the VAX that the results of both instructions are available at the same time. processor [4] line and earlier in-house RISC designs of SAFE and All source operands R1 of instructions that depend on either the PRISM as well as the designs of the Titan and MutiTitan [15] result of I1 or I2 are also renamed to use the correct copy of the processors undertaken at DEC's Western Research Lab (WRL). result P1 or P2 respectively. Out-of-order execution processors have a logical register set (Rx) that is the one referred to in the ISA. These The Alpha ISA was formulated as the industries first 64-bit logical registers are renamed to a much larger set of physical registers instruction set, with 64-bit addressing, 32 64-bit integer registers (Px) which are actually present in the implementation of the register and 32 64-bit floating point registers. It is a pure RISC instruction file to support multiple versions of each logical register. Renaming set with fixed length 32-bit instructions. It has dedicated memory is performed in a pipeline stage after instruction decoding and before load/store instructions with simple addressing modes. All instructions enter the out-of-order instruction pool. Renaming takes computational instructions operate on sources and destination in care of all write-after-write (WAW) and write-after-read (WAR) the register file. Integer instructions support data types up to 64 bit, conflicts introduced by out-of-order execution [14]. floating point instructions support 32-bit and 64-bit IEEE as well as Instructions write their results to the architectural machine state, VAX floating point formats. A couple of instructions were added to corresponding to logical registers, only when they commit. This support easy recompilation of VAX applications onto the new Alprevents the architectural state from becoming corrupted by pha ISA. instructions that have executed but will be killed later due to branch mispredictions or other reasons. With out-of-order execution, architectures can better leverage the instruction level parallelism (ILP) that is inherent in the program that is executed. Long latency instructions, like floating point divides, do no longer block progress for following instructions and even very long latency operations like loading data from the second level cache or even memory can be sufficiently overlapped with the execution of following operations. However, since the number of instructions in flight, i.e. the Tab.1: Alpha Implementation Overview. number of instructions fetched but not yet committed can be around hundred in such an architecture, very accurate branch prediction is crucially important The Alpha ISA avoids a few pitfalls that had been made by to achieve good performance. competing RISC instruction sets, which allowed to let first Pipelined superscalar out-of-order architectures are the current state implementation artifacts sneak into the ISA definition. In that, the of the art in microprocessor design. Examples of such processors Alpha ISA does not provide for branch delay slots - an include the Alpha 21264 (EV6) [16], the Pentium 4 (Willamette) [9], implementation can use branch prediction to speed up control-flow the MIPS R10000 [20], and the HP PA-8000 [13] processing. The ISA does not contain any special complex operations - implementations are counting on multiple issue Previous Trends - A History of Alpha Processors superscalar operation to process many instructions at a time. The ISA only defines a weak memory consistency model with explicit Birth of the Alpha ISA barrier instructions to be used if strict ordering is necessary, allowing In the late 1980s, design teams that had been working on the DEC more freedom in the memory system implementation. The ISA only VAX line of processors had started to investigate possible successor guarantees imprecise arithmetic exceptions with explicit trap barrier architectures for their product line. As in many other computer instructions to be used when ordering has to be guaranteed, which 14 Telematik 1/2001 Thema also allows more implementation freedom. Additionally, the Alpha ISA adheres to typical RISC traits such as the lack of global state and mode bits. The new Alpha ISA had been formulated to keep its implementation simple yet give it a large degree of flexibility in implementation choices. Table 1 gives an overview of the variety of architecture and technology implementation parameters that the Alpha ISA has been and is being implemented on. Early Design Prototype - EV3 To prove the viability of the new architecture and to allow early system debugging and software development, the first in-house design was targeted to the current circuit technology of the time, CMOS-3. Since the Alpha architecture was developed as an eventual replacement for the VAX architecture, the first design was termed EV3, EV for Extended VAX and 3 coming from its implementation technology CMOS-3. This design was a simple 2-way superscalar in-order processor, however, due to time and area constraints it had only a simple memory system and did not implement a floating point unit yet. Floating point instructions were trapped as unimplemented opcodes and emulated in software. This initial design allowed architects and implementers to gain some experience with the new ISA and provided a convenient vehicle for efficient software development. implementation technology that would allow around 9 million transistors to be integrated onto the die. With that many transistors available there was plenty of space for large on-chip caches as well as a more complicated processor core. The EV5 processor was designed as a 4-way superscalar in-order core with separate 8 kB first level instruction and data caches and a unified on-chip 96 kB second level cache. EV5 supports two concurrent read accesses to its first level data cache via a duplicated cache, providing a bandwidth of 4.8 GB/s to the first level cache. The majority of the increased transistor budget went into the caches, but the core also was widened from 2-way to 4-way superscalar execution. In order to keep the target clock frequency of 300 MHz at its introduction in 1995, the decision was made to still stay with in-order execution for this generation. The processor includes control for a third level offchip cache up to 64 MB and can support a cache bandwidth of 1.2 GB/s and a memory bandwidth of 400 MB/s in order to keep its execution engine feed with data. Again, like in the last generation, a lot of implementation detail was paid to the processor speed and the EV5 was again the fastest processor on the market when it was introduced [7]. Internal architectural studies with wider superscalar in-order processors have shown that in-order execution would find insufficient ILP in most programs to be pushed much beyond a 4way super-scalar design. Additional studies have revealed that a VLIW organization, where more scheduling complexity is pushed off from the architecture into the compiler, did not result in significant simplifications of the processor architecture to warrant its radical break in instruction set architecture. Furthermore, many of the VLIW compiler and instruction scheduling enhancements could be used even for a normal RISC architecture. At this point, the decision was made to take the next generation processor out-of-order to leverage the additional ILP found in many programs. 1st Generation - EV4 (Alpha 21064) After the successful initial design study of EV3, proving the feasibility of the Alpha ISA and getting some experience in its implementation, the first for-sale architectural generation, the EV4 processor, was designed and built and got introduced to the public in 1992. This first generation was a 2-way superscalar in-order processor. With its 200 MHz operation in a 0.75 µm CMOS process technology, it was by far the fastest processor on the market. A lot of effort in the architecture design and implementation had gone into achieving the goal of very high operating frequency for this process technology. This can FETCH MAP QUEUE REG EXEC DCACHE mainly be attributed to the fact that both the Stage: 0 1 2 3 4 5 6 instruction set and architecture were kept simple, Int Int Branch Reg Exec and the chip was implemented in a full custom Issue Predictors Reg File Addr Sys Bus Queue design with great detail to speed paths. The 1.68 Map (80) Exec (20) 64-bit L1 million transistors of this first generation Bus Data implementation include an on-chip 8 kB first level Reg Inter- Cache Bus Exec 80 in-flight instructions Cache File face instruction cache and 8 kB first level data cache Addr 64 kB plus 32 loads and 32 stores 128-bit ) (80 Unit Exec Next-Line for an on-chip cache bandwidth of 1.6 GB/s. The 2-Set Address Phys Addr processor achieves an off-chip cache bandwidth 4 Instructions / cycle L1 Ins. 44-bit of 600 MB/s and a memory bandwidth of Cache FP ADD Reg FP 150 MB/s. 64 kB Div/Sqrt FP File Victim Issue 2-Set With so much detail paid to the clock speed of Reg Buffer (72) Queue FP MUL Map the processor, its architecture had to be kept simMiss (15) ple so it would not get in the way of speed. The 2Address way in-order design was the compromise that resulted. Fig.1: Alpha 21264 (EV6) processor architecture. Nevertheless, it was clear that architectural advances could be leveraged for future generations together with technological advances to push ahead in the high-performance 3rd Generation - EV6 (Alpha 21264) processor arena. The EV6 microprocessor [16, 17] was finally introduced in 1998 on a 0.35 µm CMOS process technology and with an operating frequency 2nd Generation - EV5 (Alpha 21164) of 600 MHz. It is designed as a 4-way superscalar out-of-order The next Alpha architecture [6] was targeted at a 0.5 µm CMOS architecture (see figure 1). Now, although it can fetch only 4 Telematik 1/2001 15 Thema instructions per cycle, its execution core is capable of executing 6 instructions per cycle 16 L1 (4 INT and 2 FP) and it can commit 8 Miss Buffers instructions per cycle. This widening of the pipeline towards the end helps to catch up 64 kB Icache with instructions that delay the out-of-order core of the processor due to incurring long 21264 operation latencies, like memory accesses. Its Core approximately 15 million transistors encompass a separate 64 kB first level instruction cache and a 64 kB first level data 64 kB Dcache cache for an on-chip cache bandwidth of 16 L1 9.6 GB/s, an off-chip cache bandwidth of Victim Buffers 6.4 GB/s and a memory bandwidth of 3.2 GB/s. Like the previous generation, EV6 supports two concurrent accesses to the first level data cache, this time by a double-pumped cache running at twice the processor frequency. EV6 Fig.2: Alpha 21364 also supports up to 16 outstanding off-chip memory references in its out-of-order design. The large bandwidth improvement in the memory system proved to be pivotal to successfully feed the execution core, in particular for demanding server-type applications like large database workloads, but also in the high-performance technical computing field. Besides the advances in the memory system and taking the processor core architecture to an out-of-order design, some initial attempts were made at integrating more functionality onto the chip. The 21264 integrates the controller for an off-chip second level cache using separate data busses for accessing the off-chip cache and memory systems in order to maximize memory throughput. Higher integration reduces the component count on the processor board and speeds up the cache access, since fewer slow chip crossings are necessary to access the data. We will see that this is a trend to continue in coming generations. Current Trends - System on a Chip Previous Alpha processor implementations have focused mainly on improving processor core features for improved performance, like wider issue widths and out-of-order execution. With the third generation processor EV6, an increasing amount of attention has been given to the memory system, greatly improving its performance and thus lifting processor performance for memory intensive application areas like commercial and high-performance technical workloads. However, circuit technology has also improved in the meantime. Besides increased device speeds that have been a result of smaller feature sizes, current generation process technologies also support a much larger number of transistors on the same size die. This leads to the potential for integrating more and more computer components on the processor die that used to be located in separate chips on the motherboard, or in other words we are at the point where it becomes possible to integrate whole systems on a chip (SoC). 4th Generation - EV7 (Alpha 21364) The fourth generation Alpha processor [5], being implemented in a 0.18 µm CMOS process technology, allows for 152 million transistors on the die. With that many transistors available it starts to become feasible to use a SoC design for a high-performance microprocessor. 16 Address In Address Out 1.75 MB 7-Set L2 Cache Memory Controller Network Interface & Router R A M B U S N S E W I/O 16 L2 Victim Buffers (EV7) processor architecture. Previously, SoC designs were mostly a domain of the embedded processor market, which could be realized with a much smaller transistor count. The EV7 processor architecture leverages the EV6 processor core architecture and integrates a number of additional features on the periphery of the processor core. Following are the main features of this generation (see figure 2): EV6 Processor Core: In order to leverage the work done on the EV6 out-of-order execution processor design, EV7 reuses the EV6 processor core with some minor enhancements, like supporting up to 48 outstanding memory requests, at an initial frequency of 1.2 GHz. Integrated Second Level Cache: 138 of the 152 million transistors will be spent on RAM, integrating a 1.75 MB, 7-way set associative L2 cache onto the processor die. The cache is ECC protected and provides a bandwidth of 19.2 GB/s to the processor core and a 10 ns load-to-use latency. Integrated Memory Controller: While previous processor generations were relying on an external support logic chip set for controlling the computer's memory, EV7 integrates two 800 MHz Direct RAMbus controllers onto the die. Each controller connects to 4 RDRAM channels. Like the L2 cache, memory is ECC protected. Integration of the memory controllers onto the processor die will boost the memory bandwidth to 12.8 GB/s and a 30 ns access latency for open memory pages. These enhancements in the memory system are major contributors to performance gains for today's memory intensive commercial workloads. Integrated Network Interface: In order to make it easy to build large multiprocessor systems,EV7 contains an on-chip interface and router for a direct processor-to-processor interconnect.The interface consists of 4 32-bit ports with 6.4 GB/s bandwidth each and a 15 ns processor-to-processor latency. The interconnect forms a 2-D torus network that supports deadlockfree adaptive routing in the integrated interface. The network Telematik 1/2001 Thema 80 % 60 % Higher Integration 100 % Higher MHz New core interface autonomously handles all protocol transactions necessary to build a cache-coherent nun-uniform memory access (ccNUMA) shared memory multi-processor (SMP) system. An additional 5th port on the processor provides a similar interface to support an I/O channel with 6.4 GB/s bandwidth. Each network router provides buffer space for over 300 packets. Issue Mispred Trap Cache Memory 40 % 20 % 0% EV5 / EV6 / 600 MHz 575 MHZ EV6 / 1 GHz EV7 / 1 GHz Fig.3: Fraction of execution time spent in various system components for TPC-C workload on EV5, EV6, and EV7. To better understand what has driven the architectural decisions for EV6 and EV7, it helps to look at the time that various system components contribute to the execution of a commercial benchmark like TPC-C (see figure 3). On an EV5 architecture, approximately 1/3 of the time is spent in the memory system, 1/3 of the time is spent in the cache, and 1/3 of the time is spent for instruction execution in the core. With the introduction of the EV6 architecture and its outof-order execution engine, times spent in those three components have scaled roughly equally. However, scaling the same processor core to a higher clock frequency does not provide much additional benefit since it mainly attacks the core execution time, but does little to help reducing the time spent in the memory system and caches. In order to attack this problem, it was necessary to allow for tighter integration of the caches and memory system with the processor core. With this higher integration, memory, cache, and core execution times are more evenly distributed again. Larger server systems are generally built as multiprocessor systems. For example, the EV6 powered Wildfire [8] server line supports up to 32 CPUs in a ccNUMA CMP architecture connected by a 2-level hierarchy of switches. To build such a system, however, a number or support chips are necessary for the network interfaces, routers, and coherency protocol engines. With more chip crossings on its way, latency for data packets becomes a problem [2]. On Wildfire, non-pipelined remote memory access latency is on the order of thousand nanoseconds, much slower than local memory access, which only takes a few hundred nanoseconds. To make large SMP systems more efficient, tighter integration of the memory system and networking interface to other processors (i.e. remote memory) is highly desirable. EV7 has the networking interface, router, and protocol engines pulled onto the processor chip in order to allow lower latency remote accesses and a larger number of CPUs, reaching into the hundreds of processors. Telematik 1/2001 Future Trends - Latency, Complexity, Clustering, and Parallelism Perceived and expected trends over the next few years are starting to influence architectural decisions that are being made for future processor generations. In this section, I outline some of these trends and their influence on processor architectures. 5th Generation - EV8 (Alpha 21464) The fifth generation Alpha microprocessor implementation is well underway in its architecture design and implementation. Analysis of typical Alpha workloads has shown that many programs have sufficient ILP to support scaling an out-of-order execution core to an issue width greater than 4. However, performance improvements from wider issue execution cores start to level off. In order to find sufficient parallelism to warrant a wider issue core, parallelism has to be sought beyond just ILP. Looking at server workloads it quickly becomes apparent that there are several higher levels of parallelism above the instruction level parallelism of a single program. Typically, a server runs many different programs or processes at the same time. In many machines, this multi-process parallelism is harnessed with multiprocessor systems, each processor running one process at a time. However, many programs are also multi-threaded or can be converted to multithreaded programs relatively easily. Multiple threads in a program all run in the same address space and cooperate more closely with each other than multiple programs typically do. Thus, there exists potential to utilize this thread level parallelism (TLP) together with the ILP of each thread within a single processor, and utilize process level parallelism across processors in a multiprocessor system. Note, however, that the distinction between thread and process-level parallelism is a subtle one, since processes can also share address space in most operating systems and thus almost behave like threads in many respects. Simultaneous multithreading (SMT) [18] is an architecture design that can utilize thread level parallelism (TLP). An SMT architecture is an extension of the superscalar out-of-order execution architecture. In a normal superscalar out-of-order execution architecture, all instructions that are fetched and executed at a time come from the same thread. Since wide issue processors do not tend to find sufficient ILP at all times, a large fraction of available issue slots can not be occupied by instructions from one thread and this potential execution bandwidth stays unused. In an SMT architecture, the processor fetches instructions from multiple threads at the same time. Instructions from different threads go through decoding and register renaming separately. They are then dumped into the common pool of instructions that need to be executed. Like in a conventional out-of-order architecture, instructions from the shared pool are picked depending on data-flow order and are executed irrespective of their correlation to any specific thread. Since instructions from different threads are independent of each other, more instructions in the common issue pool are data-ready at any given time, thus allowing for better use of the existing functional units and issue bandwidth of the superscalar execution core. Finally, after execution, instructions retire again in program order within their respective thread. It is easy to see that this architectural organization for SMT largely leverages the design that is already existent in a conventional superscalar out-of-order execution processor. EV8, the fifth generation Alpha architecture, is being designed as a 17 Thema 4-way SMT, 8-way super-scalar out-of-order execution processor. It supports 4 concurrent contexts in its 4 thread processing units (TPUs). Note that in this context the TPUs are not required to run in the same address space. TPUs can be running multiple processes or threads or any mixture thereof. EV8 can sustain an aggregate execution bandwidth of 8 instructions per cycle. With this design, a single EV8 processor can harness ILP and TLP, whereas multiple EV8s in a multiprocessor configuration can be used to leverage multi-program parallelism. Like the previous generation, EV8 with its total of 250 million transistors is designed as a system on a chip with a large second level cache, memory controller, and direct processor-to-processor network interface and router on the die for glue-less ccNUMA SMP systems with up to 512 processors. It is designed to support higher cache and memory bandwidth than the previous generation to push performance ahead for memory intensive workloads. The initial implementation targets a 0.125 µm CMOS technology for a clock frequency of around 2.0 GHz. Influence of Multi-threading and Design Complexity Wide issue superscalar out-of-order execution processors like the EV8 are very complex architectures that require a large design team and a long design time. They reward the architect with superior performance in a wide area of single-threaded as well as multithreaded applications. However, alternatives arise in some application areas that do not need this level of generality. For example, database applications are highly parallel at a thread or process level, but provide little ILP within a thread. In addition, this workload is very memory intensive and spends a very large fraction of its time in memory stalls. For these reasons, wide issue superscalar architectures have little advantage over simpler architectures. Important performance contributors are a good memory system and the ability to support a large number of concurrent memory accesses, which can be achieved for example with multiprocessing. Design complexity can be largely reduced if alternative architectures are used for such special application areas. An alternative that fits well with the requirements of database workloads are chip multiprocessors (CMP), like implemented by IBM in their Power4 architecture [3], and in the Piranha research prototype [1] of Compaq's Western Research Lab. Piranha integrates 8 processors on a single die. Each processor is a simple single-issue in-order implementation of the Alpha ISA with blocking private first level data and instruction caches. All processors on the die share a large on-chip second level cache, the memory system controller, and the integrated network interface that connects the chip to other chips, similar to the networks of the EV7 and EV8 architectures. Since each processor is relatively simple, design time and complexity for one such processor are largely reduced. The architecture relies on the fact that each processor is reused many times on the die to achieve large aggregate execution bandwidth, without much additional design complexity. The complexity of designing the second level cache, memory system controller, and network interface, however, still has to be considered as well. This CMP achieves its performance by having 8 simple processor cores together with a tightly integrated memory system. It does not support high ILP, but does support thread and process-level parallelism very well, matching its targeted application domain. Here we have seen how application domain restrictions that leverage certain types of parallelism better than others can drive the 18 architecture design into a specific direction. Besides influences of the application areas that high-performance microprocessors are used for, there is also a strong influence of the process technology that is used for their implementation. Processor architects have to take into consideration the effects of process technology on the implementation of a processor architecture, in order to achieve the desired result in terms of performance, cost, power dissipation, etc. The following section describes how process technology changes the way processor architectures are designed. Technological Influences on Architecture Over the last decade, process technology has made steady, fast paced advances which have largely helped to attain the performance improvements we have grown used to. In the Alpha processor line, process technology was at 0.75 µm for EV4 in 1992 and will be at 0.125 µm for EV8. Integration has shot up from 1.68 million transistors in EV4 to 250 million in EV8. Clock frequency has grown from 200 MHz in EV4 to about 2 GHz for EV8. However, all the performance enhancing effects of process technology also bring problems along. Current Sourcing and Power Dissipation EV4 had a power dissipation of 30 W at a supply voltage of 3.3 V. EV7 has a power dissipation of 125 W. Assuming EV8 will have the same power dissipation at a supply voltage of only 1.2 V, its current sourcing demands will increase significantly. If you do the math, EV4 draws an average current of 9 A, whereas EV8 is going to draw a current of > 100 A. This current has to get onto the chip with incurring minimal IR drop on the supply voltage, especially since VDD has also gotten smaller. Dynamic current requirements are getting worse as well. EV4 has a total effective switching capacitance of 12.5 nF at 200 MHz, EV6 has an effective switching capacitance of 34 nF at 600 MHz. EV8 will be running at close to 2 GHZ on a yet larger die with a more complicated architecture, prone to increase its switching capacitance even further. To help with the di/dt problem, EV6 has a total on-chip decoupling capacitance of 320 nF, taking up approximately 15-20% of the die area, plus an additional 1 µF wire-bond attached chip capacitor with 160 VDD/VSS bond wires to reduce inductive coupling. Thermal issues also have to be considered. Sufficient cooling for an EV4 die could be sustained in a package with 1.1 K/W thermal resistance. For an EV6 and EV7 die, a package with a thermal resistance of 0.3 K/W has to be used to guarantee sufficient heat exchange from the die to the package surface and prevent damaging the die. However, packages with lower thermal resistance are a lot more expensive. Currently, 125-150 W dies are the absolute maximum power dissipation that can be sustained with passively cooled packaging technology. EV7 with its 125 W at 1.2 GHz of operation is currently at the limit of passive air-cooled packaging technology. Besides making sure the heat can be brought off the die reliably, high performance processors also face the problem of thermal variation across the die, i.e. avoiding hot spots. A chip can tolerate approximately 20 K temperature differential across its surface. In order to reduce mechanical stress on the die due to thermal expansion, the power grid on EV6 is actually laid out as two meshes, one for VDD and one for VSS, taking up two complete metal layers. Covering the whole metal layer with a solid surface was considered Telematik 1/2001 Thema to produce too much mechanical stress on the chip, for the thermal expansion of the metal, silicon, and insulation layers are not exactly the same [11]. So what does this have to do with processor architecture? Well, in order to keep power dissipation in bounds, architectural and implementation tricks can be used. Predominantly what is employed here is conditional clocking and shutting down non-used logic blocks on the chip for short periods, i.e. a few clock cycles, before they start to get used again. This reduces both the power dissipated by the clock distribution network in the conditionally clocked area as well as the power of the temporarily halted circuit, since it does not experience any transitions. On EV4 through EV7, clock distribution is responsible for approximately 30-40% of the total chip power dissipation. On EV6, the clock grid is partitioned into global and local clock distribution with a total load of 3 nF and 6 nF respectively. Global clock distribution is unconditional, whereas local clocks can be either unconditional or conditional [10]. EV7, since it is reusing the EV6 core design, has kept the EV6 clocking scheme for the core and has added another level of clocking for the other components, split into three additional clock domains. These clock domains are synchronized to the core clock domain with separate delay-locked loops (DLLs) for global clock skew to remain below 60 ps across its almost 4 cm2 die [19]. In past and current designs, dynamic power, i.e. the power associated with switching activity, is the predominant source of power dissipation. However, both VDD and Vt (threshold voltage) have been constantly reduced over the past decade. This results in a reduction of VDD-Vt or the voltage that is responsible for turning off a device. Since devices are not turned off „as hard“ anymore, leakage or off-current is increasing, giving rise to higher static power dissipation in newer process technologies. Wire Scaling Process geometries have steadily been shrinking over the past decade, from 0.75µm in the implementation of EV4, to now 0.18 µm for EV7. This allows the integration of more transistors onto a die. However, processes do not shrink homogeneously. Whereas properties of transistors behave favorably when shrunk, wires start to become a problem. The important property for a wire in this respect is its RC delay. For sub-micron shrinking rules, the RC delay of a wire usually increases slightly between process generations. For constant drive strength, this means that wires become slower than they used to be in the previous generation. Since transistors become faster at the same time, wires become much slower relative to transistor speeds and their impact gets more and more noticeable with every process generation. If wires are „local“ to logic blocks, the effect of bad wire scaling is typically mitigated by the fact that the wire also reduces in absolute length. However, going to newer designs also generally means more complexity. Although architects try to keep complexity localized, some communication will still be required across long distances on the chip. These distances might have the same absolute length as in previous process generations, but those „global“ wires are now slower than in previous process generations. This results in more and more paths in the processor becoming wire dominated, i.e. the majority of the time of a signal can be attributed to its transmission delay incurred on a wire, rather than propagation Telematik 1/2001 delay through logic gates. However, the more wire dominated a path is, the worse it scales also for future process shrinks, since the wire gets relatively slower with respect to transistor speeds. Switching from aluminum to copper (Cu) interconnects helps wires by reducing the wire's R. Using low-k dielectrics helps a wire's C. However, these process changes are one-time solutions, i.e. the one process shrink where the switch to Cu occurs is easier, but following shrinks exhibit the same problem again. Alpha processor implementations have switched to Cu interconnect with the introduction of EV7. This has eased the reuse of the EV6 core in EV7's 0.18 µm process, a core that was originally designed two process generations earlier in 0.35 µm design rules. Clustering An architectural tool to cope with the challenges that smaller process geometries impose on a design is the increased use of clustering. Clustering serves to reduce local design complexity. In a clustered design, a tradeoff is made between local (intra-cluster) complexity and latency vs. global (inter-cluster) latency. Building clustered designs means to break up logical components of a design into smaller pieces - clusters. Each cluster is sufficiently small that wire delay problems within the cluster are not yet dominating the design. This allows for small intra-cluster latencies and supports a high clock frequency in each cluster. On the other hand, small clusters mean that logic complexity within a cluster is limited and decisions can only be made based on local availability of corresponding input data, i.e. each cluster is not as „smart“ as a monolithic design could be. Whenever information has to cross cluster boundaries additional transmission latency will be incurred and needs to be accounted for. Stage 1 Stage 1 Stage 2 Stage 2 Stage 3 Stage 3 Stage 4 Stage 5 Pipelining Super-Pipelining Stage 2 Cluster 1 Stage 1 Cluster 2 Stage 3 Clustering Cluster 3 Fig.4: Pipeline organizations Clustered design can be understood as an extension of the idea of pipelining as depicted in Fig. 4. With the introduction of pipelining, the operation of a processor was broken up into separate logical functions, each of which could operate independently from the others. Each pipeline stage consumes some inputs from earlier stages and produces some outputs for later stages. The pipeline stages are ordered in a temporal sequence. Local complexity in each pipeline stage is limited in order to reach a high overall clock speed. Some designs like the Pentium 4 have broken each pipeline state into even smaller sequential pieces, sometimes referred to as super19 Thema pipelining. However, the whole pipeline still consists of a temporally sequential succession of (now smaller) stages, on the order of 20 stages for a Pentium 4. Complementary to pipelining, clustering breaks up the operation within a pipeline stage into multiple concurrent operations. Information of one pipeline stage fans out into multiple clusters operating in parallel. All clusters then feed their results back to a combined successor pipeline stage, as shown in Fig. 4. An example where clustering was used in a previous design are the EV6 execution units and associated schedulers. Aggregate, the EV6 processor can issue 6 instructions each cycle out of a pool of 35 instruction queue entries. The picker loop has to complete in a single cycle. It has to determine which of the 35 instructions are data ready, pick 6 of them in some priority scheme, remove the picked instructions from the instruction queue, and inform the remaining instructions when the results of the picked instructions will become ready. This proved to be an unmanageable task at the target clock speed. The solution employed in EV6 is clustering. The scheduler and execution units are broken up into 3 clusters. One cluster operates on all floating point instructions and each of the other two clusters operate on half of the integer instructions, as can be seen in the block diagram of EV6 in Fig. 1. Each integer picker is now responsible for picking 2 data ready instructions out of a shared pool of 20 instruction queue entries. The floating point picker has to pick 2 data ready instructions out of its pool of 15 instruction queue entries. This allows the picker loop to run within a short cycle time. The tradeoff that was bought into, however, is that results that have to cross cluster boundaries incur an additional cycle of delay. For example, assume instruction I1 issues in the left cluster in cycle t and it has a nominal latency of 1 cycle. An instruction I2 in the left cluster that is data dependent on I1 can issue at cycle t + 1. However, if instruction I2 is on the right cluster, it can not issue until cycle t + 2, incurring an additional 1 cycle cross-cluster communication penalty. Careful slotting of instructions into clusters is necessary in this case to avoid paying the cross-cluster penalty too frequently. With wire delays becoming more dominant in future process technologies, clustering will need to be used to an increasing degree in order to keep local complexity down and clock speed up. Speculation Current microprocessor designs use speculation to varying degrees to cope with the fact that in some situations information is not available at the time it is needed. One example is branch prediction as used in all Alpha implementations. Since the outcome of a branch is not yet known when the next instruction has to be fetched, a branch predictor is used to guess this outcome and the processor speculatively proceeds down the predicted direction. Another example of speculation is the memory system of EV6. The L1 data cache of EV6 has a 2 cycle access latency, resulting in a 3 cycle load-to-use latency of load instructions. If I1 is an integer load instruction issuing at cycle t, a dependent instruction I2 can issue as soon as cycle t + 3 if I1 hits in the data cache. However, EV6 does not know whether I1 is a hit or miss in the data cache until cycle t + 5. Therefore, we are in the same dilemma again. We need information in cycle t + 3 that will not become available until cycle 20 t + 5. Speculation is the solution used in EV6 [17]. The processor assumes that I1 will hit in the cache and issues I2 speculatively. If I1 hits, speculation was correct and maximum performance was achieved by the schedule. However, if I1 misses, I2 and all instructions that have been issued since have to be squashed and reissued later. This guarantees architectural correctness, but costs performance because issue slots were wasted. The goal is to keep this waste to a minimum. In order to do this, EV6 uses a load hit/ miss predictor. The hit/miss predictor is trained with the outcome of the load hit/miss speculation of previous encounters of a load instruction. If the hit/miss predictor has encountered many misspeculations in the past, it does not allow instruction I2 to issue until the outcome of the load instruction I1 is determined for sure in cycle t + 5. However, if the speculation was correct frequently in the past, I2 can issue as soon as possible, i.e. in cycle t + 3. The two cases above are simple examples where speculation has to be used because information is produced later than needed. With wire delays becoming more dominant and the increasing use of clustering, another form of information deficiency is on the rise. When multiple clusters operate in parallel with one or several cycles of communication latency between them, it occurs more frequently that one cluster has information that another cluster needs. Even though this information is available in a timely fashion, its producer and consumer are not in the same "time domain". The information is on time but in the wrong place. This effect will be seen increasingly with future technologies. On an EV6 die (0.35 µm) a signal could still cross the entire chip in one clock cycle. In a 0.1 µm technology, a signal would require 5-6 clock cycles to travel from one functional block to another one across the die, assuming the historical trend of constant die sizes across process generations continues. Increased use of speculation is a solution to this delayed information problem. The information receiver has to make a guess as to what information it is going to receive in order to make rapid progress and meet performance goals. In case of misspeculation, appropriate misspeculation recovery actions have to be performed on all affected components. Process technology improvements provide the fuel to run the highperformance processor engine in the future. Besides the ever increasing transistor count and device speed, future implementation technologies also pose significant challenges to the processor architect to get the highest benefit and best possible performance from process technology advances. Conclusions In this article, I have presented an overview of the past developments, current state of the art, and near future trends and problems in high-performance general-purpose microprocessor architecture and design. The paper has followed the Alpha RISC processor designs from the birth of the Alpha ISA and architecture in the late 1980s, through the past decade, to the current generation and has given an outlook into the design plans for the Alpha architectures of the near future. Coarse trends and problem areas for current and future processor implementations have been reviewed in the areas of program characteristics and behavior as well as process technology, and some implications on processor architecture and design were pointed out. Telematik 1/2001 Thema Acknowledgements I would like to extend my thanks to my colleges at the Compaq Alpha Advanced Development Group (VSSAD) and EV8 design team to provide answers to my pestering questions about ancient history (late 1980s, early 1990s), memories of which seem to be fading away rapidly. References [1] Luiz Barroso and Kourosh Gharachorloo et al. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In 27th Intl. Symp. on Computer Architecture, pages 282293, Vancouver, Canada, June 2000. [2] Luiz Barroso, Kourosh Gharachorloo, Andreas Nowatzyk, and Ben Verghese. Impact of Chip-Level Integration on Performance of OLTP Workloads. In 6th Intl. Symp. on High Performance Computer Architecture, Toulouse, France, January 2000. IEEE. [3] Keith Diefendorff. Power4 Focuses on Memory Bandwidth. Microprocessor Report, 13(10):11-17, October 1999. [4] Joel Emer and Douglas Clark. Retrospective: Characterization of Processor Performance in the VAX-11/780. In 25 years of the Intl. Symp. on Computer Architecture (selected papers), pages 37-38, Barcelona, Spain, June 1998. [5] Anil Jain et al. A 1.2 GHz Alpha Microprocessor with 44.8 GB/ s of Chip Pin Bandwidth. In Intl. Solid State Circuits Conf., San Francisco, CA, February 2001. IEEE. [6] John H. Edmondson et al. Internal Organization of the Alpha 21164. A 300 MHz 64-bit Quad-issue CMOS RISC Microprocessor. Digital Technical Journal, 7(1):119-135, 1995. [7] William Bowhill et al. Circuit Implementation of a 300 MHz 64bit Second-generation CMOS Alpha CPU. Digital Technical Journal, 7(1):100-118, 1995. [8] Kourosh Gharachorloo, Mandhu Sharma, Simon Steely, and Stephen Van Doren. Architecture and Design of the AlphaServer GS320. In 9th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 13-24, Boston, MA, November 2000. ACM. [9] Peter Glaskowsky. Pentium 4 (Partially) Reviewed. Microprocessor Report, 14(8):10-13, August 2000. [10] Michael Gowan, Larry Biro, and Daniel Jackson. Power Considerations in the Design of the Alpha 21264 Microprocessor. In Design Automation Conference, pages 726-731, San Francisco, CA, 1998. ACM. [11] Paul Gronowski, William Bowhill, Ronald Preston, Michael Gowan, and Randy Allman. High-Performance Microprocessor Design. IEEE Journal of Solid-State Circuits, 33(5):676-686, May 1998. [12] John Hennessy and David Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1990. ISBN 1-55860-069-8. [13] Doug Hunt. Advanced Performance Features of the 64-bit PA8000. In COMPCON, pages 123-128, San Francisco, CA, March 1995. IEEE. [14] William Johnson. Superscalar Microprocessor Design. Prentice-Hall, 1991. ISBN 0-13-875634-1. [15] Norman P. Jouppi. Architectural and Organizational Tradeoffs in the Design of the MultiTitan CPU. Research Report 89/9, Digital WRL, July 1989. Telematik 1/2001 [16] R. E. Kessler, E. J. McLellan, and D. A. Webb. The Alpha 21264 Microprocessor Architecture. In ICCD'98, Intl. Conf. on Computer Design: VLSI in Computers and Processors, Austin, TX, October 1998. IEEE. [17] Richard Kessler. The Alpha 21264 Microprocessor. IEEE Micro, 19(2):24-36, March 1999. [18] Jack L. Lo, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Rebecca L. Stamm, and Dean M. Tullsen. Converting ThreadLevel Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading. ACM Transactions on Comp. Sys., 15(3):322-354, August 1997. [19] T. Xanthopoulos, D. Bailey, A. Gangwar, M. Gowan, A. Jain, and B. Prewitt. The Design and Analysis of the Clock Distribution Network of a 1.2 GHz Alpha Microprocessor. In Intl. Solid State Circuits Conf., San Francisco, CA, February 2001. IEEE. [20] Keneth C. Yeager. The MIPS R10000 Superscalar Microprocessor. IEEE Micro, 16(2):28-41, February 1996. 23rd International Conference on Software Engineering Toronto, Ontario, Canada May 12-19, 2001 Today, the engineering of software profoundly impacts world economics. For example, the desperate demands by all information technology sectors to adapt their information systems to the web has generated a tremendous need for methods, tools, processes, and infrastructure to develop new and evolve existing applications efficiently and cost-effectively. ICSE 2001, the premier conference for software engineering will present the latest inventions, achievements, and experiences in software engineering research and practice. We invite you to participate in ICSE 2001 to help us build an exciting forum for exchanging ideas and experiences in this ever expanding and critical field of software engineering. Keynote Speakers Frederick P. Brooks, Jr. (Univ. of North Carolina, USA) Linda Northrop (Software Engineering Institute, USA) Mary Shaw (Carnegie Mellon University, USA) Daniel Sabbah (IBM Corporation, USA) Bernd Voigt (Lufthansa, Germany) 21