1 AMD OPTERON ARCHITECTURE Omar Aragon, Abdel Salam Sayyad Abstract— The AMD Opteron processor is a 64-bit, x86 architecture. It has an on-chip double-data-rate memory controller and three HyperTransport links. This processor is an out-of-order, superscalar and supports on-chip level-1 and level-2 caches. The benchmarks for the uniprocessor result that the Opteron is more efficient that the processor Intel Xeon. Index Terms— benchmarks, microprocessor, Opteron, pipeline, superscalar, x86-64-bit. I. INTRODUCTION T HIS document presents the main characteristics of the AMD Opteron microprocessor. This processor was released on April 22, 2003 with the intention to compete in the server and workstation markets. New configurations released subsequently included 4-core, 8-core and 12-core, code-named "Magny-Cours". The information is presented as follows: Section II includes an explanation of the main features presented in the processor. Section III describes the core of the micro architecture, including a description of each functional unit of the processor. Section IV makes a comparison of the processor with the Intel Xeon and defines the benchmarks. Finally, Section V gives some examples of where this processor is used. II. FEATURES Below it is described all the properties presented in the Opteron processor. Figure 1 displays the block diagram for a single processor. Figure 1. Block diagram for the Opteron processor. A. x86-64 Architecture This is an extension of the x86 instruction set. Ideally, the AMD supports 64-bit virtual and 52-bit physical address, allowing the programmers to work with a larger of data sets. For the case of the Opteron, supports 48-bit virtual and 40-bit physical address. This architecture doubles the number of registers, and extends all the integer arithmetic and logical integer instructions to 64 bits. The integer general-purpose registers and streaming SIMD extension registers increases from 8 to 16. The general purpose registers increases from 32 to 64. The compatibility with 32-bit code is done without speed penalties, supporting applications and operating systems without modifications or recompilation. 2 B. Memory controller The memory controller is a digital circuit that manages the data flow between the processor and the main memory. The fact that the memory controller is on-chip reduces memory latency and increases bandwidth, but forces the microprocessor a specific type of memory. C. Out-of-order execution This computer engineering paradigm used in high-performance processors uses the instruction cycles that would be wasted by a certain delay. The processor executes instructions according to the input data, instead of the original order of the program. The important thing here is allow the processor to avoid the stalls that can occur when the data needed to perform the next instruction is not available. D. Superscalar This feature implements instruction level parallelism within a single processor. In this case, the processor executes more than one instruction during a clock cycle dispatching multiple instructions to redundant functional units on the processor. Each functional unit is not a separate CPU, but an execution resource inside the single processor. A typical superscalar processor fetches and decodes several instructions at the time. A branch prediction unit is used to prevent the interruption of the instructions stream. After the instructions are inspected for data dependencies and distributed to functional units, organized usually by instruction type. E. HyperTransport links This is a technology for interconnection of processors and other I/O devices. The protocol is a bidirectional serial/parallel high-bandwidth and low latency point to point link that runs on speeds between 200 MHz to 3.2 GHz. It sends data on both rising and falling edges of the clock signal. III. MICROARCHITECTURE The AMD Opteron processor is an aggressive, out-of-order, three-way superscalar processor. Figure 2 shows the processor core block diagram. It reorders 72 micro-ops, fetches and decodes three x86-64 instructions each cycle. The instructions are transformed from variable-length x86-64 into fixed-length micro-ops and dispatched into two independent schedulers: one for integer and one for floating-point and multimedia. Also, the store/load unit takes care of the load and store micro-ops. The dispatchers and the load/store unit dispatch 11 micro-ops each cycle to the units: Three integer units. Three address generation units. Three floating point and multimedia units. Two loads/store to the data cache. Figure 2. Processor core block diagram. 3 A. Pipeline The pipeline for the Opteron processor is fully integrated from the instruction fetch to the DRAM access as displayed in Figure 3. It is long enough for high frequency and short enough for a good CPI (cycles per instruction). It takes 7 cycles for fetch and decode. The pipeline takes 12 stages for integer and 17 stages for floating-point. The data cache access occurs in stage 11 for loads, the result returns in stage 12. When L1 cache misses, the pipeline accesses the L2 cache and the system request queue at the same time. In case that cache L2 hits the pipeline cancels the request. The memory controller schedules the request to DRAM. Figure 3. Opteron pipeline. B. Caches Figure 4 shows the cache, DDR and HyperTransport blocks. Figure 4. Cache, DDR, and HyperTransport. Two levels of cache are included: 1- L1, with separate Instruction and Data caches, each 64 Kbytes. They are two-way set-associative, linearly indexed, and physically tagged with a cache line size of 64 bytes. 2- L2 cache for both Data & Instructions. 1 Mbytes, 16-way set associative. It uses a pseudo-least-recently-used (LRU) replacement policy. 4 C. Translation Look-aside Buffers (TLBs) The AMD Opteron has Independent L1 and L2 translation look-aside buffers (TLBs). The L1 TLB is fully associative and stores thirty-two 4-Kbyte page translations, and eight 2-Mbyte/4-Mbyte page translations. The L2 TLB is four-way setassociative with 512 4-Kbyte entries. D. Instruction fetch and decode This unit provides 16 instructions bytes each cycle to the scan/align unit, that scans the data and aligns it on instruction boundaries. The process gets complicated because the x86 instructions can vary in length between 1 and 15 bytes. The processor decodes variable lengths instruction into internal fixed-length micro-ops using fast-path decoders or the microcode engine. Fast-path decoders can issue three x86 instructions per cycle. In the other side, microcode can issue only one x86-instruction per cycle. E. Branch prediction Opteron uses a combination of branch prediction schemes. The branch selector array selects between static prediction and the history table. The history table uses 2-bit saturating up/down counters. In order to accelerate calculations the branch target array holds the branch target address. F. Integer and floating-point execution units Decoders dispatch three micro-ops each cycle to the instruction control unit, which has a 72-entry reorder buffer and control the execution of the instructions. The decoders generate micro-ops to the integer and floating-point schedulers. In the same way, these schedulers issue micro-ops when the operands are available. Opteron has three units for each of the following functions: integer execution, address generation and floating-point execution. The integer units have a full 64-bit data path and most of the instructions are single cycle. The 32-bit multiply takes 3 cycles and the 64-bit multiply takes 5. The floating-point data path is full 80-bit extended precision. All floating-points units are fully pipelined. Operations like compare and MMX instructions take 2 cycles. G. Load/store unit The decoders also issue load/store micro-ops to the load/store unit that manages the access to the data cache and system memory. Each cycle, two 64-bit loads or stores can access the data cache as long they are not in the same bank. Load-to-use latency is 3 cycles. Loads can return results to the execution units order but the stores cannot commit data until these units retire. A hardware prefetcher is used to identify pattern of cache misses to consecutive lines. The L2 cache can handle 10 requests for cache miss, state change and TLB miss. H. Integrated Memory controller The on-chip memory controller is a low-latency, high-bandwidth, DDR SDRAM controller, which provides a dual-channel 128-bit wide interface to a 333-MHz DDR memory composed of dual, in-line memory modules (DIMMs). The controller’s peak memory bandwidth is 5.3 Gbytes/s. It supports eight registered or unbuffered DIMMs, for a memory capacity of 16 Gbytes using 2-Gbyte DIMMs, and includes ECC checking with double-bit detect and single-bit correct. I. HyperTransport HyperTransport is a bidirectional, serial/parallel, scalable, high-bandwidth low-latency bus, which facilitates power management and low latencies. The AMD Opteron has three integrated HyperTransport links. Each link is 16 bits wide (3.2 Gbytes/s per direction) and can be configured as a coherent HyperTransport non-coherent HyperTransport. The first one is for connecting processors, and the second one is for connecting I/O devices. HyperTransport is flexible and can be scaled for different I/O topologies. The programming model for configuration space is the same one as the one for PCI bus specification, version 2.2, making it transparent for the OS that supports PCI. J. InterCPU connections It is common to use multiple AMD Opteron CPUs connected through a proprietary extension running on additional HyperTransport interfaces. To facilitate this, the AMD Opteron supports a cache-coherent, Non-Uniform Memory Access, multi-CPU memory access protocol. With Non-Uniform Memory Access (NUMA), each processor has its own cache memory, but any processor can access its own and other processors’ main memories. Memory access time depends on memory location, i.e. accessing local memory would be faster than non-local memory. With Cache coherence, the integrity of data stored in local caches of a shared resource is guaranteed. Each CPU can access the main memory of another processor, transparent to the programmer. The standard MOESI (modified, owner, exclusive, shared, invalid) protocol was used for cache coherence. 5 K. Reliability The processor AMD Opteron count with reliability features. Error code correction (ECC) or parity checking protects all large arrays. The L1 data cache, L2 cache and DRAM are protected by ECC. Eight bits of each 64 bits of data, allow correction of single bit errors and the detection of double-bit errors. When an error is present, L1 data cache, L2 cache tags and DRAM implement hardware scrubbers to steal idle cycles and clean up single bit ECC errors. Parity checking protects the instruction cache data, L1 and L2 TLBs. These errors are reported via machine check architecture. IV. BENCHMARKS Benchmarks provide comparative results of systems that perform similar tasks. The purpose of the benchmarks is to define performance expectations of real-world tasks. The Standard Performance Evaluation Corporation (SPEC) developed standard benchmarks of CPU performance. The tasks are evaluated in two categories: SPECint for integer performance and SPECfp for floating point performance. The results for the AMD processor model 146 against the Intel Xeon 3.0 GHz are displayed below. Figure 5 shows the results for the SPECint _peak2000 performance for the uniprocessor under Windows. The SPEC rate measures the capabilities of running multiple tasks. In this case the AMD Opteron outperforms the Intel Xeon. Figure 5. SPECint_peak2000 performance Figure 6 displays the SPECfp_peak2000 performance for the uniprocessor under Windows. The AMD Opteron processor performs well on this benchmark, ahead of the Xeon, Intel's x86 processor. Figure 6. SPECfp_peak2000 performance The High-Performance Linpack (HPL) benchmark gives the metrics for determining the world's top supercomputers since 1993. One of the libraries used by the AMD processor and analyzed here is the GOTO BLAS (Basic linear algebra subroutines). As displayed in Table 1, these benchmarks demonstrate the AMD Opteron efficiency, achieving 87.1% of the theoretical peak floating points operations per second (FLOPS) on a single processor system, while a single CPU 2.4 GHz Xeon shows 81.2% efficiency. Table 1. GOTO Libraries benchmarks for the AMD Opteron processor. 6 Also, the AMD Opteron scales well. Using HyperTransport technology between processors results in low-latency and highbandwidth communication. It remains efficient scaled up to four processor systems. The HPL efficiency for a 4P system is almost 84%. V. USES This processor is mainly used in supercomputers. This type of computers is used for highly calculation-intensive tasks due to the high processing capacity and the speed of calculation. Some examples mentioned in the top 20 fastest supercomputers in the world as of June 20, 2011 include: #3. Oak Ridge National Laboratory, USA. AMD64 Opteron six-core 2600 MHz (10.4 GFlops/unit). Cray Inc. 224,162 total cores #6. Cielo - Cray XE6 8-core 2.4 GHz. Cray Inc. 142,272 total cores. #8. Hopper - Cray XE6 12-core 2.1 GHz. Cray Inc. 153,408 total cores. #10. The IBM Roadrunner at Los Alamos National Laboratory uses 6,912 Opteron Dual Core processors. REFERENCES [1] [2] [3] [4] [5] Ardsher Ahmed, Pat Conway, Bill Hughes, and Fred Weber. AMD Opteron Shared Memory MP Systems. In Proceedings of the 14th HotChips Symposium, August 2002. P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes. Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor. IEEE Micro, 30(2):16{29, 2010. D. O’flaherty, M. Goddar. AMD Opteron™ Processor Benchmarking for Clustered Systems. http://www.opteronics.com/pdf/39497A_HPC_WhitePaper_2xCli.pdf. July, 2003. AMD Opteron™ Processor Product Data Sheet http://support.amd.com/us/Processor_TechDocs/23932.pdf March, 2007. HyperTransport Consortium. HyperTransport I/O Technology Overview. http://www.hypertransport.org/docs/wp/HT_Overview.pdf June, 2004.