report

advertisement
1
AMD OPTERON ARCHITECTURE
Omar Aragon, Abdel Salam Sayyad

Abstract— The AMD Opteron processor is a 64-bit, x86 architecture. It has an on-chip double-data-rate memory controller and three
HyperTransport links. This processor is an out-of-order, superscalar and supports on-chip level-1 and level-2 caches. The benchmarks for the
uniprocessor result that the Opteron is more efficient that the processor Intel Xeon.
Index Terms— benchmarks, microprocessor, Opteron, pipeline, superscalar, x86-64-bit.
I. INTRODUCTION
T
HIS document presents the main characteristics of the AMD Opteron microprocessor. This processor was released on April
22, 2003 with the intention to compete in the server and workstation markets. New configurations released subsequently
included 4-core, 8-core and 12-core, code-named "Magny-Cours".
The information is presented as follows: Section II includes an explanation of the main features presented in the processor.
Section III describes the core of the micro architecture, including a description of each functional unit of the processor. Section
IV makes a comparison of the processor with the Intel Xeon and defines the benchmarks. Finally, Section V gives some
examples of where this processor is used.
II. FEATURES
Below it is described all the properties presented in the Opteron processor. Figure 1 displays the block diagram for a single
processor.
Figure 1. Block diagram for the Opteron processor.
A. x86-64 Architecture
This is an extension of the x86 instruction set. Ideally, the AMD supports 64-bit virtual and 52-bit physical address, allowing
the programmers to work with a larger of data sets. For the case of the Opteron, supports 48-bit virtual and 40-bit physical
address.
This architecture doubles the number of registers, and extends all the integer arithmetic and logical integer instructions to 64
bits. The integer general-purpose registers and streaming SIMD extension registers increases from 8 to 16. The general purpose
registers increases from 32 to 64.
The compatibility with 32-bit code is done without speed penalties, supporting applications and operating systems without
modifications or recompilation.
2
B. Memory controller
The memory controller is a digital circuit that manages the data flow between the processor and the main memory. The fact
that the memory controller is on-chip reduces memory latency and increases bandwidth, but forces the microprocessor a specific
type of memory.
C. Out-of-order execution
This computer engineering paradigm used in high-performance processors uses the instruction cycles that would be wasted by
a certain delay. The processor executes instructions according to the input data, instead of the original order of the program. The
important thing here is allow the processor to avoid the stalls that can occur when the data needed to perform the next instruction
is not available.
D. Superscalar
This feature implements instruction level parallelism within a single processor. In this case, the processor executes more than
one instruction during a clock cycle dispatching multiple instructions to redundant functional units on the processor. Each
functional unit is not a separate CPU, but an execution resource inside the single processor.
A typical superscalar processor fetches and decodes several instructions at the time. A branch prediction unit is used to prevent
the interruption of the instructions stream. After the instructions are inspected for data dependencies and distributed to functional
units, organized usually by instruction type.
E. HyperTransport links
This is a technology for interconnection of processors and other I/O devices. The protocol is a bidirectional serial/parallel
high-bandwidth and low latency point to point link that runs on speeds between 200 MHz to 3.2 GHz. It sends data on both
rising and falling edges of the clock signal.
III. MICROARCHITECTURE
The AMD Opteron processor is an aggressive, out-of-order, three-way superscalar processor. Figure 2 shows the processor
core block diagram. It reorders 72 micro-ops, fetches and decodes three x86-64 instructions each cycle. The instructions are
transformed from variable-length x86-64 into fixed-length micro-ops and dispatched into two independent schedulers: one for
integer and one for floating-point and multimedia. Also, the store/load unit takes care of the load and store micro-ops. The
dispatchers and the load/store unit dispatch 11 micro-ops each cycle to the units:
 Three integer units.
 Three address generation units.
 Three floating point and multimedia units.
 Two loads/store to the data cache.
Figure 2. Processor core block diagram.
3
A. Pipeline
The pipeline for the Opteron processor is fully integrated from the instruction fetch to the DRAM access as displayed in
Figure 3. It is long enough for high frequency and short enough for a good CPI (cycles per instruction). It takes 7 cycles for fetch
and decode. The pipeline takes 12 stages for integer and 17 stages for floating-point. The data cache access occurs in stage 11 for
loads, the result returns in stage 12.
When L1 cache misses, the pipeline accesses the L2 cache and the system request queue at the same time. In case that cache
L2 hits the pipeline cancels the request. The memory controller schedules the request to DRAM.
Figure 3. Opteron pipeline.
B. Caches
Figure 4 shows the cache, DDR and HyperTransport blocks.
Figure 4. Cache, DDR, and HyperTransport.
Two levels of cache are included:
1- L1, with separate Instruction and Data caches, each 64 Kbytes. They are two-way set-associative, linearly indexed, and
physically tagged with a cache line size of 64 bytes.
2- L2 cache for both Data & Instructions. 1 Mbytes, 16-way set associative. It uses a pseudo-least-recently-used (LRU)
replacement policy.
4
C. Translation Look-aside Buffers (TLBs)
The AMD Opteron has Independent L1 and L2 translation look-aside buffers (TLBs). The L1 TLB is fully associative and
stores thirty-two 4-Kbyte page translations, and eight 2-Mbyte/4-Mbyte page translations. The L2 TLB is four-way setassociative with 512 4-Kbyte entries.
D. Instruction fetch and decode
This unit provides 16 instructions bytes each cycle to the scan/align unit, that scans the data and aligns it on instruction
boundaries. The process gets complicated because the x86 instructions can vary in length between 1 and 15 bytes.
The processor decodes variable lengths instruction into internal fixed-length micro-ops using fast-path decoders or the
microcode engine. Fast-path decoders can issue three x86 instructions per cycle. In the other side, microcode can issue only one
x86-instruction per cycle.
E. Branch prediction
Opteron uses a combination of branch prediction schemes. The branch selector array selects between static prediction and the
history table. The history table uses 2-bit saturating up/down counters. In order to accelerate calculations the branch target array
holds the branch target address.
F. Integer and floating-point execution units
Decoders dispatch three micro-ops each cycle to the instruction control unit, which has a 72-entry reorder buffer and control
the execution of the instructions. The decoders generate micro-ops to the integer and floating-point schedulers. In the same way,
these schedulers issue micro-ops when the operands are available. Opteron has three units for each of the following functions:
integer execution, address generation and floating-point execution.
The integer units have a full 64-bit data path and most of the instructions are single cycle. The 32-bit multiply takes 3 cycles
and the 64-bit multiply takes 5.
The floating-point data path is full 80-bit extended precision. All floating-points units are fully pipelined. Operations like
compare and MMX instructions take 2 cycles.
G. Load/store unit
The decoders also issue load/store micro-ops to the load/store unit that manages the access to the data cache and system
memory. Each cycle, two 64-bit loads or stores can access the data cache as long they are not in the same bank. Load-to-use
latency is 3 cycles. Loads can return results to the execution units order but the stores cannot commit data until these units retire.
A hardware prefetcher is used to identify pattern of cache misses to consecutive lines. The L2 cache can handle 10 requests for
cache miss, state change and TLB miss.
H. Integrated Memory controller
The on-chip memory controller is a low-latency, high-bandwidth, DDR SDRAM controller, which provides a dual-channel
128-bit wide interface to a 333-MHz DDR memory composed of dual, in-line memory modules (DIMMs). The controller’s peak
memory bandwidth is 5.3 Gbytes/s. It supports eight registered or unbuffered DIMMs, for a memory capacity of 16 Gbytes using
2-Gbyte DIMMs, and includes ECC checking with double-bit detect and single-bit correct.
I. HyperTransport
HyperTransport is a bidirectional, serial/parallel, scalable, high-bandwidth low-latency bus, which facilitates power
management and low latencies.
The AMD Opteron has three integrated HyperTransport links. Each link is 16 bits wide (3.2 Gbytes/s per direction) and can be
configured as a coherent HyperTransport non-coherent HyperTransport. The first one is for connecting processors, and the
second one is for connecting I/O devices.
HyperTransport is flexible and can be scaled for different I/O topologies. The programming model for configuration space is
the same one as the one for PCI bus specification, version 2.2, making it transparent for the OS that supports PCI.
J. InterCPU connections
It is common to use multiple AMD Opteron CPUs connected through a proprietary extension running on additional
HyperTransport interfaces.
To facilitate this, the AMD Opteron supports a cache-coherent, Non-Uniform Memory Access, multi-CPU memory access
protocol.
With Non-Uniform Memory Access (NUMA), each processor has its own cache memory, but any processor can access its
own and other processors’ main memories. Memory access time depends on memory location, i.e. accessing local memory
would be faster than non-local memory.
With Cache coherence, the integrity of data stored in local caches of a shared resource is guaranteed. Each CPU can access the
main memory of another processor, transparent to the programmer.
The standard MOESI (modified, owner, exclusive, shared, invalid) protocol was used for cache coherence.
5
K. Reliability
The processor AMD Opteron count with reliability features. Error code correction (ECC) or parity checking protects all large
arrays. The L1 data cache, L2 cache and DRAM are protected by ECC. Eight bits of each 64 bits of data, allow correction of
single bit errors and the detection of double-bit errors.
When an error is present, L1 data cache, L2 cache tags and DRAM implement hardware scrubbers to steal idle cycles and
clean up single bit ECC errors. Parity checking protects the instruction cache data, L1 and L2 TLBs. These errors are reported
via machine check architecture.
IV. BENCHMARKS
Benchmarks provide comparative results of systems that perform similar tasks. The purpose of the benchmarks is to define
performance expectations of real-world tasks.
The Standard Performance Evaluation Corporation (SPEC) developed standard benchmarks of CPU performance. The tasks
are evaluated in two categories: SPECint for integer performance and SPECfp for floating point performance.
The results for the AMD processor model 146 against the Intel Xeon 3.0 GHz are displayed below.
Figure 5 shows the results for the SPECint _peak2000 performance for the uniprocessor under Windows. The SPEC rate
measures the capabilities of running multiple tasks. In this case the AMD Opteron outperforms the Intel Xeon.
Figure 5. SPECint_peak2000 performance
Figure 6 displays the SPECfp_peak2000 performance for the uniprocessor under Windows. The AMD Opteron processor
performs well on this benchmark, ahead of the Xeon, Intel's x86 processor.
Figure 6. SPECfp_peak2000 performance
The High-Performance Linpack (HPL) benchmark gives the metrics for determining the world's top supercomputers since
1993. One of the libraries used by the AMD processor and analyzed here is the GOTO BLAS (Basic linear algebra subroutines).
As displayed in Table 1, these benchmarks demonstrate the AMD Opteron efficiency, achieving 87.1% of the theoretical peak
floating points operations per second (FLOPS) on a single processor system, while a single CPU 2.4 GHz Xeon shows 81.2%
efficiency.
Table 1. GOTO Libraries benchmarks for the AMD Opteron processor.
6
Also, the AMD Opteron scales well. Using HyperTransport technology between processors results in low-latency and highbandwidth communication. It remains efficient scaled up to four processor systems. The HPL efficiency for a 4P system is
almost 84%.
V. USES
This processor is mainly used in supercomputers. This type of computers is used for highly calculation-intensive tasks due to the
high processing capacity and the speed of calculation. Some examples mentioned in the top 20 fastest supercomputers in the
world as of June 20, 2011 include:
 #3. Oak Ridge National Laboratory, USA. AMD64 Opteron six-core 2600 MHz (10.4 GFlops/unit). Cray Inc. 224,162
total cores
 #6. Cielo - Cray XE6 8-core 2.4 GHz. Cray Inc. 142,272 total cores.
 #8. Hopper - Cray XE6 12-core 2.1 GHz. Cray Inc. 153,408 total cores.
 #10. The IBM Roadrunner at Los Alamos National Laboratory uses 6,912 Opteron Dual Core processors.
REFERENCES
[1]
[2]
[3]
[4]
[5]
Ardsher Ahmed, Pat Conway, Bill Hughes, and Fred Weber. AMD Opteron Shared Memory MP Systems. In Proceedings of the 14th HotChips
Symposium, August 2002.
P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes. Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor. IEEE
Micro, 30(2):16{29, 2010.
D. O’flaherty, M. Goddar. AMD Opteron™ Processor Benchmarking for Clustered Systems.
http://www.opteronics.com/pdf/39497A_HPC_WhitePaper_2xCli.pdf. July, 2003.
AMD Opteron™ Processor Product Data Sheet
http://support.amd.com/us/Processor_TechDocs/23932.pdf
March, 2007.
HyperTransport Consortium. HyperTransport I/O Technology Overview.
http://www.hypertransport.org/docs/wp/HT_Overview.pdf
June, 2004.
Download