The Intel processor roadmap for industry

advertisement
The Intel® processor roadmap for industrystandard servers
technology brief, 8th edition
Abstract.............................................................................................................................................. 2
Introduction......................................................................................................................................... 2
Intel processor architecture and microarchitectures................................................................................... 2
NetBurst® microarchitecture................................................................................................................... 5
Hyper-pipeline and clock frequency ................................................................................................... 5
Hyper-Threading Technology ............................................................................................................. 7
NetBurst microarchitecture on 90nm silicon process technology............................................................. 9
Extended hyper-pipeline .............................................................................................................. 10
SSE3 instructions ........................................................................................................................ 10
64-bit extensions — Intel 64 ........................................................................................................ 10
Dual-core technology ...................................................................................................................... 11
Intel Core™ microarchitecture ............................................................................................................. 12
Processors ..................................................................................................................................... 12
Xeon dual-core processors............................................................................................................... 12
Xeon quad-core processors ............................................................................................................. 13
Enhanced SpeedStep® Technology .............................................................................................. 14
Intel Virtualization® Technology................................................................................................... 15
Performance comparisons ................................................................................................................... 15
TPC-C performance ........................................................................................................................ 15
SPEC performance ......................................................................................................................... 16
Intel Nahalem microarchitecture .......................................................................................................... 17
Conclusion........................................................................................................................................ 17
For more information.......................................................................................................................... 18
Call to action .................................................................................................................................... 18
Abstract
Intel® continues to introduce processor technologies that boost the performance of x86 processors in
multi-threaded environments. This paper describes these processors and some of the more important
innovations as they affect HP industry-standard enterprise servers.
Introduction
As standards-based computing has pushed into the enterprise server market, the demand for
increased performance and greater variety in processor solutions has grown with it. To meet this
demand, Intel continues to introduce processor innovations and new speeds. This paper summarizes
the recent history and near-term plans for Intel processors as they relate to the industry-standard
enterprise server market.
Intel processor architecture and microarchitectures
The Intel processor architecture refers to its x86 instruction set and registers that are exposed to
programmers. The x86 instruction set is the list of all instructions and their variations that can be
executed by processors derived from the original 16-bit 8086 processor architecture. Processor
manufacturers, such as Intel and AMD, use a common processor architecture to maintain backward
and forward compatibility of the instruction set among generations of their processors. Intel refers to
its 32-bit and 64-bit versions of the x86 processor architecture as Intel Architecture (IA)-32 and IA-64.
In comparison, the term “microarchitecture” refers to each processor’s physical design that implements
the instruction set. Processors with different microarchitectures, Intel and AMD x86 processors for
example, can still use a common instruction set.
Figure 1 shows the relationship between the x86 processor architecture and Intel’s evolving
microarchitectures, as well as processors based on these microarchitectures.
Figure 1. Intel processor architecture and microarchitectures for industry-standard enterprise servers
2
Intel processor sequences are intended to help developers select the best processor for a particular
platform design. Intel offers three processor number sequences for server applications (see Table 1).
Intel processor series numbers within a sequence (for example, 5100 series) help differentiate
processor features such as number of cores, architecture, cache, power dissipation, and embedded
Intel technologies.
Table 1. Intel processor sequences
Processor sequence
Platform
Dual-Core Intel® Xeon™ processor 3000 sequence
Uni-processor servers
Dual-Core and Quad-Core Intel® Xeon™ processor
5000 sequence
Dual-processor high-volume servers and
workstations
Dual-Core and Quad-Core Intel® Xeon™ processor
7000 sequence
Enterprise servers with 4 to 32-processors
Intel enhances the microarchitecture of a family of processors over time to improve performance and
capability while maintaining compatibility with the processor architecture. One method to enhance
the microarchitectures involves changing the silicon process technology. For example, Figure 2 shows
that Intel enhanced NetBurst-based processors in 2004 by changing the manufacturing process from
130nm to 90nm silicon process technology.
In the second half of 2006, Intel launched the Core® microarchitecture, which is the basis for the
multi-core Xeon 5000 Sequence processors, including the first quad-core Xeon processor
(Clovertown). Beginning with the Penryn family of processors, Intel plans to enhance the performance
and energy efficiency of Intel Core microarchitecture-based processors by switching from 65nm to
45nm Hi-k 1 process technology with the hafnium-based high-K + metal gate transistor design. In
2008, Intel plans initial production of processors based on the “next generation” Nehalem
microarchitecture.
Figure 2. Intel microarchitecture introductions and associated silicon process technologies for industry-standard
servers
Hi-k, or High-k, stands for high dielectric constant, a measure of how much charge a material can hold. For
more information, refer to http://www.intel.com/technology/silicon/high-k.htm?iid=tech_arch_45nm+body_hik.
1
3
Table 2 includes more details about the release dates and features of previously released Intel x86
processors as well as processors projected to be available through 2007.
Table 2. Release dates and features of Intel x86 processors
Cache
Max. Bus
speed1
(MT/s)
Code
Name
Market
name
Feature
size
(nm)
Description
Date available/
Projected
availability
Smithfield
Pentium D
90
Dual-core uniprocessor
2H2005
1MB L2
per
core
800
Irwindale
Xeon
90
2MB L2 version of
Nocona
1Q2005
2MB L2
800
Cranford
Xeon MP
90
Xeon MP
1Q2005
1MB L2
667
Prescott 2M
Xeon
90
2MB L2 version of
Prescott
1Q2005
2MB L2
800
Potomac
Xeon MP
90
Xeon MP
1Q2005
8MB L3
667
Paxville
Xeon MP
90
Dual-core Xeon MP
4Q2005
2x1MB
L2
800
Paxville
Xeon MP
90
Dual-core Xeon MP
4Q2005
2x2MB
L2
800
Presler
Pentium D
65
Dual-core uniprocessor
Q12006
2MB L2
per core
>800
Dempsey
Xeon
5000
65
Dual-core Xeon
1H2006
2MB L2
per core
1066
Woodcrest
Xeon
5100
65
Dual-core Xeon
1H2006
4MB L2
shared
1333
Conroe
Core 2
Duo
65
Dual-core,
uni-processor
Mid-2006
4MB L2
shared
1333 MHz
Conroe
Xeon
65
Dual-core,
uni-processor
3Q2006
4MB L2
shared
1333 MHz
Tulsa
Xeon MP
65
Dual-core Xeon MP
4Q2006
16MB
L3
800 MHz
Clovertown
Xeon
65
Quad-core Xeon
4Q2006
2x4MB
L2
1333 MHz
Tigerton
Xeon
65
Quad-core Xeon
2H2007
8MB L2
1066 MHz
Wolfdale
Xeon
45
Dual-core
1Q2008
1x6MB
L2
1600 MHz*
Harpertown
Xeon
45
Quad-core Xeon
4Q2007
2x6MB
L2
1333/1600
MHz*
MT/s is an abbreviation for Mega-Transfers per second. A bus operating at 200 MHz and
transferring four data packets on each clock (referred to as quad-pumped) would have 800 MT/s.
1
* Selected chipsets only
4
NetBurst® microarchitecture
The NetBurst-based processor for low-cost, single-processor servers is the Pentium® 4 processor. The
original 180nm version of the Pentium 4 was known as Willamette, and the subsequent 130nm
version was known as Northwood. NetBurst-based processors intended for multi-processor
environments are referred to as Intel® Xeon™ (for dual-processor systems) and Xeon MP (for systems
using more than two processors).
The NetBurst microarchitecture included the following enhancements:
• Higher bandwidth for instruction fetches
• 256-KB Level 2 (L2) cache with 64-byte cache lines
• NetBurst system bus: a 64-bit, 100-MHz bus capable of providing 3.2 GB/s of bandwidth by
double pumping the address and quad pumping the data. The 100-MHz quad pumped data bus is
also referred to as a 400-MHz data bus. To provide higher levels of performance, Intel added
support for a 533-MHz front side bus to the Pentium 4 and Xeon processors and later added
support for 800 MHz to the Pentium 4.
• Integer arithmetic logic unit (ALU) running at twice the clock speed (double data rate)
• Modified floating point unit (FPU)
• Streaming SIMD extension 2 (SSE2): New instructions bring the total to 144 SIMD instructions to
manage floating point, application, and multimedia performance.
• Advanced dynamic execution
• Deeper instruction window for out-of-order, speculative execution and improved branch prediction
over the P6 dynamic execution core
• Execution trace cache (stores pre-decoded micro-operations)
• Enhanced floating point/multimedia engine
• Hyper-threading (HT) in Xeon processors and Pentium 4 processors (described below)
Hyper-pipeline and clock frequency
One performance-enhancing feature of the NetBurst microarchitecture was its hyper-pipeline, a 20stage branch-prediction pipeline. Previous 32-bit processors had a 10-stage pipeline. The hyperpipeline can contain more than 100 instructions at once and can handle up to 48 loads and stores
concurrently. The pipeline in a processor is analogous to a factory assembly line where production is
split into multiple stages to keep all factory workers busy and to complete multiple stages in parallel.
Likewise, the work to execute program code is split into stages to keep the processor busy and allow
it to execute more code during each clock cycle. In this case, the processor must complete the
operation for each stage within a single clock cycle. The processor can achieve this by splitting the
task into smaller tasks and using more (shorter) stages to execute the instructions (Figure 3). Thus,
each stage can be completed quicker, allowing the processor to have a higher clock frequency.
However, it is important to understand that splitting each stage into smaller stage to achieve a higher
clock frequency does not mean that more work is being done in the pipeline per clock cycle.
5
Figure 3. Decreasing the amount of work done in each stage allows the clock frequency to increase
A basic structure for a computer pipeline consists of the following four steps, which are performed
repeatedly to execute a program:
1. Fetch the next instruction from the address stored in the program counter.
2. Store that instruction in the instruction register and decode it, and increment the address in the
program counter.
3. Execute the instruction currently in the instruction register.
4. Write the results of that instruction from the execution unit back into the destination register.
Typical processor architectures split the pipeline into segments that perform those basic steps: the
“front end” of the microprocessor, the execution engine, and the retire unit, as shown in Figure 4. The
front end fetches the instruction and decodes it into smaller instructions (commonly referred to as
micro-ops). These decoded instructions are sent to one of the three types of execution units (integer,
load/store, or floating point) to be executed. Finally, the instruction is retired and the result is written
back to its destination register.
Figure 4. Basic 4-stage pipeline schematic
6
Keeping the pipeline busy requires that the processor begin executing a second instruction before the
first has traveled completely through the pipeline. However, suppose a program has an instruction
that requires summing three numbers:
X=A+B+C
If the processor already has A and B stored in registers but needs to get C from memory, this causes a
“bubble,” or stall, in the pipeline in which the processor cannot execute the instruction until it obtains
the value for C from memory. This bubble must propagate all the way through the pipeline, forcing
each stage that contains the bubble to sit idle, wasting execution resources during that clock cycle.
Clearly, the longer the pipeline, the more significant this problem becomes.
Processor stalls often occur as a result of one instruction being dependent on another. If the program
has a branch, such as an IF… THEN loop, the processor has two options. The processor either waits
for the critical instruction to finish (stalling the pipeline) before deciding which program branch to
take, or it predicts which branch the program will follow.
If the processor predicts the wrong code branch, it must flush the pipeline and start over again with
the IF… THEN statement using the correct branch. The longer the pipeline, the higher the performance
cost for branch mispredicts. For example, the longer the pipeline, the more the processor must execute
speculative instructions that must be discarded when a mispredict occurs. Specific to the NetBurst
design was an improved branch-prediction algorithm aided by a large branch target array that stored
branch predictions.
Hyper-Threading Technology
Intel Hyper-Threading (HT) Technology is a design enhancement for server environments. It takes
advantage of the fact that, according to Intel estimates, the utilization rate for the execution units in a
NetBurst processor is typically only about 35 percent. To improve the utilization rate, HT Technology
adds Multi-Thread-Level Parallelism (MTLP) to the design. In essence, MTLP means that the core
receives two instruction streams from the operating system (OS) to take advantage of idle cycles on
the execution units of the processor. For one physical processor to appear as two distinct processors
to the OS, the new design replicates the pieces of the processor with which the OS interacts to create
two logical processors in one package. These replicated components include the instruction pointer,
the interrupt controller, and other general-purpose registers―all of which are collectively referred to
as the Architectural State, or AS (see Figure 5).
Figure 5. Hyper-Threading technology enables one physical processor to appear as two distinct, logical
processors to the OS so that the OS sends two instruction streams to the processor core.
IA-32 Processor with
Hyper-thread Technology
AS1
AS2
AS
AS
Processor
Core
Processor
Core
Logical
processor
Traditional Dual-processor
(D) System
Processor
Core
Logical
processor
System Bus
System Bus
Since multi-processing operating systems such as Microsoft Windows and Linux are designed to
7
divide their workload into threads that can be independently scheduled, these operating systems can
send two distinct threads to work their way through execution in the same device. This provides the
opportunity for a higher abstraction level of parallelism at the thread level rather than simply at the
instruction level, as in the Pentium 4 design. To illustrate this concept, refer to Table 3, where it can be
seen that instruction-level parallelism can take advantage of opportunities in the instruction stream to
execute independent instructions at the same time. Thread-level parallelism, shown in Table 4, takes
this a step further since two independent instruction streams are available for simultaneous execution
opportunities.
It should be noted that the performance gain from adding HT Technology does not equal the expected
gain from adding a second physical processor. The overhead to maintain the threads and the
requirement to share processor resources will necessarily limit the HT performance. Nevertheless, HT
Technology is a valuable and cost-effective addition to the Pentium 4 design.
Table 3. Instruction-level parallelism enables simultaneous execution of independent instructions.
Instruction
number
Instruction
thread
Instruction execution
1
Read register A
2
Write register B
Operations 1, 2, and 3 are independent and can execute simultaneously if
resources permit.
3
Read register C
4
Add A + B
This operation must wait for instructions 1 and 2 to complete, but it can
execute in parallel with operation 3.
5
Inc A
This operation needs to wait for the completion of instruction 4 before
executing.
Table 4. Thread-level parallelism supports two independent instruction streams for simultaneous execution.
Instruction
number
Instruction
thread
1a
Read
register A
2a
Instruction
thread
Instruction execution
1b
Add D + E
Write
register B
2b
Inc E
3a
Read
register C
3b
Read F
None of the instructions in Thread
2 depend on those in Thread 1;
therefore, to the extent that
execution units are available, any
of them can execute in parallel
with those in Thread 1.
4a
Add A + B
4b
Add E+F
5b
Write E
5a
Inc A
Instruction
number
As an example, instruction 2b
must wait for instruction 1b, but
does not need to wait for 1a.
Similarly, if two arithmetic units
are available, 4a and 4b can
execute at the same time.
According to Intel’s internal simulations, HT Technology achieves its objective of improving the
microarchitecture utilization rate significantly. Improved performance is the real goal though, and Intel
reports that the performance gain can be as high as 30 percent.
The performance gained by these design changes is limited by the fact that two threads now share
and compete for processor resources, such as the execution pipeline and L1 and L2 caches. There is
some risk that data needed by one thread can be replaced in a cache by data that the other is using,
resulting in a higher turnover of cache data (referred to as thrashing) and a reduced hit rate. HT
Technology also puts a heavier load on the OS to allocate threads and switch contexts on the device.
8
The evaluation of the threads for parallelism and context switching are OS tasks and increase the
operating overhead.
Currently, HT Technology presents little in the way of software licensing issues. Intel asserts that the HT
design is still only a single-processor unit, so customers should not have to purchase two software
licenses for each processor. This is true for Microsoft SQL Server 2000 and Windows Server 2003,
which only require one license for each physical processor, regardless of how many logical
processors it contains. However, Windows 2000 Server does not make this distinction between
physical and logical processors and fills the licensing limit based on the number of processors the
BIOS discovers at boot time. 2
According to Intel, the system requirements for HT Technology are as follows:
• A processor that supports HT Technology 3
• HT Technology-enabled chipset
• HT Technology-enabled system BIOS
• HT Technology-enabled/optimized operating system
For more information, refer to http://www.intel.com/products/ht/hyperthreading_more.htm.
NetBurst microarchitecture on 90nm silicon process technology
In 2004, Intel introduced major improvement to the Pentium 4 and Xeon processor lines by changing
the manufacturing process from 130nm to 90nm silicon process technology. Enhancements for
NetBurst on 90nm technology included:
• Larger, more effective caches (1MB or 2-MB L2 Advanced Transfer Cache compared to 512-KB on
the 0.13 micron Pentium 4 processor)
• Faster processor bus: a 64-bit, 200-MHz bus capable of providing 6.4 GB/s of bandwidth by
double pumping the address and quad pumping the data. The 200-MHz Quad-pumped data bus is
also referred to as an 800-MHz data bus.
• Extended hyper-pipeline (31 stages versus 20 stages) to enable high CPU core frequencies
(described below)
• Enhanced execution units including the addition of a dedicated integer multiplier, and support for
shift and rotate instruction execution on a fast ALU
• Improved branch prediction to help compensate for longer pipeline
• Streaming SIMD Extensions 3 (SSE3) instructions (described below)
• Larger execution schedulers and execution queues
• Improved hardware memory prefetcher
• Improved Hyper-Threading
• 64-bit extensions (described below)
• Dual-core (for Smithfield, Dempsey, and Paxville)
For more information on Hyper-Threading technology, visit
www.microsoft.com/windows2000/docs/hyperthreading.doc
3
Hyper-Threading Technology supported in dual-core Intel Xeon processor 5000 series only.
2
9
Extended hyper-pipeline
In keeping with its history of regularly increasing processor frequencies, Intel has extended the hyperpipeline queue from 20 (in the earlier Pentium 4 design) to 31 stages. The biggest drawback to this
approach is that, as the pipe gets longer, interruptions (stalls) to the regular flow of instructions in the
pipe become progressively more costly in terms of performance. To mitigate such stalls, Intel improved
the branch-prediction algorithm sufficiently to prevent this deeper pipeline from causing performance
degradation.
SSE3 instructions
The Prescott design added Streaming Single-Instruction-Multiple-Data (SIMD) Extensions 3, or Prescott
New Instructions. As they did in earlier processors, SIMD instructions provide the potential for
improved performance because each instruction permits operation on multiple data items at the same
time. For Prescott processors, there are newer versions of arithmetic, graphics, and HT
synchronization instructions.
The arithmetic group consists of one new instruction for converting x87 data into integer format, and
five instructions that simplify the process of performing complex arithmetic. Complex numbers actually
consist of two numbers, a real and an imaginary component. The additional instructions facilitate
complex operations because they are designed to operate on both parts of these complex pairs of
numbers at the same time. Using these instructions also simplifies coding complex arithmetic
operations because fewer instructions are needed to accomplish the goal.
The graphics group contains one instruction for video encoding and four that are specific to graphics
operations. Finally, two instructions facilitate HT operation, for example, by allowing one operational
thread to be moved to a higher priority than another.
64-bit extensions — Intel 64
In response to market demands, Intel added 64-bit extensions to the x86 architecture of the Xeon,
Xeon MP, and Pentium 4 processors. The key advantage of 64-bit processing is that the system can
address a much larger flat memory space (up to 16 exabytes). Even though the 32-bit architecture
can actually access up to 64 GB of memory, access above the standard 4 GB limit must go through a
slow and cumbersome windowing facility. Due to the complexities of this process, most 32-bit
applications have not made use of the higher address space. Today, few applications require more
than 1 or 2 GB of memory; however, this will eventually change. By adding 64-bit extensions to its
x86 processors, Intel has provided users with the same 64-bit addressing benefit at a much lower cost
than if users were forced to replace both the hardware and software.
AMD was first to release 64-bit extensions―called AMD64―with its Opteron processor in early
2003. Within the year, Intel responded with its own plans to deliver a similar solution called
Extended Memory 64 Technology, or EM64T, which is broadly compatible with AMD64. In late
2006, Intel began using the name Intel 64 for its implementation. Intel 64 and AMD64 use the same
register sets and definitions, and the 64-bit instructions are nearly identical. HP expects that any minor
differences will be handled by the OS and compiler, so that the average application writer or
customer should see no differences. New operating systems are required to make use of 64-bit
extensions. Red Hat, SuSE, and Microsoft provide AMD64 support and Intel 64 support.
Even though the larger memory addressing capability is the primary advantage of 64-bit extensions, it
is not the only one. 64-bit extensions also provide a larger register set with eight additional general
purpose registers (GPR) and 64-bit versions of the existing registers. With a total of 16 GPRs, 64-bit
extensions provide additional resources that compilers can use to increase performance. The 16register limit was a tradeoff AMD chose as a good compromise between performance and cost.
10
Dual-core technology
Single-core processors that run multi-threaded applications become less cost effective with each
increase in frequency. This is because the multiple threads compete for available compute resources,
which limits the increase in performance at higher frequencies. Increasing the CPU core frequency not
only delivers lower incremental performance gains, but also increases power requirements and heat
generation. These factors create significant barriers for single-core architectures to keep pace with the
growing needs of data centers.
To address the performance, power, and heat issues, Intel announced its first dual-core processor
architecture in 2005. A dual-core processor is a single physical package that contains two, full
processor cores per socket. The two cores share the same functional execution units and cache
hierarchy; however, the OS recognizes each execution core as an independent processor.
Figure 6 illustrates the difference between single-core and dual-core processors with HT Technology.
In the case of the single-core processor, HT Technology allows the OS to schedule two threads on the
core by treating it as two separate "logical" processors with a shared 2-MB L2 cache. The dual-core
processor builds on HT Technology with two execution cores. Each core has its own 2-MB L2 cache
and separate interface to an 800-MHz front side bus. The dual-core architecture runs two threads on
each execution core, allowing the processor to run up to four threads simultaneously. The additional
capacity of the second core reduces competition for processor resources and increases processor
utilization. Thus, the performance improvement of a dual-core processor is in addition to the
improvement due to HT Technology.
A dual-core processor has better performance-per-watt than a single-core processor running at a
higher frequency. This is analogous to the way a wide pipe, by virtue of its volume, can carry more
water than a smaller pipe with a higher flow rate. Likewise, the dual-core architecture is designed to
make processors perform more efficiently at lower frequencies (and power). The dual-core processor
allows a better balance between performance and power requirements, and it is the first step in multicore processor technology.
Figure 6. Implementation of Hyper-Threading Technology on single processor core (left) supports two threads
through a shared L2 cache. Implementation of Hyper-Threading Technology on dual-core processors (right)
supports four threads running simultaneously.
11
Intel Core™ microarchitecture
In 2006, Intel introduced the Core microarchitecture, which extends the NetBurst microarchitecture
features as well as adds the energy efficient features of Intel’s mobile microarchitecture. The Core
microarchitecture uses less power and produces less heat than previous generation Intel processors.
The Core microarchitecture features the following new technologies that improve per-watt
performance and energy efficiency 4 :
• Intel® Wide Dynamic Execution enables delivery of more instructions per clock cycle to improve
execution time and energy efficiency.
• Intel® Intelligent Power Capability reduces power consumption and design requirements.
• Intel® Smart Memory Access improves system performance by optimizing the use of the available
data bandwidth from the memory subsystem.
• Intel® Advanced Smart Cache is optimized for multi-core and dual-core processors to reduce
latency to frequently used data, providing a higher-performance, more efficient cache subsystem.
• Intel® Advanced Digital Media Boost improves performance when executing SSE, SSE2, and SSE3
instructions, accelerating a broad range of applications, including encryption, financial,
engineering, and scientific applications.
• Streaming SIMD Extensions 4 (SSE4) instructions
Processors
The dual-core Intel Xeon 3000 and 5000 Sequence and the 7300 series processors are based on the
Core microarchitecture. Using Hyper-Threading technology, dual-core processors (with the exception
of the Xeon 3000 Sequence processors) can simultaneously execute four software threads, thereby
increasing processor utilization. To avoid saturation of the Front Side Bus (FSB), the Intel 5000 chipset
widens the interface by providing dual independent buses. The Xeon 7300 series processors
introduce an independent point-to-point interface between the chipset and each processor that allows
full front-side-bus bandwidth.
Xeon dual-core processors
The 64-bit Intel Xeon 3000 Sequence processors combine performance and power efficiency to
enable smaller, quieter systems. Xeon 3000 Sequence processors run at a maximum frequency of
2.66 gigahertz (GHz), with 4 megabytes (MB) of shared L2 cache (Figure 7 left) and a maximum
front-side bus speed of 1333 megahertz. These processors are compatible with IA-32 software and
support single-processor operation. Xeon 3000 Sequence processors use the Intel 3000 or 3010
chipsets, which support Error Correction Code (ECC) memory for a high level of data integrity,
reliability, and system uptime. ECC can detect multiple-bit memory errors and locate and correct
single-bit errors to keep business applications running smoothly.
The 64-bit Intel Xeon 5000 Sequence processors have two complete processor cores, including
caches, buses, and execution states. The Xeon 5000 Sequence processors run at a maximum
frequency of 3.73 GHz, with 2 MB of L2 cache per core. The processor supports maximum front-side
bus speeds of 1066 megahertz (Figure 7 center).
For more information, read the white paper “Introducing the 45nm next-generation Intel® Core™
microarchitecture at Intel® 45nm Hi-k silicon technology.
4
12
The 64-bit Xeon 5100 series dual-core processor runs at a maximum frequency of 3.0 GHz with 4
MB of shared L2 cache and a maximum front-side bus speed of 1333 megahertz (Figure 7 right).
The Xeon 5000 Sequence and 5100 series processors use the Intel 5000 series chipsets. These
chipsets contains two main components: the Memory Controller Hub (MCH) and the I/O controller
hub. The new Northbridge MCH supports DDR2 Fully-Buffered DIMMs (dual in-line memory modules).
Figure 7. Diagram representing the major components of dual-core Intel Xeon 3000, 5000, and 5100
Sequence processors
Xeon quad-core processors
The quad-core Intel Xeon 5300 series processor (Clovertown) is the first quad-core processor for dualsocket platforms (Figure 8). The Xeon 5300 series processor has two dual-cores. Each pair of cores
shares a L2 cache; up to 4 MB of L2 cache can be allocated to one core. The processor runs at a
maximum frequency of 3.0 GHz, with 2 MB of L2 cache per core. This configuration delivers a
significant increase in processing capacity utilizing the Intel 5000 series chipsets. ProLiant 300 series
servers use the Intel 5000P and 5000Z chipsets. These chipsets support 1066-MHz and 1333-MHz
Dual Independent Buses, DDR2 FB-DIMMs, and PCI Express I/O slots.
The quad-core Xeon 5400 series processor (Harpertown) has two dual-cores, with each pair sharing
a 6-MB L2 cache. The Xeon 5400 series processor runs at a maximum frequency of 3.0 MHz with a
1333 MZ or 1600 MHz FSB.
Figure 8. Quad-core Intel Xeon 5300 sequence processor
13
The quad-core Intel Xeon 7300 series processor (Tigerton) consists of two dual-core silicon chips on a
single ceramic module, similar to the Xeon 5300 series processors. Each pair of cores shares a L2
cache; up to 4 MB of L2 cache can be allocated to one core. Intel states the Xeon 7300 series
processors offer more than twice the performance and more than three times the performance-per-watt
of the previous generation 7100 series, which is based on the NetBurst microarchitecture. The Xeon
7300 series processors are empowered by the Intel® 7300 Chipset, which features Dedicated HighSpeed Interconnects (DHSI). DHSI is an independent a point-to-point interface between the chipset and
each processor that allows full front side bus bandwidth to each processor (Figure 9). The point-topoint interface significantly reduces data traffic on the DHSI, providing lower latencies and greater
available bandwidth. The chipset also features a 64 MB snoop filter that manages data coherency
across processors, eliminating unnecessary snoops and boosting available bandwidth.
Figure 9. The Xeon 7300 series processors and Intel 7300 Chipset enable 4-socket server architectures with up to
16 processor cores. It provides fast memory access through Dedicated High-Speed Interconnects.
Enhanced SpeedStep® Technology
Quad-core Intel Xeon 5300 and 7300 series processors support Enhanced Intel SpeedStep
Technology. These processors have power state hardware registers that are available (exposed) to
allow IT organizations to control the performance and power consumption of the processor. These
capabilities are implemented through Intel’s Enhanced Intel SpeedStep® Technology and demandbased switching. With the appropriate ROM firmware or operating system interface, programmers
can use the exposed hardware registers to switch a processor between different performance states,
also called P-states 5 , which have different power consumption levels. For example, HP developed a
power management feature called HP Power Regulator that utilizes P-state registers to control
processor power use and performance. These capabilities have become increasingly important for
power and heat management in high-density data centers. When combined with data-center
management tools like Insight Power Manager, IT organizations have more control over the power
consumption of all the servers in the data center.
The ACPI body defines P-states as processor performance states. For Intel and AMD processors, a P-state is
defined by a fixed operating frequency and voltage.
5
14
Intel Virtualization® Technology
Virtualization techniques that are completely enabled in software perform many complex translations
between the guest operating systems and the hardware. With software virtualization, the processor
overhead increases (performance decreases) as each guest OS and application vies for the host
machine’s physical resources, such as memory space and I/O devices. Also, memory latency
increases as the virtual machine monitor, or hypervisor, dynamically translates the memory addresses
sent to and received from the memory controller. The hypervisor does this so that each guest
application does not realize that it is being virtualized.
Quad-core Intel Xeon 5300 and 7300 series processors support Intel Virtualization Technology (VT-x),
a processor hardware enhancement designed to reduce this software overhead. Intel VT-x is a group
of extensions to the x86 instruction set that affect the processor, memory, and local I/O address
translations. The new instructions enable guest operating systems to run in the standard Ring-0
architectural layer 6 .
The Xeon 7300 series processors also include APIC Task Programmable Register, a new Intel® VT
extension that improves interrupt handling to further optimize virtualization software efficiency.
Performance comparisons
TPC-C performance
The Transaction Processing Performance Council benchmark TPC-C results for Woodcrest,
Clovertown, Tulsa, and Tigerton processors are compared in Figure 10. TPC-C is measured in
transactions per minute (tpmC). The TPC-C results confirm the superior performance of multi-processor
dual-core and quad-core processors.
Figure 10. TPC-C performance comparisons for dual-processor (DP) and multi-processor (MP) Intel CPUs
For more information, refer to the technology brief “Server virtualization technologies for x86-based HP
BladeSystem and HP ProLiant servers” at
http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01067846/c01067846.pdf
6
15
SPEC performance
The Standard Performance Evaluation Corporation (SPEC) CPU2006 benchmark is provides
performance measurements that can be used to compare compute-intensive workloads on different
computer systems. SPEC results for Woodcrest, Clovertown, Tulsa, and Tigerton processors are
compared in Figure 10. SPEC CPU2006 contains two benchmark suites: CINT2006 for measuring
and comparing compute-intensive integer performance, and CFP2006 for measuring and comparing
compute-intensive floating point performance. The performance result show that the quad-core
processors, Clovertown and Tigerton, performed better in the SPEC tests.
Figure 11. SPEC CPU2006 performance comparisons for Intel processors show that the quad-core Clovertown
and Tigerton processors performed better than the dual-core Woodcrest and Tulsa processors.
16
Intel Nahalem microarchitecture
Beginning in 2008, new Intel processors will incorporate Intel’s next-generation microarchitecture
codenamed Nehalem. Nehalem will provide on-demand performance and feature a design-scalable
architecture for optimal price-performance and energy efficiency. The Nahalem microarchitecture will
offer scalable performance for one-to-sixteen (or more) threads and from one-to-eight (or more) cores,
scalable and configurable system interconnects, and integrated memory controllers. The introduction
of Nahalem will be followed by Intel’s 32nm silicon process technology. 7
Conclusion
Intel processors continue to provide dramatic increases in the processing capability of HP industrystandard servers. In addition to improved system performance, multi-core Intel processors offer greater
energy efficiency to help HP customers manage power costs. HP ProLiant servers continue to offer
both AMD Opteron™ and Intel® Xeon™ processor architectures to deliver the best possible choice to
customers.
For more information, refer to http://www.intel.com/technology/architecturesilicon/32nm/index.htm?iid=tech_arch_45nm+rhc_32nm.
7
17
For more information
For additional information, refer to the resources listed below.
Resource description
Web address
ProLiant servers home page
www.hp.com/servers/proliant
Power Regulator for ProLiant
Servers
http://h20000.www2.hp.com/bc/docs/support/Su
pportManual/c00593374/c00593374.pdf
ISS Technology Papers
www.hp.com/servers/technology
Call to action
Send comments about this paper to TechCom@HP.com.
© 2002, 2005, 2006, 2007 Hewlett-Packard Development Company, L.P. The
information contained herein is subject to change without notice. The only
warranties for HP products and services are set forth in the express warranty
statements accompanying such products and services. Nothing herein should be
construed as constituting an additional warranty. HP shall not be liable for technical
or editorial errors or omissions contained herein.
Intel, Intel Xeon, Pentium and Itanium are trademarks or registered trademarks of
Intel Corporation or its subsidiaries in the United States and other countries
AMD and AMD Opteron are trademarks of Advanced Micro Devices, Inc.
Linux is a U.S. registered trademark of Linus Torvalds.
Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation.
TC071201TB, December 2007
Download