Intel Core i7

advertisement
Intel Core i7 (Nehalem) Performance Preview by Brandon Sandman Bell,
November 02, 2008
Extract from: http://www.firingsquad.com/hardware/intel_core_i7_nehalem_performance_preview/default.asp
It’s been a little over two years since Intel introduced the world to their first Core 2 processors
utilizing their next-generation Conroe microarchitecture. Based somewhat off their Pentium M
“Yonah” CPU core, Conroe restored Intel’s leadership position in CPUs. The chip boasted a wider
execution core, allowing the processor to complete up to four full instructions simultaneously, along
with a more efficient 14-stage pipeline improving IPC (instructions per clock) in comparison to
Pentium 4/D.
If you recall, this was one of the chief weaknesses in Core 2’s predecessor, Pentium 4/D. Pentium 4
processors sacrificed the amount of work performed per clock in exchange for more pipeline stages,
31 in the case of latter Pentium D processors. Essentially Intel made a conscious decision to
sacrifice IPC in exchange for higher clock speeds. Ultimately this decision came back to haunt them
when Pentium 4/D had trouble scaling to higher clock speeds of 4GHz and beyond.
Core 2 never hit the clock speeds of Pentium 4, but because of its improved IPC, it didn’t have too in
order to achieve breakthrough performance.
But Intel didn’t stop there. To further enhance performance, Core 2 also featured more accurate
branch prediction, improved SSE/SSE2/3 performance, and a unified L2 cache with more advanced
prefetchers residing in the L1 and L2 caches to reduce memory access.
Ultimately Core 2 was over two times faster than Intel’s previous Pentium processor, and it also
significantly outperformed AMD’s fastest Athlon X2 and FX processors, all while generating very little
power and with tons of frequency headroom for overclockers. It wasn’t uncommon for Core 2 Duo
E6300 and E6400 chips to push 3GHz.
Late last year Intel gave Core 2 a midlife upgrade with their Penryn architecture. Besides its smaller
45-nm manufacturing process, Penryn also featured double the divider speed over Conroe when
handling math computations and a new super shuffle engine. This is a 128-bit wide, single-pass
shuffle unit that improved Penryn’s performance with SSE2, SSE3, and SSE4 instructions that have
shuffle-like operations.
Penryn was also the first Intel processor to support SSE4.
The final ingredients Intel added to Penryn to improve performance were faster bus speeds and a
larger L2 cache. Quad-core chips shipped with up to 12MB of L2 cache while dual-core parts
featured 6MB of L2.
As a result of all these improvements, Penryn generally performed around 10-15% faster than
Conroe/Kentsfield clock-for-clock. In apps that took advantage of SSE4, this advantage was even
greater. In comparison, AMD’s fastest Phenom CPU, the Phenom 9950, is just now approaching the
performance of Intel’s older quad-core Kentsfield CPUs like the Core 2 Quad Q6600 and Q6700.
And now, just as AMD’s approaching the eve of the arrival of their first 45-nm CPUs, Intel’s back
again with the “tock” of their tick-tock model that follows every process shrink (in this case Penryn)
with a next-generation microarchitecture (Nehalem) each year.
As you probably know by now, Intel’s next-generation microarchitecture (previously codenamed
Nehalem) was officially given a brand name by Intel in August of this year: Core i7. Over the course
of the past 18 months, Intel has slowly divulged most of the tech goodies that make up Core i7
including its integrated memory controller, Intel’s Quick Path Interconnect (Intel’s equivalent of AMD
HyperTransport that previously went under the codename CSI), its new L3 cache, the return of
Hyper-Threading, and Nehalem’s Turbo Mode, but we’re going to briefly go over these changes
before we take a look at the new Core i7 platform and the processors behind it.
Nehalem Architecture
Fundamentally Nehalem is designed to be scaleable. In Core i7 form, the chip has four processing
cores, a triple-channel memory controller, bi-directional Quick Path Interconnect delivering up to
25.6GB/sec of bandwidth (12.8GB/sec in each direction), and 8MB of L3 cache. Server variants of
Nehalem could have more cores, larger L3 cache, and more QPI links (desktop chips feature one
link), while mobile variants could have fewer cores with less cache and a dual-channel (rather than
triple-channel) memory controller. Intel has indicated that they will even add graphics to the equation
at some point, taking yet another feature off the system chipset and onto the CPU itself.
This modular design helps to reduce power consumption. Features like the memory controller and
QPI all run at voltages independent of each other.
Intel has incorporated a number of improvements into Nehalem that are designed to improve IPC.
For instance, the number of micro-ops (microinstructions) in flight has increased from 96 in
Conroe/Penryn to 128 in Nehalem. Intel also increased the size of the load and store buffers to
ensure that they wouldn’t become a limiting factor.
Intel also improved Nehalem’s branch prediction. A new second-level branch target buffer has been
added to improve branch prediction in applications that have large footprints such as databases.
This second predictor has a much larger history table which should allow it to predict branches more
accurately than the first level predictor. Intel has also added a new renamed return stack buffer
(RSB). RSBs store forward and return pointers associated with call and return instructions. The RSB
should help Nehalem avoid return instruction mispredictions.
With its faster synchronization primitives, Nehalem has also been tweaked to handle threaded
software better.
Speaking of threading, with Nehalem we see the resurgence of simultaneous multi-threading (HyperThreading). With Hyper-Threading, one processing core can run two threads at the same time. With
four processing cores inside Core i7, the OS “sees” eight cores and sends eight instructions to the
CPU, effectively doubling the number of overall threads that Nehalem can run simultaneously over a
conventional quad-core CPU.
Whereas Hyper-Threading (HT) never really took off on the Pentium 4, Intel feels that Nehalem has
a distinctive HT advantage thanks to its larger cache and greater memory bandwidth, all of which
should allow it to deliver better HT performance. Additionally, there are also more apps capable of
taking advantage of HT than there were a few years ago. As you’ll see in our Lost Planet,
Cinebench, and Valve benchmarks, Nehalem delivers a significant performance increase in HTaware apps.
New cache subsystem
While Nehalem has the same 32KB instruction/32KB data L1 cache configuration as previous Core
2 CPUs, Intel has totally revamped the L2 cache and added a new L3 cache.
Nehalem’s L2 cache is much smaller than Penryn. Each core has its own 256KB L2 cache for
handling data and instruction. While this is significantly less than previous processors, Nehalem’s L2
is lower latency than its predecessors.
In addition to the L1 and L2 caches, like AMD’s Phenom Nehalem also features an L3 cache that is
shared across all the cores. Unlike Phenom however, Nehalem’s L3 is inclusive and not exclusive
like AMD’s. Intel feels that this inclusive architecture gives them an advantage over AMD, as an
exclusive architecture doesn’t store data from the lower level L1 and L2 caches. As a result, if a data
request misses on the L3 cache, each processor core must be snooped (searched) in case its L1 or
L2 cache has the requested data. This increases latency and snoop traffic between the cores.
With Nehalem these snoops are unnecessary, as the CPU already knows that the data doesn’t
reside in L1 or L2, this helps to reduce latency and thus improve performance as well as reducing
power consumption.
Like its two-level branch prediction, Nehalem features a two-level 512 entry translation lookaside
buffer (TLB). Nehalem is the first CPU to feature a second TLB. This is another improvement Intel
has incorporated into Nehalem to improve its performance with server apps like large databases.
SSE4
Nehalem is Intel’s first CPU to offer SSE4.2 support. 7 new application targeted accelerators have
been added to the new instruction set providing improved performance in string and text processing
operations. One example Intel provides is the parsing of XML files at a much higher speed. The
other two instructions are focused on accelerated searching and pattern recognition of large data
sets (useful for voice/handwriting recognition) and the seventh is a CRC instruction focused on new
communications capabilities such as accelerated network attached storage.
Download