Intel Core i7 (Nehalem) Performance Preview by Brandon Sandman Bell, November 02, 2008 Extract from: http://www.firingsquad.com/hardware/intel_core_i7_nehalem_performance_preview/default.asp It’s been a little over two years since Intel introduced the world to their first Core 2 processors utilizing their next-generation Conroe microarchitecture. Based somewhat off their Pentium M “Yonah” CPU core, Conroe restored Intel’s leadership position in CPUs. The chip boasted a wider execution core, allowing the processor to complete up to four full instructions simultaneously, along with a more efficient 14-stage pipeline improving IPC (instructions per clock) in comparison to Pentium 4/D. If you recall, this was one of the chief weaknesses in Core 2’s predecessor, Pentium 4/D. Pentium 4 processors sacrificed the amount of work performed per clock in exchange for more pipeline stages, 31 in the case of latter Pentium D processors. Essentially Intel made a conscious decision to sacrifice IPC in exchange for higher clock speeds. Ultimately this decision came back to haunt them when Pentium 4/D had trouble scaling to higher clock speeds of 4GHz and beyond. Core 2 never hit the clock speeds of Pentium 4, but because of its improved IPC, it didn’t have too in order to achieve breakthrough performance. But Intel didn’t stop there. To further enhance performance, Core 2 also featured more accurate branch prediction, improved SSE/SSE2/3 performance, and a unified L2 cache with more advanced prefetchers residing in the L1 and L2 caches to reduce memory access. Ultimately Core 2 was over two times faster than Intel’s previous Pentium processor, and it also significantly outperformed AMD’s fastest Athlon X2 and FX processors, all while generating very little power and with tons of frequency headroom for overclockers. It wasn’t uncommon for Core 2 Duo E6300 and E6400 chips to push 3GHz. Late last year Intel gave Core 2 a midlife upgrade with their Penryn architecture. Besides its smaller 45-nm manufacturing process, Penryn also featured double the divider speed over Conroe when handling math computations and a new super shuffle engine. This is a 128-bit wide, single-pass shuffle unit that improved Penryn’s performance with SSE2, SSE3, and SSE4 instructions that have shuffle-like operations. Penryn was also the first Intel processor to support SSE4. The final ingredients Intel added to Penryn to improve performance were faster bus speeds and a larger L2 cache. Quad-core chips shipped with up to 12MB of L2 cache while dual-core parts featured 6MB of L2. As a result of all these improvements, Penryn generally performed around 10-15% faster than Conroe/Kentsfield clock-for-clock. In apps that took advantage of SSE4, this advantage was even greater. In comparison, AMD’s fastest Phenom CPU, the Phenom 9950, is just now approaching the performance of Intel’s older quad-core Kentsfield CPUs like the Core 2 Quad Q6600 and Q6700. And now, just as AMD’s approaching the eve of the arrival of their first 45-nm CPUs, Intel’s back again with the “tock” of their tick-tock model that follows every process shrink (in this case Penryn) with a next-generation microarchitecture (Nehalem) each year. As you probably know by now, Intel’s next-generation microarchitecture (previously codenamed Nehalem) was officially given a brand name by Intel in August of this year: Core i7. Over the course of the past 18 months, Intel has slowly divulged most of the tech goodies that make up Core i7 including its integrated memory controller, Intel’s Quick Path Interconnect (Intel’s equivalent of AMD HyperTransport that previously went under the codename CSI), its new L3 cache, the return of Hyper-Threading, and Nehalem’s Turbo Mode, but we’re going to briefly go over these changes before we take a look at the new Core i7 platform and the processors behind it. Nehalem Architecture Fundamentally Nehalem is designed to be scaleable. In Core i7 form, the chip has four processing cores, a triple-channel memory controller, bi-directional Quick Path Interconnect delivering up to 25.6GB/sec of bandwidth (12.8GB/sec in each direction), and 8MB of L3 cache. Server variants of Nehalem could have more cores, larger L3 cache, and more QPI links (desktop chips feature one link), while mobile variants could have fewer cores with less cache and a dual-channel (rather than triple-channel) memory controller. Intel has indicated that they will even add graphics to the equation at some point, taking yet another feature off the system chipset and onto the CPU itself. This modular design helps to reduce power consumption. Features like the memory controller and QPI all run at voltages independent of each other. Intel has incorporated a number of improvements into Nehalem that are designed to improve IPC. For instance, the number of micro-ops (microinstructions) in flight has increased from 96 in Conroe/Penryn to 128 in Nehalem. Intel also increased the size of the load and store buffers to ensure that they wouldn’t become a limiting factor. Intel also improved Nehalem’s branch prediction. A new second-level branch target buffer has been added to improve branch prediction in applications that have large footprints such as databases. This second predictor has a much larger history table which should allow it to predict branches more accurately than the first level predictor. Intel has also added a new renamed return stack buffer (RSB). RSBs store forward and return pointers associated with call and return instructions. The RSB should help Nehalem avoid return instruction mispredictions. With its faster synchronization primitives, Nehalem has also been tweaked to handle threaded software better. Speaking of threading, with Nehalem we see the resurgence of simultaneous multi-threading (HyperThreading). With Hyper-Threading, one processing core can run two threads at the same time. With four processing cores inside Core i7, the OS “sees” eight cores and sends eight instructions to the CPU, effectively doubling the number of overall threads that Nehalem can run simultaneously over a conventional quad-core CPU. Whereas Hyper-Threading (HT) never really took off on the Pentium 4, Intel feels that Nehalem has a distinctive HT advantage thanks to its larger cache and greater memory bandwidth, all of which should allow it to deliver better HT performance. Additionally, there are also more apps capable of taking advantage of HT than there were a few years ago. As you’ll see in our Lost Planet, Cinebench, and Valve benchmarks, Nehalem delivers a significant performance increase in HTaware apps. New cache subsystem While Nehalem has the same 32KB instruction/32KB data L1 cache configuration as previous Core 2 CPUs, Intel has totally revamped the L2 cache and added a new L3 cache. Nehalem’s L2 cache is much smaller than Penryn. Each core has its own 256KB L2 cache for handling data and instruction. While this is significantly less than previous processors, Nehalem’s L2 is lower latency than its predecessors. In addition to the L1 and L2 caches, like AMD’s Phenom Nehalem also features an L3 cache that is shared across all the cores. Unlike Phenom however, Nehalem’s L3 is inclusive and not exclusive like AMD’s. Intel feels that this inclusive architecture gives them an advantage over AMD, as an exclusive architecture doesn’t store data from the lower level L1 and L2 caches. As a result, if a data request misses on the L3 cache, each processor core must be snooped (searched) in case its L1 or L2 cache has the requested data. This increases latency and snoop traffic between the cores. With Nehalem these snoops are unnecessary, as the CPU already knows that the data doesn’t reside in L1 or L2, this helps to reduce latency and thus improve performance as well as reducing power consumption. Like its two-level branch prediction, Nehalem features a two-level 512 entry translation lookaside buffer (TLB). Nehalem is the first CPU to feature a second TLB. This is another improvement Intel has incorporated into Nehalem to improve its performance with server apps like large databases. SSE4 Nehalem is Intel’s first CPU to offer SSE4.2 support. 7 new application targeted accelerators have been added to the new instruction set providing improved performance in string and text processing operations. One example Intel provides is the parsing of XML files at a much higher speed. The other two instructions are focused on accelerated searching and pattern recognition of large data sets (useful for voice/handwriting recognition) and the seventh is a CRC instruction focused on new communications capabilities such as accelerated network attached storage.