Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14th Why is the Core Better at Prefetching and Caching? 3 prefetchers, 2 data, 1 instruction, per core 2 prefetchers for the shared L2-cache Eight prefetchers active in a Core 2 Duo CPU Load operations (data prefetch) or demand bandwidth gets priority Data prefetch uses the store port for the tag lookup…Why more Loads than Stores Cache Comparison The Memory subsystem K8 has a bigger 2 x 64 KB L1 cache but Core’s 8-way 32 KB cache will have a hit rate close to that of a 2-way 64 KB cache K8 on die direct memory controller lowers the latency to RAM considerably But…Core CPUs have much bigger caches and much smarter prefetching Core’s L1 cache delivers about twice as much bandwidth and its L2-cache is about 2.5 times faster than that of the Athlon 64 or Opteron. Decoding “In almost every situation, the Core architecture has the advantage. It can decode 4 x86 instructions per cycle, and sometimes 5 thanks to x86 fusion. AMD’s Hammer can do only 3.” Out of Order execution Core 96 entry ROB buffer is, thanks to Macro-op fusion, bigger than the 72 Entry Macro-op buffer of the K8 Core uses a central reservation station, while the Athlon uses distributed schedulers A central reservation station has better utilization while distributed schedulers allow more entries. Both do 1 branch prediction per cycle Core outperforms K8 on 128-bit SSE2/3 processing due to its 3 units K8 128-bit SSE instructions are decoded into two separate 64-bit instructions: Core does this twice as fast Core can do 4 Double Precision 64 bit FP calculations per cycle, while the Athlon 64 can do just 3 K8 has a small advantage as it has 3 AGU compared to Core's 2 However, deeper, more flexible out of order buffers and bigger, faster L2-cache of the Core should negate this small advantage in most integer workloads A Tale of Two Cores… Better Out of Order Execution… The K8 Athlon 64 can only move loads before independent ALU operations (ADD etc.)Loads cannot be moved ahead much at all to minimize the effect of a cache miss, and other loads cannot be used to keep the CPU busy if a load has to wait for a store to finish. The K8 has some Load/Store reordering, but it's much later in the pipeline and is less flexible than the Core architecture Vs. Core’s approach to determining whether a Load and a Store share the same address is called Memory Disambiguation. The P8 terefore permits Loads to move ahead of Stores thereby giving a big performance boost. Intel claim up to a 40% performance boost in some instances: however, 10-20% increase in performance is possible using the fast L2 and L1 cache HyperThreading and Integrated Memory Controller There is no Simultaneous Multi Threading (SMT) or HyperThreading in the Core architecture SMT can offer up to a 40% performance boost in server applications However, TLP is being addressed by increasing the number of cores on-die: e.g. the 65 nm Tigerton is two Woodcrests in one package giving 4 cores IMC was not adopted as the transistors were better spent in the 4 MB shared cache Conclusions Compared to the AMD K8 the Intel’s Core is simply a wider, more efficient and more out of order CPU Memory disambiguation enabled increases in ILP and the massive bandwidth of the L1 and L2 caches delivers 33% and comes close to 33% more performance, clock-for-clock AMD could enhance the SSE/SIMD power by increasing the width of each execution unit or by simply implementing more of them in the out of order FP pipeline and improve the bandwidth of the two caches further If AMD adopts a more flexible approach to reordering of Loads even without memory disambiguation – a 5% increase in IPC is possibel Core may provide a couple of free lunch vouchers to programmers building single threaded applications…for now!