Comparing Intel`s Core with AMD`s K8 Microarchitecture

advertisement
Comparing Intel’s
Core with AMD's K8
Microarchitecture
IS 3313 December 14th
Why is the Core Better at Prefetching and
Caching?





3 prefetchers, 2 data, 1 instruction, per core
2 prefetchers for the shared L2-cache
Eight prefetchers active in a Core 2 Duo CPU
Load operations (data prefetch) or demand
bandwidth gets priority
Data prefetch uses the store port for the tag
lookup…Why more Loads than Stores
Cache Comparison
The Memory subsystem




K8 has a bigger 2 x 64 KB L1 cache but Core’s
8-way 32 KB cache will have a hit rate close to
that of a 2-way 64 KB cache
K8 on die direct memory controller lowers the
latency to RAM considerably
But…Core CPUs have much bigger caches and
much smarter prefetching
Core’s L1 cache delivers about twice as much
bandwidth and its L2-cache is about 2.5 times
faster than that of the Athlon 64 or Opteron.
Decoding

“In almost every situation, the Core architecture
has the advantage. It can decode 4 x86
instructions per cycle, and sometimes 5 thanks
to x86 fusion. AMD’s Hammer can do only 3.”
Out of Order execution


Core 96 entry ROB buffer is, thanks to Macro-op fusion, bigger than the 72
Entry Macro-op buffer of the K8
Core uses a central reservation station, while the Athlon uses distributed
schedulers



A central reservation station has better utilization while distributed schedulers
allow more entries.
Both do 1 branch prediction per cycle
Core outperforms K8 on 128-bit SSE2/3 processing due to its 3 units

K8 128-bit SSE instructions are decoded into two separate 64-bit instructions:
Core does this twice as fast
 Core can do 4 Double Precision 64 bit FP calculations per cycle, while the Athlon
64 can do just 3

K8 has a small advantage as it has 3 AGU compared to Core's 2

However, deeper, more flexible out of order buffers and bigger, faster L2-cache
of the Core should negate this small advantage in most integer workloads
A Tale of Two Cores…
Better Out of Order Execution…
The K8 Athlon 64 can only move
loads before independent ALU
operations (ADD etc.)Loads cannot
be moved ahead much at all to
minimize the effect of a cache miss,
and other loads cannot be used to
keep the CPU busy if a load has to
wait for a store to finish. The K8 has
some Load/Store reordering, but it's
much later in the pipeline and is
less flexible than the Core
architecture
Vs.
Core’s approach to determining
whether a Load and a Store share
the same address is called Memory
Disambiguation. The P8 terefore
permits Loads to move ahead of
Stores thereby giving a big
performance boost. Intel claim up to
a 40% performance boost in some
instances: however, 10-20%
increase in performance is possible
using the fast L2 and L1 cache
HyperThreading and Integrated
Memory Controller

There is no Simultaneous Multi Threading (SMT)
or HyperThreading in the Core architecture
 SMT
can offer up to a 40% performance boost in
server applications
 However, TLP is being addressed by increasing the
number of cores on-die: e.g. the 65 nm Tigerton is
two Woodcrests in one package giving 4 cores

IMC was not adopted as the transistors were
better spent in the 4 MB shared cache
Conclusions





Compared to the AMD K8 the Intel’s Core is simply a wider, more
efficient and more out of order CPU
Memory disambiguation enabled increases in ILP and the massive
bandwidth of the L1 and L2 caches delivers 33% and comes close
to 33% more performance, clock-for-clock
AMD could enhance the SSE/SIMD power by increasing the width of
each execution unit or by simply implementing more of them in the
out of order FP pipeline and improve the bandwidth of the two
caches further
If AMD adopts a more flexible approach to reordering of Loads even without memory disambiguation – a 5% increase in IPC is
possibel
Core may provide a couple of free lunch vouchers to programmers
building single threaded applications…for now!
Download