Chapter 11: The L2 cache

advertisement
Chapter 11: The L2 cache
The level 2 cache is normally much bigger (and unified), such as 256, 512 or 1024 KB.
The purpose of the L2 cache is to constantly read in slightly larger quantities of data
from RAM, so that these are available to the L1 cache.
In the earlier processor generations, the L2 cache was placed outside the chip: either
on the motherboard (as in the original Pentium processors), or on a special module
together with the CPU (as in the first Pentium II’s).
As process technology has developed, it has become possible to make room for the
L2 cache inside the actual processor chip. Thus the L2 cache has been integrated and
that makes it function much better in relation to the L1 cache and the processor
core.
The L2 cache is not as fast as the L1 cache, but it is still much faster than normal
RAM.
Traditionally the L2 cache is connected to the front side bus. Through it, it connects
to the chipset’s north bridge and RAM:
Fig. 73. The way the processor uses the L1 and L2 cache has crucial significance for
its utilisation of the high clock frequencies.
The level 2 cache takes up a lot of the chip’s die, as millions of transistors are needed
to make a large cache. The integrated cache is made using SRAM (static RAM), as
opposed to normal RAM which is dynamic (DRAM).
While DRAM can be made using one transistor per bit (plus a capacitor), it costs 6
transistors (or more) to make one bit of SRAM. Thus 256 KB of L2 cache would
require more than 12 million transistors. Thus it has only been since fine process
technology (such as 0.13 and 0.09 microns) was developed that it became feasible to
integrate a large L2 cache into the actual CPU. In Fig. 66 on page 27, the number of
transistors includes the CPU’s integrated cache.
Powerful bus
The bus between the L1 and L2 cache is presumably THE place in the processor
architecture which has the greatest need for high bandwidth. We can calculate the
theoretical maximum bandwidth by multiplying the bus width by the clock
frequency. Here are some examples:
CPU
Bus
Clock
width frequency
Theoretical
bandwidth
Intel
64
Pentium III bits
1400
MHz
11.2
GB/sek.
AMD
Athlon
XP+
AMD
Athlon 64
64
bits
2167
MHz
17.3
GB/sek.
64
bits
2200
MHz
17,6
GB/sek.
AMD
Athlon 64
FX
Intel
Pentium 4
128
bits
2200
MHz
35,2
GB/sek.
256
bits
3200
MHz
102
GB/sek.
Fig. 74. Theoretical calculations of the bandwidth between the L1 and L2 cache.
Different systems
There are a number of different ways of using caches. Both Intel and AMD have
saved on L2 cache in some series, in order to make cheaper products. But there is no
doubt, that the better the cache – both L1 and L2 – the more efficient the CPU will
be and the higher its performance.
AMD have settled on a fairly large L1 cache of 128 KB, while Intel continue to use
relatively small (but efficient) L1 caches.
On the other hand, Intel uses a 256 bit wide bus on the “inside edge” of the L2 cache
in the Pentium 4, while AMD only has a 64-bit bus (see Fig. 74).
AMD uses exclusive caches in all their CPU’s. That means that the same data can’t be
present in both caches at the same time, and that is a clear advantage. It’s not like
that at Intel.
However, the Pentium 4 has a more advanced cache design with Execution Trace
Cache making up 12 KB of the 20 KB Level 1 cache. This instruction cache works with
coded instructions, as described on page 35.
Latency
A very important aspect of all RAM – cache included – is latency. All RAM storage has
a certain latency, which means that a certain number of clock ticks (cycles) must pass
between, for example, two reads. L1 cache has less latency than L2; which is why it is
so efficient.
When the cache is bypassed to read directly from RAM, the latency is many times
greater. In Fig. 77 the number of wasted clock ticks are shown for various CPU’s.
Note that when the processor core has to fetch data from the actual RAM (when
both L1 and L2 have failed), it costs around 150 clock ticks. This situation is called
stalling and needs to be avoided.
Note that the Pentium 4 has a much smaller L1 cache than the Athlon XP, but it is
significantly faster. It simply takes fewer clock ticks (cycles) to fetch data:
Latency Pentium Athlon
II
Pentium
4
L1
cache:
3 cycles
3
cycles
2 cycles
L2
cache:
18 cycles 6
cycles
5 cycles
Fig. 77. Latency leads to wasted clock ticks; the fewer there are of these, the faster
the processor will appear to be.
Intelligent ”data prefetch”
In CPU’s like the Pentium 4 and Athlon XP, a handful of support mechanisms are also
used which work in parallel with the cache. These include:
A hardware auto data prefetch unit, which attempts to guess which data should be
read into the cache. This device monitors the instructions being processed and
predicts what data the next job will need.
Related to this is the Translation Look-aside Buffer, which is also a kind of cache. It
contains information which constantly supports the supply of data to the L1 cache,
and this buffer is also being optimised in new processor designs. Both systems
contribute to improved exploitation of the limited bandwidth in the memory system.
Conclusion
L1 and L2 cache are important components in modern processor design. The cache is
crucial for the utilisation of the high clock frequencies which modern process
technology allows. Modern L1 caches are extremely effective. In about 96-98% of
cases, the processor can find the data and instructions it needs in the cache. In the
future, we can expect to keep seeing CPU’s with larger L2 caches and more advanced
memory management. As this is the way forward if we want to achieve more
effective utilisation of the CPU’s clock ticks. Here is a concrete example:
In January 2002 Intel released a new version of their top processor, the Pentium 4
(with the codename, “Northwood”). The clock frequency had been increased by
10%, so one might expect a 10% improvement in performance. But because the
integrated L2 cache was also doubled from 256 to 512 KB, the gain was found to be
all of 30%.
CPU
L2
cache
Clock
freq.
Improvement
Intel Pentium 256 KB 2000
4
MHz
(0.18 micron)
Intel Pentium 512 KB 2200
4
MHz
(0.13 micron)
+30%
Fig. 79. Because of the larger L2 cache, performance increased significantly.
In 2002 AMD updated the Athlon processor with the new ”Barton” core. Here the L2
cache was also doubled from 256 to 512 KB in some models. In 2004 Intel came with
the “Prescott” core with 1024 KB L2 cache, which is the same size as in AMD’s Athlon
64 processors. Some Extreme Editions of Pentium 4 even uses 2 MB of L2 cache.
Xeon for servers
Intel produces special server models of their Pentium III and Pentium 4 processors.
These are called Xeon, and are characterised by very large L2 caches. In an Intel Xeon
the 2 MB L2 cache uses 149,000,000 transistors.
Xeon processors are incredibly expensive (about Euro 4,000 for the top models), so
they have never achieved widespread distribution. They are used in high-end
servers, in which the CPU only accounts for a small part of the total price.
Otherwise, Intel’s 64 bit server CPU, the Itanium. The processor is supplied in
modules which include 4 MB L3 cache of 300 million transistors.
Multiprocessors
Several Xeon processors can be installed on the same motherboard, using special
chipsets. By connecting 2, 4 or even 8 processors together, you can build a very
powerful computer.
These MP (Multiprocessor) machines are typically used as servers, but can also be
used as powerful workstations, for example, to perform demanding 3D graphics and
animation tasks. AMD has the Opteron processors, which are server-versions of the
Athlon 64. Not all software can make use of the PC’s extra processors; the programs
have to be designed to do so. For example, there are professional versions of
Windows NT, 2000 and XP, which support the use of several processors in one PC.
See also the discussion of Hyper Threading, which allows a Pentium 4 processor to
appear as an MP system. Both Intel and AMD also works on dual-core processors.
Download