CACTI: An Enhanced Cache Access and Cycle Time Model

advertisement
To appear in IEEE Journal of Solid State Circuits 1996
CACTI: An Enhanced Cache Access and Cycle Time Model
Steven J.E. Wilton
Department of Electrical and Computer Engineering
University of Toronto
Toronto, Ontario, Canada M5S 1A4
(416) 978-1652
wilton@eecg.toronto.edu
Norman P. Jouppi
Digital Equipment Corporation Western Research Lab
250 University Avenue
Palo Alto, CA 94301
(415) 617-3305
jouppi@pa.dec.com
May 14, 1996
Abstract
This paper describes an analytical model for the access and cycle times of on-chip directmapped and set-associative caches. The inputs to the model are the cache size, block size,
and associativity, as well as array organization and process parameters. The model gives
estimates that are within 6% of Hspice results for the circuits we have chosen.
This model extends previous models and xes many of their major shortcomings. New
features include models for the tag array, comparator, and multiplexor drivers, non-step
stage input slopes, rectangular stacking of memory subarrays, a transistor-level decoder
model, column-multiplexed bitlines controlled by an additional array organizational parameter, load-dependent size transistors for wordline drivers, and output of cycle times as well
as access times.
Software implementing the model is available via ftp.
1
To appear in IEEE Journal of Solid State Circuits 1996
1 Introduction
Most computer architecture research involves investigating trade-os between various alternatives. This can not be done adequately without a rm grasp of the costs of each
alternative. For example, it is impossible to compare two dierent cache organizations
without considering the dierence in access or cycle times. Similarly, the chip area and
power requirements of each alternative must be taken into account. Only when all the costs
are considered can an informed decision be made.
Unfortunately, it is often dicult to determine costs. One solution is to employ analytical models that predict costs based on various architectural parameters. In the cache
domain, both chip area models [1] and access time models [2] have been published.
In [2], Wada et al. present an equation for the access time of an on-chip cache as
a function of various cache parameters (cache size, associativity, block size) as well as
organizational and process parameters. Unfortunately, Wada's access time model has a
number of signicant shortcomings. For example, the cache tag and comparator in setassociative memories are not modeled, and in practice, these often constitute the critical
path. Each stage in their model (e.g., bitline, wordline) assumes that the inputs to the
stage are step waveforms; actual waveforms in memories are far from steps and this can
greatly impact the delay of a stage. In the Wada model, all memory subarrays are stacked
linearly in a single le; this can result in aspect ratios of greater than 10:1 and overly
pessimistic access times. Wada's decoder model is a gate-level model which contains no
wiring parasitics. In addition, transistor sizes in Wada's model are xed independent of the
load. For example, the wordline driver is always the same size independent of the number
of cells that it drives. Finally, Wada's model predicts only the cache access time, whereas
both the access and cycle time are important for design comparisons.
This paper describes a signicant improvement and extension of Wada's access time
model. The enhanced model is called CACTI. Some of the new features are:
a tag array model with comparator and multiplexor drivers
non-step stage input slopes
rectangular stacking of memory subarrays
a transistor-level decoder model
column-multiplexed bitlines and an additional array organizational parameter
2
To appear in IEEE Journal of Solid State Circuits 1996
load-dependent transistor sizes for wordline drivers
cycle times as well as access times
The enhancements to Wada's model can be classied into two categories. First, the
assumed cache structure has been modied to more closely represent real caches. Some
examples of these enhancements are the column multiplexed bitlines and the inclusion of
the tag array. The second class of enhancements involve the modeling techniques used to
estimate the delay of the assumed cache structure (e.g. taking into account non-step input
rise times). This paper describes both classes of enhancements. After discussing the overall
cache structure and model input parameters, the structural enhancements are described
in Section 4. The modeling techniques used are then described in Section 5. A complete
derivation of all the equations in the model is far beyond the scope of a journal paper but
is available in a technical report [3].
Any model needs to be validated before the results generated using the model can be
trusted. In [2], a Hspice model of the cache was used to validate the authors' analytical
model. The same approach was used here; Section 6 compares the model predictions to
Hspice measurements. Of course, this only shows that the analytical model matches the
Hspice model; it does not address the issue of how well the assumed cache structure (and
hence the Hspice model) reects a real cache design. When designing a real cache, many
dierent circuit implementations are possible. In architecture studies, however, the relative
dierences in access/cycle times between dierent cache sizes or congurations are usually
more important than absolute access times. Thus, even though our model predicts absolute
access times for a specic cache implementation, it can be used in a wide variety of situations
when an estimate of the cost of varying an architectural parameter is required. Section 7
gives some examples of how the model can be used in architectural studies.
The model described in this paper has been implemented, and the software is available
via ftp. Appendix A explains how to obtain and use the CACTI software.
2 Cache Structure
Figure 1 shows the organization of the SRAM cache being considered. The decoder rst
decodes the address and selects the appropriate row by driving one wordline in the data
array and one wordline in the tag array. Each array contains as many wordlines as there
3
To appear in IEEE Journal of Solid State Circuits 1996
ADDRESS
INPUT
BIT LINES
BIT LINES
WORD
LINES
WORD
LINES
TAG
ARRAY
DATA
ARRAY
COLUMN
MUXES
COLUMN
MUXES
DECODER
SENSE
AMPS
SENSE
AMPS
COMPARATORS
OUTPUT
DRIVERS
MUX
DRIVERS
DATA OUTPUT
OUTPUT
DRIVER
VALID OUTPUT
Figure 1: Cache structure
are rows in the array, but only one wordline in each array can go high at a time. Each
memory cell along the selected row is associated with a pair of bitlines; each bitline is
initially precharged high. When a wordline goes high, each memory cell in that row pulls
down one of its two bitlines; the value stored in the memory cell determines which bitline
goes low.
Each sense amplier monitors a pair of bitlines and detects when one changes. By
detecting which line goes low, the sense amplier can determine the contents of the selected
memory cell. It is possible for one sense amplier to be shared among several pairs of
bitlines. In this case, a multiplexor is inserted before the sense amps; the select lines of the
multiplexor are driven by the decoder. The number of bitlines that share a sense amplier
depends on the layout parameters described in the next section.
The information read from the tag array is compared to the tag bits of the address. In an
A-way set-associative cache, A comparators are required. The results of the A comparisons
are used to drive a valid (hit/miss) output as well as to drive the output multiplexors.
These output multiplexors select the proper data from the data array (in a set-associative
cache or a cache in which the data array width is larger than the output width), and drive
4
To appear in IEEE Journal of Solid State Circuits 1996
the selected data out of the cache.
3 Cache and Array Organization Parameters
Table 1 shows the ve model input parameters.
Parameter Meaning
C
Cache size in bytes
B
Block size in bytes
A
Associativity
bo
Output width in bits
baddr
Address width in bits
Table 1: CACTI input parameters
In addition, there are six array organization parameters that are used to estimate the
cache access and cycle time. In the basic organization discussed by Wada [2], a single
set shares a common wordline. Figure 2-a shows this organization, where B is the block
size (in bytes), A is the associativity, and S is the number of sets (S = BCA ). Clearly,
such an organization could result in an array that is much larger in one direction than the
other, causing either the bitlines or wordlines to be very slow. To alleviate this problem,
Wada describes how the array can be broken horizontally and vertically and denes two
parameters, Ndwl and Ndbl , which indicates to what extent the array has been divided.
Figure 2-b introduces another organization parameter, Nspd. This parameter indicates how
many sets are mapped to a single wordline, and allows the overall access time of the array
to be changed without breaking it into smaller subarrays. The optimum values of Ndwl ,
Ndbl, and Nspd depend on the cache and block sizes, as well as the associativity.
We assume that the tag array can be congured independently of the data array. Thus,
there are also three tag array parameters: Ntwl, Ntbl, and Ntspd.
4 Model Components
This section gives an overview of key portions of the cache read access and cycle time
model. The access and cycle times were derived by estimating delays due to the following
components:
5
To appear in IEEE Journal of Solid State Circuits 1996
8xBxA
16xBxA
S
S/2
a) Original Array
b) Nspd = 2
Figure 2: Cache organization parameter Nspd
decoder
wordlines (in both the data and tag arrays)
bitlines (in both the data and tag arrays)
sense ampliers (in both the data and tag arrays)
comparators
multiplexor drivers
output drivers (data output and valid signal output)
The delay of each these components is estimated separately and the results combined
to estimate the access and cycle time of the entire cache. A complete description of each
component can be found in [3]. In this paper, we focus on those parts that dier signicantly
from Wada's model.
4.1 Decoder
Wada's model contains a gate-level decoder model without any parasitic capacitances or
resistances. It also assumes all memory sub-arrays are stacked single le in a linear array.
We have used a detailed transistor-level decoder that includes both parasitic capacitances
and resistances. We have also assumed that sub-arrays are placed in a two-dimensional
array to minimize critical wiring parasitics.
Figure 3 shows the logical structure of the decoder architecture used in this model. The
decoder in Figure 3 contains three stages. Each block in the rst stage takes three address
bits (in true and complement), and generates a 1-of-8 code, driving a precharged decoder
bus. These 1-of-8 codes are combined using NOR gates in the second stage. The nal
6
To appear in IEEE Journal of Solid State Circuits 1996
WORDLINE
DRIVER
Word
Lines
Address
3
to
8
3
to
8
Figure 3: Single decoder structure
stage is an inverter that drives each wordline driver. We also model separate decoder driver
buers for driving the 3-to-8 decoders of the data arrays and the tag arrays.
Estimating the wire lengths in the decoder requires knowledge of the memory tile layout.
As mentioned in Section 3, the memory is divided into Ndwl Ndbl subarrays; each of these
spd
arrays is 8BAN
Ndwl cells wide. If these arrays were placed side-by-side, the total memory
width would be 8 B A Ndbl Nspd cells. Instead, we assume they are grouped in twoby-two blocks, with the 3-to-8 predecode NAND gates at the center of each block; Figure 4
shows one of these blocks. This reduces the length of the connection between the decoder
driver and the predecode block to approximately one quarter of the total memory width,
or 2 B A Ndbl Nspd. The length of the connection between the predecode block and
C
the NOR gate is then (on average) half of the subarray height, which is BANdbl
Nspd cells.
In large memories with many groups the bits in the memory are arranged so that all bits
driving the same data output bus are in the same group, shortening the data bus.
4.2 Variable Size Wordline Driver
The size of the wordline driver in Wada's model is independent of the number of cells
attached to the wordline; this severely overestimates the wordline delay of large arrays.
Our model assumes a variable-sized wordline driver. Normally, a cache designer would
choose a target wordline rise time, and adjust the driver size appropriately. Rather than
7
To appear in IEEE Journal of Solid State Circuits 1996
Predecoded
address
Address in
from driver
Predecode
Channel for data output bus
Figure 4: Memory block tiling assumptions
assuming a constant rise time for caches of all sizes, however, we assume the desired rise
time (to a 50% word line swing) is:
desired rise time = krise ln(cols) 0:5
where
spd
cols = 8BAN
N
dwl
and krise is a constant that depends on the implementation technology. To obtain the
transistor size that would give this rise time, it is necessary to work backwards, using an
equivalent RC circuit to nd the required driver resistance, and then nding the transistor
width that would give this resistance. This is described in Section 5.5.
4.3 Bitlines and Sense Ampliers
Wada's model does not apply to memories with column multiplexing. Our model allows
column multiplexing using NMOS pass transistors between several pairs of bitlines and a
shared sense amp. In our model, the degree of column multiplexing (number of pairs of
bitlines per sense amp) is Nspd Ndbl .
Although we use the same sense amp as Wada's model, we precharge the bitlines to
two NMOS diodes less than Vdd since the sense amp performs poorly with a common-mode
8
To appear in IEEE Journal of Solid State Circuits 1996
Vdd
PRECHARGE
OUT
a0
a0
a1
a
1
a
n
an
b
0
b
0
b
b1
bn
bn
1
EVAL
From
dummy
row
sense
amp
in tag
array
small p
large n
Figure 5: Comparator
voltage of Vdd .
4.4 Comparator
Although Wada's model gives access times for set-associative caches, it only models the data
portion of a set-associative memory. However, the tag portion of a set-associative memory
is often the critical path. Our model assumes the tag memory array circuits are similar to
those on the data side with the addition of comparators to choose between dierent sets.
The comparator that was modeled is shown in Figure 5. The outputs from the sense
ampliers are connected to the inputs labeled bn and bn -bar. The an and an -bar inputs are
driven by tag bits in the address. Initially, the output of the comparator is precharged high;
a mismatch in any bit will close one pull-down path and discharge the output. In order to
ensure that the output is not discharged before the bn bits become stable, node EVAL is
held high until roughly three inverter delays after the generation of the bn -bar signals. This
is accomplished by using a timing chain driven by a sense amp on a dummy row in the tag
array. The output of the timing chain is used as a \virtual ground" for the pull-down paths
of the comparator. When the large NMOS transistor in the nal inverter in the timing
chain begins to conduct, the virtual ground (and hence the comparator output if there is a
mismatch) begins to discharge.
9
To appear in IEEE Journal of Solid State Circuits 1996
4.5 Set Multiplexor and Output Drivers
In a set-associative cache, the result of the A comparisons must be used to select which
of the A possible data blocks are to be sent out of the cache. Since the width of a block
(8B ) is usually greater than the cache output width (bo ), it is also necessary to choose part
of the selected block to drive the output lines. An A-way set-associative cache contains
A multiplexor driver blocks, as shown in Figure 6. Each multiplexor driver uses a single
comparator output bit, along with address bits, to determine which bo data array outputs
drive the output bus.
1
A
Mux driver circuit
Mux driver circuit
bo
bo
bo
bo
8B
bo
8B
bo
bo
data
bus
outputs
Figure 6: Overview of data bus output driver multiplexors
5 Model Derivation
The delay of each component was estimated by decomposing each component into several
equivalent RC circuits, and using simple RC equations to estimate the delay of each stage.
This section shows are resistances and capacitances were estimated, as well as how they
were combined and the delay of a stage calculated. The stage delay in our model depends
on the slope of its inputs; this section also describes how this was done.
5.1 Estimating Resistances
To use the RC approximations described in Sections 5.5 and 5.6, it is necessary to estimate
the full-on resistance of a transistor. The full-on resistance is the resistance seen between
drain and source of a transistor assuming the gate voltage is constant and the gate is fully
conducting. This resistance can also be used for pass transistors that (as far as the critical
path is concerned) are fully conducting.
It is assumed that the equivalent resistance of a conducting transistor is inversely pro10
To appear in IEEE Journal of Solid State Circuits 1996
portional to the transistor width (only minimum-length transistors were used). Thus,
R
equivalent resistance = W
where R is a constant (dierent for NMOS and PMOS transistors) and W is the transistor
width.
5.2 Estimating Gate Capacitances
The RC approximations in Section 5.5 and 5.6 also require an estimation of a transistor's
gate and drain capacitances. The gate capacitance of a transistor consists of two parts: the
capacitance of the gate itself, and the capacitance of the polysilicon line going into the gate.
If Le is the eective length of the transistor, Lpoly is the length of the poly line going into
the gate, Cgate is the capacitance of the gate per unit area, and Cpolywire is the poly line
capacitance per unit area, then a transistor of width W has a gate capacitance of:
gatecap(W) = W Le Cgate + Lpoly Le Cpolywire
The same formula holds for both NMOS and PMOS transistors.
The value of Cgate depends on whether the transistor is being used as a pass transistor,
or as a pull-up or pull-down transistor in a static gate. Thus, two dierent values of Cgate
are required.
5.3 Drain Capacitances
Figure 7 shows typical transistor layouts for small and large transistors. We have assumed
that if the transistor width is larger than 10m, the transistor is split as shown in Figure 7-b.
The drain capacitance is composed of both an area and perimeter component. Using
the geometries in Figure 7, the drain capacitance for a single transistor can be obtained. If
the width is less than 10m,
draincap(W ) = 3Le W Cdiarea + (6Le + W ) Cdiside + W Cdigate
where Cdiarea , Cdiside , and Cdigate are process dependent parameters (there are two
values for each of these: one for NMOS and one for PMOS transistors). Cdigate is the
11
To appear in IEEE Journal of Solid State Circuits 1996
Leff
SOURCE
DRAIN
DRAIN
SOURCE
SOURCE
W/2
W
3 x Leff
3 x Leff
3 x Leff
GATE
3 x Leff
3 x Leff
GATE
a) width < 10m
Leff
b) width 10m
Figure 7: Transistor Geometries
sum of the junction capacitance due to the diusion and the oxide capacitance due to the
gate/source or gate/drain overlap.
If the width is larger than 10m, we assume the transistor is folded (see Figure 7-b),
reducing the drain capacitance to:
draincap(W ) = 3Le W2 Cdiarea + 6Le Cdiside + W Cdigate
Now, consider two transistors (with widths less than 10m) connected in series, with
only a single Le W wide region acting as both the source of the rst transistor and
the drain of the second. If the rst transistor is on, and the second transistor is o, the
capacitance seen looking into the drain of the rst is:
draincap(W ) = 4Le W Cdiarea + (8Le + W ) Cdiside + 3W Cdigate
Figure 8 shows the situation if the transistors are wider than 10m. In this case, the
capacitance seen looking into the drain of the inner transistor (x in the diagram) assuming
it is on but the outer transistor is o is:
draincap(W ) = 5Le W2 Cdiarea + 10Le Cdiside + 3W Cdigate
12
To appear in IEEE Journal of Solid State Circuits 1996
3xLeff
x
W/2
3xLeff
3xLeff
Leff
Leff
Figure 8: Two stacked transistors if each width 10m
5.4 Other parasitic eects
Parasitic resistances and capacitances of the bitlines, wordlines, predecode lines, and various
other signals within the cache are also modeled. These resistances and capacitances are xed
values per unit length; the capacitance includes an expected value for the area and sidewall
capacitances to the substrate and other layers.
5.5 Simple RC Circuits
Each component described in Section 4 can be decomposed into several rst or second
order RC circuits. Figure 9-a shows a typical rst-order circuit. The time for node x to rise
or fall can be determined using the equivalent circuit of Figure 9-b. Here, the pull-down
path (assuming a rising input) of the rst stage is replaced by a resistance, and the gate
capacitances of the second stage and the drain capacitance of the rst stage are replaced by
a single capacitor. The resistances and capacitances are calculated as shown in Sections 5.1
to 5.3. In stages in which the two gates are separated by a long wire, parasitic capacitances
and resistances of the wire are included in Ceq and Req .
The delay of the circuit in Figure 9 can be estimated using an equation due to Horowitz [4]
(assuming a rising input):
s v 2
v
th
delay = tf log V
+ 2trise b 1 , Vth =tf
dd
dd
where vth is the switching voltage of the inverter, trise is the input rise time, tf is the output
13
To appear in IEEE Journal of Solid State Circuits 1996
x
Ceq
Req
x
a) schematic
b) equivalent circuit
Figure 9: Example Stage
time constant assuming a step input (tf = Req Ceq ), and b is the fraction of the input
swing in which the output changes (we used b = 0:5). For a falling input with a fall time of
tfall, the above equation becomes:
s 2
delay = tf log 1 , vth + 2tfall b vth
Vdd
tf Vdd
In this case, we used b = 0:4.
The delay of a gate is dened as the time between the input reaching the switching
voltage (threshold voltage) of the gate, and the output reaching the threshold voltage of
the following gate. If the gate drives a second gate with a dierent switching voltage, the
above equations need to be modied slightly. If the switching voltage of the switching gate
is vth1 and the switching
of the following gate is vth2 , then:
s voltage
2
delay = tf log vVth1 + 2trise b 1 , vVth1 =tf + tf log vVth1 , log vVth2
dd
dd
dd
dd
for a rising input, and
s v 2 2t b v
v v
th
1
th
1
rise
th
1
delay = tf log 1 , V
+ tf log 1 , V , log 1 , Vth2
+ t V
dd
f
dd
dd
dd
for a falling input.
As described in Section 4.2, the size of the wordline driver depends on the number of
cells being driven. For a given array width, the capacitance driven by the wordline driver
can be estimated by summing the gate capacitance of each pass transistor being driven by
the wordline, as well as the metal capacitance of the line. Using this, and the desired rise
14
To appear in IEEE Journal of Solid State Circuits 1996
time, the required pull-up resistance of the driver can be estimated by:
rise time
Rp = ,desired
C ln(0:5)
eq
(recall that the desired rise time is assumed to be the time until the wordline reaches 50%
of its maximum value).
Once Rp is found, the required transistor width can be found using the equation in
Section 5.1. Since this \backwards analysis" did not take into account the non-zero input
fall time, we then use Rp and the wordline capacitance and calculate the adjusted delay
using Horowitz's equations as described earlier. These transistor widths are also used to
estimate the delay of the nal gate in the decoder.
5.6 RC-Tree Solutions
All but two of the stages along the cache's critical path can be approximated by simple
rst-order stages as in the previous section. The bitline and comparator equivalent circuits,
however, require more complex solutions. Figure 10 shows an equivalent circuit that can
be used for the bitline and comparator circuits. From [5], the delay of this circuit can be
written as:
Tstep = [R2C2 + (R1 + R2)C1] ln vvstart
end
where vstart is the voltage at the beginning of the transistion, and vend is the voltage at
which the stage is considered to have \switched" (vstart > vend ). For the comparator, vstart
is Vdd and vend is the threshold voltage of the multiplexor driver. For the bitline subcircuit,
vstart is the precharged voltage of the bitlines (vpre ), and vend is the voltage which causes
the sense amplier output to fully switch (vpre , vsense ).
R1
C2
R2
C1
Figure 10: Equivalent circuit for bitline and comparator
15
To appear in IEEE Journal of Solid State Circuits 1996
When estimating the bitline delay, a non-zero wordline rise time is taken into account
as follows. Figure 11 shows the wordline voltage as a function of time for a step input. The
time dierence Tstep shown on the graph is the time after the input rises (assuming a step
input) until the output reaches vpre , vsense (the output voltage is not shown on the graph).
An equation for Tstep is given above. During this time, we can consider the bitline being
\driven" low. Because the current sourced by the access transistor can be approximated as
i gm(Vgs , Vt)
the shaded area in the graph can be thought of as the amount of charge discharged before
the output reaches vpre , vsense . This area can be calculated as:
area = Tstep (Vdd , Vt)
(Vt is the voltage at which the NMOS transistor begins to conduct).
Vdd
input
voltage
Vt
t
T
step
Figure 11: Step input
If we assume that the same amount of \drive" is required to drive the output to vpre ,
vsense regardless of the shape of the input waveform, then we can calculate the output delay
for an arbitrary input waveform. Consider Figure 12-a. If we assume the area is the same
as in Figure 11, then we can calculate the value of T (delay adjusted for input rise time).
Using simple algebra, it is easy to show that
s
T = 2 Tstep (Vdd , Vt)
m
where m is the slope of the input waveform (this can be estimated using the delay of the
previous stage). Note that the bitline delay is measured from the time the wordline reaches
16
To appear in IEEE Journal of Solid State Circuits 1996
area = T
slope = m
vdd
V
area = T
(V −v )
step dd t
V
dd
input
voltage
t
t
T
(V
dd
−v )
t
input
voltage
slope = m
V
t
step
t
T
a) Slow-rising input
b) Fast-rising input
Figure 12: Non-zero input rise time
Vt. Unlike the version of the model described in [3], the wordline rise time in this model
is dened as the time until the bitlines begin to discharge; this happens when the wordline
reaches Vt .
If the wordline rises quickly, as shown in Figure 12-b, then the algebra is slightly dierent.
In this case,
, Vt
T = Tstep + Vdd2m
The cross-over point between the two cases for T occurs when:
m = V2ddT, Vt
step
The non-zero input rise time of the comparator can be taken into account similarly.
The delay of the comparator is composed of two parts: the delay of the timing chain and
the delay discharging the output (see Figure 5). The delay of the rst three inverters in
the timing chain can be approximated using simple rst-order RC stages as described in
Section 5.5. The time to discharge the comparator output through the nal inverter can
be estimated using the equivalent circuit of Figure 10 and taking into account the non-zero
input rise time using the same technique that was used for the bitline subcircuit. In this
case, the \input" is the output of the third inverter in the timing chain (we assume the
timing chain is long enough that the an and bn lines are stable). The discharging delay of
the comparator output is measured from the time the input reaches the threshold voltage
of the nal timing chain inverter. The equations for this case can be found in [3].
17
To appear in IEEE Journal of Solid State Circuits 1996
6 Total Access and Cycle Time
This section describes how delays of the model components described in Section 4 are
combined to estimate the cache read access and cycle times.
6.1 Access Time
There are two potential critical paths in a cache read access. If the time to read the tag
array, perform the comparison, and drive the multiplexor select signals is larger than the
time to read the data array, then the tag side is the critical path, while if it takes longer to
read the data array, then the data side is the critical path. In many cache implementations,
the designer would try to margin the cache design such that the tag path is slightly faster
than the data path so that the multiplexor select signals are valid by the time the data
is ready. Often, however, this is not possible. Therefore, either side could determine the
access time, meaning both sides must be modeled in detail.
In a direct-mapped cache, the access time is the larger of the two paths:
Taccess dm = max(Tdataside + Toutdrive data; Ttagside dm + Toutdrive valid)
where Tdataside is the delay of the decoder, wordline, bitline, and sense amplier for the
data array, Ttagside dm is the delay of the decoder, wordline, bitline, sense amplier, and
comparator for the tag array, Toutdrive dm is the delay of the cache data output driver, and
Toutdrive valid is the delay of the valid signal driver.
In a set-associative cache, the tag array must be read before the data signals can be
driven. Thus, the access time is:
Taccess;sa = max(Tdataside ; Ttagside sa ) + Toutdrive data
where Ttagside sa is the same as Ttagside dm , except that it includes the time to drive the
select lines of the output multiplexors.
Figures 13 to 16 show analytical and Hspice estimations of the data and tag sides for
direct-mapped and 4-way set-associative caches. A 0:8m CMOS process was assumed [6].
To gather these results, the model was rst used to nd the array organization parameters
which resulted in the lowest access time via exhaustive search for each cache size. These
18
To appear in IEEE Journal of Solid State Circuits 1996
12ns
11ns
10ns
9ns
8ns
Data Side 7ns
(including
6ns
output
5ns
driver)
4ns
3ns
2ns
1ns
0ns
2:4:4
1:4:4
1:8:4
1:4:8
g
g
g
g
2:4:4
2:4:2 1:4:4
1:2:4
gg
g
g
2:2:1 1:4:1
1:2:4
2:4:1 1:2:2
1:2:2
g
g
g
g
g
g
2048
Analytical
Hspice
8192
Markers:
Ndwl:Ndbl:Nspd
Ntwl:Ntbl:Ntspd
32768
131072
Cache Size in bytes (B=16 bytes, A=1)
Figure 13: Direct mapped: Tdataside + Toutdrive data
12ns
11ns
10ns
9ns
8ns
Tag 7ns
Side 6ns
5ns
4ns
3ns
2ns
1ns
0ns
1:8:4
1:4:8
2:4:4
1:4:4
g
g
2:4:4
1:4:4
2:4:2
1:4:1 1:2:4
2:2:1 1:2:4
2:4:1 1:2:2
1:2:2
g
g
g
g
gg
gg
gg
g
2048
g
Analytical
Hspice
8192
Markers:
Ndwl:Ndbl:Nspd
Ntwl:Ntbl:Ntspd
32768
131072
Cache Size in bytes (B=16 bytes, A=1)
Figure 14: Direct mapped: Ttagside dm
optimum parameters are shown in the gures (the six numbers associated with each point
correspond to Ndwl , Ndbl, Nspd , Ntwl, Ntbl, and Ntspd in that order). The parameters were
then used in the Hspice model. As the graphs show, the dierence between the analytical
and Hspice results is less than 6% in every case.
6.2 Cycle Time
The dierence between the access and cycle time of a cache varies widely depending on
the circuit techniques used. Usually the cycle time is a modest percentage larger than the
access time, but in pipelined or post-charge circuits [7, 8] the cycle time can be less than
19
To appear in IEEE Journal of Solid State Circuits 1996
12ns
11ns
10ns
9ns
8ns
Data Side 7ns
(plus output 6ns
driver)
5ns
4ns
3ns
2ns
1ns
0ns
1:8:1
2:4:1 1:4:2
1:4:2
g
g
g
2:2:1
4:1:1 1:4:1
1:2:1
g
g
4:1:1
1:1:1
4:2:1
1:2:1
g
g
g
4:2:1
1:2:1
g
g
g
gg
g
g
2048
Analytical
Hspice
8192
Markers:
Ndwl:Ndbl:Nspd
Ntwl:Ntbl:Ntspd
32768
131072
Cache Size in bytes (B=16 bytes, A=4)
Figure 15: 4-way set associative: Tdataside + Toutdrive data
14ns
13ns
12ns
11ns
10ns
9ns
8ns
Tag
7ns
Side
6ns
5ns
4ns
3ns
2ns
1ns
0ns
1:8:1
1:4:2
g
2:4:1
1:4:2
g
g
g
2:2:1
4:1:1 1:4:1
4:2:1 1:2:1
4:2:1 1:2:1
4:1:1 1:2:1
1:1:1
g
g
g
g
gg
gg
gg
g
2048
g
Analytical
Hspice
8192
Markers:
Ndwl:Ndbl:Nspd
Ntwl:Ntbl:Ntspd
32768
131072
Cache Size in bytes (B=16 bytes, A=4)
Figure 16: 4-way set associative: Ttagside;sa
the access time. We have chosen to model a more conventional structure with the cycle
time equal to the access time plus the precharge.
There are three elements in our assumed cache organization that need to be precharged:
the decoders, the bitlines, and the comparator. The precharge times for these elements
are somewhat arbitrary, since the precharging transistors can be scaled in proportion to
the loads they are driving. We have assumed that the time for the wordline to fall and
bitline to rise in the data array is the dominant part of the precharge delay. Assuming
properly ratioed transistors in the wordline drivers, the wordline fall time is approximately
the same as the wordline rise time. It is assumed that the bitline precharging transistors are
20
To appear in IEEE Journal of Solid State Circuits 1996
scaled such that a constant (over all cache organizations) bitline charge time is obtained.
This constant will, of course, be technology dependent. In the model, we assume that this
constant is equal to four inverter delays (each with a fanout of four). Thus, the cycle time
of the cache can be written as:
Tcache = Taccess + Twordline delay + 4 (inverter delay)
7 Applications of the Model
This section gives examples of how the analytical model can be used to quickly gather data
that can be used in architectural studies.
7.1 Cache Size
First consider Figure 17. These graphs show how the cache size aects the cache access
and cycle times in a direct-mapped and 4-way set-associative cache. In these graphs (and
all graphs in this report), bo = 64 and baddr = 32. For each cache size, the optimum array
organization parameters were found (these optimum parameters are shown in the graphs
as before; the six numbers associated with each point correspond to Ndwl , Ndbl , Nspd , Ntwl,
Ntbl, and Ntspd in that order), and the corresponding access and cycle times were plotted.
In addition, the graph breaks down the access time into several components.
There are several observations that can be made from the graphs. Starting from the
bottom, it is clear that the time through the data array decoders is always longer than the
time through the tag array decoders. For all but one of the organizations selected, there
are more data subarrays (Ndwl Ndbl ) than tag subarrays (Ntwl Ntbl). This is because the
total tag storage is usually much less than the total data storage.
In all caches shown, the comparator is responsible for a signicant portion of the access
time. Another interesting trend is that the tag side is always the critical path in the
cache access. In the direct-mapped cases, organizations are found which result in very
closely matched tag and data sides, while in the set-associative case, the paths are not
matched nearly as well. This is due primarily to the delay driving select lines of the output
multiplexor.
21
To appear in IEEE Journal of Solid State Circuits 1996
Markers:
g
15ns
10ns
Time
5ns
g
`
`
`
`
.`. . . . . . . . .`
.
.
.
.
.
.
.
.
.
.
.
∇
∇.
f
f
f. . . . . . . . . . f.
∗
∗
∗. . . . . . . . . .∗.
†
†
†. . . . . . . . . .†.
Markers:
Ndwl:Ndbl:Nspd
Ntwl:Ntbl:Ntspd
Ndwl:Ndbl:Nspd
Ntwl:Ntbl:Ntspd
Cycle time
Access time
Data side
Tag side
Compare
Data bitline & sense
Tag bitline & sense
Data wordline
Tag wordline
Data decode
Tag decode
g
1:8:4
1:4:8
15ns
g
2:4:4
1:4:4
`.̀
..
.. ∇
.
2:4:4
.
.
. .
.. ..
1:4:4
.. ... `
.
g
.̀. .
2:4:2
. .`
` ..
f
. . ..
. . . .∇
1:4:1 1:2:4
.
. .. f
g
.
.
.̀.
1:2:4
. .` . .
2:2:1
. . ...
g
. . . .∇
`
. f.
.
2:4:1 1:2:2
.
..
.̀. . . .
g
.. ∗
. .`
.
`
.
.
.
.
1:2:2
.
.
f
.
..
.̀. . . . . .∇
. f.
g
. .`
f
..
.. ∗
...
...
.̀. . . . . .∇
..
.
`
.
.
`
.
†.
..
... `
.∗
...
..
`.̀ . . . . . . .∇
†. . . . . . .
f
f. .
f
`. . .
. . .∗
∗
.
.
∇
.
.
∗.
..
.
.... †
. f. .
f
..
...
†. . . . . . .†. . .
†
...
.
∗
.
.
.
.
.
f
.
.. ∗
..
...
∗ . . .∗f. . . . .
..
...
.
. . . ∗ . . . .†
.
†f. . . . †. . . . . . . .†
....
.
∗
.
.
. .†
. .∗
...
....
∗. . .. . . . . .†. . . . . . . .†. .
.
†.
g
`
`
`. . . . . . . . . .`.
∇. . . . . . . . . .∇.
f
f
f. . . . . . . . . . f.
∗
∗
∗. . . . . . . . . .∗.
†
†
†. . . . . . . . . .†.
Cycle time
Access time
Mux drv & sel inv
Compare
Data bitline & sense
Tag bitline & sense
Data wordline
Tag wordline
Data decode
Tag decode
g
10ns
Time
4:2:1
1:2:1
4:1:1
1:1:1
`
.̀
5ns
`
...
....
∇. . . . .
..̀ . .
g
2:4:1
1:4:2
g
g
`
...
.
...
. . .∇
.
. . .̀.
.
`
..
...
.
...
. . .∇
..
f
f
`
g
2:2:1
1:4:1
4:1:1
1:2:1
4:2:1
1:2:1
g
g
1:8:1
1:4:2
. .̀.
..
...
f. . . .
. .∇
.
g
`
`
.̀.
..
.
. . .̀.
..
.
..
.∇
... f
∗. . . . .
. . f.
.
f. .
. .∗
..
..
..
..
..
f. .
. .∇
f. .
.∗
..
..
.
..
.
..
.
..
..
.
..
..
..
..
..
.̀
.
.∇
f
. f.
∗
∗.
.†
† .....
.
.
..
.†
.∗
..
...
..
.
∗. . . . . †
..
†. . . .
..
.
. .∗
..
...
. .†
∗ . . . . . .†f
.....
...
.
.
.
∗
.
.
†
.
.
.
.
f
.
...... †
...
....
∗f.
...
. . . .∗
. . .†
....
....
. . . . .∗
.
.
.
.
.†
†
.
∗
..
....
. . .†
†. . . . .
f
0ns
0ns
4096
16384
65536
262144
Cache Size in bytes (B=16 bytes, A=1)
4096
16384
65536
262144
Cache Size in bytes (B=16 bytes, A=4)
a) direct-mapped
b) set-associative
Figure 17: Access/cycle time as a function of cache size
7.2 Block Size
Figure 18 shows how the access and cycle times are aected by the block size (the cache size
is kept constant). In the direct-mapped graph, the access and cycle times drop as the block
size increases. Most of this is due to a drop in the decoder delay (a larger block decreases
the depth of each array and reduces the number of tags required).
In the set-associative case, the access and cycle time begins to increase as the block size
gets above 32. This is due to the output driver; a larger block size means more drivers
share the same cache output line, so there is more loading at the output of each driver.
This trend can also be seen in the direct-mapped case, but it is much less pronounced. The
number of output drivers that share a line is proportional to A, so the proportion of the
total output capacitance that is the drain capacitance of other output drivers is smaller in
a direct-mapped cache than in the 4-way set associative cache. Also, in the direct-mapped
case, the slower output driver only aects the data side, and it is the tag side that dictates
22
To appear in IEEE Journal of Solid State Circuits 1996
g
15ns
10ns
Time
5ns
g
`
`
`
`
`. . . . . . . . . . . .`.
∇. . . . . . . . . . . .∇.
f
f
f. . . . . . . . . . . . f.
∗
∗
∗. . . . . . . . . . . . ∗.
†
†
†. . . . . . . . . . . . †.
Cycle time
Access time
Data side
Tag side
Compare
Data bitline & sense
Tag bitline & sense
Data wordline
Tag wordline
Data decode
Tag decode
Markers:
Markers:
Ndwl:Ndbl:Nspd
Ntwl:Ntbl:Ntspd
Ndwl:Ndbl:Nspd
Ntwl:Ntbl:Ntspd
g
15ns
g
`
`
`. . . . . . . . . . . .`.
∇. . . . . . . . . . . .∇.
f
f
f. . . . . . . . . . . . f.
∗
∗
∗. . . . . . . . . . . . ∗.
†
†
†. . . . . . . . . . . . †.
4:2:1
1:4:1
2:4:2
1:4:4
g
2:4:2
1:2:4
10ns
Time
1:4:1
g
`
.̀ . . . . .
......
∇f. . . . . . .
5ns
0ns
`
.̀. . . . . . .
4:2:1
1:2:1
4:2:1
1:2:1
8:1:1
1:4:1
g
g
1:2:4
`.̀ . . . .
2:2:1
1:4:1
g
1:2:2
`. . . . . . . . . .
∇
.
.
1:2:2
g
f . . . . . .`.̀. . . . .
....
...
g
∇
`. . . . . . . . . .`.̀. . . . .
....
....
.
.
.
f
∇
`. . . . . . . . . . . .`.̀. . . . . . . . . . .
.∇
. .̀
∗
f
f. . . . . . . . . . . .`
`.
.†
f. .. . .. ..
∇
. . .∗
f. . . . . . .
f
....
. f. . . . .
∗ . . . . . . .∗. .
∗. . . . . . . . . †
f ....... ∗
.....
....
†. . . . . . . . . ∗ . . . . . . . . . .†
f
∗. . . . . . . . . . . .†
.....
∗. . . . . . . . . . . .
† .........
†.
∗
.†
. . . . . . . . . . . .†
........
.....
†
Cycle time
Access time
Mux drv & sel inv
Compare
Data bitline & sense
Tag bitline & sense
Data wordline
Tag wordline
Data decode
Tag decode
g
g
`
`
. . . . . .̀. . . .
f
....
.∇
....
....
4:1:1
1:1:1
g
`
. . . . . . . . .̀. . . . . . . . . . . . .̀
.
....
∇. . . . . . . .
∗
......
f
∇f . . . . . . . . . .
†
∇
....
. f. . . .
f
.....
. . .∗
f. . . . . . . . . ∗
∗. . . . . . . . . . .
.
†
. . f. . . . .
.∗
.....
.....
† . . . . . . .∗f.
....
†. . . . . . . . . . . .
†. . . . . . . . . . ∗ . . . . . . . . . .∗. . . . . . . . . . . .
. .†
†.
.........
∗
. . .†
..........
. .†
.
∗
†f. . . . . . .
0ns
4
8
16
32
64
4
Block Size in bytes (C=16 Kbytes, A=1)
8
16
32
64
Block Size in bytes (C=16 Kbytes, A=4)
a) direct-mapped
b) set-associative
Figure 18: Access/cycle time as a function of block size
the access time in all the organizations shown.
7.3 Associativity
Finally, consider Figure 19 which shows how the associativity aects the access and cycle
time of a 16KB and 64KB cache. As can be seen, there is a signicant step between a
direct-mapped and a 2-way set-associative cache, but a much smaller jump between a 2way and a 4-way cache (this is especially true in the larger cache). As the associativity
increases further, the access and cycle time begin to increase more dramatically.
The real cost of associativity can be seen by looking at the tag path curve in either
graph. For a direct-mapped cache, this is simply the time to output the valid signal, while
in a set-associative cache, this is the time to drive the multiplexor select signals. Also, in
a direct-mapped cache, the output driver time is hidden by the time to read and compare
the tag. In a set-associative cache, the tag array access, compare, and multiplexor driver
23
To appear in IEEE Journal of Solid State Circuits 1996
Markers:
g
15ns
Cycle time
Access time
Tag Path
Compare
Data bitline & sense
Tag bitline & sense
Data wordline
Tag wordline
Data decode
Tag decode
g
`
`
`. . . . . . . . . . `.
∇. . . . . . . . . .∇.
f
f
f. . . . . . . . . . f.
∗
∗
∗. . . . . . . . . . ∗.
†
†
†. . . . . . . . . . †.
8:1:1
1:2:2
10ns
Time
g
1:4:1
1:2:4
`
g
.
..
5ns
4:2:1
1:2:1
..
..
..
.
`
...
.̀. . . . . . . . . . .̀. . . . .
g
g
`
4:1:1
1:1:1
15ns
`
`
`. . . . . . . . . . `.
∇. . . . . . . . . .∇.
f
f
f. . . . . . . . . . f.
∗
∗
∗. . . . . . . . . . ∗.
†
†
†. . . . . . . . . . †.
g
`
`
.
. . .̀.
. . .̀. .
f
...
...
..
..
. .∇
..
..
..
..
.
.
..
..
.
..
..
..
. .̀
2:4:1
1:4:2
Time
10ns
∇.
..
2:4:4
1:4:4
g
.
. f.
.
`.̀
..
...
..
∇. . . . . . . . . .∇. . . . . . . . . .∇. . . . . . . . . .∇. . . .
..
.
∗
f
..
f.
f
.∗
..
f
..
†
..
.
..
f
.
.. f
..
.
∗
.
.
∗
..
...
..
∗f. . . . . . . . . . f. . . . . . . . . .†f. . . . . . . . . .∗f.
..
. .∗
†
.
.
∗
.
† ..
†
∗. . . . . . . . . .∗. . . . . . . . . .∗. . . . . . . . . .∗. . .
†
†. . . . . . . . . .†. . . . . . . . . .†. . . . . . . . . . . . . .
† . . . . . .†. . . . . . . . . .†.
5ns
0ns
..
...
Ndwl:Ndbl:Nspd
Ntwl:Ntbl:Ntspd
4:1:1
1:1:1
Cycle time
Access time
Tag Path
Compare
Data bitline & sense
Tag bitline & sense
Data wordline
Tag wordline
Data decode
Tag decode
g
g
4:2:1
1:2:1
g
Markers:
Ndwl:Ndbl:Nspd
Ntwl:Ntbl:Ntspd
8:1:1
1:1:1
..
2:2:1
1:4:1
g
g
`
`
g
`
2:1:1
1:2:1
g
2:2:1
1:2:1
g
`
`
. .̀. . . . . . . . . . .̀. . . . . . . . . . .̀.
...
. . .̀.
....
..
..
.
..
.
..
.
..
..
..
..
..
. .̀
∇
..
.
`.̀ .
..
.
. .∇
. .f
∇. . . . . . . . . . f. . . . . . . . . . . . . . . . . . . . . . . . .
..
∇
.∇
∇
.f .
..
.
.
f
..
f . . . . . f. . .
f.
∗
.
.∗
.
.....
..
..
∗f. . . . . . . . . .†f. . . . . . . . . .∗f. . . . . . . . f
.
.
..
∗
..
. .∗
†
.... ∗
†. . . . . . . . . .∗
∗. . . . . . . . . .∗. . . . . . . . . .∗
.∗. . .
†. . . . . . . . . .†. . . . . . . . . .†. . . . . . . . . †
†
.†
. . . . . . . . . .†
†. . . . . . . . . .†.
0ns
1
2
4
8
16
32
1
Associativity (C=16 Kbytes, B=16 bytes )
2
4
8
16
32
Associativity (C=64 Kbytes, B=16 bytes)
a) 16KB cache
b) 64KB cache
Figure 19: Access/cycle time as a function of associativity
must be completed before the output driver can begin to send the result out of the cache.
Looking at the 16KB cache results, there seems to be an anomaly in the data array
decoder at A = 2. This is due to a larger Ndwl at this point. This doesn't aect the overall
access time, however, since the tag access is the critical path.
8 Conclusions
In this paper, we have presented an analytical model for the access and cycle time of a
cache. By comparing the model to an Hspice model, CACTI was shown to be accurate to
within 6%. The computational complexity, however, is considerably less than Hspice.
Although previous models have also given results close to Hspice predictions, the underlying assumptions in previous models have been very dierent from typical on-chip cache
memories. Our extended and improved model xes many major shortcomings of previous
24
To appear in IEEE Journal of Solid State Circuits 1996
models. Our model includes the tag array, comparator, and multiplexor drivers, non-step
stage input slopes, rectangular stacking of memory subarrays, a transistor-level decoder
model, column-multiplexed bitlines, an additional array organizational parameter and loaddependent transistor sizes for wordline drivers. It also produces cycle times as well as access
times. This makes the model much closer to real memories.
It is dangerous to make too many conclusions directly from the graphs without considering miss rate data. Figure 19 seems to imply that a direct-mapped cache is always the
best. While it is always the fastest, it is important to remember that the direct-mapped
cache will have the lowest hit-rate. Hit rate data obtained from a trace-driven simulation
(or some other means) must be included in the analysis before the various cache alternatives
can be fairly compared. Similarly, a small cache has a lower access time, but will also have
a lower hit rate. In [9], it was found that when the hit rate and cycle time are both taken
into account, there is an optimum cache size between the two extremes.
Appendix A: Obtaining and Using the CACTI Software
A program that implements the CACTI model described in this paper is available. To
obtain the software, log into gatekeeper.dec.com using anonymous ftp (use \anonymous"
as the login name and your machine name as the password). The les for the program
are stored together in \/archive/pub/DEC/cacti.tar.Z". Get this le, \uncompress" it, and
extract the les using \tar".
The program consists of a number of C les; time.c contains the model. Transistor
widths and process parameters are dened in def.h. A makele is provided to compile the
program.
Once the program is compiled, it can be run using the command:
cacti C B A
where C is the cache size (in bytes), B is the block size (in bytes), and A is the associativity.
The output width and internal address width can be changed in def.h.
When the program is run, it will consider all reasonable values for the array organization
parameters (discussed in Section 3) and choose the organization that gives the smallest
access time. The values of the array organization parameters chosen are included in the
output report.
25
To appear in IEEE Journal of Solid State Circuits 1996
The extended description of model details is available on the World Wide Web at URL
\http://nsl.pa.dec.com/wrl/techreports/93.5.html#93.5".
Acknowledgements
The authors wish to thank Stefanos Sidiropoulos, Russell Kao, and Ken Schultz for their
helpful comments on various drafts of the manuscript. Financial support was provided
by the Natural Sciences and Engineering Research Council of Canada and the Walter C.
Sumner Memorial Foundation.
References
[1] J. M. Mulder, N. T. Quach, and M. J. Flynn, \An area model for on-chip memories and
its application," IEEE Journal of Solid-State Circuits, Vol. 26, No. 2, pp. 98{106, Feb.
1991.
[2] T. Wada, S. Rajan, and S. A. Przybylski, \An analytical access time model for on-chip
cache memories," IEEE Journal of Solid-State Circuits, Vol. 27, No. 8, pp. 1147{1156,
Aug. 1992.
[3] S. J. Wilton and N. P. Jouppi, \An access and cycle time model for on-chip caches,"
Tech. Rep. 93/5, Digital Equipment Corporation Western Research Lab, 1994.
[4] M. A. Horowitz, \Timing models for mos circuits," Tech. Rep. Technical Report SEL83003, Integrated Circuits Laboratory, Stanford University, 1983.
[5] J. Rubinstein, P. Peneld, and M. A. Horowitz, \Signal delay in RC tree networks,"
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.
2, No. 3, pp. 202{211, July 1983.
[6] M. G. Johnson and N. P. Jouppi, \Transistor model for a synthetic 0.8um CMOS process," Class notes for Stanford University EE371, Spring 1990.
[7] R. J. Proebsting, \Post charge logic permits 4ns 18K CMOS RAM." 1987.
[8] T. I. Chappell, B. A. Chappel, S. E. Schuster, J. W. Allen, S. P. Klepner, R. V. Joshi, and
R. L. Franch, \A 2ns cycle, 3.8ns access 512kb CMOS ECL RAM with a fully pipelined
architecture," IEEE Journal of Solid-State Circuits, Vol. 26, No. 11, pp. 1577{1585,
Nov. 1991.
[9] N. P. Jouppi and S. J. Wilton, \Tradeos in two-level on-chip caching," in Proceedings of
the 21th Annual International Symposium on Computer Architecture, pp. 34{45, April
1994.
26
Download