Document/Base paper - PG Embedded systems

advertisement
102
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013
An Energy-Efficient L2 Cache Architecture Using
Way Tag Information Under Write-Through Policy
Jianwei Dai and Lei Wang, Senior Member, IEEE
Abstract—Many high-performance microprocessors employ
cache write-through policy for performance improvement and at
the same time achieving good tolerance to soft errors in on-chip
caches. However, write-through policy also incurs large energy
overhead due to the increased accesses to caches at the lower
level (e.g., L2 caches) during write operations. In this paper,
we propose a new cache architecture referred to as way-tagged
cache to improve the energy efficiency of write-through caches.
By maintaining the way tags of L2 cache in the L1 cache during
read operations, the proposed technique enables L2 cache to
work in an equivalent direct-mapping manner during write hits,
which account for the majority of L2 cache accesses. This leads
to significant energy reduction without performance degradation.
Simulation results on the SPEC CPU2000 benchmarks demonstrate that the proposed technique achieves 65.4% energy savings
in L2 caches on average with only 0.02% area overhead and no
performance degradation. Similar results are also obtained under
different L1 and L2 cache configurations. Furthermore, the idea
of way tagging can be applied to existing low-power cache design
techniques to further improve energy efficiency.
Index Terms—Cache, low power, write-through policy.
I. INTRODUCTION
M
ULTI-LEVEL on-chip cache systems have been widely
adopted in high-performance microprocessors [1]–[3].
To keep data consistence throughout the memory hierarchy,
write-through and write-back policies are commonly employed.
Under the write-back policy, a modified cache block is copied
back to its corresponding lower level cache only when the
block is about to be replaced. While under the write-through
policy, all copies of a cache block are updated immediately after
the cache block is modified at the current cache, even though
the block might not be evicted. As a result, the write-through
policy maintains identical data copies at all levels of the cache
hierarchy throughout most of their life time of execution. This
feature is important as CMOS technology is scaled into the
nanometer range, where soft errors have emerged as a major
reliability issue in on-chip cache systems. It has been reported
Manuscript received March 28, 2011; revised July 15, 2011 and October 08,
2011; accepted December 07, 2011. Date of publication January 26, 2012; date
of current version December 19, 2012. This work was supported by the National
Science Foundation under Grant CNS-0954037, Grant CNS-1127084, and in
part by the University of Connecticut Faculty Research Grant 443874.
J. Dai is with Intel Corporation, Hillsboro, OH 97124 USA (e-mail: jianwei.
dai@engr.uconn.edu).
L. Wang is with the Department of Electrical and Computer Engineering, University of Connecticut, Storrs, CT 06269 USA (e-mail: leiwang@engr.uconn.
edu).
Digital Object Identifier 10.1109/TVLSI.2011.2181879
that single-event multi-bit upsets are getting worse in on-chip
memories [7]–[9]. Currently, this problem has been addressed
at different levels of the design abstraction. At the architecture
level, an effective solution is to keep data consistent among
different levels of the memory hierarchy to prevent the system
from collapse due to soft errors [10]–[12]. Benefited from
immediate update, cache write-through policy is inherently
tolerant to soft errors because the data at all related levels of the
cache hierarchy are always kept consistent. Due to this feature,
many high-performance microprocessor designs have adopted
the write-through policy [13]–[15].
While enabling better tolerance to soft errors, the
write-through policy also incurs large energy overhead.
This is because under the write-through policy, caches at the
lower level experience more accesses during write operations.
Consider a two-level (i.e., Level-1 and Level-2) cache system
for example. If the L1 data cache implements the write-back
policy, a write hit in the L1 cache does not need to access the L2
cache. In contrast, if the L1 cache is write-through, then both
L1 and L2 caches need to be accessed for every write operation.
Obviously, the write-through policy incurs more write accesses
in the L2 cache, which in turn increases the energy consumption
of the cache system. Power dissipation is now considered as
one of the critical issues in cache design. Studies have shown
that on-chip caches can consume about 50% of the total power
in high-performance microprocessors [4]–[6].
In this paper, we propose a new cache architecture, referred
to as way-tagged cache, to improve the energy efficiency of
write-through cache systems with minimal area overhead and
no performance degradation. Consider a two-level cache hierarchy, where the L1 data cache is write-through and the L2 cache
is inclusive for high performance. It is observed that all the data
residing in the L1 cache will have copies in the L2 cache. In
addition, the locations of these copies in the L2 cache will not
change until they are evicted from the L2 cache. Thus, we can
attach a tag to each way in the L2 cache and send this tag information to the L1 cache when the data is loaded to the L1 cache.
By doing so, for all the data in the L1 cache, we will know exactly the locations (i.e., ways) of their copies in the L2 cache.
During the subsequent accesses when there is a write hit in the
L1 cache (which also initiates a write access to the L2 cache
under the write-through policy), we can access the L2 cache in
an equivalent direct-mapping manner because the way tag of the
data copy in the L2 cache is available. As this operation accounts
for the majority of L2 cache accesses in most applications, the
energy consumption of L2 cache can be reduced significantly.
The basic idea of way-tagged cache was initially proposed in
our past work [26] with some preliminary results. In this paper,
1063-8210/$26.00 © 2012 IEEE
DAI AND WANG: ENERGY-EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION
we extend this work by making the following contributions.
First, a detailed VLSI architecture of the proposed way-tagged
cache is developed, where various design issues regarding
timing, control logic, operating mechanisms, and area overhead
have been studied. Second, we demonstrate that the idea of way
tagging can be extended to many existing low-power cache
design techniques so that better tradeoffs of performance and
energy efficiency can be achieved. Third, a detailed energy
model is developed to quantify the effectiveness of the proposed technique. Finally, a comprehensive suite of simulations
is performed with new results covering the effectiveness of the
proposed technique under different cache configurations. It is
also shown that the proposed technique can be integrated with
existing low-power cache design techniques to further improve
energy efficiency.
The rest of this paper is organized as follows. In Section II, we
provide a review of related low-power cache design techniques.
In Section III, we present the proposed way-tagged cache. In
Section IV, we discuss the detailed VLSI architecture of the
way-tagged cache. Section V extends the idea of way tagging
to existing cache design techniques to further improve energy
efficiency. An energy model is presented in Section VI to study
the effectiveness of the proposed technique. Simulation results
are given in Section VII.
II. RELATED WORK
Many techniques have been developed to reduce cache
power dissipation. In this section, we briefly review some
existing work related to the proposed technique.
In [16], Su et al. partitioned cache data arrays into subbanks. During each access, only the subbank containing the
desired data is activated. Ghose et al. further divided cache
bitlines into small segmentations [17]. When a memory cell
is accessed, only the associated bitline segmentations are
evaluated. By modifying the structure of cache systems, these
techniques effectively reduce the energy per access without
changing cache architectures. At the architecture level, most
work focuses on set-associative caches due to their low miss
rates. In conventional set-associative caches, all tag and data
arrays are accessed simultaneously for performance consideration. This, however, comes at the cost of energy overhead.
Many techniques have been proposed to reduce the energy
consumption of set-associative caches. The basic idea is to
activate fewer tag and data arrays during an access, so that
cache power dissipation can be reduced. In the phased cache
[18] proposed by Hasegawa et al., one cache access is divided
into two phases. Cache tag arrays are accessed in the first phase
while in the second phase only the data array corresponding to
the matched tag, if any, is accessed. Energy consumption can
be reduced because at most only one data array is accessed as
compared to
data arrays in a conventional -way set-associative cache. Due to the increase in access cycles, phased
caches are usually employed in the lower level memory to
minimize the performance impact. Another technique referred
to as way concatenation was proposed by Zhang et al. [19]
to reduce the cache energy in embedded systems. With the
necessary software support, this cache can be configured as direct-mapping, two-way, or four-way set-associative according
103
to the system requirement. By accessing fewer tag and data
arrays, better energy efficiency is attained. Although effective
for embedded systems, this technique may not be suitable
for high-performance general purpose microprocessors due to
the induced performance overhead. Other techniques include
way-predicting set-associative caches, proposed by Inoue et al.
[20]–[22], that make a prediction on the ways of both tag and
data arrays in which the desired date might be located. If the
prediction is correct, only one way is accessed to complete the
operation; otherwise, the rest ways of the cache are accessed to
collect the desired data. Because of the improved energy efficiency, many way-prediction based techniques are employed in
microprocessor designs. Another similar approach proposed by
Min et al. [23] employs redundant cache (refer to as location
cache) to predict the incoming cache references. The location
cache needs to be trigged for every operation in the L1 cache
(including both read and write assesses), which wastes energy
if the hit rate of L1 cache is high.
Among the above related work, phased caches and way-predicting caches are commonly used in high-performance
microprocessors. Compared with these techniques, the proposed way-tagged cache achieves better energy efficiency with
no performance degradation. Specifically, the basic idea of
way-predicting caches is to keep a small number of the most
recently used (MRU) addresses and make a prediction based
on these stored addresses. Since L2 caches are usually unified
caches, the MRU-based prediction has a poor prediction rate
[24], [25], and mispredictions introduce performance degradation. In addition, applying way prediction to L2 caches
introduces large overheads in timing and area [23]. For phased
caches, the energy consumption of accessing tag arrays accounts for a significant portion of total L2 cache energy. As
shown in Section V, applying the proposed technique of way
tagging can reduce this energy consumption. Section VII-D
provides more details comparing the proposed technique with
these related work.
III. WAY-TAGGED CACHE
In this section, we propose a way-tagged cache that exploits
the way information in L2 cache to improve energy efficiency.
We consider a conventional set-associative cache system when
the L1 data cache loads/writes data from/into the L2 cache,
all ways in the L2 cache are activated simultaneously for
performance consideration at the cost of energy overhead. In
Section V, we will extend this technique to L2 caches with
phased tag-data accesses.
Fig. 1 illustrates the architecture of the two-level cache. Only
the L1 data cache and L2 unified cache are shown as the L1
instruction cache only reads from the L2 cache. Under the writethrough policy, the L2 cache always maintains the most recent
copy of the data. Thus, whenever a data is updated in the L1
cache, the L2 cache is updated with the same data as well. This
results in an increase in the write accesses to the L2 cache and
consequently more energy consumption.
Here we examine some important properties of write-through
caches through statistical characterization of cache accesses.
Fig. 2 shows the simulation results of L2 cache accesses based
104
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013
TABLE I
EQUIVALENT L2 ACCESS MODES UNDER DIFFERENT
OPERATIONS IN THE L1 CACHE
Fig. 1. Illustration of the conventional two-level cache architecture.
Fig. 2. Read and write accesses in the L2 cache running SPEC CPU2000
benchmarks.
on the SPEC CPU2000 benchmarks [30]. These results are obtained from Simplescalar1 for the cache configuration given in
Section VII-A. Unlike the L1 cache where read operations account for a large portion of total memory accesses, write operations are dominant in the L2 cache for all but three benchmarks
(galgel, ammp, and art). This is because read accesses in the
L2 cache are initiated by the read misses in the L1 cache, which
typically occur much less frequently (the miss rate is less than
5% on average [27]). For galgel, ammp, and art, L1 read
miss rates are high resulting in more read accesses than write
accesses. Nevertheless, write accesses still account for about
20%–40% of the total accesses in the L2 cache. From the results
in Section VII, each L2 read or write access consumes roughly
the same amount of energy on average. Thus, reducing the energy consumption of L2 write accesses is an effective way for
memory power management.
As explained in the introduction, the locations (i.e., way tags)
of L1 data copies in the L2 cache will not change until the data
are evicted from the L2 cache. The proposed way-tagged cache
exploits this fact to reduce the number of ways accessed during
L2 cache accesses. When the L1 data cache loads a data from
1[Online].
Available: http://www.simplescalar.com/
the L2 cache, the way tag of the data in the L2 cache is also sent
to the L1 cache and stored in a new set of way-tag arrays (see
details of the implementation in Section IV). These way tags
provide the key information for the subsequent write accesses
to the L2 cache.
In general, both write and read accesses in the L1 cache may
need to access the L2 cache. These accesses lead to different
operations in the proposed way-tagged cache, as summarized in
Table I. Under the write-through policy, all write operations of
the L1 cache need to access the L2 cache. In the case of a write
hit in the L1 cache, only one way in the L2 cache will be activated because the way tag information of the L2 cache is available, i.e., from the way-tag arrays we can obtain the L2 way of
the accessed data. While for a write miss in the L1 cache, the
requested data is not stored in the L1 cache. As a result, its corresponding L2 way information is not available in the way-tag
arrays. Therefore, all ways in the L2 cache need to be activated simultaneously. Since write hit/miss is not known a priori,
the way-tag arrays need to be accessed simultaneously with all
L1 write operations in order to avoid performance degradation.
Note that the way-tag arrays are very small and the involved energy overhead can be easily compensated for (see Section VII).
For L1 read operations, neither read hits nor misses need to access the way-tag arrays. This is because read hits do not need to
access the L2 cache; while for read misses, the corresponding
way tag information is not available in the way-tag arrays. As
a result, all ways in the L2 cache are activated simultaneously
under read misses.
From Fig. 2 write accesses account for the majority of L2
cache accesses in most applications. In addition, write hits are
dominant among all write operations. Therefore, by activating
fewer ways in most of the L2 write accesses, the proposed waytagged cache is very effective in reducing memory energy consumption.
Fig. 3 shows the system diagram of proposed way-tagged
cache. We introduce several new components: way-tag arrays,
way-tag buffer, way decoder, and way register, all shown in the
dotted line. The way tags of each cache line in the L2 cache
are maintained in the way-tag arrays, located with the L1 data
cache. Note that write buffers are commonly employed in writethrough caches (and even in many write-back caches) to improve the performance. With a write buffer, the data to be written
into the L1 cache is also sent to the write buffer. The operations
stored in the write buffer are then sent to the L2 cache in sequence. This avoids write stalls when the processor waits for
write operations to be completed in the L2 cache. In the proposed technique, we also need to send the way tags stored in
the way-tag arrays to the L2 cache along with the operations in
the write buffer. Thus, a small way-tag buffer is introduced to
buffer the way tags read from the way-tag arrays. A way decoder is employed to decode way tags and generate the enable
DAI AND WANG: ENERGY-EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION
105
Fig. 4. Way-tag arrays.
Fig. 3. Proposed way-tagged cache.
TABLE II
OPERATIONS OF WAY-TAG ARRAYS
signals for the L2 cache, which activate only the desired ways
in the L2 cache. Each way in the L2 cache is encoded into a way
tag. A way register stores way tags and provides this information to the way-tag arrays.
IV. IMPLEMENTATION OF WAY-TAGGED CACHE
In this section, we discuss the implementation of the proposed
way-tagged cache.
A. Way-Tag Arrays
In the proposed way-tagged cache, each cache line in the L1
cache keeps its L2 way tag information in the corresponding
entry of the way-tag arrays, as shown in Fig. 4, where only
one L1 data array and the associated way-tag array are shown
for simplicity. When a data is loaded from the L2 cache to the
L1 cache, the way tag of the data is written into the way-tag
array. At a later time when updating this data in the L1 data
cache, the corresponding copy in the L2 cache needs to be updated as well under the write-through policy. The way tag stored
in the way-tag array is read out and forwarded to the way-tag
buffer (see Section IV-B) together with the data from the L1
data cache. Note that the data arrays in the L1 data cache and
the way-tag arrays share the same address as the mapping between the two is exclusive. The write/read signal of way-tag
arrays, WRITEH_W, is generated from the write/read signal
of the data arrays in the L1 data cache as shown in Fig. 4.
A control signal referred to as UPDATE is obtained from the
cache controller. When the write access to the L1 data cache is
caused by a L1 cache miss, UPDATE will be asserted and allow
WRITEH_W to enable the write operation to the way-tag arrays
(
,
, see Table II). If a STORE instruction accesses the L1 data cache, UPDATE keeps invalid
and WRITE_W indicates a read operation to the way-tag arrays
(
,
). During the read operations of
the L1 cache, the way-tag arrays do not need to be accessed and
thus are deactivated to reduce energy overhead. To achieve this,
the wordline selection signals generated by the decoder are disabled by WRITEH (
,
) through
AND gates. The above operations are summarized in Table II.
Note that the proposed technique does not change the cache
replacement policy. When a cache line is evicted from the L2
cache, the status of the cache line changes to “invalid” to avoid
future fetching and thus prevent cache coherence issues. A
read or write operation to this cache line will lead to a miss,
which can be handled by the proposed way-tagged cache (see
Section III). Since way-tag arrays will be accessed only when a
data is written into the L1 data cache (either when CPU updates
a data in the L1 data cache or when a data is loaded from the
L2 cache), they are not affected by cache misses.
It is important to minimize the overhead of way-tag arrays.
The size of a way-tag array can be expressed as
(1)
where
,
, and
are the size of the L1 data
cache, cache line size, and the number of ways in the L1 data
cache, respectively. Each way in the L2 cache is represented by
bits assuming the binary code is applied.
As shown in (1), the overhead increases linearly with the size of
L1 data cache
and sublinearly with the number of ways in
L2 cache
. In addition, since
is very small compared with
(i.e.,
), the overhead
accounts for a very small portion of the L1 data cache. Clearly,
the proposed technique shows good scalability trends with the
increasing sizings of L1 and L2 caches.
As an example, consider a two-level cache hierarchy where
the L1 data cache and instruction cache are both 16 kB 2-way
set-associative with cache line size of 32 B. The L2 cache is
4-way set-associative with 32 kB and each cache line has 64
106
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013
Fig. 6. Timing diagram of way-tag buffer.
Fig. 5. Way-tag buffer.
B. Thus,
16 kB,
32 B,
, and
. The size of each way-tag array is 16 K
512 bits, and two way-tag arrays are needed for the L1 data
cache. This introduces an overhead of only
K
of the L1 data cache, or
K
K
K
of the entire L1 and L2 caches.
To avoid performance degradation, the way-tag arrays are
operated in parallel with the L1 data cache. Due to their small
size, the access delay is much smaller than that of the L1 cache.
On the other hand, the way-tag arrays share the address lines
with the L1 data cache. Therefore, the fan-out of address lines
will increase slightly. This effect can be well-managed via
careful floorplan and layout during the physical design. Thus,
the way-tag arrays will not create new critical paths in the L1
cache. Note that accessing way-tag arrays will also introduce a
small amount of energy overhead. However, the energy savings
achieved by the proposed technique can offset this overhead,
as shown in Section VII.
B. Way-Tag Buffer
Way-tag buffer temporarily stores the way tags read from the
way-tag arrays. The implementation of the way-tag buffer is
shown in Fig. 5. It has the same number of entries as the write
buffer of the L2 cache and shares the control signals with it.
Each entry of the way-tag buffer has
bits, where is the
line size of way-tag arrays. An additional status bit indicates
whether the operation in the current entry is a write miss on
the L1 data cache. When a write miss occurs, all the ways in
the L2 cache need to be activated as the way information is
not available. Otherwise, only the desired way is activated. The
status bit is updated with the read operations of way-tag arrays
at the same clock cycle.
Similar to the write buffer of the L2 cache, the way-tag buffer
has separate write and read logic in order to support parallel
write and read operations. The write operations in the way-tag
buffer always occur one clock cycle later than the corresponding
write operations in the write buffer. This is because the write
buffer, L1 cache, and way-tag arrays are all updated at the same
clock cycle when a STORE instruction accesses the L1 data
cache (see Fig. 4). Since the way tag to be sent to the way-tag
buffer comes from the way-tag arrays, this tag will be written
into the way-tag buffer one clock cycle later. Thus, the write
signal of the way-tag buffer can be generated by delaying the
write signal of the write buffer by one clock cycle, as shown in
Fig. 5.
The proposed way-tagged cache needs to send the operation
stored in the write buffer along with its way tag to the L2 cache.
This requires sending the data in the write buffer and its way
tag in the way-tag buffer at the same time. However, simply
using the same read signal for both the write buffer and the
way-tag buffer might cause write/read conflicts in the way-tag
buffer. This problem is shown in Fig. 6. Assume that at the th
clock cycle an operation is stored into the write buffer while the
way-tag buffer is empty. At the
th clock cycle, a read
signal is sent to the write buffer to get the operation while its
way tag just starts to be written into the way-tag buffer. If the
same read signal is used by the way-tag buffer, then read and
write will target the same location of the way-tag buffer at the
same time, causing a data hazard.
One way to fix this problem is to insert one cycle delay to
the write buffer. This, however, will introduce a performance
penalty. In this paper, we propose to use a bypass multiplexer
(MUX in Fig. 5) between the way-tag arrays and the L2 cache.
If an operation in the write buffer is ready to be processed while
the way-tag buffer is still empty, we bypass the way-tag buffer
and send the way tag directly to the L2 cache. The EMPTY
signal of the way-tag buffer is employed as the enable signal
for read operations; i.e., when the way-tag buffer is empty, a
read operation is not allowed. During normal operations, the
write operation and the way tag will be written into the write
buffer and way-tag buffer, respectively. Thus, when this write
operation is ready to be sent to the L2 cache, the corresponding
way tag is also available in the way-tag buffer, both of which can
be sent together, as indicated by the th cycle in Fig. 6. With
this bypass multiplexer, no performance overhead is incurred.
C. Way Decoder
The function of the way decoder is to decode way tags and activate only the desired ways in the L2 cache. As the binary code
is employed, the line size of way-tag arrays is
bits,
where is the number of ways in the L2 cache. This minimizes
the energy overhead from the additional wires and the impact on
chip area is negligible. For a L2 write access caused by a write
hit in the L1 cache, the way decoder works as a -to- decoder
that selects just one way-enable signal. The technique proposed
in [19] can be employed to utilize the way-enable signal to activate the corresponding way in the L2 cache. The way decoder
DAI AND WANG: ENERGY-EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION
107
Fig. 7. Implementation of the way decoder.
Fig. 8. Architecture of the WT-based phased access cache.
operates simultaneously with the decoders of the tag and data
arrays in the L2 cache. For a write miss or a read miss in the L1
cache, we need to assert all way-enable signals so that all ways
in the L2 cache are activated. To achieve this, the way decoder
can be implemented by the circuit shown in Fig. 7. Two signals,
read and write miss, determine the operation mode of the way
decoder. Signal read will be “1” when a read access is sent to
the L2 cache. Signal write miss will be “1” if the write operation accessing the L2 cache is caused by a write miss in the L1
cache.
Fig. 9. Operation modes of the WT-based phased access cache.
D. Way Register
The way register provides way tags for the way-tag arrays.
For a 4-way L2 cache, labels “00”, “01”, “10”, and “11” are
stored in the way register, each tagging one way in the L2 cache.
When the L1 cache loads a data from the L2 cache, the corresponding way tag in the way register is sent to the way-tag arrays.
With these new components, the proposed way-tagged cache
operates under different modes during read and write operations
(see Table I). Only the way containing the desired data is activated in the L2 cache for a write hit in the L1 cache, making the
L2 cache equivalently a direct-mapping cache to reduce energy
consumption without introducing performance overhead.
V. APPLICATION OF WAY TAGGING
ACCESS CACHES
IN
PHASED
In this section, we will show that the idea of way tagging can
be extended to other low-power cache design techniques such
as the phased access cache [18]. Note that since the processor
performance is less sensitive to the latency of L2 caches, many
processors employ phased accesses of tag and data arrays in L2
caches to reduce energy consumption. By applying the idea of
way tagging, further energy reduction can be achieved without
introducing performance degradation.
In phased caches, all ways in the cache tag arrays need to be
activated to determine which way in the data arrays contains
the desired data (as shown in the solid-line part of Fig. 8). In
the past, the energy consumption of cache tag arrays has been
ignored due to their relatively small sizes. Recently, Min et al.
show that this energy consumption has become significant [33].
As high-performance microprocessors start to utilize longer addresses, cache tag arrays become larger. Also, high associativity
is important for L2 caches in certain applications [34]. These
factors lead to the higher energy consumption in accessing cache
tag arrays [35]. Therefore, it has become important to reduce the
energy consumption of cache tag arrays.
The idea of way tagging can be applied to the tag arrays of
phased access cache used as a L2 cache. Note that the tag arrays
do not need to be accessed for a write hit in the L1 cache (as
shown in the dotted-line part in Fig. 9). This is because the destination way of data arrays can be determined directly from the
output of the way decoder shown in Fig. 7. Thus, by accessing
fewer ways in the cache tag arrays, the energy consumption of
phased access caches can be further reduced.
Fig. 8 shows the architecture of the phased access L2 cache
with way-tagging (WT) enhancement. The operation of this
cache is summarized in Fig. 9. Multiplexor M1 is employed to
generate the enable signal for the tag arrays of the L2 cache.
When the status bit in the way-tag buffer indicates a write hit,
M1 outputs “0” to disable all the ways in the tag arrays. As
mentioned before, the destination way of the access can be
obtained from the way decoder and thus no tag comparison is
needed in this case. Multiplexor M2 chooses the output from
the way decoder as the selection signal for the data arrays. If
on the other hand the access is caused by a write miss or a read
miss from the L1 cache, all ways are enabled by the tag array
decoder, and the result of tag comparison is selected by M2 as
the selection signal for the data arrays. Overall, fewer ways
in the tag arrays are activated, thereby reducing the energy
consumption of the phased access cache.
Note that the phased access cache divides an access into two
phases; thus, M2 is not on the critical path. Applying way tagging does not introduce performance overhead in comparison
with the conventional phased cache.
108
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013
VI. ENERGY MODEL
To study the effectiveness of the proposed way-tagged cache,
we utilize an energy model that describes the major components of cache energy consumption. In general, a cache system
consists of address decoders and data multiplexers, which are
shared by all ways. Each way contains several components such
as tag arrays, data arrays, precharging circuit, way comparators,
and sense amplifier circuit. Thus, the energy consumption per
access of a conventional -way associative L2 cache can be
expressed as
(2)
where
,
, and
denote the energy consumption of
address decoders, data multiplexers, and one way in the cache,
respectively. Note that in the conventional L2 cache, all ways
are activated during each access.
Given the number of accesses
, the total energy consumption can be determined as
(3)
Different from conventional caches, the proposed
way-tagged cache activates different components depending on
the type of cache accesses. As shown in Table I, if the access
is caused by a read miss or a write miss in the L1 cache, the
L2 cache works as a conventional cache, of which the energy
consumption can be obtained from (2). On the other hand, if
the access is caused by a write hit in the L1 cache, only one
way in the L2 cache will be activated, of which the energy
consumption is given by
(4)
Assuming the numbers of read misses, write misses, and
write hits in the L1 cache are
,
, and
, respectively, the energy consumption of the proposed way-tagged L2 cache can be expressed as
way-tag buffer, way decoder and way register) are denoted
as
,
, and
, respectively.
and
are the numbers of write and read accesses,
respectively, to the L2 cache.
Since the proposed way-tagged cache does not affect the
cache miss rate, the energy consumption related to cache misses
such as replacement, off chip memory accesses and microprocessor stalls will be the same as that of the conventional cache.
Therefore, we do not include these components in (5). Note that
the energy components in (5) represent the switching power.
Leakage power reduction is an important topic for our future
study.
We define the efficiency of the proposed way-tagged cache as
(8)
Substituting (2)–(7) into (8), we obtain
(9)
where
(10)
From (9), it is clear that the efficiency of the proposed waytagged cache is affected by a number of factors such as the
number of ways in the L2 cache and the configuration of the L1
cache [e.g., size and the number of ways, which affect
in (9)]. The impact of these factors will be evaluated in the next
section. Note that in this paper we choose to evaluate the energy
efficiency at the cache level as the proposed technique focuses
exclusively on cache energy reduction. Smaller energy savings
will be expected at the processor level because the L2 cache
only consumes a portion of the total energy, e.g., around 12% in
Pentium Pro CPU [31]. Reducing the total power of a processor
that consists of different components (ALU, memory, busses,
I/O, etc.) is an important research topic but is beyond the scope
of this work.
VII. EVALUATION AND DISCUSSION
(5)
In this section, we evaluate the proposed technique by comparing energy savings, area overhead, and performance with existing cache design techniques.
where
A. Simulation Setup
Note that read hits in the L1 cache do not need to access
the L2 cache and thus they are not included in (5). The energy
overheads introduced by accessing (read and write) the way-tag
arrays and other components (including the bypass multiplexer,
We consider separate L1 instruction and data caches, both of
16 kB 4-way set-associative with cache line size of 64 B. The
L2 cache is a unified 8-way set-associative cache with size of
512 kB and cache line size of 128 B. There are eight banks
in the L2 cache. The L1 data cache utilizes the write-through
policy and the L2 cache is inclusive. This cache configuration,
used in Pentium-4 [23], will be used as a baseline system for
comparison with the proposed technique under different cache
configurations.
(6)
(7)
DAI AND WANG: ENERGY-EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION
109
TABLE III
ENERGY CONSUMPTION PER READ AND WRITE ACCESS OF THE CONVENTIONAL
SET-ASSOCIATIVE L2 CACHE AND THE PROPOSED L2 CACHE
In these simulations, Simplescalar1 is employed to obtain the
cache access statistics and performance. The energy consumption is estimated by CACTI 5.32 for a 90-nm CMOS process. All
the simulations are based on the SPEC CPU2000 benchmarks
collected from the stream-based trace compression (SBC) [32],
where trace files of 23 benchmarks are available. All the benchmarks were simulated for at least two billion memory references.
B. Results of Baseline Cache Configuration
1) Energy Efficiency: Table III compares the average energy consumption of a read access and a write access in the
conventional 8-way set-associative L2 cache and the proposed
way-tagged L2 cache. Due to the fewer activated ways, the average energy consumption of the proposed way-tagged L2 cache
is only about 12.9% of the conventional L2 cache during the
write access under a write hit. Since the way-tag arrays are very
small, they introduce only 0.01% energy overhead per read and
write accesses. The energy overheads due to the way-tag buffer,
bypass multiplexer, way decoder, and way register are much
smaller and thus are not shown in Table III.
Based on the cache access statistics obtained from Simplescalar, we estimate the values of
,
,
,
, and
in (5). Employing (2)–(8), we
can determine the energy efficiency of the proposed way-tagged
cache. Fig. 10 shows that the energy reduction achieved by
the proposed technique ranges from 83.4% (meas) to 13.1%
(ammp) as compared to the conventional L2 cache. On average,
the proposed technique can reduce 65.4% of the L2 energy
consumption, or equivalently 7.5% of total processor power
if applied in Pentium Pro CPU [31], where the L2 cache
consumes about 12% of total processor power. These results
demonstrate that by reducing the unnecessary way accesses in
the L2 cache, our technique is very effective in reducing L2
cache power consumption. It is noted that the energy reduction
achieved by the different applications is not uniform. This is
because different applications have different write hit rates,
which affect
in (9) and in turn affect energy savings.
2) Area Overhead and Performance Impact: The area overhead of the proposed technique comes mainly from four components: way-tag arrays, way-tag buffer, bypass multiplexer, and
way decoder. As discussed in Section IV, these components
are very small. For example, the size of the way-tag arrays,
the largest component, is only about 0.02% of the whole cache
system. Also, only three additional wires are introduced for way
2[Online].
Available: http://www.hpl.hp.com/research/cacti/
Fig. 10. Energy reduction of the way-tagged L2 cache compared with the conventional set-associative L2 cache.
tag delivery between the L1 and L2 caches. Thus, we expect the
area overhead can be easily accommodated.
The proposed way-tagged cache does not affect the hit
rate, i.e., no performance degradation, as it does not change
the cache placement policy. Furthermore, the way-tag arrays,
way-tag buffer, and way decoder are operated in parallel with
the L1 data cache, write buffer, and decoders of tag and data
arrays in the L2 cache, respectively. Due to their small sizes,
the access delay can be fully covered by the delay of the L1
data cache, i.e., no new critical paths are created. As a result,
the proposed technique does not introduce any performance
overhead at the architecture and circuit levels.
C. Energy Reduction under Different Cache Configurations
As discussed in Section VI, the effectiveness of the proposed
technique varies with the configurations of L1 and L2 caches. In
this subsection, we will study this effect by assessing the energy
reduction achieved under different cache configurations in terms
of the associativity and the sizes of L1 and L2 caches.
Fig. 11 shows the energy reduction of the proposed technique
for a 4-way set-associative L1 cache with cache sizes of 8, 16,
and 32 kB, while the size of L2 cache is 512 kB. The block size
in these L1 caches is 64 B while that of the L2 cache is 128 B.
A larger energy reduction in the L2 cache is observed with the
increase in the size of L1 cache. This is because by increasing
the size of L1 cache, the miss rate will decrease which leads
to a larger
. This enables larger energy reduction according to (9). We also performed simulations by varying the
size of the L2 cache from 256 to 1024 kB. Since the proposed
technique does not target the misses in the L2 cache, changing
L2 cache size has little effect on the relative energy reduction
(i.e., both energy consumption and energy savings change proportionally).
Fig. 12 shows the energy reduction under the 16 kB L1 cache
and 512 kB L2 cache, where the number of ways in the L1 cache
varies from 2, 4, to 8. The L2 cache is 8-way set-associative. It
is shown that the proposed technique becomes more effective
as the associativity of L1 cache increases. This comes from the
fact that a higher associativity in general results in a smaller miss
rate, which enables better energy efficiency in the L2 cache (i.e.,
more write hits and thus fewer accessed ways in the L2 cache).
A similar trend can also be found in Fig. 13, which demonstrates
110
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013
Fig. 11. Energy reduction of the way-tagged L2 cache compared with the conventional set-associative L2 cache under different L1 sizes.
Fig. 14. Comparison of the MRU-based way-predicting cache and the proposed cache.
Fig. 15. Energy reduction of the WT-based phased access L2 cache compared
with the conventional phased access L2 cache.
Fig. 12. Energy reduction of the way-tagged L2 cache compared with the conventional set-associative L2 cache under different L1 set associativity.
D. Comparison With Existing Low-Power Cache Techniques
Fig. 13. Energy reduction of the way-tagged L2 cache compared with the conventional set-associative L2 cache under different L2 set associativity.
the effectiveness of the proposed technique for a 512 kB L2
cache with the associativity ranging from 4, 8, to 16, while the
16 kB L1 cache is 4-way associative. As the number of ways in
the L2 cache increases, in (9) becomes larger and thus enables
better energy efficiency. In other words, each time when there
is a write hit in the L1 cache, only a small part of the L2 cache
is activated as the number of ways in the L2 cache increases.
In this subsection, we compare the proposed way-tagged
cache with two existing low-power cache design techniques:
phased access cache and MRU-based way-predicting cache.
Figs. 14 and 15 show the energy reduction achieved by the
three techniques for a 256 K 16-way set-associative L2 cache.
It is shown that the proposed way-tagged cache is much more
effective than the way-predicting cache in energy reduction.
Specifically, our technique can achieve 32.6% more energy
reduction on average. This is because the L2 cache is a unified
cache, which in general leads to a poor prediction rate in the
way-predicting cache. To compare with the phased cache,
we employ the proposed WT-based phased access cache (see
Section V) as it has the same number of access cycles as the
phased cache. As shown in Fig. 15, the proposed technique
achieves energy savings ranging from 45.1% to 8.1% with an
average of 34.9% for the whole L2 cache at the same performance level. These results indicate that the energy consumption
of tag arrays accounts for a significant portion of the total L2
cache energy. Thus, applying the technique of way tagging in
the phased cache is quite effective.
We also study the performance of these three cache design
techniques. As discussed before, the proposed way-tagged
cache in Section III has no performance degradation compared
with the conventional set-associative L2 cache with simultaneous tag-data accesses. Using Simplescalar, we observed the
DAI AND WANG: ENERGY-EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION
performance degradation of the phased cache is very small for
most applications, below 0.5% in terms of instruction per cycle
(IPC). This is expected as L2 cache latency is not very critical
to the processor performance. However, nontrivial performance
degradation was observed in some applications. For example,
benchmark perlbmk sees 3.7% decrease in IPC while the
IPC of gzip decreases by 1.7%. The performance degradation
may be more significant for other applications that are sensitive
to L2 cache latency, such as TPC-C as indicated in [29]. As
a result, L2 caches with simultaneous tag-data accesses are
still preferred in some high-performance microprocessors [23],
[28]. The similar trend was also observed in the way-predicting
cache.
VIII. CONCLUSION
This paper presents a new energy-efficient cache technique for high-performance microprocessors employing the
write-through policy. The proposed technique attaches a tag to
each way in the L2 cache. This way tag is sent to the way-tag
arrays in the L1 cache when the data is loaded from the L2
cache to the L1 cache. Utilizing the way tags stored in the
way-tag arrays, the L2 cache can be accessed as a direct-mapping cache during the subsequent write hits, thereby reducing
cache energy consumption. Simulation results demonstrate significantly reduction in cache energy consumption with minimal
area overhead and no performance degradation. Furthermore,
the idea of way tagging can be applied to many existing
low-power cache techniques such as the phased access cache to
further reduce cache energy consumption. Future work is being
directed towards extending this technique to other levels of
cache hierarchy and reducing the energy consumption of other
cache operations.
REFERENCES
[1] G. Konstadinidis, K. Normoyle, S. Wong, S. Bhutani, H. Stuimer, T.
Johnson, A. Smith, D. Cheung, F. Romano, S. Yu, S. Oh, V. Melamed,
S. Narayanan, D. Bunsey, C. Khieu, K. J. Wu, R. Schmitt, A. Dumlao,
M. Sutera, J. Chau, and K. J. Lin, “Implementation of a third-generation
1.1-GHz 64-bit microprocessor,” IEEE J. Solid-State Circuits, vol. 37,
no. 11, pp. 1461–1469, Nov. 2002.
[2] S. Rusu, J. Stinson, S. Tam, J. Leung, H. Muljono, and B. Cherkauer,
“A 1.5-GHz 130-nm itanium 2 processor with 6-MB on-die L3 cache,”
IEEE J. Solid-State Circuits, vol. 38, no. 11, pp. 1887–1895, Nov. 2003.
[3] D. Wendell, J. Lin, P. Kaushik, S. Seshadri, A. Wang, V. Sundararaman, P. Wang, H. McIntyre, S. Kim, W. Hsu, H. Park, G.
Levinsky, J. Lu, M. Chirania, R. Heald, and P. Lazar, “A 4 MB
on-chip L2 cache for a 90 nm 1.6 GHz 64 bit SPARC microprocessor,”
in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers,
2004, pp. 66–67.
[4] S. Segars, “Low power design techniques for microprocessors,” in
Proc. Int. Solid-State Circuits Conf. Tutorial, 2001, pp. 268–273.
[5] A. Malik, B. Moyer, and D. Cermak, “A low power unified cache architecture providing power and performance flexibility,” in Proc. Int.
Symp. Low Power Electron. Design, 2000, pp. 241–243.
[6] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A framework for architectural-level power analysis and optimizations,” in Proc. Int. Symp.
Comput. Arch., 2000, pp. 83–94.
[7] J. Maiz, S. hareland, K. Zhang, and P. Armstrong, “Characterization of
multi-bit soft error events in advanced SRAMs,” in Proc. Int. Electron
Devices Meeting, 2003, pp. 21.4.1–21.4.4.
[8] K. Osada, K. Yamaguchi, and Y. Saitoh, “SRAM immunity to
cosmic-ray-induced multierrors based on analysis of an induced
parasitic bipolar effect,” IEEE J. Solid-State Circuits, pp. 827–833,
2004.
111
[9] F. X. Ruckerbauer and G. Georgakos, “Soft error rates in 65 nm
SRAMs: Analysis of new phenomena,” in Proc. IEEE Int. On-Line
Test. Symp., 2007, pp. 203–204.
[10] G. H. Asadi, V. Sridharan, M. B. Tahoori, and D. Kaeli, “Balancing performance and reliability in the memory hierarchy,” in Proc. Int. Symp.
Perform. Anal. Syst. Softw., 2005, pp. 269–279.
[11] L. Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin,
“Soft error and energy consumption interactions: A data cache perspective,” in Proc. Int. Symp. Low Power Electron. Design, 2004, pp.
132–137.
[12] X. Vera, J. Abella, A. Gonzalez, and R. Ronen, “Reducing soft error
vulnerability of data caches,” presented at the Workshop System Effects Logic Soft Errors, Austin, TX, 2007.
[13] P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: A 32-way multithreaded Sparc processor,” IEEE Micro, vol. 25, no. 2, pp. 21–29,
Mar. 2005.
[14] J. Mitchell, D. Henderson, and G. Ahrens, “IBM POWER5 processor-based servers: A highly available design for business-critical
applications,” IBM, Armonk, NY, White Paper, 2005. [Online].
Available:
http://www03.ibm.com/systems/p/hardware/whitepapers/power5_ras.pdf
[15] N. Quach, “High availability and reliability in the Itanium processor,”
IEEE Micro, pp. 61–69, 2000.
[16] C. Su and A. Despain, “Cache design tradeoffs for power and performance optimization: A case study,” in Proc. Int. Symp. Low Power
Electron. Design, 1997, pp. 63–68.
[17] K. Ghose and M. B. Kamble, “Reducing power in superscalar processor
caches using subbanking, multiple line buffers and bit-line segmentation,” in Proc. Int. Symp. Low Power Electron. Design, 1999, pp.
70–75.
[18] A. Hasegawa, I. Kawasaki, K. Yamada, S. Yoshioka, S. Kawasaki, and
P. Biswas, “Sh3: High code density, low power,” IEEE Micro, vol. 15,
no. 6, pp. 11–19, Dec. 1995.
[19] C. Zhang, F. Vahid, and W. Najjar, “A highly-configurable cache architecture for embedded systems,” in Proc. Int. Symp. Comput. Arch.,
2003, pp. 136–146.
[20] K. Inoue, T. Ishihara, and K. Murakami, “Way-predicting set-associative cache for high performance and low energy consumption,” in Proc.
Int. Symp. Low Power Electron. Design, 1999, pp. 273–275.
[21] A. Ma, M. Zhang, and K. Asanovi, “Way memoization to reduce fetch
energy in instruction caches,” in Proc. ISCA Workshop Complexity Effective Design, 2001, pp. 1–9.
[22] T. Ishihara and F. Fallah, “A way memoization technique for reducing
power consumption of caches in application specific integrated processors,” in Proc. Design Autom. Test Euro. Conf., 2005, pp. 358–363.
[23] R. Min, W. Jone, and Y. Hu, “Location cache: A low-power L2 cache
system,” in Proc. Int. Symp. Low Power Electron. Design, 2004, pp.
120–125.
[24] B. Calder, D. Grunwald, and J Emer, “Predictive sequential associative
cache,” in Proc. 2nd IEEE Symp. High-Perform. Comput. Arch., 1996,
pp. 244–254.
[25] T. N. Vijaykumar, “Reactive-associative caches,” in Proc. Int. Conf.
Parallel Arch. Compiler Tech., 2011, p. 4961.
[26] J. Dai and L. Wang, “Way-tagged cache: An energy efficient L2 cache
architecture under write through policy,” in Proc. Int. Symp. Low
Power Electron. Design, 2009, pp. 159–164.
[27] L. Hennessey and D. A. Patterson, Computer Architecture: A Quantitative Approach, 4th ed. New York: Elsevier Science & Technology
Books, 2006.
[28] B. Brock and M. Exerman, “Cache Latencies of the PowerPC
MPC7451,” Freescale Semiconductor, Austin, TX, 2006. [Online].
Available: cache.freescale.com
[29] T. Lyon, E. Delano, C. McNairy, and D. Mulla, “Data cache design considerations for Itanium 2 processor,” in Proc. IEEE Int. Conf. Comput.
Design, 2002, pp. 356–362.
[30] Standard Performance Evaluation Corporation, Gainesville, VA,
“SPEC CPU2000,” 2006. [Online]. Available: http://www.spec.
org/cpu
[31] “Pentium Pro Family Developer’s Manual,” Intel, Santa Clara, CA,
1996.
[32] A. Milenkovic and M. Milenkovic, “Exploiting streams in instruction
and data address trace compression,” in Proc. IEEE 6th Annu. Workshop Workload Characterization, 2003, pp. 99–107.
[33] R. Min, W. Jone, and Y. Hu, “Phased tag cache: An efficient low power
cache system,” in Proc. Int. Symp. Circuits Syst., 2004, pp. 23–26.
112
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013
[34] M. K. Qureshi, D. Thompson, and Y. N. Patt, “The V-way cache: Demand based associativity via global replacement,” in Proc. Int. Symp.
Comput. Arch., 2005, pp. 544–555.
[35] J. M. Rabaey, Digital Integrated Circuits: A Design Perspective. Englewood Cliffs, NJ: Prentice-Hall, 1996.
Jianwei Dai received the B.S. degree from Beijing
University of Chemical Technology, Beijing, China,
in 2002, the M.Eng. degree from Beihang University,
Beihang, China, in 2005, and the Ph.D. degree from
the University of Connecticut, Storrs, in 2011.
Currently, he is with Intel Corporation, Hillsboro,
OR, where he is participating in designing next generation processors. His research interests include low
power VLSI design, error and reliability-centric statistical modeling for emerging technology, and nanocomputing.
Lei Wang (M’01–SM’11) received the B.Eng. degree and the M.Eng. degree from Tsinghua University, Beijing, China, in 1992 and 1996, respectively,
and the Ph.D. degree from the University of Illinois at
Urbana-Champaign, iUrbana-Champaign, in 2001.
During the Summer of 1999, he worked with
Microprocessor Research Laboratories, Intel Corporation, Hillsboro, OR, where his work involved
development of high-speed and noise-tolerant VLSI
circuits and design methodologies. From December
2001 to July 2004, he was with Microprocessor
Technology Laboratories, Hewlett-Packard Company, Fort Collins, CO, where
he participated in the design of the first dual-core multi-threaded Itanium Architecture Processor, a joint project between Intel and Hewlett-Packard. Since
August 2004, he has been with the Department of Electrical and Computer
Engineering, University of Connecticut, where he is presently an Associate
Professor.
Dr. Wang was a recipient of the National Science Foundation CAREER
Award in 2010. He is a member of IEEE Signal Processing Society Technical
Committee on Design and Implementation of Signal Processing Systems.
He currently serves as an Associate Editor for the IEEE TRANSACTIONS ON
COMPUTERS. He has served on Technical Program Committees of various
international conferences.
Download