EMBEDDED SYSTEMS AND VLSI

advertisement
VLSI Architecture
CONTENTS
Sr.no
I
Introduction
Page
No
1
II
Related Work
2
III
IV
Topic

Dynamically Resizable Instruction Cache

Cache Decay

Partitioned Cache Architecture

Selective Cache Ways
Time Based Leakage Control In Partitioned Cache
Architecture

Overview

Block Diagram

Implementation

Placement Strategies

Prediction Strategies

Deciding Cache Decay Interval
7
CONCLUSION
11
REFERENCES
12
1
VLSI Architecture
2
VLSI Architecture
leakage energy of the processor. This paper
SECTION I: INRODUCTION
The advance in the semiconductor
technology has paved way for increasing the
suggests an architectural approach for
reducing leakage energy in caches.
Various
density of transistors per chip. The amount
of information storable on a given amount of
silicon has roughly doubled every year since
the technology was invented. Thus the
performance of the processor improved and
the chips’ energy dissipation increased in
each processor generation. This created
awareness for designing low power circuits.
Low power is important in portable devices
because the weight and size of the device is
determined by the amount of battery needed
which in turn depends on the amount of
power dissipated in the circuit. The cost
involved in providing power and associated
cooling,
reliability
issues,
expensive
packaging made low power a concern in
nonportable applications like desktops and
servers too. Even though most power
dissipation in CMOS CPUs is dynamic
power dissipation (which is a function of
frequency of operation of the device and the
switching
capacitance),
leakage
power
(function of number of on-chip transistors)
is also becoming increasingly significant as
leakage current flows in every transistor that
is on, irrespective of signal transition. Most
of
the
leakage
energy
comes
from
memories, since cache occupies much of
CPU chip’s area and has more number of
transistors, reducing leakage in cache will
result in a significant reduction in overall
approaches
have
been
suggested both in architecture and circuit
level to reduce leakage energy. One
approach is to count the total number of
misses in a cache and upsize/ downsize the
cache depending on whether the miss count
is greater or lesser than a preset value. The
cache
dynamically
resizes
to
the
application’s required size and the unused
sections of the cache are shut off. Another
method called cache decay turns off the
cache lines when they hold data not likely to
be reused. The cache lines are shut off
during their dead time that is during the time
after the last access and before the eviction.
After a specific number of cycles have
elapsed and still if the data is unused then
that cache line is shut off. Another approach
was to disable portion of the cache ways
called selective cache ways. This method,
which is application sensitive, enables all
the cache ways (a way is one of the n
sections in an n-way set associative cache)
when high performance is required and
enables only a subset of the ways when
cache demands are not high.
This paper is organized as follows,
section 1 narrates the work done related to
this problem, section 2 describes our
approach, which uses a time based decay
policy in a partitioned architecture of level –
3
VLSI Architecture
2 cache and finally section 3 presents the
series with SRAM cell-stacking
conclusion.
effect) with only 5% area increase.

By controlling the miss rate with
reference to a preset value, the
performance degradation and the
SECTION II: RELATED WORK
increase in lower cache levels’
Dynamically resizable instruction
cache:
energy dissipation (due to misses in
L1 cache) is kept low.
This method exploits the utilization
of cache, cache utilization varies depending

The dynamic energy of the counter
on the application requirements. By shutting
hardware used is small as the
of portion of the cache that is unused,
average number of bits switching on
leakage energy can be reduced significantly.
a counter increment is less than two
It uses a dynamically resizable I-cache
(as the ith bit in a counter switches
architecture, which resizes in accordance
once only every 2^i increments).
with the application requirements and uses a
technique called gated-Vdd in the circuit
Demerits:
level to turn off unused portions of the

Here resizing affects the miss rate, a
cache. The number of misses is counted
miss in L1 cache will lead to
periodically
million
dynamic energy dissipation in L2
instructions), the cache size is increased or
cache, so the number of accesses to
decreased depending on whether the count is
L2 cache should be low.
(say
every
1
more or less than a preset value. The cache

due to the resizing bits.
is also prevented from thrashing by fixing a
minimum size beyond which the cache
There is an extra L1 dynamic energy

Resizing circuitry may increase
energy dissipation offsetting the
cannot be decreased.
gains form cache resizing, so the
resizing frequency should be low.
Merits:


Reduces the average size of a 64 K

Longer resizing will span multiple
cache by 62%, thus lower leakage
application
energy
opportunity for resizing and shorter
and
the
performance
phases
reducing
degradation is within 4%.
resizing interval may result in
By employing a wide NMOS dual-
increase in overhead.
Vt gated-Vdd implementation the
leakage
is
virtually
eliminated

Resizing form one size to another
will
modify
the
set-mapping
(connecting gated Vdd transistor in
2
VLSI Architecture
function for blocks and may result

As seen in fig A, the access interval
in an incorrect lookup.
is the time between two hits, dead time is the
For an application, which requires a
time between last hit and the time at which
small I-cache, dynamic component
the data is evicted.
will be large due to large number of
resizing tag bits.

Gated Vdd transistor must be large
to sink the current flowing through
SRAM cells during read/ write
operation. But too large gated Vdd
reduces
stacking
effect
and
Fig B
increases area of overhead.
As seen in the fig B, the dead time
for most of the benchmarks is high.
Cache decay:
In this technique, cache lines, which
hold data that are not likely to be reused, are
Merits:

leakage energy achieved.
turned off. It exploits the fact that cache
lines will be used frequently when data is
70 % reduction in L1 data cache

Program performance or dynamic
first brought in and then there will be a
power
period of dead time before the data is
affected much as the cache line is
evicted. So by turning off the cache lines
turned off only during its dead time.
during their dead time, leakage energy can

dissipation
will
not
be
Results show that dead times are
be reduced significantly without additional
long, thus moderately easy to
misses incurred thus performance will be
identify.
comparable to a conventional cache. The

Very successful if application has
policy used here is a time-based policy that
poor reuse of data like streaming
turns a cache line off if a pre-set number of
applications.
cycles have elapsed since its last access.

Can be applied to outer levels of
cache hierarchy (as outer levels are
likely to have longer generations
with larger dead time intervals)
Demerits:
Fig A

There might be additional L1
misses if there is a miss in the L1
3
VLSI Architecture

cache due to early shut off of the
accessed per data reference, and thus the
cache line.
energy consumed.
Shorter decay intervals (time after
which cache line is shut off) reduce
Merits:
leakage energy but may increase

Reduces per access energy costs
miss rate, leading to dynamic

Improves locality behavior
energy dissipation in lower level

Smaller
memory.
Partitioned Cache Architecture:

into smaller units (subcaches) each of which
acts as a cache and selectively disables
unused components. Since the partition is at
the architecture level the data placement and
data probing mechanisms are sophisticated
than those at the circuit level. The topology
of subcaches may be different. The cache
predictor tells the cache controller, which
determines
should
of
the
energy
be
the
activated.
probing
number
of
The subcaches can all be the same
or caches with different topology
This method partitions the cache
effectiveness
lesser
consuming components

subcaches
and
Both performance and energy can
be optimized

Breaking up into sub caches or sub
banks reduces wiring and diffusion
capacitances of bit lines and wiring
and gate capacitances of word lines.
Thus dynamic energy consumption
when accessing the cache will be
less.
The
strategy
subcaches
4
VLSI Architecture
PARTITIONED CACHE ARCHITECTURE
MISSING PHYSICAL BLOCK #
No Prediction
SUB CACHE 1
CACHE PREDICTOR
M
Reprobe
Default
Predictor
Logic
SUB CACHE 2
Feedback
V
I
R
T
U
A
L
E
CACH
E ID
M
Cache ID
CACHE CONTROLLER
SUB CACHE 3
Page Offset
A
D
D
R
E
S
S
CACHE
MISS
AND
PLACEMEN
T LOGIC
O
Frame #
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
TLB
R
Y
SUB CACHE N
SUB CACHES
Fig C
1
VLSI Architecture
ARCHITECTURE OF A SUB-CACHE
GLOBAL
COUNTER
VALID BIT
WRD
WRD
LOCAL 2 BIT
COUNTER
V
CACHE-LINE (DATA +TAG)
LOCAL 2 BIT
COUNTER
V
CACHE-LINE (DATA +TAG)
CASCADED
TICK
PULSE
T
ROW
DECODERS
V
Vb
B
B
M
FSM 2-BIT
COUNTER
Bb
M
V
Vg
WRD
POWER OFF
RESET
ALWAYS POWERED
WRD
Bb
SWITCHED POWER
WRD
WRD
s1 s0
0
0
0
T
1
1
1
1
0
T/PowerOff
T
T
State Diagram for 2-bit(S1,S0),Saturating,Gray Code Counter with two inputs(WRD,T)
Fig D
2
VLSI Architecture
requirements, enabling cache ways and
Demerits:

If large number of cycles is spent
saving cache way select register. Thus this is
in servicing a memory request
a combination of hardware and software
because
of
a
poor
probing
strategy then performance will be

levels and how they are affected by
Performance depends on the
effectiveness
disabled depends on the relative energy
dissipation of different memory hierarchy
degraded.

elements. The degree to which ways are
of
the
disabling ways.
probing
policy, if the probing policy is
SECTION
not good then there will be
LEAKAGE
reprobing penalty.
PARTITIONED
Energy depends on the number
ARCHITECTURE
III:
TIME
BASED
CONTROL
IN
CACHE
of subcaches accessed per data
reference.
Overview:
Level 2 cache is larger in size than level
Selective cache ways:
1 cache, thus level 2 cache will dissipate
This method exploits the subarray
more leakage energy than level 1 cache.
partitioning that is usually already present
Thus by reducing leakage energy in L2
and enables all the cache ways when
cache, overall leakage energy can be
required to achieve high performance but
reduced to a great extent.
only a subset of cache ways when cache
combines two existing strategies, to reduce
demands are not high. Since only a subset of
leakage power in level 2 cache. This paper
the cache ways is active leakage energy can
exploits the advantages of partitioning and
be reduced significantly. This strategy
time based cache decay techniques. The
exploits the fact that cache requirements
level 2 cache is partitioned into smaller units
vary considerably between applications as
each of which is a cache by itself called
well as within an application. A software
subcache.
visible register called the cache way select
partitioning the cache structure, shutting of
register (CWSR), signals the hardware to
part of cache ways during their dead time,
enable/ disable particular ways. Special
partitioning the sub arrays of a cache
instructions are there for writing and reading
structure.
cache way select register. Software also
partitioning the cache structure into small
plays a role in analyzing application cache
caches called subcaches and implementing
This paper
Methods were proposed for
In
this
paper
we
suggest
3
VLSI Architecture
the cache decay (shutting of portions of

Program performance will not be
cache ways) in each subcache. This can
affected much as the cache line is
reduce the leakage energy significantly.
turned of only during its dead time.
Subcache architecture enjoys the following

Time based cache decay works well
benefits:
if the reuse of data is poor, reuse of

Reduces per access energy costs
data in L2 cache will be less than

Improves locality behavior
that in L1 cache, so it is appropriate

Smaller
and
lesser

consuming components


to apply this technique to L2 cache.
energy
Outer levels of hierarchy are likely
Both performance and energy can
to have longer generations with
be optimized
larger dead time intervals, which is
Breaking up into subcaches or
what is required for this time based
subbanks
cache decay technique.
reduces
wiring
and
diffusion capacitances of bit lines

The fraction of time the cache way
and wiring and gate capacitances of
is dead increases with higher miss
word lines. Thus dynamic energy
rate as the lines spend more of their
consumption when accessing the
time about to be evicted.
cache will be less.
Implementation:
This architecture selectively disables unused
The
block
diagram
of
the
subcaches and activates the one holding the
hardware implementation is shown in
data, thereby leakage energy can be reduced
Fig C. The level 2 cache is divided into
significantly. By applying the time based
cache decay technique to each of the
subcache, only part of the cache ways will
be enabled within a subcache, the power
wasted on dead times (when cache way is
smaller units each acting like a cache by
itself, these are called as SUBBANKS or
SUBCACHES. The subcache that needs
to be activated is decided by a
idle) can be avoided, thus this combination
logic called CACHE PREDICTOR. This
of partitioning and selective cache ways can
operation is performed concurrently with
reduce the leakage energy more than that
table lookup operation in order to avoid
when one technique is applied. Selective
delay in critical path. The output of the
cache ways is an appropriate technique to be
cache predictor will be the subcache id
used in subbank because of the following
or will be a no prediction. If the output is
reasons:
a no prediction then a logic called
DEFAULT PREDICTOR will be used to
2
VLSI Architecture
select the cache for activation. Based on
the SRAM cell is turned off, disabling
the cache predictor output the CACHE
that cache way. Thus the cache ways that
CONTROLLER
are idle will be disabled.
will
activate
the
appropriate subcache. The check will be
If the cache controller cannot
made only with the cache ways that are
find data in the selected subcache then
active within the subcache, not all cache
the CACHE MISS logic informs the RE-
ways
the
PROBE logic, which determines the next
subcache. Disabling the cache ways
subcache to probe. The re-probe logic
within a subcache is done by means of a
will be active until the data is found.
time based decay policy (fig D).
When the data cannot be found on any of
will
be
enabled
within
The time based decay policy is
the subcache then the cache miss and
implemented in each subcache. Each
placement logic will become active and
cache
is
brings the block from main memory.
connected to a counter, this counter is a
The information in cache predictor and
2-bit counter (local counter) which
re-probe logic is updated, as one of the
increments its value after receiving the
blocks needs to be evicted. The cache
tick pulse form a global counter. The
predictor is also updated whenever there
two inputs the local counter receives are
is a cache hit that was not predicted by
the global tick signal T and the cache
it.
line
within
a
subcache
line access signal WRD. When the 2 bit
The global counter used can be
counter reaches its maximum value, the
common for all subcaches. The tick
decay interval, (It is found that for L2
signal is cascaded from one local counter
caches the decay interval should be in
to another with one clock cycle latency,
the range of tens of thousands of cycles)
so that writebacks cannot take place at
which is the time allowed before which
same
the line is shut off would have elapsed.
implement the state machine shown in
On every access to the cache line the 2-
figure D. The output of the local counter
bit counter is reset to its initial value.
is a power off signal that goes to gated
Once the counter saturates to its
Vdd transistor, which will be turned off
maximum value the cache line is shut off
when asserted.
time.
Each
cache
line
will
using gated Vdd technique. The gated
The subcache architecture adopts
Vdd transistor connected in series with
certain policies for placing data if a miss
3
VLSI Architecture
is encountered, predicting the subcache
(probing) in which data will be present,
reprobing subcaches in case the first
probing fails. These are described below.
4
VLSI Architecture
Placement strategies:
concurrently, thus this strategy will not
This tells how a data from memory
provide any energy savings during probing
is placed in the subcache system. Selecting a
stage. MRU/WP strategy accesses the most
good placement policy has both energy and
recently used subcache first. This most
performance implications and it depends on
recently used information can be maintained
the amount of past history maintained by the
in a single register. CIB- Cache Identifier
system. Once the subcache is selected then
Buffer strategy holds a list of mostly
the data will be placed inside the subcache
recently used virtual addresses and the
by its own topology. The different policies
corresponding physical subcaches holding
are random, least-recently-used (LRU),
those blocks. Whenever the CIB is not able
spatial temporal (ST) and modified-spatial-
to make a prediction, a default predictor
temporal (MST). In random policy one of
predicts the subcache. The CIB entries are
the subcaches is selected at random, In LRU
updated whenever the corresponding cache
the subcache which was least recently used
line is accessed and evicted by the cache
is selected for placement. Usually the spatial
miss and placement logic.
data and temporal data are stored in separate
In order to reduce reprobe penalty
subcaches, if in case a bypass data comes in
on the first probe misses, it is good to probe
then this is stored in either spatial or
all subcaches other than the one already
temporal subcache instead of fixing a
accessed simultaneously. CIB is an effective
location for the bypass data, this strategy is
probing strategy. When the program exhibits
called MST and the performance is found to
good locality CIB has a high probability of
be better if done this way, as the number of
making a prediction.
misses is reduced. The data will be stored in
The decay interval that determines
the subcache spatial or temporal that has less
the time each cache way is shut off within a
number of misses for better load balance.
subcache has to be chosen properly in order
MST improves performance, as the number
to avoid extra misses in the cache. Some
of misses are less when compared to other
techniques is described below.
strategies.
Prediction strategies:
This is the strategy used to probe the
data in the subcache. This needs to be good
otherwise there will be a penalty if miss
occurs, as reprobing has to be done. The
strategy ‘All ‘ accesses all subcaches
5
VLSI Architecture
Deciding cache decay interval:
Two things can be done either the
cache line can be turned off at some
reference point deciding that the cache line
is worth turning off or this fact can be
watched over a period of time and turned off
if no further access occurs. One policy is to
turn off at a point in time where the extra
cost incurred by waiting is precisely equal to
extra cost that might occur if the action done
is wrong. If waited longer, then the leakage
energy dissipated will be more, on the other
hand if the decay interval is short then the
number of L2 misses will be more, which
cannot be tolerated especially if the miss
penalty is from off chip. One best time to
SECTION IV: CONCLUSION
turn off is when the static energy dissipated
Thus by partitioning the cache into
since last access is precisely equal to
smaller units and shutting of portion of the
dynamic energy that would be dissipated if
cache ways in each subcache using time
turning the line off induces an extra miss.
based decay policy, the leakage energy can
The decay interval for L1 cache was found
be reduced more than that when single
to be about 10,000 cycles, since higher level
technique is implemented without much
cache tends to have longer generation with
performance degradation. Partitioning the
large dead time interval the decay interval
cache into smaller units offers many benefits
for L2 cache will be greater.
like reducing per access energy, improving
locality behavior, less dynamic energy
consumption. The subcache architecture
uses efficient placement and prediction
schemes. The placement scheme is MST
(modified spatial temporal), which reduces,
miss rate by placing the data at the
appropriate subcache (spatial or temporal),
the prediction scheme is CIB (cache
identifier buffer), which predicts good
especially when the program exhibits good
6
VLSI Architecture
locality. Since the time based decay policy
scheme
involves
turns off the cache only during its dead time
multitude of decay intervals per cache line,
the performance will not be affected much,
by varying the decay interval per cache line
as the number of misses in the cache will be
depending upon the utilization of cache line,
less. The time based decay policy exploits
additional energy due to leakage can be
the generational characteristics of cache-line
saved.
usage. Here individual cache lines are turned
technique can also be applied to level 1
off during their dead period (the time
caches.
The
time
selecting
based
among
cache
a
decay
between the last successful access and line’s
eviction). A global counter provides a tick
pulse for the local counters of all cache
lines. The cache line is turned off when the
local counter reaches its maximum value.
Compared to standard cache of various sizes
a decay cache offers better active size, for
the same miss rate or better miss rate for the
same active size. The decay interval can be
varied as per the utilization of the cache line,
thus saving more energy. The adaptive
REFERENCES:
1. Michael Powell, Se-Hyun Yang,
Babak Falsafi, Kaushik Roy, and
T.N. Vijaykumar, “Reducing
Leakage in a High-Performance
Deep-Submicron Instruction
Cache,” in IEEE Transactions on
Very Large Scale Integration
(VLSI) Systems, Vol.9 No.1,
February 2001
2. Stefanos kaxiras, Zhigang Hu, and
Margaret Martonosi, “Cache Decay:
Exploiting Generational Behaviour
to Reduce Cache Leakage Power”
3. DavidH. Albonesi, “Selective Cache
Ways: On-Demand Cache Resource
Allocation,” in Journal of
Instruction-Level Parallelism 2
(2000) 1-6 May 2000.
4. S.Kim, N. Vijaykrishnan, M.
Kandemir, A. Sivasubramaniam,
M.J. Irwin and E. Geethanjali,
“Power-aware Partitioned Cache
Architectures”
5. John L Hennessy and David A
Patterson, “Computer Architecture
A Quantitative Approach,” second
edition.
6. Anantha P. Chandrakasan and
Robert W. Brodersen, “Minimizing
Power Consumption in Digital
CMOS Circuits,” in Proceedings
7
VLSI Architecture
7. of the IEEE, Vol. 83, NO. 4, April
1995.
1
Download