A Study According to the Temperature Effect of Substrate Cuurrent

advertisement
Memory System Midterm Report
Low Power Non-uniform Memory Access (NUMA)
0251808 周德玉
Abstract
NUMA refers to the computer memory design choice
available for multiprocessors. NUMA means that it will
take longer to access some regions of memory than others.
This work explains what NUMA is, the background
developments, and how the memory access time depends
on the memory location relative to a processor. It presents
a background of multiprocessor architectures, and some
trends in hardware that exist along with NUMA. It, then
briefly discusses the changes NUMA demands to be made
in two key areas. One is in the policies the Operating
System should implement for scheduling and run-time
memory allocation scheme used for threads and the other
is in the programming approach the programmers should
take, in order to harness NUMA’s full potential. It also
presents some numbers for comparing UMA vs. NUMA’s
performance.
Introduction
Non-uniform memory access (NUMA) is a computer
memory design used in multiprocessing, where the
memory access time depends on the memory location
relative to a processor. Under NUMA, a processor can
access its own local memory faster than non-local
memory (memory local to another processor or memory
shared between processors). The benefits of NUMA are
limited to particular workloads, notably on servers where
the data are often associated strongly with certain tasks or
users. NUMA architectures logically follow in scaling
from symmetric multiprocessing (SMP) architectures.
make the problem considerably worse. Now a system can
starve several processors at the same time, notably
because only one processor can access the computer's
memory at a time.
NUMA attempts to address this problem by providing
separate memory for each processor, avoiding the
performance hit when several processors attempt to
address the same memory. For problems involving spread
data (common for servers and similar applications),
NUMA can improve the performance over a single shared
memory by a factor of roughly the number of processors
(or separate memory banks).
Of course, not all data ends up confined to a single task,
which means that more than one processor may require
the same data. To handle these cases, NUMA systems
include additional hardware or software to move data
between memory banks. This operation slows the
processors attached to those banks, so the overall speed
increase due to NUMA depends heavily on the nature of
the running tasks.
Intel announced NUMA compatibility for its x86 and
Itanium servers in late 2007 with
its Nehalem and Tukwila CPUs. Both CPU families share
a common chipset; the interconnection is called
Intel Quick Path Interconnect (QPI). AMD implemented
NUMA with its Opteron processor (2003),
using HyperTransport. [8]
Background
Modern CPUs operate considerably faster than the main
memory they use. In the early days of computing and data
processing, the CPU generally ran slower than its own
memory. The performance lines of processors and
memory crossed in the 1960s with the advent of the first
supercomputers. Since then, CPUs increasingly have
found themselves "starved for data" and having to stall
while waiting for data to arrive from memory. Many
supercomputer designs of the 1980s and 1990s focused on
providing high-speed memory access as opposed to faster
processors, allowing the computers to work on large data
sets at speeds other systems could not approach.
Hardware Goals / Performance Criteria
There are 3 criteria on which performance of a
multiprocessor system can be judged, viz. Scalability,
Latency and Bandwidth. Scalability is the ability of a
system to demonstrate a proportionate increase in parallel
speedup with the addition of more processors.
Latency is the time taken in sending a message from node
A to node B, while bandwidth is the amount of data that
can be communicated per unit of time. So, the goal of a
multiprocessor system is to achieve a highly scalable, low
latency, high bandwidth system.
Limiting the number of memory accesses provided the
key to extracting high performance from a modern
computer. For commodity processors, this meant
installing an ever-increasing amount of high-speed cache
memory and using increasingly sophisticated algorithms
to avoid cache misses. But the dramatic increase in size of
the operating systems and of the applications run on them
has generally overwhelmed these cache-processing
improvements. Multi-processor systems without NUMA
Parallel Architectures
Typically, there are 2 major types of Parallel
Architectures that are prevalent in the industry: Shared
Memory Architecture and Distributed Memory
Architecture. Shared Memory Architecture, again, is of 2
types: Uniform Memory Access (UMA), and
Non-Uniform Memory Access (NUMA).
Shared Memory Architecture
As seen from the figure 1 (more details shown in
“Hardware Trends” section) all processors share the same
memory, and treat it as a global address space. The major
challenge to overcome in such architecture is the issue of
Cache Coherency (i.e. every read must reflect the latest
write). Such architecture is usually adapted in hardware
model of general purpose CPU’s in laptops and desktops.
Figure 3. UMA Architecture Layout. [3]
Figure 1. Shared Memory Architecture. [1]
Distributed Memory Architecture
In figure 2 (more details shown in “Hardware Trends”
section) type of architecture, all the processors have their
own local memory, and there is no mapping of memory
addresses across processors. So, we don’t have any
concept of global address space or cache coherency. To
access data in another processor, processors use explicit
communication. One example where this architecture is
used with clusters, with different nodes connected over
the internet as network.
Shared Memory Architecture – NUMA
Figure 4 shows type of shared memory architecture, we
have identical processors connected to a scalable network,
and each processor has a portion of memory attached
directly to it. The primary difference between a NUMA
and distributed memory architecture is that no processor
can have mappings to memory connected to other
processors in case of distributed memory architecture,
however, in case of NUMA, a processor may have so. It
also introduces classification of local memory and remote
memory based on access latency to different memory
region seen from each processor. Such systems are often
made by physically linking SMP machines. UMA,
however, has a major disadvantage of not being scalable
after a number of processors [6].
Figure 4. NUMA Architecture Layout. [3]
Hardware Trends
Figure 2. Distributed Memory. [1]
Shared Memory Architecture – UMA
Shared Memory Architecture, again, is of 2 distinct
types, Uniform Memory Access (UMA), and
Non-Uniform Memory Access (NUMA).
The Figure 3 shows a sample layout of processors and
memory across a bus interconnection. All the processors
are identical, and have equal access times to all memory
regions. These are also sometimes known as Symmetric
Multiprocessor (SMP) machines. The architectures that
take care of cache coherency in hardware level, are knows
as CC-UMA (cache coherent UMA).
We now discuss 2 practical implementations of the
memory architectures that we just saw, one is the Front
Side Bus and the other is Intel’s Quick Path Interconnect
based implementation.
Traditional FSB Architecture (used in UMA)
As shown in Figure 5, FSB based UMA architecture
has a Memory Controller Hub, which has all the memory
connected to it. The CPUs interact with the MCH
whenever they need to access the memory. The I/O
controller hub is also connected to the MCH, hence the
major bottleneck in this implementation is the bus, which
has a finite speed, and has scalability issues. This is
because, for any communication, the CPU’s need to take
control of the bus which leads to contention problems.
OS Design Goals
Operating Systems, basically, try to achieve 2 major
goals, viz. Usability and Utilization. By usability, we
mean that OS should be able to abstract the hardware for
programmer’s convenience. The other goal is to achieve
optimal resource management, and the ability to multiplex
the hardware amongst different applications.
Figure 5. Intel's FSB based UMA Arch. [4]
Quick Path Interconnect Architecture (used in NUMA)
The key point to be observed in this implementation is
that the memory is directly connected to the CPU’s
instead of a memory controller. Instead of accessing
memory via a Memory Controller Hub, each CPU now
has a memory controller embedded inside it. Also, the
CPU’s are connected to an I/O hub, and to each other. So,
in effect, this implementation tries to address the
common-channel contention problems.
Features of NUMA aware OS
The basic requirements of a NUMA aware OS are to be
able to discover the underlying hardware topology, and to
be able to calculate the NUMA distance accurately.
NUMA distances tell the processors (and / or the
programmer) how much time it would take to access that
particular memory.
Besides these, the OS should provide a mechanism for
processor affinity. This is basically done to make sure that
some threads are scheduled on certain processor(s), to
ensure data locality. This not only avoids remote access,
but can also take the advantage of hot cache. Also, the
operating system needs to exploit the first touch memory
allocation policy.
Optimized Scheduling Decisions
The operating systems needs to make sure that load is
balanced amongst the different processors (by making
sure that data is distributed amongst CPU’s for large jobs),
and also to implement dynamic page migration (i.e. use
latency topology to make page migration decisions).
Conflicting Goals
The goals that the Operating System is trying to achieve
are conflicting in nature, in the sense, on one hand we are
trying to optimize the memory placement (for load
balancing), and on the other hand, we would like to
minimize the migration of data (to overcome resource
contention). Eventually, there is a trade off which is
decided on the basis of the type of application.
Programming Paradigms
NUMA Aware Programming Approach
The main goals of NUMA aware programming
approach are to reduce lock contention and maximize
memory allocation on local node. Also, programmers need
to manage their own memory for maximum portability.
This is can prove to be quite a challenge, since most
languages do not have an in-built memory manager.
Figure 6. Intel's QPI based NUMA Arch. [4]
New Cache Coherency Protocol
This new QPI based implementation also introduces a
new cache coherency protocol, “MESIF” instead of
“MESI”. The new state “F” stands for forward, and is
used to denote that a cache should act as a designated
responder for any requests.
Operating System Policies
Support for Programmers
Programmers rely on tools and libraries for application
development. Hence the tools and libraries need to help
the programmers in achieving maximum efficiency, also
to implement implicit parallelism. The user or the system
interface, in turn needs to have programming constructs
for associating virtual memory addresses. They also need
to provide certain functions for obtaining page residency.
Programming Approach
The programmers need to explore the various NUMA
libraries that are available to help simplify the task. If the
data allocation pattern is analyzed properly, “First Touch
Access” can be exploited fully. There are several lock-free
approaches available, which can be used.
Besides these approaches, the programmers can exploit
various parallel programming paradigms, such as Threads,
Message Passing, and Data Parallelism.
Scalability – UMA vs NUMA
We can see from the figure, that UMA based
implementation have scalability issues. Initially both the
architectures scale linearly, until the bus reaches a limit
and stagnates. Since there is no concept of a “shared bus”
in NUMA, it is more scalable.
Figure 7. UMA vs. NUMA – Scalability. [6]
Cache Latency
The figure shows a comparison of cache latency
numbers of UMA and NUMA. There is no layer 3 cache
in UMA. However, for Main Memory and Layer 2 cache,
NUMA shows a considerable improvement. Only for
Layer 1 cache, UMA marginally beats NUMA.
attempt to access the same memory area in rapid
succession. Operating-system support for NUMA attempts
to reduce the frequency of this kind of access by
allocating processors and memory in NUMA-friendly
ways and by avoiding scheduling and locking algorithms
that make NUMA-unfriendly accesses necessary.
Alternatively, cache coherency protocols such as
the MESIF protocol attempt to reduce the communication
required to maintain cache coherency. Scalable Coherent
Interface(SCI)
is
an IEEE standard
defining
a
directory-based cache coherency protocol to avoid
scalability limitations found in earlier multiprocessor
systems. SCI is used as the basis for the
Numascale NumaConnect technology.
As of 2011, ccNUMA systems are multiprocessor
systems based on the AMD Opteron processor, which can
be implemented
without
external logic, and
the Intel Itanium processor, which requires the chipset to
support NUMA. Examples of ccNUMA-enabled chipsets
are the SGI Shub (Super hub), the Intel E8870,
the HP sx2000 (used in the Integrity and Superdome
servers), and those found in NEC Itanium-based systems.
Earlier ccNUMA systems such as those from Silicon
Graphics were
based
on MIPS processors
and
the DEC Alpha 21364 (EV7) processor. [8]
Non-Uniform Distribution of Memory Accesses on Cache
Sets Affects the System Performance of Chip
Multiprocescsors
Extension to CMP Platforms
There are typically two ways to organize non-first level
cache in chip multiprocessors (CMPs), i.e. shared cache or
private cache. We study how non-uniform memory access
distribution across sets affects each organization by
adapting SBC to each of them respectively.
A. Shared Cache
Figure 8. UMA vs NUMA - Cache Latency. [4]
Cache coherent NUMA (ccNUMA)
Nearly all CPU architectures use a small amount of
very fast non-shared memory known as cache to
exploit locality of reference in memory accesses. With
NUMA, maintaining cache coherence across shared
memory has a significant overhead. Although simpler to
design and build, non-cache-coherent NUMA systems
become prohibitively complex to program in the
standard von Neumann architecture programming model.
Typically,
ccNUMA
uses
inter-processor
communication between cache controllers to keep a
consistent memory image when more than one cache
stores the same memory location. For this reason,
ccNUMA may perform poorly when multiple processors
In shared cache, a central cache (distributed shared
cache is also a typical shared cache organization. However,
we focus on central shared cache in this paper.) is shared
among all cores. Each set of it is thus required to serve
misses from above cache level or bypassed requests of
multiple applications running on all cores [11]. The
overall memory access distribution across sets on shared
cache is a cumulative consequence of that of each
application. To study the memory distribution across sets
on shared cache, we treat all accesses to each set as a
whole and simply apply SBC directly to it, regardless of
where each access comes from. To distinguish this scheme
from SBC on single-core platforms, we call it Shared SBC
(SSBC) .
There is an impressive body of research on shared
cache optimization [12] [14] [15] [18]. However, we do
not attempt to adapt SBC to one of those schemes for the
following reasons: (1) most of those optimization schemes
seek to control either the share of ways each application is
allowed to use smartly [12] [13] or the timing a block
being evicted from the cache [14] [15], both of which
modifies the replacement policy, making it difficult for
SBC to be adapted to; (2) experimental results for several
different cache configurations indicate that SSBC has little
to no performance boost over the baseline shared cache
architecture.
B. Private Cache
In private based design, non-first level cache is
composed of multiple slices, with each one private to and
closely coupled to a different core both logically and
physically, serving only local misses and requests. As a
result, memory accesses of applications are well isolated
from each other, making it easy for us to adapt
non-uniform distributionbased schemes on top of it.
Therefore, we put forward three schemes on private
based design to try to make use of non-uniform memory
access distribution across sets and to study how this kind
of non-uniformity affects system performance of private
based cache design:
(a) Private Set Balancing Cache (PSBC): Sets of
private cache slices work exactly the way they do on
single-core platforms, expect that they may sometimes
need to deal with a few coherence operations. Therefore,
we first directly introduce SBC to each private slice just
the way it is for singlecore platforms, but with small
modifications to coherence protocols to ensure a
coherence request to a source set will also be directed to
its destination set on a miss. We call this simple scheme as
Private SBC (PSBC).
(b) Balanced Private NUCA (BP-NUCA): The static
partitioning style in private based cache design may lead
to undesirable low utilization of the precious on-chip
cache resources [11]. To address this limitation, many
private cache enhancement schemes use spilling [16] [17]
technique to improve capacity utilization of private cache
design by allowing evicted blocks of one private slice to
be saved at a peer private slice [16].
In fact, spilling technique shares similar ideas with SBC,
both of which seek to move some blocks of highly
accessed sets to underutilized sets. The main difference is
that SBC seeks to move those blocks to sets in the same
private cache slice but with different index address,
whereas typical spilling technique attempts to move those
blocks to sets with the same index address but in a remote
peer cache. To distinguish between the two different
“moves”, we call the former “move” operation as
displacement, the latter one as spill.
Recall that, the Saturation Counter (SC) adopted in
SBC can measure the memory access pressure
experienced by the corresponding cache set. It puts
forward a new spilling technique for private cache based
on the following insight, i.e. SC could also be used in
spilling technique to guide the spill process with memory
access pressure information detected. We call this new
spilling technique as Balance Private Non-uniform Cache
architecture (BP-NUCA) because private based cache
design is a NUCA itself in nature, with varying access
latency to different cache slices, and also because this
SC-based spilling technique seeks to balance the cache
space utilization of different private slices.
BP-NUCA works in the following way: on a miss to a
set, if the SC value of this set is larger than the migration
limit (denoted as ThM), then the victim of the miss is
allowed to spill into one of the sets of peer caches that are
with the same index (called peer sets). A peer set is
qualified to serve as a receiver of the spilled block on the
condition that its SC value is smaller than a receiver limit
(denoted as ThR). When there are more than one potential
receiver set, the one with the shortest access latency is
selected.
(c) Balanced Private NUCA+ (BP-NUCA+): We also
adapt SBC to BP-NUCA and get BP-NUCA+. We expect
BP-NUCA+ to be beneficial because it attempts to
balance the cache capacity utilization both horizontally
(by spilling blocks to peer sets) and vertically (by
displacing blocks to different sets of the same cache slice).
As a result, blocks of highly accessed sets are also able to
borrow the space of sets that are with different index
address and also at remote cache slices.
Consequently, BP-NUCA+ has five policy parameters,
namely (SAT, ThM, ThA, ThL, ThR). Those parameters
should follow SAT ≥ ThM ≥ ThA. SAT is the upper bound of
SC. Spilled blocks have longer access latency than
displaced ones, so we prefer to displace blocks first.
Therefore, we let ThM ≥ ThA.
Figure 9 compares the basic ideas of PSBC, BP-NUCA
and BP-NUCA+.
Figure 9. Comparison of PSBC, BP-NUCA and
P-NUCA+
Experimental Methodology
It uses g-cache of Virtutech Simics [9], a full system
simulator for the performance studies. Evaluation is
performed on a 4-core CMP with parameters given in
Table I. An in-order core model is used so that it can
evaluate our proposals within a reasonable time.
For the study, it uses 23 SPEC CPU2006 benchmarks
to create 16 4-benchmark multiprogrammed workloads
randomly,as listed in Table II. All workloads are
simulated till each benchmark in the workload executes
at-least 250M instructions.
make use of the non-uniformity of memory access
distribution across sets for performance boost are not
necessary for shared cache on CMP platforms.
B. Results for Private Cache
Experimental Results and Findings
We examines the performance of SSBC over Shared.
For private cache based design, we compare the
performance of PSBC, BP-NUCA and BP-NUCA+ to that
of Private.
A. Results for Shared Cache
To examine the performance of SSBC across a
spectrum of memory configurations (specifically for
different associativity), we run the simulation on three
different configurations, namely 8-way 2MB cache,
16-way 2MB cache and 32-way 2MB cache respectively.
We set (SAT, ThM, ThA) as (3A − 1, 2A − 1, A).
Figure 10 shows the throughput of SSBC on those three
different configurations, normalized to that of Shared of
each corresponding configuration. Geomean is the
geometric mean of all 16 workloads.
As can be observed from Figure 5, although for each
memory configuration, SSBC does outperforms the
baseline shared cache for a few workloads, SSBC
degrades the throughput performance for the majority of
the workloads. The most obvious example is the case for
32-way 2MB cache. In general, SSBC has little to no
average performance benefit for all three configurations,
even with obvious average degradation for 32-way 2MB
cache. Besides, SSBC don’t show the kind of stability that
SBC holds for singlecore platforms.
Based on the experimental results, we draw to two
conclusions:
(a)
on
CMP
platforms
with
multiprogrammed workloads, the unpredicted interaction
between memory accesses from different applications
complicates the memory access distribution across sets,
reducing the non-uniformity of memory access
distribution across sets in general; (b) schemes seek to
Figure 11 shows the throughput of PSBC, BP-NUCA
and BP-NUCA+, normalized to Private. Geomean is the
geometric mean of all 16 workloads. We expect PSBC to
outperform Private because the memory access pattern for
each private cache slice is similar to that for single-core
cache due to the static partitioning of cache space, except
for a few coherence requests. The results shown in Figure
6 confirm our expectation, with PSBC outperforms
Private for almost all the workloads except for MIX 12,
with an average performance boost of 2.0% across all 16
workloads.
As previously stated, BP-NUCA has three policy
parameters, i.e. (SAT, ThM, ThR), while BP-NUCA+ has
five policy parameters, i.e. (SAT, ThM, ThA, ThL, ThR). The
values for those parameters not only impact the accuracy
of cache pressure level estimation of each set, but also
tune the aggressiveness of the two schemes. It would be
unrealistic to try the whole exploration space to find out
the optimal values for all the parameters. Based on the
initial conclusion on (ThA, ThL) of work [10] and Section
III-B, we come up with several parameter configurations
for each scheme empirically and search for the best
configurations amongst them experimentally. The detailed
parameter study is ruled out due to limited space.
BP-NUCA and BP-NUCA+ achieve surprisingly high
performance boost, far more than our expectation,
outperforming the baseline Private by 7.7% and 7.6% on
average, respectively. BP-NUCA outperforms PSBC for
14 out of all 16 workloads, while BP-NUCA+
outperforms PSBC for 13 out of all 16 workloads.
BP-NUCA and BP-NUCA+ demonstrate considerably
good stability in that they work better than the baseline
Private for all 16 workloads simulated.
One notable fact is that, although BP-NUCA+ takes
into account the non-uniformity of memory access
distribution across sets, it seems that this kind of
enhancement over BP-NUCA does not necessarily result
in much performance benefit. In fact, although
BP-NUCA+ outperforms BP-NUCA for 8 out of the 16
workloads, it achieves slightly lower geometric mean
average throughput over BP-NUCA.
Based on above results and analysis, we draw to several
conclusions: (a) on private based cache design, direct
adaption of schemes that make use of non-uniform
memory access distribution across sets for performance
boost is proved to be beneficial; (b) however, if those
schemes are used in conjunction with traditional private
cache enhancing technique, namely spilling, they fail to be
beneficial across a wide spectrum of multiprogrammed
workloads.[19]
Figure 10. Throughput Performance of SSBC (on 8-way
2MB cache, 16-way 2MB cache and 32-way 2MB cache
respectively)
Figure 11. Throughput Performance of PSBC, BP-NUCA
and BP-NUCA+,normalized to that of Private
A Non-Uniform Cache Architecture for Low Power
System Design
Non-Uniform Cache Architecture
We determine the optimum number of cache-ways for
each cache-set at design time. Although the number of
active cache-ways can be changed dynamically by using a
sleep transistor during the course of running an
application program, we do not consider it in this work.
The power supply of unused cache-ways (the gray portion
of Figure 12) can be disconnected by eliminating vias
used for connecting the power supply to memory cells.
Unused memory cells can also be disconnected from bit
and word
lines in the same fashion.
Figure 12. Deactivating sense amplifiers
One possible way of marking unused cache blocks is to
use a second valid bit [20]. If the bit is one, the
corresponding cache block will not be used for
replacement in case of a cache miss. Accessing an unused
block will always cause a cache miss. To reduce the
dynamic power consumption of the non-uniform cache, it
is possible to deactivate sense-amplifiers of cache-ways
which are marked as unused for the accessed cache-set.
This can be easily implemented by checking the set-index
field of the memory address register. For example in
Figure 12, sense-amplifiers for tag1 and wayl are
deactivated when the target cache-set is 4, 5, 6, or 7.
Similarly, sense-ampkifiers for tag2, way2, tag3, and
way3 are deactivated when one of sets 2-7 is accessed.
Reducing Redundant Cache Accesses
In [21], Panwer et al. have shown that cache-tag access
and tag comparison do not need to be performed for all
instruction fetches. Consider an instruction j executed
immediately after an instruction i. There are three cases,
1. Intra-cache-line sequentiaiflow
This occurs when both i and j instructions reside on the
same cache-line and i is a non-branch instruction or an
untaken branch.
2. Infer-cache-line sequentialflow
This case is similar to the first one, the only difference
is that i and j reside on different cache-lines.
3. Non-sequeprtiulflow
In this case, i is a taken branch instruction andj is its
target. In the first case (intra-cache-he sequential flow), it
is easy to detect that j resides in the same cache-way as i.
Therefore, there is no need to perform a tag lookup for
instruction j [21][22][23]. On the other hand, a tag lookup
and a cache-way access are required for a non-sequential
fetch such as a taken-branch (non-sequential flow) or a
sequential fetch across a cache line boundary (intercacheline sequential flow). As a consequence, the power
consumption of the cache memory can be reduced by
deactivating memory modules of tags and cache-ways in
case of the intracache-line sequentid flow. Several
embedded processors including ARM [22][23] use this
technique. We refer to this technique as Inter-Line Way
Memoization or ILWM. We use ILWM in our approach.
Figure 13. A code placement technique for reducing
redundant cache-way and cache-tag accesses
Assume a basic block “a” consists of 7 instructions and its
last instruction, a7, which is a taken-branch resides in the
fourth word of the cache line “n” (see Figure 13). Further,
assume the last instruction of the cache line “n” is not a
branch instruction. A tag lookup is required when a3 or a7
is executed because in either case it is not clear whether
the next instruction resides in the cache or not. However,
if the location of the basic block “a” in the address space
is changed so the basic block “n” is not located across a
cache-line boundary, the cache and tag accesses for
instruction a3 can be eliminated (see Figure 13). Therefore,
we change the placement of basic blocks in the main
memory so frequently accessed basic blocks are not
located across a cacheline boundary. To the best of ow
knowledge, this is the first code placement technique
which reduces the number of redundant cache-way and
cache-tag accesses.
Figure 14 shows the power breakdown for a cache. For
example in “JPEG-enc”, the inter-cache-line sequential
flow is responsible for 10% of cache accesses. Note that
for inter-cache-line sequential flows, all cache-ways and
cache-tags are activated. Therefore, the power
consumption of the cache memory due to the
inter-cache-line sequential flow is large especially for
highly associative caches. Assuming a 16-way set
associative cache, more than 50% of the cache power in
“JPEG-enc” is due to the inter cache-line sequentialflow.
Therefore, decreasing the number of the inter cache-line
sequential flow substantially reduces the cache power
consumption. Another way of reducing thc number of
times the inter cache-line sequential flow occurs is
increasing the size of cache-lines. However, increasing the
cache-line size increases the number of off-chip memory
accesses in case of a cache miss. Our algorithm presented
in the next section takes this trade-off into account and
explores different cache-line sizes to minimize the total
power consumption of the memory hierarchy.
Since the behavior of a program depends on its input
Figure 14. Power break down for a cache
Experimental Results
We compared the following four techniques, (a)
performing cache sizing for uniform cache, (b) performing
cache sizing for uniform cache and the conventional code
placement after that, (c) performing our code placement
and cache sizing for a uniform cache concurrently, and (d)
concurrent optimization for nonuniform cache. Redundant
cache-way and cache-tag access elimination (ILWM, [21]
was used for all four techniques. The number of
cache-sets in all experiments was 8. The power
consumption results optimized without and with a time
constraint are shown in Figure 8. The time constraint
Tconst, is set to the execution time of the target
application program with the original cache configuration.
“Low Leak” and “High Leak” in Figure 8 correspond to
low- and high-leakage scenarios,
respectively.
Since conventional code placement techniques reduce
the number of cache misses only, they may increase the
number of cache-way and tag accesses if the processor
uses the ILWM technique [21]. For example, compare case
(a) and (b) in the timeconstrained optimization results in
Figure 8. On the other hand, our methods (c) and (d)
always reduce the dynamic power consumption of the
cache memories. Optimizing without a timeconstraint
reduced the power consumption for “Compress” by
29% (17% on an average), while in presence of a
time-constraint, up to 76% (52% on an average) reduction
in the power consumption was achieved. The reason for
better results for timeconstrained case is that it requires
higher number of ways. Therefore, there is more
opportunity for our method to reduce the average number
of cache-ways accessed.
Table III shows the number of ways, cache-line size and
cache size (in byte) in the high-leakage case in our
experiment. As one can see, in many cases our approach
(d) reduces the effective size (the total size of blocks used)
of the cache memory as well.
TABLE III. The cache configuration results
values, an object code and cache configuration optimized
for a specific input value is not necessarily optimal for the
other input values. To see the effect of changing the input
value on the cache behavior, we calculated the power
consumption of memory systems for different input values.
We calculated the following three values for six different
input values:
1. the power consumption for the original object code
executed with a uniform cache optimized for Data0.
2. the power consumption (Ptotal) for the optimized
object code executed with a non-uniform cache optimized
for Data0.
3. the total execution time (Ttotal) for the optimized
object code running on a processor with the non-uniform
cache. The performance value is normalized to the
performance for Data0.
Figure 15 shows the results for six different input values
for each benchmark program. The left and right vertical
axes represent the power consumption of memories and
the normalized performance of a processor with the
non-uniform cache, respectively. The object code and
cache configuration were optimized €or Data0 using our
algorithm for non-uniform caches. As one can see, the
object code and the cache configuration optimized for
Data0 achieve very good results for other input values as
well.
Figure 15. Input Data Dependency
Table IV shows computation time (in second) of four
optimization methods executed on an UltraSPARC-II dual
CPU workstation running Solaris8 at 450MHz with 2GB
of memory. Since the optimization time in some cases is
very large, our future plan is substantially reducing it. [24]
Table IV. CPU-time for cache optimization (second)
The hardware industry has adapted NUMA as a
architecture design choice, primarily because of its
characteristics like scalability and low latency. However,
modern hardware changes also demand changes in the
programming approaches (development libraries, data
analysis) as well Operating System policies (processor
affinity, page migration). Without these changes, full
potential of NUMA cannot be exploited.
It explores the feasibility of taking advantage of
non-uniform memory access distribution across sets in
nonfirst level cache management to improve system
performance of CMPs. We present four cache
management schemes for CMP platforms, namely SSBC,
PSBC, BP-NUCA and BPNUCA+, based on a single-core
scheme SBC [10], aiming to balance the memory access
distribution across cache sets, on both shared cache and
private cache.
Experimental results using a full system CMP simulator
indicate that on CMP platforms with multiprogrammed
workloads: (a) for shared caches, the non-uniform
memory access distribution across different cache sets is
biased by the fact multiple applications are running
concurrently and sharing the cache capacity. The proposed
scheme SSBC, which is derived by adapting SBC to
shared caches, is proved to be of little to no benefit or
even lead to degradation; (b) for caches that are organized
as private caches, simply adaption of SBC to each private
cache, namely PSBC, is proved to outperform the baseline
private organization by 2% on average; (c) however, for a
private cache-based cache enhancement scheme we
proposed, i.e. BP-NUCA, further adaption of SBC on top
of it, i.e. BP-NUCA+ is of little to no benefit.
Therefore, we come to the conclusion that on CMPs
platforms with multiprogrammed workloads, the
distribution of memory accesses across cache sets is less
non-uniform or the non-uniformity can not be easily taken
advantage of, due to the interactions between multiple
applications. As a result, special efforts to seek more
benefit by taking advantage of this kind of non-uniformity
are not really necessary.
we proposed the non-uniform cache architecture,
a code placement technique for reducing the power
consumption of caches, and an algorithm for simultaneous
cache configuration optimization and code placement. In
future we plan to enhance our method by dynamically
disabling cache-ways during the course of running an
application program.
[1] “Introduction to Parallel Computing.”:
https://computing.llnl.gov/tutorials/parallel_comp/
[2] “Optimizing software applications for NUMA”:
http://software.intel.com/en-us/articles/optimizing-softwar
e-applications-for-numa/
[3] “Parallel Computer Architecture - Slides”:
http://www.eecs.berkeley.edu/~culler/cs258-s99/
[4] “Cache Latency Comparison”:
http://arstechnica.com/hardware/reviews/2008/11/nehalem
-launch-review.ars/3
[5] “Intel – Processor Specifications”:
http://www.intel.com/products/processor/index.htm
[6] “UMA-NUMA Scalability”
www.cs.drexel.edu/~wmm24/cs281/lectures/ppt/cs282_le
c12.ppt[7] “Non-Uniform Memory Access (NUMA)”
http://cs.nyu.edu/~lerner/spring10/projects/NUMA.pdf
[8] “Non-Uniform Memory Access”
http://en.wikipedia.org/wiki/Non-Uniform_Memory_Acce
ss
[9] P S Magnusson, M. Christensson, J. Eskilson, D.
Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A.
Moestedt, B. Werner. Simics: a full system simulation
platform. Computer, 2002, 35(2): 50–58.
[10] Dyer Rol’an, Basilio B. Fraguela, Ram’on Doallo.
Adaptive line placement with the set balancing cache,
Proceedings of the 42nd Annual IEEE/ACM International
Symposium on Microarchitecture, New York, NY, ACM,
2009, pp. 529–540.
[11] Xiaomin Jia, Jiang Jiang, Tianlei Zhao, Shubo Qi,
Minxuan Zhang. Towards Online Application Cache
Behaviors Identification in CMPs, The 12th IEEE
International Conference on High Performance
Computing and Communications, Melbourne, Australia,
IEEE Computer Society, 2010.
[12] G. E. Suh, S. Devadas, et al. A New Memory
Monitoring Scheme for Memory-Aware Scheduling and
Partitioning. Proc Int Symposium on High Performance
Computer Architecture. Washington, DC, USA: IEEE
Computer Society, 117–128. 2002.
[13] S. Kim, D. Chandra, Y. Solihin. Fair Cache Sharing
and Partitioning in a Chip Multiprocessor Architecture,
Proceedings of the 13th International Conference on
Parallel Architectures and Compilation Techniques,
Antibes, Juan-les-Pins, France, IEEE Computer Society,
2004, pp. 111-122.
[14] Aamer Jaleel, William Hasenplaugh, Moinuddin
Qureshi, Julien Sebot, Simon Steely, Joel Emer. Adaptive
insertion policies for managing shared caches on CMPs,
Proc Int Conference on Parallel Architectures and
Compilation Techniques, Toronto, CANADA, ACM,
2008, pp. 208–219.
[15] Yuejian Xie and G. H. Loh. PIPP:
promotion/insertion pseudo-partitioning of multi-core
shared caches. Proceedings of the 36th Annual
IEEE/ACM International Symposium on Computer
Architecture(ISCA-36), Austin, TX, USA, ACM. 2009.
[16] J Chang, G S Sohi. Cooperative caching for chip
multiprocessors. Proc Int Symposium on Computer
Architecture. IEEE Computer Society: Boston, MS, USA,
2006. 264–276.
[17] Qureshi, M. K.: Adaptive spill-receive for robust
highperformance caching in CMPs, Proc Int Symposium
on High Performance Computer Architecture, Raleigh,
North Carolina, USA, IEEE Computer Society, 45–54.
2009.
[18] Moinuddin K. Qureshi, David Thompson, Yale N.
Patt. The V-Way Cache: Demand Based Associativity via
Global Replacement. SIGARCH Comput. Archit. News,
Vol. 33, 2005, No. 2, pp. 544-555.
[19] “Understanding How Non-Uniform Distribution of
Memory Accesses on Cache Sets Affects the System
Performance of Chip Multiprocessors”
[20] D. A. Patterson, et al., “Architecture of a VL81
instruction cache for a RISC”, In Proc. 10” Annual Int’l
Symposium on Computer Architecture, vol. 11, no. 3,
pp.108-116, June, 1983.
[21] R. Panwar, and D. Rennels, “Reducing the Frequency
of Tag Compares for Low Power I-Cache Design”, In
Proc. of ISLPED, pp.57-62, August 1995.
[22] S. Segars, “Low Power Design Techniques for
Microprocessors”, ISSCC Tutorial note, February 2001.
[23] M. Muller, “Power Efficiency & Low Cost: The
ARM6 Family”, In Proc. of Hot Chips IV, August 1992.
[24] “A Non-Uniform Cache Architecture fur Low Power
System Design”
Download