of Cached DRAM Organizations in Vector Supercomputers

advertisement
of Cached DRAM
Performance
W.
-C. HSU1
Hewlett-Packard
19447
Organizations
Cupertino,
CA
Cray
Ave.
95014
Chippewa
Abstract
We study both DRAMs with a single, long cache
line and with smaller, multiple
cache lines. Memory
interleaving schemes that increase data locality are proposed and studied.
The interleaving
schemes are also
shown to lead to non-uniform
bank accesses, i.e. hot
banks. This suggests there is an important optimization
problem
involving
methods that increase
locality
(o
improve pe~ormance,
but not so much that hot banks
diminish pe~ormance.
We show that for uniprocessor
systems, both types of cached DRAMs work well with the
proposed interleave methods. For multiprogrammed
multiprocessors, the multiple cache line DRAMs work better.
1. Introduction
After years of using simple DRAM organizations
that provide data at rates keyed to the performance of
internal transistor arrays, chip makers are now in the process of introducing imovative
organizations for commodity DRAM parts [Jone92]. They are aimed at reducing
the disparity between processor and memory performance
in desktop systems, while keeping the costs of such systems low.
Consequently,
a common characteristic
of
these new DRAM parts is that they use some form of
internal data cacheing.
was
done
while
W.-C. H su was
with
Cray
Researeh,
Inc.
327
08S4-7495/93 $3.00@ 1993 IEEE
Falls,
Inc.
Rd.
WI
54729
(1)
Supercomputers
typically
do not use data caches
for veetors, and sometimes (as in the case of Cray
Research systems) they do not cache scalar data,
either. There are at least three reasons for this.
First, in vector machines memory latencies are
amortized
over pipelined
streams of data references, i.e. vectors.
Second, supercomputer-class
problems often do not exhibit the locality required
to make a data cache effective
(although
reprogramming
and cache blocking compiler algorithms may help in some cases). Third, maintaining
cache coherence is perceived to be a difficult problem in vector multiprocessors.
(2)
Vector supercomputer systems usually contain multiple processors which either operate in parallel on
the same job or independently
on different jobs.
The new DRAM organizations typically depend on
locality that may be significantly
reduced in multiprocessor situations [Come92].
(3)
The per-processor bandwidth requirements in vector supercomputers are much greater Ithan in PCs
and workstations.
For example, a vector machine
needs sustained bandwidth of several words of data
per clock period per processor (6 in the Cray Y-MP
C90 [Cray91],
for example).
This means that
highly interleaved
memory
systems with many
banks are necessary.
Despite the significant differences between desktop
systems and vector supercomputers, we feel that the new
DRAM parts may still yield cost-performance
improvements for vector supercomputers,
provided Ilhe memory
system is properly designed. In the supercomputer context, we stress the cost aspect, because current systems
often use SRAM for main memory. SRAM provides performance at least equivalent to the best of the new DRAM
technologies, but costs much more. As a point of reference, the CRAY Y-MP C90 uses 1024 banks of 15ns
SRAM memory. A total of 20,000 SRAM chips are used
in the largest size C90 main memory. Consequently,
SRAM costs make up a majority of system-wide part
costs. Although
using traditional
DRAM
memory can
dramatically
reduce the cost for the same size memory,
the memory bandwidth
will also be much lower.
For
Vector supercomputers have significantly
different
characteristics from the desktop systems that are driving
the development of new DRAM organizations.
Besides
the obvious difference in raw processing speeds, vector
supercomputers differ from desktop systems in the following ways.
work
Research,
900 Lowater
DRAMs containing
cache memory are studied in
the context of vector supercomputers.
In particular,
we
consider systems where processors have no internal data
caches and memory reference streams are generated by
vector instructions.
For this application,
we expect that
cached DRAMs can provide high bandwidth at relatively
low cost.
1 This
Supercomputers
J. E. Smith
Company
Pruneridge
in Vector
instance, if conventional
140ns DRAM
memory were
used in the C90 instead of 15ns SRAM, many times (8 to
16) more banks would be required to provide comparable
memory bandwidth.
Any savings in per-chip costs would
likely be lost due to the larger number of chips and higher
logic and interconnect
costs. However,
with the new
cached DRAM parts it may now be possible to build an
affordable memory system with both large size and high
memory bandwidth.
row address
Memory Array
last
row
compare
In this paper, we look at ways the new cacheoriented DRAM parts and techniques can be adapted for
use in high-end vector supercomputers.
These methods
use the cacheing capabilities of DRAM chips and employ
unorthodox interleaving
techniques to improve locality,
especially in multiprocessor
situations.
Section 2 provides an overview of the new DRAM organizations.
Section 3 proposes memory interleaving
methods that are
directed
toward
improving
performance.
Section 4
describes the system model we are studying, as well as
the simulation benchmarks and performance measures we
use. Section 5 gives results for our trace-driven simulations. Finally, Section 6 contains conclusions.
hit/miss
/
9.
single line cache
cdumnaddress
$+
data out
(a)
I
2. Cached DRAM
I
Organizations
row address
A wide variety of high performance DRAM parts
are possible, but we divide the ones of interest to us into
two generic classes.
(1)
(2)
The first class is a simple outgrowth
of static
column
DRAMs
where an entire row of the
memory array is latched and may be accessed
repeatedly by modifying
the column address lines
only (as long as consecutive
accesses are to
addresses within the latched row). In effect, these
DRAMs have an internal cache that consists of one
large line, i.e. the latched row.
The second class is made up of parts that contain
multiple cache lines of conventional length. These
on-chip
caches can be accessed using directmapped or set associative methods, with the tags
being held in the off-chip memory controller.
Fig. la illustmtes the class of single line cached
DRAMs, and Fig. lb illustrates the class of multiple line
cached DRAMs.
Parts from RAMtron
[Bond92]
and
RAMbus
~arm92]
(RAMbus
essentially
puts two
memory banks on a chip) belong to the tirst class, and
DRAMs
from
Mitsubishi
[Hart92]
belong
to the second
class. A common property
of both types of cached
DRAMs
is that the cache-fill
bandwidth
is very high
because the bus connecting the cache and the DRAM
memory array is very wide. This feature encourages the
use of large line sizes to exploit spatial locality.
For our study, we look at generic versions of these
DRAM part types. We want to avoid becoming bogged
down in the details (and quirks) of specific implementations and timing. There is currently no standard, de facto
Memory
i
#
I
tag compare
and
1
:
~
! hithniss logic
Array
I
I
f
.
!
I
I
I
I
==3+7
‘%pk
--=---4 --4?+!!=
I
‘1
+
data out
(b)
Fig. 1. Basic DRAM
organizations
a) a single line organization,
b) a multi-line organization.
or otherwise, so such details could change, anyway. For
both types of RAMs, we assume the internal cache lines
are write-back with write-allocate.
To make comparisons
easier, we simulate parts where the total cache sizes are
the same and are of similar size to commercially
available
parts.
3. Memory
Interleaving
also likely to produce highly non-uniform bank references
and thus degrade performance for single jobs. For example, consider the vector loop A(I)=B(I)+C(I)
with stride
one references. While executing this Imp, we typically
exercise at most three memory banks because there are
three reference streams, each with its own spa[ial locus of
reference.
Another disadvantage that offsets improvements in spatial locality is that when the cache hit time is
slower than the system clock such interleaving
can not
deliver one word of data per clock period for sequential
accesses. Some supercomputers
may have clock rates
much faster than the hit time of the DRAM cache.
To simplify
our discussion,
we assume word
addressing (adjustments for byte addressing are straightforward). If a memory system has b banks, each containing m words, then there are Iogb plus logm address bits.
Traditionally
memories
are interleaved
on low-order
address bits. That is, the low order logb bits are used to
select the bank and the high order logm bits are used to
read a word within the bank (see Fig. 2a). This puts all
the addresses modulo b in the same bank. Such interleaving is very common in high performance systems, especially when the bank cycle time is slower than the system
clock cycle. This interleave scheme has the advantage
that each bank will tend to be uniformly
referenced
(except vector strides that are a multiple of a power-oftwo). However, since words at consecutive addresses are
placed in different banks, this scheme prevents DRAM
caches from effectively
exploiting
spatial locality.
That
is, a single cache line held in a DRAM chip contains
words that are at logical addresses separated by a large
power of two (the number of banks).
We attempt to blend the two interleaving
schemes
by distributing
the interleave bits in positions other than
the highest and lowest order positions.
To do this we
consider cache line boundaries and/or memory bandwidth
capabilities.
One such way of interleaving is to use address bits
at the cache line level. If a cache line hasp words, then
the low order Iogp bits and the high order logm –Iogp bits
are used to select the word within a bank,, The bank
address bits are the logb bits above the low order Iogp
bits (Fig. 2c). That is, all the words in a cache line are
from consecutive memory addresses, then the interleaving
moves to the next bank (Fig. 3a). Such a scheme still has
significant hot bank problems, as we shall see, especially
for DRAMs
with a single large line.
Also, there is
reduced bandwidth
for sequential accesses when the
cache hit time is slower than the processor cycle time, as
explained above.
An alternative, which increases spatial locality, is to
interleave on high order address bits, as shown in Fig. 2b.
In this case the high order logb address bits are used to
select the bank. This does increase locality, since consecutive addresses are in the same bank. However, it is
I
log m
log b
To overcome these problems we sugg,est “block”
interleaving where the interleave bits are split. For n-way
block interleaving,
the low order logn bits are used to
address a bank within a bank group, the next logp bits are
used to address the word within
the banlc, the next
logb–logn
bits are used to select the bank group, and the
remaining high order bits are also used to address the
word within the bank (Fig. 2d). The addresses placed in
each bank are illustrated in Fig. 3b. By this definition,
cache line interleaving is also 1-way block interleaving.
I
(a)
I
log b
log m
I
I
log m - log p
log b
I
log p
I
4. Simulation
(c)
Framework
4.1. System Model
log m - log p
log b
-log n
log p
Fig. 4 illustrates the systems we simulate. To simplify the simulation model (this is important because of
the lengths of the reference streams) the system model
assumes a conflict-free interconnection
network, with uniform, fixed delays.
That is, we apply streams of
addresses from the processor(s)
directly
to memory
banks, and then determine if there is a hit or miss within
the banks’ caches. For simplicity,
we assume one cache
per bank.
log n
(d)
Fig.
2.
Different
memory
interleaving
schemes.
Assuming there are b banks, m words per brink, and cache
line with p words. Bits used for bank address are enclosed
in bold lines. a) Low order interleaving.
b) high order
interleaving,
c) cache line interleaving,
d) n-way block
interleaving.
We consider
multi-programmed
329
both unitmcessor
.
.Performance
multiprocessor
perfornmrtce.
and
For
bank b-1
bank 1
bank O
.
.
.
.
●
●
16b
16b+16
15
31
32b-16
●
☛☛
16b-1
.
.
.
●
●
independent streams of addresses to the memory banks.
In this case, we pick an address from each of the processors in turn, in round-robin fashion. We tried simulations
with randomly
selected processors and with methods
using a small number of consecutive addresses from each
processor before moving
on. These showed slightly
improved
performance,
but no significant
differences
from the single reference per processor, round robin
method. Therefore, we decided to use the single reference per processor, round robin method throughout our
simulations.
.
.
1
.
.
17
16b-15
0
16
16b-16
., .
11
Hence, in multiprocessor
systems, we simulate a
throughput mode of operation, i.e. multiprogramming
by
interleaving
streams from independent
jobs into the
memory system. While we do not simulate paratlel processing, the performance would likely be better because
the parallel processing address streams would be likely to
exhibit higher locality than independent address streams.
a) Cache Line Interleaving
bank O bankl
.
bank2
bank3
bank b-1
7
16b
16~1
16b
+2
16b
+3
60
.
62
●
61
.
.
.
.
63
.
.
4
5
6
7
0
1
2
3
. . .
For each of the simulations, we vary the number of
memory banks. In an actuat well-balanced
system, the
number of banks would be a loose function of the bank
reservation time, processor clock period, and the rate at
which the processor can make memory requests. For a
simple example, consider the case where the avemge
bank reservation time is eight processor clock periods.
Then a system with eight banks would be matched to a
processor that makes a memory request every cycle. Sixteen banks would be needed for two such processors, etc.
On the other hand, if the bank reservation time is 16 clock
periods, then 16 memory banks would be needed for a
single processor system. Machines with multiple pipelines
can make memory requests at a higher rate, for example,
a 4-pipe processor can request four words per cycle with
a single vector load instruction. Therefore, a single 4-pipe
processor can be considered making memory requests at
the same rate as four single-pipe processors. By varying
the number of banks in our simulations,
we take into
account different ratios of total processor request rates
and bank cycle times.
.
.
.
(b-1)/4
*64 +3
b) 4-way Block Interleaving
Fig. 3. Examples of memory addresses for
cache line interleaving and block interleaving.
There m b banks, m words per bank, and
16 word cache lines.
Mere.
Mere,
o
1
cache
cache
L
Mere.
●
.*
@
B-1
cache
4.2. Performance
Measures
To measure performance,
we primarily
use cache
miss rates and memory bandwidths.
The cache miss rate,
the percentage or fraction of references that miss in the
DRAM-resident
data cache, is a traditional
measure of
cache performance.
We consider that a hit occurs when a
reference is contained in the on-chip cache; otherwise
there is a miss. Cache miss rates are sometimes considered to be less meaningful
measurement of performance in situations where cache miss latency can be
overlapped with instruction execution. However, since we
are primarily concerned with memory bandwidth, which
is determined by bank reservation times instead of access
latency, cache miss rates are appropriate.
Fig. 4. System model.
uniprocessor performance,
a single processor applies its
stream of references to the memory system. For muhiprogrammed
performance,
multiple
processors apply
For simulations of uniprocessors, we average miss
rates of the various stmms.
Using the arithmetic average
330
is an accurate way to determine aggregate miss rate performance.
It is as if we normalize by considering
the
same number of references from each stream, then take
the total number of misses divided by the total number of
references.
requests is divided by the longest bank time.
Although miss rate is related to memory bandwidth,
we combine miss rate with hit/miss timings and measure
bandwidth
more directly.
We first derive a memory
bandwidth
measure
that indicates
an upper-bound
bandwidth a particular memory system can provide. We
do this by using the measured
number
of cache
hits/misses
and corresponding
cache/DRAM
timing
parameters to compute the time it would take to serve all
the memory requests. Dividing
the total number of
memory requests by this time gives us a bandwidth
number.
In this case, only the most heavily loaded bank is
never idle, and other banks may have some idle time. If
the banks am uniformly
accessed then the effective
bandwidth should equal the potential bandwidth. On the
other hand, the difference between potential and effective
bandwidth provides a measure of the performance impact
of hot banks.
We represent the numker of hits in memory
effective
4.3. Benchmarks
To measure performance we use a set of 10 benchmarks shown in Table 1. Among
the 10 programs,
ARC2D, ARC3D, MDG, MG3D, SPEC77 and TRFD are
selected from the optimized
Perfect suite [Cybe90].
APPBT, APPLU, MG and FFT are chosen from the NAS
parallel benchmark set [Bai191].
Table 1 characterizes
the problems according to the number of different words
referenced, the total data set size in millions of words, and
the trace length (number of memory references).
Each
trace is generated by simulating a single processor CRAY
Y-MP for 40 million instructions (both scalar and vector
instructions).
Because some applications
lilce APPBT,
MG, FFT, and ARC2D have larger vector lengths, their
memory traces are longer than others. The trace lengths
vary from about 32 million memory references to 652
million
references.
The first six benchmarks are relatively small (by supercomputer standards) and have problem sizes of about 1 to 7 million words (8 to 56 Mbytes).
The last four benchmarks are much larger, and range
from 32 to 56 Mwords (256 to 448 Mbytes~.
bank i
~(th*Hi+tm *MCi+td*MDi)
If we assum~ that the memory banks receive equal
numbers of accesses and equal numkers misses then this
sum divided by the number of banks gives the minimum
time required to service all the requests. Dividing the total
number of requests by this minimum
time gives us an
upper-bound estimate of memory bandwidth. That is:
bandwidth
b* (H+MC+MD
=
(H+MC +MD )/m@th *Hi +tm *MCi +t~ *MDi )
as Hi, the number of clean misses in bank i as MCi, and
the number of dirty misses in bank i as MDi. H, MC,
and MD denote the total number of hits, clean misses,
and dirty misses summed over all the banks.
We
represent the hit time as th cycles, the clean miss time m
t~ cycles, and the dirty miss time as tti cycles. Recall
that the number of memory banks is b; then the total time
required to service all the requests i,x
potential
bandwidth
=
)/~(th *Hi+tm *MCi+td
1
*MDi )
We refer to the above bandwidth
as “potential”
because it assumes all memory banks are always active.
If a bank can be idle, then the actual bandwidth is less
than the best-case bandwidth calculated with the formula.
Table 1. Characteristics
Different
Words
Referenced
Programs
As stated above, potential bandwidth also assumes
the memory banks are accessed an equal number of times,
and that the numbers of hits and misses are equally distributed. That is, there are no “hot” bank(s) that get many
more requests than the others (or which have a higher
fraction of misses). While we will find that this is generally true for low-order interleaved memory systems, it
is not true for the other interleaving schemes where some
banks may be much busier than others. To investigate
how much impact
the hot bank problem
has on
bandwidth, we define the effective bandwidth as follows.
First, we keep track of which requests, hits, and misses
are handled by each memory bank. Then we can compute
the time required by each bank and use the longest bank
time as the overall time required.
Then the total of all
~ARC2D
~
1
4314768
Problem
Size
(MW)
~
1
4.9
programs.
Trace
Length
I (Mill~ns)
I
1
-ZZ!W3-
798048
1.3
MDG
1118928
1.4
32.33
MG3D
2085024
241.41
522112
218384
7.4
1.3
3.6
51.79
APPBT
37052912
42.2
652.55
MG
51850160
56.7
319.06
33893280
42.9
419.46
29327760
32.2
232.40
ARC3D
SPEC77
TRFD
APPLU
331
of benchmark
128.64
147.71
5. Simulation
the single line cache gets worse as more banks are used.
With this type of interleaving,
spatial locality is reduced
for systems with more memory banks.
Results
5.1. Single Processor
Performance
For the multiple line cache, more banks provide
better opportunities
to exploit temporal locality, hence
there is generally a lower miss rate with more banks.
However, there is still a slight increase in miss rate when
going from 16 banks to 64 banks with low order interleaving. In this case, the increase in temporal locality is
less than the loss of spatial locality.
We begin with uniproeessor simulations.
We simulate both a single line cache, and a multiple line cache.
To allow us to more easily compare results, we assume
exactly the same total cache size in both cases. In particular, we assume a 512 word cache per bank, organized as
both a single 512 word line and as 32 lines of 16 words
each. In the multi-line
case we begin with a directmapped cache. We consider four address interleaving
methods
(1)
interleave on low
interleave),
(2)
interleave
order
address bits
For 256 banks spatial locality is almost non-existent
with low order bit interleaving; only temporal locality can
be exploited.
Although
the multiple
line cache works
better than the single line cache, its miss rate is still much
higher than non-traditional
interleaving
methods. This
demonstrates that for large scientific jobs such as our
Ixmchmark programs, a cached DRAM memory system
earmot rely merely on temporal Ioeality to be effective.
Non-traditional
interleave schemes must be considered.
(traditional
on cache lines,
(3)
2-way block interleaving,
(4)
4-way block interleaving.
Fig. 5 illustrates performance for single jobs. Each
of the jobs was run individually,
and the miss rates for the
ten jobs were then averaged. Overall, we see that miss
rates of 10 percent and lower are possible with large single line caches and block or cache line interleaving.
Interleaving
at cache lines intuitively
exploits the
highest spatial locality of the methods we consider. In
Fig. 5, however, for single line caches, 4-way and 2-way
block interleaving
have lower average miss rates than
interleaving
at cache lines. This anomaly is due to program ARC2D which has a lot of memory accesses with
stride 588 (a multiple of 4). With such strides, spatial
locality
is reduced for cache line interleaving
but is
enhanced for the block interleaving.
When a single processor is running, the single line
cache performs better than the multiple line cache for all
eases except where low order bit interleaving
is used.
Furthermore, with low order interleaving, performance of
50
Fig. 6 and Fig. 7 show the potential and effective
bandwidths for single jobs. For generating these graphs,
we assume a cache hit takes one cycle, a clean miss takes
14 cycles, and a dirty miss takes 28 cycles. These
numbers are consistent with a 10 ns clock period and the
Mitsubishi TP-10 CDRAM chip. Fig. 6 shows that single
line caches with block interleaving have higher potential
bandwidth than multiple line caches. However, due to the
hot bank problem, the effective bandwidth of single line
caches tends to become lower as the number of banks is
increased (Fig. 7). For larger memory systems, the effective bandwidth of a single line cache is reduced by more
than a factor of 4 from the potential bandwidth.
....
*...””””
.“
..“
solid
40
lines
.“
.“
.“
30
❑ 2-way block
.“
A
.“
.“
.“
.“
.“
r
;
~ * low order
.“
~ x 4-way block
dotted: single line
m
i
:
multiple
20
cache hne
This difference
between effective
and potential
bandwidths is an indicator of non-uniform
memory bank
references. Fig. 8 illustrates the distribution of requests to
banks for one of the benchmark programs. This particular
lxmchmark was chosen because it has a particularly obvious hot bank problem; not all the benchmarks are this
bad. We see that for cache line interleaving,
there is a
single hot bank that gets about eight times as many references as any of the others. We also see that for 2-way
block interleaving, the hot spot becomes spread over two
banks, and for 4-way block interleaving
it becomes
spread over four banks. This is as one might expect, and
it demonstrates the value of using block interleaving
for
reducing (but, unfortunately,
not eliminating)
hot bank
(;)
10
5
2
L
I
4
I
I
16
I
(4
2;6
Number of Banks
Fig, 5. Miss rates for single proeessom;
traces.
average of 10
332
.
256
b
a 128
* low order
n
d64
x
:
32
;
16
4-way block
J
)“’”/
I
.1!
❑ 2-way block
cache line
A
problems.
However, if two consecutive banks are hot
when cache line interleaving is used, 2-way block interleaving may not reduce the problem.
For the same reason, 4-way block interleaving may not bean improvement
when them me four consecutive banks that are hot. This
explains
why all three block
interleaving
schemes
(including cache line interleaving)
have similar effective
bandwidths for multiple line caches as in Fig. ‘7.
In the ten benchmark programs, we have determined that hot banks occur due to one of the following
three cases: (1) there are small active working arrays, (2)
vector registers are spilled and reloaded from the run time
stack, (3) data blccking algorithms cause intensive data
reuse. It seems that some forms of local memory might
be able to minimize such redundant memory references,
and reduce the hot bank problem associated with cache
line interle~ving.
In other words, systems with processor
data caches that are sufficiently large to exploit high temporal locality might not have hot bank problems as severe
as the cacheless systems we are considering.
.,.~>
‘8
:4
‘2
d
s
1
I
c 0.5
P
0.25
/
0
/
/
/
/
/
dotted
single lines
&shed:
no cache
To make sure that our bandwidth resullts are interpreted correctly, consider ways the bandwidth gmphs can
be used for designing systems. Recall that our bandwidth
graphs are based on the assumption of no idle memory
cycles, that is, memory is a saturated resource (either all
of memory for potential
bandwidth,
or a~t least one
~
4
16
256
64
Number of Banks
Fig. 6. Potential
of 10 traces.
bandwidth
for single processors
average
256
b
a 12%
* low order
n
d64
x
. low order
70 –
x
:
32
:
16
4-way block
❑ 2-way block
A
4-way block
60-
/
cache line
r
e
f50–
e
.,,::::9
/
❑ 2-way block
A
cache line
32
40
:40–
‘8
n
c30-e
:4
....
‘2
d
s
I
’201
c 0.5
P
0.25
4
/
/
/
/
/“
:.
.
dotted: single lines
dashed no cache
~
4
16
64
8
256
bandwidth
for single processor$
24
48
56
64
Bank Number
Number of Banks
Fig. 7. Effective
of 10 traces.
16
Fig. 8. Hot bank distribution
in multiple linle caches for
program TRFDLG; number of reference~ from n mmple
of 100K.
average
333
memory bank for effective bandwidth).
This is a desirable design point when the memory system is the most
expensive resource, as is typically
the case in vector
supercomputers.
The saturated
memory
assumption
means that for specific processor configurations,
only certain regions of the the bandwidth graphs maybe valid. In
particular, they are only valid in regions where the processor demand for memory exceeds or equals memory’s
ability to deliver data. This occurs when the number of
processo~
times the number
of memory
reference
streams per processor
is greater than the memory
bandwidth.
100
90 –
s
s
r
Processor
;
.-..
..
J. .....
... .... ....
J
:70
For example, in Figs. 6 and 7 we are assuming single processors. A prccessor’s memory demand is a function of the number of memory pipelines, and its clock
rate. Figs. 6 and 7 assume a system clock period that is
the same as the DRAM cache hit time. In this case, a
vector uniprocessor operating with eight memory pipelines has a maximum
demand of eight references per
clock period. A processor with a clock four times as fast
and two memory pipelines would have a similar demand.
In either case, the memory can be saturated only if it
delivers fewer than eight words per cycle. Applying this
to Fig. 7, we see that the region of the graph where the
bandwidth is eight words per cycle or less is applicable to
our example system. Using the applicable region of the
graph, it appears that 16 memory banks with a single
cache line DRAM and block interleaving is a reasonably
good match to the processor demand. On the other hand,
if no DRAM cacheing were used, and the memory cycle
time were always 14 clock periods, then about 8X 14=1 12
(i.e. 128) memory banks would be required to match the
potential processor demand. The cost savings of using
the cached DRAM organization is very evident.
5.2. Muitiple
..-.
.
IT
x .“” . . . . . . . . . . . . . . . . ...*
a....
40
e 30
(%)
20
10
5
0 2-way block
A
..... Q
50
x 4-way block
..
...
.. ..
.. ..
..
..
.. -...4
* low order
x.
..
..
.. ....
.....
cache line
...
..
.... solid multiple lines
...
..
.....dotted single line
A.
..
-. .. ..
.. .. ..
.. .. x.
.. .. ...
.. ..
..
.. ..
..
.. B..
.....
.. .. ..
..
....
I
I
4
I
16
I
64
I
256
Number of Banks
Fig. 9. Miss rate for multi-programmed
time shared.
runs; 8 processes
cache line. As shown in Fig. 9, the multiple line cache is
more robust and more effective than the single line cache
unless the number of banks becomes relatively large. We
also observe that performance differences in miss rate of
non-traditional
interleaving
for single line caches are
significan~
for the uniprocessor,
the differences
were
small. This is because several processes are competing for
the banks and the cache line in the bank. A higher degree
of block interleaving
spreads an active line across more
banks, increasing the likelihood of conflict and thrashing,
Performance
For multi-program
performance, we chose eight of
the benchmark programs: APPBT, APPLU, MG, FIW,
ARC3D,
SPEC77, TRFDLG,
and MG3DLG,
and ran
them on eight different processors with memory requests
being made in round-robin
fashion.
Note that the tirst
four benchmarks are large, and the last four are small
(relatively).
The eight jobs were placed next to each
other in memory, aligned to 2KW boundaries. The miss
rate results are in Fig. 9. As might be expected, performance is worse than for a single job. This is because
Fig. 10 gives results for the small jobs alone (where
one might expect more temporal locality),
and Fig. 11
shows performance
for the large jobs alone. The miss
rates for small jobs are much better. For large jobs, low
order interleaving
performs poorly:
single line caches
have nearly a 100% miss rate and multiple line caches
have a 45% miss rate even when the number of banks is
as high as 256. For large jobs where caches may not
exploit a high degree of temporal locality, exploiting spatial locality becomes critical.
For instance, block interleaving can bring the miss rate down to near 10% with
256 banks.
Wtlity
tends to be disrupted due to the multiple indepen.
dent memory streams. However, in systems with larger
numbers of memory banks, performance
is significantly
betteq in some cases reaching the same level as a single
job.
Figs. 12 and 13 show potential
bandwidth
and
effective bandwidth, respectively, for multi-programmed
jobs. For multiple line caches, cache line interleaving
works well for both potential and effective bandwidth.
For single line caches, cache line interleaving
has the
highest potential bandwidth, but the lowest performance
For single job (uniprocessor) runs, we observed that
the single line cache with block interleaving
is quite
effective
in exploiting
spatial locality.
However,
in
multi-program
runs,
the competition
from
several
processes destroy the effectiveness
of holding a large
334
b 128-
*.. . . . . . .
...*
..
. . . . . . . ...*
..
90-
“..
..
..
...
...
..
G ......
..
... ..
..-.
m 70i
s
s
50r
a 40t
e 30(%)
20.
..
A.
a
n
d
w
i
d
..
..
.*
* low order
...
.. ..
..
.. ..
..
.. ..
..
.. ..
..
.. .. x.
.. ..
..
... ...
..
. .
..
..
64–
solid: multiple
32_
dotted
16—
dashed no cache
single lines
8–
:
x
4–
4-way block
❑ 2-way block
:
2–
:
1–
.. -.
x
s 0.5 —
10.
5.
, .Ix
fi’-
P
solid:
multiple
L
lines
i
i
I
I
I
i6
&
2;6
Number of Banks
Fig. 10. Miss rate for multi-programmed
processes time shared.
100
runs; 4 small
90
Q ......
..
.. ..
... ....
70
..
s
s
A.
.
Number of Banks
Fig. 12. Potential bandwidth
8 processes time shared.
...
.
.;.
.
for multi-programmed
solid
* low order
n
d
32–
dotted
x 4-way block
~
16–
dashed: no cache
~
8–
❑ 2-way block
r
50
h
4–
;
40
;
2–
e 30
(%)
:
runw
I
64—
..
256
16
4
b
a
-.
..
..
..
A C2iChf3 h?
~—T—
128
..
..
4-way block
❑ 2-way block
I
c 0.25_
y
lines
multiple
lines
single lines
l–
s 0.5 –
20
..
10
5
dotted
❑ 2-way block
I
c 0.25 –
single line
A
cache line
P
1
I
4
I
16
I
64
—r-l-l-~
I
256
4
64
Number of Banks
Number of Banks
Fig. 11. Miss rate for multi-programmed
processes time shared.
16
Fig. 13. Effective bandwidth with multi-prc}gramming
8 processes time shared.
run.y 4 large
335
DRAMs are unlikely to be effective, consistent with the
opinions expressed in [Come92].
On the other hand, we
have also shown that multi-line
caches with block interleaving schemes can be made to work well.
in effective
bandwidth
due to the hot bank problem.
Four-way block interleaving has performance close to the
better performing
one in terms of both potential and
effective bandwidth.
Fig. 14 illust.mtes the distribution of
requests to banks for multi-program
runs with single large
line caches. The curve for cache line interleaving
has
several spikes, and block interleaving
smooths out some
of the spikes.
Single line cached DRAMs are more sensitive to
the memory interleaving scheme that is used. We recommend that block interleaving be used. The performance of
multi-line caches is less sensitive to interleaving methods,
and block interleaving
is called for when the cache hit
time is slower than processor cycle time.
6. Conclusions
In this paper we considered the usefulness of new
cache-oriented
DRAMs
for delivering
cost-effective
bandwidth in vector supercomputers.
It is apparent from
our simulations
that traditional
low-order-bit
memory
interleaving
will not take full advantage of the new
DRAM parts; spatial locality is reduced too much. We
have shown that using other less common interleave
schemes, which place successive words in the same
DRAM chips, can increase locality and improve performance. Unfortunately,
they also tend to produce nonuniform bank usage the presence of hot banks can reduce
overall performance.
Consequently, we feel that memory
system designs will involve optimization
to increase
memory bank locality up to a certain poin~ but no further.
A disadvantage
of non-conventional
interleaving
schemes that we have not yet discussed is that they lead to
systems where the addressing logic external to the chip
becomes tuned to internal chip characteristics,
i.e. the
cache characteristics.
If DRAM chips in an optimized
system are later replaced with chips having different
characteristics, the interleave scheme may no longer be
optimal.
Finally, we feel that our results will extend qualitatively to systems other than those we have specifically
studied. However, as we observed earlier, the types of
program constructs that lead to hot bank problems may
have a smaller effect when processors contain data caches
that are able to reduce redundant
memory
traffic
significantly.
Non-standard
interleaving
schemes will
likely become a key component
of memory
system
design, and the tradeoff between increased spatial locality
and hot banks will bean important design issue.
In uniprocessor environments, cached DRAMs with
block
interleaving
can provide
more cost-effective
bandwidth.
In
multiprogramming
multiprocessor
environments,
we have shown that single line cached
7. References
330
lJ3ai191] David H. Bailey et al., “The NAS Parallel
Benchmark:
Summary and Preliminary
Results,” IEEE
Supercomputing
’91, pp 158-165, Nov., 1991
. low order
x
r 270
e
f
1
4-way block
l130nd92]
Bondurant
D.,
“Enhanced
Dynamic
RAM,”
IEEE Spectrum, pp 49, in [Jone92], Oct., 1992
❑ 2-way block
[Come92]
Comerford,
R. and G. Watson,
“Memory
catches up,” IEEE Spectrum, pp 34-35, Oct. 1992.
:210
[CIWWI
Cray Research Inc., “CRAY
tional Description Manual,” HR-04028,
e
n
c 150
e
s
A
YMP C90 FuncMarch, 1992.
[Cybe90] Cybenko, G., et al., “Supercomputer
PerforCSRD
mance Evaluation and the Perfect Benchmarks,”
Report No. 965, University of Illinois, March, 1990
4
U%rm92] Farmwald M. and D. Mooring “A Fast Path to
One Memory,” IEEE Spectrum, pp 50-51, Oct. 1992.
90
mart92]
Hart C., “Dynamic
RAM
as Secondary
Cache,”
IEEE Spectrum, pp 48, in [Jone92], Oct., 1992
30
8
[Jone92] Jones Fred, “ A New Era of Fast Dynamic
RAMs” IEEE Spectrum, pp 43-49, Oct. 1992.
16243240485664
Bank Number
Fig. 14. Hot bank disrnbution in single line caches;
for 8 processes time shared,
336
Download