of Cached DRAM Performance W. -C. HSU1 Hewlett-Packard 19447 Organizations Cupertino, CA Cray Ave. 95014 Chippewa Abstract We study both DRAMs with a single, long cache line and with smaller, multiple cache lines. Memory interleaving schemes that increase data locality are proposed and studied. The interleaving schemes are also shown to lead to non-uniform bank accesses, i.e. hot banks. This suggests there is an important optimization problem involving methods that increase locality (o improve pe~ormance, but not so much that hot banks diminish pe~ormance. We show that for uniprocessor systems, both types of cached DRAMs work well with the proposed interleave methods. For multiprogrammed multiprocessors, the multiple cache line DRAMs work better. 1. Introduction After years of using simple DRAM organizations that provide data at rates keyed to the performance of internal transistor arrays, chip makers are now in the process of introducing imovative organizations for commodity DRAM parts [Jone92]. They are aimed at reducing the disparity between processor and memory performance in desktop systems, while keeping the costs of such systems low. Consequently, a common characteristic of these new DRAM parts is that they use some form of internal data cacheing. was done while W.-C. H su was with Cray Researeh, Inc. 327 08S4-7495/93 $3.00@ 1993 IEEE Falls, Inc. Rd. WI 54729 (1) Supercomputers typically do not use data caches for veetors, and sometimes (as in the case of Cray Research systems) they do not cache scalar data, either. There are at least three reasons for this. First, in vector machines memory latencies are amortized over pipelined streams of data references, i.e. vectors. Second, supercomputer-class problems often do not exhibit the locality required to make a data cache effective (although reprogramming and cache blocking compiler algorithms may help in some cases). Third, maintaining cache coherence is perceived to be a difficult problem in vector multiprocessors. (2) Vector supercomputer systems usually contain multiple processors which either operate in parallel on the same job or independently on different jobs. The new DRAM organizations typically depend on locality that may be significantly reduced in multiprocessor situations [Come92]. (3) The per-processor bandwidth requirements in vector supercomputers are much greater Ithan in PCs and workstations. For example, a vector machine needs sustained bandwidth of several words of data per clock period per processor (6 in the Cray Y-MP C90 [Cray91], for example). This means that highly interleaved memory systems with many banks are necessary. Despite the significant differences between desktop systems and vector supercomputers, we feel that the new DRAM parts may still yield cost-performance improvements for vector supercomputers, provided Ilhe memory system is properly designed. In the supercomputer context, we stress the cost aspect, because current systems often use SRAM for main memory. SRAM provides performance at least equivalent to the best of the new DRAM technologies, but costs much more. As a point of reference, the CRAY Y-MP C90 uses 1024 banks of 15ns SRAM memory. A total of 20,000 SRAM chips are used in the largest size C90 main memory. Consequently, SRAM costs make up a majority of system-wide part costs. Although using traditional DRAM memory can dramatically reduce the cost for the same size memory, the memory bandwidth will also be much lower. For Vector supercomputers have significantly different characteristics from the desktop systems that are driving the development of new DRAM organizations. Besides the obvious difference in raw processing speeds, vector supercomputers differ from desktop systems in the following ways. work Research, 900 Lowater DRAMs containing cache memory are studied in the context of vector supercomputers. In particular, we consider systems where processors have no internal data caches and memory reference streams are generated by vector instructions. For this application, we expect that cached DRAMs can provide high bandwidth at relatively low cost. 1 This Supercomputers J. E. Smith Company Pruneridge in Vector instance, if conventional 140ns DRAM memory were used in the C90 instead of 15ns SRAM, many times (8 to 16) more banks would be required to provide comparable memory bandwidth. Any savings in per-chip costs would likely be lost due to the larger number of chips and higher logic and interconnect costs. However, with the new cached DRAM parts it may now be possible to build an affordable memory system with both large size and high memory bandwidth. row address Memory Array last row compare In this paper, we look at ways the new cacheoriented DRAM parts and techniques can be adapted for use in high-end vector supercomputers. These methods use the cacheing capabilities of DRAM chips and employ unorthodox interleaving techniques to improve locality, especially in multiprocessor situations. Section 2 provides an overview of the new DRAM organizations. Section 3 proposes memory interleaving methods that are directed toward improving performance. Section 4 describes the system model we are studying, as well as the simulation benchmarks and performance measures we use. Section 5 gives results for our trace-driven simulations. Finally, Section 6 contains conclusions. hit/miss / 9. single line cache cdumnaddress $+ data out (a) I 2. Cached DRAM I Organizations row address A wide variety of high performance DRAM parts are possible, but we divide the ones of interest to us into two generic classes. (1) (2) The first class is a simple outgrowth of static column DRAMs where an entire row of the memory array is latched and may be accessed repeatedly by modifying the column address lines only (as long as consecutive accesses are to addresses within the latched row). In effect, these DRAMs have an internal cache that consists of one large line, i.e. the latched row. The second class is made up of parts that contain multiple cache lines of conventional length. These on-chip caches can be accessed using directmapped or set associative methods, with the tags being held in the off-chip memory controller. Fig. la illustmtes the class of single line cached DRAMs, and Fig. lb illustrates the class of multiple line cached DRAMs. Parts from RAMtron [Bond92] and RAMbus ~arm92] (RAMbus essentially puts two memory banks on a chip) belong to the tirst class, and DRAMs from Mitsubishi [Hart92] belong to the second class. A common property of both types of cached DRAMs is that the cache-fill bandwidth is very high because the bus connecting the cache and the DRAM memory array is very wide. This feature encourages the use of large line sizes to exploit spatial locality. For our study, we look at generic versions of these DRAM part types. We want to avoid becoming bogged down in the details (and quirks) of specific implementations and timing. There is currently no standard, de facto Memory i # I tag compare and 1 : ~ ! hithniss logic Array I I f . ! I I I I ==3+7 ‘%pk --=---4 --4?+!!= I ‘1 + data out (b) Fig. 1. Basic DRAM organizations a) a single line organization, b) a multi-line organization. or otherwise, so such details could change, anyway. For both types of RAMs, we assume the internal cache lines are write-back with write-allocate. To make comparisons easier, we simulate parts where the total cache sizes are the same and are of similar size to commercially available parts. 3. Memory Interleaving also likely to produce highly non-uniform bank references and thus degrade performance for single jobs. For example, consider the vector loop A(I)=B(I)+C(I) with stride one references. While executing this Imp, we typically exercise at most three memory banks because there are three reference streams, each with its own spa[ial locus of reference. Another disadvantage that offsets improvements in spatial locality is that when the cache hit time is slower than the system clock such interleaving can not deliver one word of data per clock period for sequential accesses. Some supercomputers may have clock rates much faster than the hit time of the DRAM cache. To simplify our discussion, we assume word addressing (adjustments for byte addressing are straightforward). If a memory system has b banks, each containing m words, then there are Iogb plus logm address bits. Traditionally memories are interleaved on low-order address bits. That is, the low order logb bits are used to select the bank and the high order logm bits are used to read a word within the bank (see Fig. 2a). This puts all the addresses modulo b in the same bank. Such interleaving is very common in high performance systems, especially when the bank cycle time is slower than the system clock cycle. This interleave scheme has the advantage that each bank will tend to be uniformly referenced (except vector strides that are a multiple of a power-oftwo). However, since words at consecutive addresses are placed in different banks, this scheme prevents DRAM caches from effectively exploiting spatial locality. That is, a single cache line held in a DRAM chip contains words that are at logical addresses separated by a large power of two (the number of banks). We attempt to blend the two interleaving schemes by distributing the interleave bits in positions other than the highest and lowest order positions. To do this we consider cache line boundaries and/or memory bandwidth capabilities. One such way of interleaving is to use address bits at the cache line level. If a cache line hasp words, then the low order Iogp bits and the high order logm –Iogp bits are used to select the word within a bank,, The bank address bits are the logb bits above the low order Iogp bits (Fig. 2c). That is, all the words in a cache line are from consecutive memory addresses, then the interleaving moves to the next bank (Fig. 3a). Such a scheme still has significant hot bank problems, as we shall see, especially for DRAMs with a single large line. Also, there is reduced bandwidth for sequential accesses when the cache hit time is slower than the processor cycle time, as explained above. An alternative, which increases spatial locality, is to interleave on high order address bits, as shown in Fig. 2b. In this case the high order logb address bits are used to select the bank. This does increase locality, since consecutive addresses are in the same bank. However, it is I log m log b To overcome these problems we sugg,est “block” interleaving where the interleave bits are split. For n-way block interleaving, the low order logn bits are used to address a bank within a bank group, the next logp bits are used to address the word within the banlc, the next logb–logn bits are used to select the bank group, and the remaining high order bits are also used to address the word within the bank (Fig. 2d). The addresses placed in each bank are illustrated in Fig. 3b. By this definition, cache line interleaving is also 1-way block interleaving. I (a) I log b log m I I log m - log p log b I log p I 4. Simulation (c) Framework 4.1. System Model log m - log p log b -log n log p Fig. 4 illustrates the systems we simulate. To simplify the simulation model (this is important because of the lengths of the reference streams) the system model assumes a conflict-free interconnection network, with uniform, fixed delays. That is, we apply streams of addresses from the processor(s) directly to memory banks, and then determine if there is a hit or miss within the banks’ caches. For simplicity, we assume one cache per bank. log n (d) Fig. 2. Different memory interleaving schemes. Assuming there are b banks, m words per brink, and cache line with p words. Bits used for bank address are enclosed in bold lines. a) Low order interleaving. b) high order interleaving, c) cache line interleaving, d) n-way block interleaving. We consider multi-programmed 329 both unitmcessor . .Performance multiprocessor perfornmrtce. and For bank b-1 bank 1 bank O . . . . ● ● 16b 16b+16 15 31 32b-16 ● ☛☛ 16b-1 . . . ● ● independent streams of addresses to the memory banks. In this case, we pick an address from each of the processors in turn, in round-robin fashion. We tried simulations with randomly selected processors and with methods using a small number of consecutive addresses from each processor before moving on. These showed slightly improved performance, but no significant differences from the single reference per processor, round robin method. Therefore, we decided to use the single reference per processor, round robin method throughout our simulations. . . 1 . . 17 16b-15 0 16 16b-16 ., . 11 Hence, in multiprocessor systems, we simulate a throughput mode of operation, i.e. multiprogramming by interleaving streams from independent jobs into the memory system. While we do not simulate paratlel processing, the performance would likely be better because the parallel processing address streams would be likely to exhibit higher locality than independent address streams. a) Cache Line Interleaving bank O bankl . bank2 bank3 bank b-1 7 16b 16~1 16b +2 16b +3 60 . 62 ● 61 . . . . 63 . . 4 5 6 7 0 1 2 3 . . . For each of the simulations, we vary the number of memory banks. In an actuat well-balanced system, the number of banks would be a loose function of the bank reservation time, processor clock period, and the rate at which the processor can make memory requests. For a simple example, consider the case where the avemge bank reservation time is eight processor clock periods. Then a system with eight banks would be matched to a processor that makes a memory request every cycle. Sixteen banks would be needed for two such processors, etc. On the other hand, if the bank reservation time is 16 clock periods, then 16 memory banks would be needed for a single processor system. Machines with multiple pipelines can make memory requests at a higher rate, for example, a 4-pipe processor can request four words per cycle with a single vector load instruction. Therefore, a single 4-pipe processor can be considered making memory requests at the same rate as four single-pipe processors. By varying the number of banks in our simulations, we take into account different ratios of total processor request rates and bank cycle times. . . . (b-1)/4 *64 +3 b) 4-way Block Interleaving Fig. 3. Examples of memory addresses for cache line interleaving and block interleaving. There m b banks, m words per bank, and 16 word cache lines. Mere. Mere, o 1 cache cache L Mere. ● .* @ B-1 cache 4.2. Performance Measures To measure performance, we primarily use cache miss rates and memory bandwidths. The cache miss rate, the percentage or fraction of references that miss in the DRAM-resident data cache, is a traditional measure of cache performance. We consider that a hit occurs when a reference is contained in the on-chip cache; otherwise there is a miss. Cache miss rates are sometimes considered to be less meaningful measurement of performance in situations where cache miss latency can be overlapped with instruction execution. However, since we are primarily concerned with memory bandwidth, which is determined by bank reservation times instead of access latency, cache miss rates are appropriate. Fig. 4. System model. uniprocessor performance, a single processor applies its stream of references to the memory system. For muhiprogrammed performance, multiple processors apply For simulations of uniprocessors, we average miss rates of the various stmms. Using the arithmetic average 330 is an accurate way to determine aggregate miss rate performance. It is as if we normalize by considering the same number of references from each stream, then take the total number of misses divided by the total number of references. requests is divided by the longest bank time. Although miss rate is related to memory bandwidth, we combine miss rate with hit/miss timings and measure bandwidth more directly. We first derive a memory bandwidth measure that indicates an upper-bound bandwidth a particular memory system can provide. We do this by using the measured number of cache hits/misses and corresponding cache/DRAM timing parameters to compute the time it would take to serve all the memory requests. Dividing the total number of memory requests by this time gives us a bandwidth number. In this case, only the most heavily loaded bank is never idle, and other banks may have some idle time. If the banks am uniformly accessed then the effective bandwidth should equal the potential bandwidth. On the other hand, the difference between potential and effective bandwidth provides a measure of the performance impact of hot banks. We represent the numker of hits in memory effective 4.3. Benchmarks To measure performance we use a set of 10 benchmarks shown in Table 1. Among the 10 programs, ARC2D, ARC3D, MDG, MG3D, SPEC77 and TRFD are selected from the optimized Perfect suite [Cybe90]. APPBT, APPLU, MG and FFT are chosen from the NAS parallel benchmark set [Bai191]. Table 1 characterizes the problems according to the number of different words referenced, the total data set size in millions of words, and the trace length (number of memory references). Each trace is generated by simulating a single processor CRAY Y-MP for 40 million instructions (both scalar and vector instructions). Because some applications lilce APPBT, MG, FFT, and ARC2D have larger vector lengths, their memory traces are longer than others. The trace lengths vary from about 32 million memory references to 652 million references. The first six benchmarks are relatively small (by supercomputer standards) and have problem sizes of about 1 to 7 million words (8 to 56 Mbytes). The last four benchmarks are much larger, and range from 32 to 56 Mwords (256 to 448 Mbytes~. bank i ~(th*Hi+tm *MCi+td*MDi) If we assum~ that the memory banks receive equal numbers of accesses and equal numkers misses then this sum divided by the number of banks gives the minimum time required to service all the requests. Dividing the total number of requests by this minimum time gives us an upper-bound estimate of memory bandwidth. That is: bandwidth b* (H+MC+MD = (H+MC +MD )/m@th *Hi +tm *MCi +t~ *MDi ) as Hi, the number of clean misses in bank i as MCi, and the number of dirty misses in bank i as MDi. H, MC, and MD denote the total number of hits, clean misses, and dirty misses summed over all the banks. We represent the hit time as th cycles, the clean miss time m t~ cycles, and the dirty miss time as tti cycles. Recall that the number of memory banks is b; then the total time required to service all the requests i,x potential bandwidth = )/~(th *Hi+tm *MCi+td 1 *MDi ) We refer to the above bandwidth as “potential” because it assumes all memory banks are always active. If a bank can be idle, then the actual bandwidth is less than the best-case bandwidth calculated with the formula. Table 1. Characteristics Different Words Referenced Programs As stated above, potential bandwidth also assumes the memory banks are accessed an equal number of times, and that the numbers of hits and misses are equally distributed. That is, there are no “hot” bank(s) that get many more requests than the others (or which have a higher fraction of misses). While we will find that this is generally true for low-order interleaved memory systems, it is not true for the other interleaving schemes where some banks may be much busier than others. To investigate how much impact the hot bank problem has on bandwidth, we define the effective bandwidth as follows. First, we keep track of which requests, hits, and misses are handled by each memory bank. Then we can compute the time required by each bank and use the longest bank time as the overall time required. Then the total of all ~ARC2D ~ 1 4314768 Problem Size (MW) ~ 1 4.9 programs. Trace Length I (Mill~ns) I 1 -ZZ!W3- 798048 1.3 MDG 1118928 1.4 32.33 MG3D 2085024 241.41 522112 218384 7.4 1.3 3.6 51.79 APPBT 37052912 42.2 652.55 MG 51850160 56.7 319.06 33893280 42.9 419.46 29327760 32.2 232.40 ARC3D SPEC77 TRFD APPLU 331 of benchmark 128.64 147.71 5. Simulation the single line cache gets worse as more banks are used. With this type of interleaving, spatial locality is reduced for systems with more memory banks. Results 5.1. Single Processor Performance For the multiple line cache, more banks provide better opportunities to exploit temporal locality, hence there is generally a lower miss rate with more banks. However, there is still a slight increase in miss rate when going from 16 banks to 64 banks with low order interleaving. In this case, the increase in temporal locality is less than the loss of spatial locality. We begin with uniproeessor simulations. We simulate both a single line cache, and a multiple line cache. To allow us to more easily compare results, we assume exactly the same total cache size in both cases. In particular, we assume a 512 word cache per bank, organized as both a single 512 word line and as 32 lines of 16 words each. In the multi-line case we begin with a directmapped cache. We consider four address interleaving methods (1) interleave on low interleave), (2) interleave order address bits For 256 banks spatial locality is almost non-existent with low order bit interleaving; only temporal locality can be exploited. Although the multiple line cache works better than the single line cache, its miss rate is still much higher than non-traditional interleaving methods. This demonstrates that for large scientific jobs such as our Ixmchmark programs, a cached DRAM memory system earmot rely merely on temporal Ioeality to be effective. Non-traditional interleave schemes must be considered. (traditional on cache lines, (3) 2-way block interleaving, (4) 4-way block interleaving. Fig. 5 illustrates performance for single jobs. Each of the jobs was run individually, and the miss rates for the ten jobs were then averaged. Overall, we see that miss rates of 10 percent and lower are possible with large single line caches and block or cache line interleaving. Interleaving at cache lines intuitively exploits the highest spatial locality of the methods we consider. In Fig. 5, however, for single line caches, 4-way and 2-way block interleaving have lower average miss rates than interleaving at cache lines. This anomaly is due to program ARC2D which has a lot of memory accesses with stride 588 (a multiple of 4). With such strides, spatial locality is reduced for cache line interleaving but is enhanced for the block interleaving. When a single processor is running, the single line cache performs better than the multiple line cache for all eases except where low order bit interleaving is used. Furthermore, with low order interleaving, performance of 50 Fig. 6 and Fig. 7 show the potential and effective bandwidths for single jobs. For generating these graphs, we assume a cache hit takes one cycle, a clean miss takes 14 cycles, and a dirty miss takes 28 cycles. These numbers are consistent with a 10 ns clock period and the Mitsubishi TP-10 CDRAM chip. Fig. 6 shows that single line caches with block interleaving have higher potential bandwidth than multiple line caches. However, due to the hot bank problem, the effective bandwidth of single line caches tends to become lower as the number of banks is increased (Fig. 7). For larger memory systems, the effective bandwidth of a single line cache is reduced by more than a factor of 4 from the potential bandwidth. .... *...”””” .“ ..“ solid 40 lines .“ .“ .“ 30 ❑ 2-way block .“ A .“ .“ .“ .“ .“ r ; ~ * low order .“ ~ x 4-way block dotted: single line m i : multiple 20 cache hne This difference between effective and potential bandwidths is an indicator of non-uniform memory bank references. Fig. 8 illustrates the distribution of requests to banks for one of the benchmark programs. This particular lxmchmark was chosen because it has a particularly obvious hot bank problem; not all the benchmarks are this bad. We see that for cache line interleaving, there is a single hot bank that gets about eight times as many references as any of the others. We also see that for 2-way block interleaving, the hot spot becomes spread over two banks, and for 4-way block interleaving it becomes spread over four banks. This is as one might expect, and it demonstrates the value of using block interleaving for reducing (but, unfortunately, not eliminating) hot bank (;) 10 5 2 L I 4 I I 16 I (4 2;6 Number of Banks Fig, 5. Miss rates for single proeessom; traces. average of 10 332 . 256 b a 128 * low order n d64 x : 32 ; 16 4-way block J )“’”/ I .1! ❑ 2-way block cache line A problems. However, if two consecutive banks are hot when cache line interleaving is used, 2-way block interleaving may not reduce the problem. For the same reason, 4-way block interleaving may not bean improvement when them me four consecutive banks that are hot. This explains why all three block interleaving schemes (including cache line interleaving) have similar effective bandwidths for multiple line caches as in Fig. ‘7. In the ten benchmark programs, we have determined that hot banks occur due to one of the following three cases: (1) there are small active working arrays, (2) vector registers are spilled and reloaded from the run time stack, (3) data blccking algorithms cause intensive data reuse. It seems that some forms of local memory might be able to minimize such redundant memory references, and reduce the hot bank problem associated with cache line interle~ving. In other words, systems with processor data caches that are sufficiently large to exploit high temporal locality might not have hot bank problems as severe as the cacheless systems we are considering. .,.~> ‘8 :4 ‘2 d s 1 I c 0.5 P 0.25 / 0 / / / / / dotted single lines &shed: no cache To make sure that our bandwidth resullts are interpreted correctly, consider ways the bandwidth gmphs can be used for designing systems. Recall that our bandwidth graphs are based on the assumption of no idle memory cycles, that is, memory is a saturated resource (either all of memory for potential bandwidth, or a~t least one ~ 4 16 256 64 Number of Banks Fig. 6. Potential of 10 traces. bandwidth for single processors average 256 b a 12% * low order n d64 x . low order 70 – x : 32 : 16 4-way block ❑ 2-way block A 4-way block 60- / cache line r e f50– e .,,::::9 / ❑ 2-way block A cache line 32 40 :40– ‘8 n c30-e :4 .... ‘2 d s I ’201 c 0.5 P 0.25 4 / / / / /“ :. . dotted: single lines dashed no cache ~ 4 16 64 8 256 bandwidth for single processor$ 24 48 56 64 Bank Number Number of Banks Fig. 7. Effective of 10 traces. 16 Fig. 8. Hot bank distribution in multiple linle caches for program TRFDLG; number of reference~ from n mmple of 100K. average 333 memory bank for effective bandwidth). This is a desirable design point when the memory system is the most expensive resource, as is typically the case in vector supercomputers. The saturated memory assumption means that for specific processor configurations, only certain regions of the the bandwidth graphs maybe valid. In particular, they are only valid in regions where the processor demand for memory exceeds or equals memory’s ability to deliver data. This occurs when the number of processo~ times the number of memory reference streams per processor is greater than the memory bandwidth. 100 90 – s s r Processor ; .-.. .. J. ..... ... .... .... J :70 For example, in Figs. 6 and 7 we are assuming single processors. A prccessor’s memory demand is a function of the number of memory pipelines, and its clock rate. Figs. 6 and 7 assume a system clock period that is the same as the DRAM cache hit time. In this case, a vector uniprocessor operating with eight memory pipelines has a maximum demand of eight references per clock period. A processor with a clock four times as fast and two memory pipelines would have a similar demand. In either case, the memory can be saturated only if it delivers fewer than eight words per cycle. Applying this to Fig. 7, we see that the region of the graph where the bandwidth is eight words per cycle or less is applicable to our example system. Using the applicable region of the graph, it appears that 16 memory banks with a single cache line DRAM and block interleaving is a reasonably good match to the processor demand. On the other hand, if no DRAM cacheing were used, and the memory cycle time were always 14 clock periods, then about 8X 14=1 12 (i.e. 128) memory banks would be required to match the potential processor demand. The cost savings of using the cached DRAM organization is very evident. 5.2. Muitiple ..-. . IT x .“” . . . . . . . . . . . . . . . . ...* a.... 40 e 30 (%) 20 10 5 0 2-way block A ..... Q 50 x 4-way block .. ... .. .. .. .. .. .. .. -...4 * low order x. .. .. .. .... ..... cache line ... .. .... solid multiple lines ... .. .....dotted single line A. .. -. .. .. .. .. .. .. .. x. .. .. ... .. .. .. .. .. .. .. B.. ..... .. .. .. .. .... I I 4 I 16 I 64 I 256 Number of Banks Fig. 9. Miss rate for multi-programmed time shared. runs; 8 processes cache line. As shown in Fig. 9, the multiple line cache is more robust and more effective than the single line cache unless the number of banks becomes relatively large. We also observe that performance differences in miss rate of non-traditional interleaving for single line caches are significan~ for the uniprocessor, the differences were small. This is because several processes are competing for the banks and the cache line in the bank. A higher degree of block interleaving spreads an active line across more banks, increasing the likelihood of conflict and thrashing, Performance For multi-program performance, we chose eight of the benchmark programs: APPBT, APPLU, MG, FIW, ARC3D, SPEC77, TRFDLG, and MG3DLG, and ran them on eight different processors with memory requests being made in round-robin fashion. Note that the tirst four benchmarks are large, and the last four are small (relatively). The eight jobs were placed next to each other in memory, aligned to 2KW boundaries. The miss rate results are in Fig. 9. As might be expected, performance is worse than for a single job. This is because Fig. 10 gives results for the small jobs alone (where one might expect more temporal locality), and Fig. 11 shows performance for the large jobs alone. The miss rates for small jobs are much better. For large jobs, low order interleaving performs poorly: single line caches have nearly a 100% miss rate and multiple line caches have a 45% miss rate even when the number of banks is as high as 256. For large jobs where caches may not exploit a high degree of temporal locality, exploiting spatial locality becomes critical. For instance, block interleaving can bring the miss rate down to near 10% with 256 banks. Wtlity tends to be disrupted due to the multiple indepen. dent memory streams. However, in systems with larger numbers of memory banks, performance is significantly betteq in some cases reaching the same level as a single job. Figs. 12 and 13 show potential bandwidth and effective bandwidth, respectively, for multi-programmed jobs. For multiple line caches, cache line interleaving works well for both potential and effective bandwidth. For single line caches, cache line interleaving has the highest potential bandwidth, but the lowest performance For single job (uniprocessor) runs, we observed that the single line cache with block interleaving is quite effective in exploiting spatial locality. However, in multi-program runs, the competition from several processes destroy the effectiveness of holding a large 334 b 128- *.. . . . . . . ...* .. . . . . . . . ...* .. 90- “.. .. .. ... ... .. G ...... .. ... .. ..-. m 70i s s 50r a 40t e 30(%) 20. .. A. a n d w i d .. .. .* * low order ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. x. .. .. .. ... ... .. . . .. .. 64– solid: multiple 32_ dotted 16— dashed no cache single lines 8– : x 4– 4-way block ❑ 2-way block : 2– : 1– .. -. x s 0.5 — 10. 5. , .Ix fi’- P solid: multiple L lines i i I I I i6 & 2;6 Number of Banks Fig. 10. Miss rate for multi-programmed processes time shared. 100 runs; 4 small 90 Q ...... .. .. .. ... .... 70 .. s s A. . Number of Banks Fig. 12. Potential bandwidth 8 processes time shared. ... . .;. . for multi-programmed solid * low order n d 32– dotted x 4-way block ~ 16– dashed: no cache ~ 8– ❑ 2-way block r 50 h 4– ; 40 ; 2– e 30 (%) : runw I 64— .. 256 16 4 b a -. .. .. .. A C2iChf3 h? ~—T— 128 .. .. 4-way block ❑ 2-way block I c 0.25_ y lines multiple lines single lines l– s 0.5 – 20 .. 10 5 dotted ❑ 2-way block I c 0.25 – single line A cache line P 1 I 4 I 16 I 64 —r-l-l-~ I 256 4 64 Number of Banks Number of Banks Fig. 11. Miss rate for multi-programmed processes time shared. 16 Fig. 13. Effective bandwidth with multi-prc}gramming 8 processes time shared. run.y 4 large 335 DRAMs are unlikely to be effective, consistent with the opinions expressed in [Come92]. On the other hand, we have also shown that multi-line caches with block interleaving schemes can be made to work well. in effective bandwidth due to the hot bank problem. Four-way block interleaving has performance close to the better performing one in terms of both potential and effective bandwidth. Fig. 14 illust.mtes the distribution of requests to banks for multi-program runs with single large line caches. The curve for cache line interleaving has several spikes, and block interleaving smooths out some of the spikes. Single line cached DRAMs are more sensitive to the memory interleaving scheme that is used. We recommend that block interleaving be used. The performance of multi-line caches is less sensitive to interleaving methods, and block interleaving is called for when the cache hit time is slower than processor cycle time. 6. Conclusions In this paper we considered the usefulness of new cache-oriented DRAMs for delivering cost-effective bandwidth in vector supercomputers. It is apparent from our simulations that traditional low-order-bit memory interleaving will not take full advantage of the new DRAM parts; spatial locality is reduced too much. We have shown that using other less common interleave schemes, which place successive words in the same DRAM chips, can increase locality and improve performance. Unfortunately, they also tend to produce nonuniform bank usage the presence of hot banks can reduce overall performance. Consequently, we feel that memory system designs will involve optimization to increase memory bank locality up to a certain poin~ but no further. A disadvantage of non-conventional interleaving schemes that we have not yet discussed is that they lead to systems where the addressing logic external to the chip becomes tuned to internal chip characteristics, i.e. the cache characteristics. If DRAM chips in an optimized system are later replaced with chips having different characteristics, the interleave scheme may no longer be optimal. Finally, we feel that our results will extend qualitatively to systems other than those we have specifically studied. However, as we observed earlier, the types of program constructs that lead to hot bank problems may have a smaller effect when processors contain data caches that are able to reduce redundant memory traffic significantly. Non-standard interleaving schemes will likely become a key component of memory system design, and the tradeoff between increased spatial locality and hot banks will bean important design issue. In uniprocessor environments, cached DRAMs with block interleaving can provide more cost-effective bandwidth. In multiprogramming multiprocessor environments, we have shown that single line cached 7. References 330 lJ3ai191] David H. Bailey et al., “The NAS Parallel Benchmark: Summary and Preliminary Results,” IEEE Supercomputing ’91, pp 158-165, Nov., 1991 . low order x r 270 e f 1 4-way block l130nd92] Bondurant D., “Enhanced Dynamic RAM,” IEEE Spectrum, pp 49, in [Jone92], Oct., 1992 ❑ 2-way block [Come92] Comerford, R. and G. Watson, “Memory catches up,” IEEE Spectrum, pp 34-35, Oct. 1992. :210 [CIWWI Cray Research Inc., “CRAY tional Description Manual,” HR-04028, e n c 150 e s A YMP C90 FuncMarch, 1992. [Cybe90] Cybenko, G., et al., “Supercomputer PerforCSRD mance Evaluation and the Perfect Benchmarks,” Report No. 965, University of Illinois, March, 1990 4 U%rm92] Farmwald M. and D. Mooring “A Fast Path to One Memory,” IEEE Spectrum, pp 50-51, Oct. 1992. 90 mart92] Hart C., “Dynamic RAM as Secondary Cache,” IEEE Spectrum, pp 48, in [Jone92], Oct., 1992 30 8 [Jone92] Jones Fred, “ A New Era of Fast Dynamic RAMs” IEEE Spectrum, pp 43-49, Oct. 1992. 16243240485664 Bank Number Fig. 14. Hot bank disrnbution in single line caches; for 8 processes time shared, 336