Memory System Midterm Report Low Power Non-uniform Memory Access (NUMA) 0251808 周德玉 Abstract NUMA refers to the computer memory design choice available for multiprocessors. NUMA means that it will take longer to access some regions of memory than others. This work explains what NUMA is, the background developments, and how the memory access time depends on the memory location relative to a processor. It presents a background of multiprocessor architectures, and some trends in hardware that exist along with NUMA. It, then briefly discusses the changes NUMA demands to be made in two key areas. One is in the policies the Operating System should implement for scheduling and run-time memory allocation scheme used for threads and the other is in the programming approach the programmers should take, in order to harness NUMA’s full potential. It also presents some numbers for comparing UMA vs. NUMA’s performance. Introduction Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors). The benefits of NUMA are limited to particular workloads, notably on servers where the data are often associated strongly with certain tasks or users. NUMA architectures logically follow in scaling from symmetric multiprocessing (SMP) architectures. make the problem considerably worse. Now a system can starve several processors at the same time, notably because only one processor can access the computer's memory at a time. NUMA attempts to address this problem by providing separate memory for each processor, avoiding the performance hit when several processors attempt to address the same memory. For problems involving spread data (common for servers and similar applications), NUMA can improve the performance over a single shared memory by a factor of roughly the number of processors (or separate memory banks). Of course, not all data ends up confined to a single task, which means that more than one processor may require the same data. To handle these cases, NUMA systems include additional hardware or software to move data between memory banks. This operation slows the processors attached to those banks, so the overall speed increase due to NUMA depends heavily on the nature of the running tasks. Intel announced NUMA compatibility for its x86 and Itanium servers in late 2007 with its Nehalem and Tukwila CPUs. Both CPU families share a common chipset; the interconnection is called Intel Quick Path Interconnect (QPI). AMD implemented NUMA with its Opteron processor (2003), using HyperTransport. [8] Background Modern CPUs operate considerably faster than the main memory they use. In the early days of computing and data processing, the CPU generally ran slower than its own memory. The performance lines of processors and memory crossed in the 1960s with the advent of the first supercomputers. Since then, CPUs increasingly have found themselves "starved for data" and having to stall while waiting for data to arrive from memory. Many supercomputer designs of the 1980s and 1990s focused on providing high-speed memory access as opposed to faster processors, allowing the computers to work on large data sets at speeds other systems could not approach. Hardware Goals / Performance Criteria There are 3 criteria on which performance of a multiprocessor system can be judged, viz. Scalability, Latency and Bandwidth. Scalability is the ability of a system to demonstrate a proportionate increase in parallel speedup with the addition of more processors. Latency is the time taken in sending a message from node A to node B, while bandwidth is the amount of data that can be communicated per unit of time. So, the goal of a multiprocessor system is to achieve a highly scalable, low latency, high bandwidth system. Limiting the number of memory accesses provided the key to extracting high performance from a modern computer. For commodity processors, this meant installing an ever-increasing amount of high-speed cache memory and using increasingly sophisticated algorithms to avoid cache misses. But the dramatic increase in size of the operating systems and of the applications run on them has generally overwhelmed these cache-processing improvements. Multi-processor systems without NUMA Parallel Architectures Typically, there are 2 major types of Parallel Architectures that are prevalent in the industry: Shared Memory Architecture and Distributed Memory Architecture. Shared Memory Architecture, again, is of 2 types: Uniform Memory Access (UMA), and Non-Uniform Memory Access (NUMA). Shared Memory Architecture As seen from the figure 1 (more details shown in “Hardware Trends” section) all processors share the same memory, and treat it as a global address space. The major challenge to overcome in such architecture is the issue of Cache Coherency (i.e. every read must reflect the latest write). Such architecture is usually adapted in hardware model of general purpose CPU’s in laptops and desktops. Figure 3. UMA Architecture Layout. [3] Figure 1. Shared Memory Architecture. [1] Distributed Memory Architecture In figure 2 (more details shown in “Hardware Trends” section) type of architecture, all the processors have their own local memory, and there is no mapping of memory addresses across processors. So, we don’t have any concept of global address space or cache coherency. To access data in another processor, processors use explicit communication. One example where this architecture is used with clusters, with different nodes connected over the internet as network. Shared Memory Architecture – NUMA Figure 4 shows type of shared memory architecture, we have identical processors connected to a scalable network, and each processor has a portion of memory attached directly to it. The primary difference between a NUMA and distributed memory architecture is that no processor can have mappings to memory connected to other processors in case of distributed memory architecture, however, in case of NUMA, a processor may have so. It also introduces classification of local memory and remote memory based on access latency to different memory region seen from each processor. Such systems are often made by physically linking SMP machines. UMA, however, has a major disadvantage of not being scalable after a number of processors [6]. Figure 4. NUMA Architecture Layout. [3] Hardware Trends Figure 2. Distributed Memory. [1] Shared Memory Architecture – UMA Shared Memory Architecture, again, is of 2 distinct types, Uniform Memory Access (UMA), and Non-Uniform Memory Access (NUMA). The Figure 3 shows a sample layout of processors and memory across a bus interconnection. All the processors are identical, and have equal access times to all memory regions. These are also sometimes known as Symmetric Multiprocessor (SMP) machines. The architectures that take care of cache coherency in hardware level, are knows as CC-UMA (cache coherent UMA). We now discuss 2 practical implementations of the memory architectures that we just saw, one is the Front Side Bus and the other is Intel’s Quick Path Interconnect based implementation. Traditional FSB Architecture (used in UMA) As shown in Figure 5, FSB based UMA architecture has a Memory Controller Hub, which has all the memory connected to it. The CPUs interact with the MCH whenever they need to access the memory. The I/O controller hub is also connected to the MCH, hence the major bottleneck in this implementation is the bus, which has a finite speed, and has scalability issues. This is because, for any communication, the CPU’s need to take control of the bus which leads to contention problems. OS Design Goals Operating Systems, basically, try to achieve 2 major goals, viz. Usability and Utilization. By usability, we mean that OS should be able to abstract the hardware for programmer’s convenience. The other goal is to achieve optimal resource management, and the ability to multiplex the hardware amongst different applications. Figure 5. Intel's FSB based UMA Arch. [4] Quick Path Interconnect Architecture (used in NUMA) The key point to be observed in this implementation is that the memory is directly connected to the CPU’s instead of a memory controller. Instead of accessing memory via a Memory Controller Hub, each CPU now has a memory controller embedded inside it. Also, the CPU’s are connected to an I/O hub, and to each other. So, in effect, this implementation tries to address the common-channel contention problems. Features of NUMA aware OS The basic requirements of a NUMA aware OS are to be able to discover the underlying hardware topology, and to be able to calculate the NUMA distance accurately. NUMA distances tell the processors (and / or the programmer) how much time it would take to access that particular memory. Besides these, the OS should provide a mechanism for processor affinity. This is basically done to make sure that some threads are scheduled on certain processor(s), to ensure data locality. This not only avoids remote access, but can also take the advantage of hot cache. Also, the operating system needs to exploit the first touch memory allocation policy. Optimized Scheduling Decisions The operating systems needs to make sure that load is balanced amongst the different processors (by making sure that data is distributed amongst CPU’s for large jobs), and also to implement dynamic page migration (i.e. use latency topology to make page migration decisions). Conflicting Goals The goals that the Operating System is trying to achieve are conflicting in nature, in the sense, on one hand we are trying to optimize the memory placement (for load balancing), and on the other hand, we would like to minimize the migration of data (to overcome resource contention). Eventually, there is a trade off which is decided on the basis of the type of application. Programming Paradigms NUMA Aware Programming Approach The main goals of NUMA aware programming approach are to reduce lock contention and maximize memory allocation on local node. Also, programmers need to manage their own memory for maximum portability. This is can prove to be quite a challenge, since most languages do not have an in-built memory manager. Figure 6. Intel's QPI based NUMA Arch. [4] New Cache Coherency Protocol This new QPI based implementation also introduces a new cache coherency protocol, “MESIF” instead of “MESI”. The new state “F” stands for forward, and is used to denote that a cache should act as a designated responder for any requests. Operating System Policies Support for Programmers Programmers rely on tools and libraries for application development. Hence the tools and libraries need to help the programmers in achieving maximum efficiency, also to implement implicit parallelism. The user or the system interface, in turn needs to have programming constructs for associating virtual memory addresses. They also need to provide certain functions for obtaining page residency. Programming Approach The programmers need to explore the various NUMA libraries that are available to help simplify the task. If the data allocation pattern is analyzed properly, “First Touch Access” can be exploited fully. There are several lock-free approaches available, which can be used. Besides these approaches, the programmers can exploit various parallel programming paradigms, such as Threads, Message Passing, and Data Parallelism. Scalability – UMA vs NUMA We can see from the figure, that UMA based implementation have scalability issues. Initially both the architectures scale linearly, until the bus reaches a limit and stagnates. Since there is no concept of a “shared bus” in NUMA, it is more scalable. Figure 7. UMA vs. NUMA – Scalability. [6] Cache Latency The figure shows a comparison of cache latency numbers of UMA and NUMA. There is no layer 3 cache in UMA. However, for Main Memory and Layer 2 cache, NUMA shows a considerable improvement. Only for Layer 1 cache, UMA marginally beats NUMA. attempt to access the same memory area in rapid succession. Operating-system support for NUMA attempts to reduce the frequency of this kind of access by allocating processors and memory in NUMA-friendly ways and by avoiding scheduling and locking algorithms that make NUMA-unfriendly accesses necessary. Alternatively, cache coherency protocols such as the MESIF protocol attempt to reduce the communication required to maintain cache coherency. Scalable Coherent Interface(SCI) is an IEEE standard defining a directory-based cache coherency protocol to avoid scalability limitations found in earlier multiprocessor systems. SCI is used as the basis for the Numascale NumaConnect technology. As of 2011, ccNUMA systems are multiprocessor systems based on the AMD Opteron processor, which can be implemented without external logic, and the Intel Itanium processor, which requires the chipset to support NUMA. Examples of ccNUMA-enabled chipsets are the SGI Shub (Super hub), the Intel E8870, the HP sx2000 (used in the Integrity and Superdome servers), and those found in NEC Itanium-based systems. Earlier ccNUMA systems such as those from Silicon Graphics were based on MIPS processors and the DEC Alpha 21364 (EV7) processor. [8] Non-Uniform Distribution of Memory Accesses on Cache Sets Affects the System Performance of Chip Multiprocescsors Extension to CMP Platforms There are typically two ways to organize non-first level cache in chip multiprocessors (CMPs), i.e. shared cache or private cache. We study how non-uniform memory access distribution across sets affects each organization by adapting SBC to each of them respectively. A. Shared Cache Figure 8. UMA vs NUMA - Cache Latency. [4] Cache coherent NUMA (ccNUMA) Nearly all CPU architectures use a small amount of very fast non-shared memory known as cache to exploit locality of reference in memory accesses. With NUMA, maintaining cache coherence across shared memory has a significant overhead. Although simpler to design and build, non-cache-coherent NUMA systems become prohibitively complex to program in the standard von Neumann architecture programming model. Typically, ccNUMA uses inter-processor communication between cache controllers to keep a consistent memory image when more than one cache stores the same memory location. For this reason, ccNUMA may perform poorly when multiple processors In shared cache, a central cache (distributed shared cache is also a typical shared cache organization. However, we focus on central shared cache in this paper.) is shared among all cores. Each set of it is thus required to serve misses from above cache level or bypassed requests of multiple applications running on all cores [11]. The overall memory access distribution across sets on shared cache is a cumulative consequence of that of each application. To study the memory distribution across sets on shared cache, we treat all accesses to each set as a whole and simply apply SBC directly to it, regardless of where each access comes from. To distinguish this scheme from SBC on single-core platforms, we call it Shared SBC (SSBC) . There is an impressive body of research on shared cache optimization [12] [14] [15] [18]. However, we do not attempt to adapt SBC to one of those schemes for the following reasons: (1) most of those optimization schemes seek to control either the share of ways each application is allowed to use smartly [12] [13] or the timing a block being evicted from the cache [14] [15], both of which modifies the replacement policy, making it difficult for SBC to be adapted to; (2) experimental results for several different cache configurations indicate that SSBC has little to no performance boost over the baseline shared cache architecture. B. Private Cache In private based design, non-first level cache is composed of multiple slices, with each one private to and closely coupled to a different core both logically and physically, serving only local misses and requests. As a result, memory accesses of applications are well isolated from each other, making it easy for us to adapt non-uniform distributionbased schemes on top of it. Therefore, we put forward three schemes on private based design to try to make use of non-uniform memory access distribution across sets and to study how this kind of non-uniformity affects system performance of private based cache design: (a) Private Set Balancing Cache (PSBC): Sets of private cache slices work exactly the way they do on single-core platforms, expect that they may sometimes need to deal with a few coherence operations. Therefore, we first directly introduce SBC to each private slice just the way it is for singlecore platforms, but with small modifications to coherence protocols to ensure a coherence request to a source set will also be directed to its destination set on a miss. We call this simple scheme as Private SBC (PSBC). (b) Balanced Private NUCA (BP-NUCA): The static partitioning style in private based cache design may lead to undesirable low utilization of the precious on-chip cache resources [11]. To address this limitation, many private cache enhancement schemes use spilling [16] [17] technique to improve capacity utilization of private cache design by allowing evicted blocks of one private slice to be saved at a peer private slice [16]. In fact, spilling technique shares similar ideas with SBC, both of which seek to move some blocks of highly accessed sets to underutilized sets. The main difference is that SBC seeks to move those blocks to sets in the same private cache slice but with different index address, whereas typical spilling technique attempts to move those blocks to sets with the same index address but in a remote peer cache. To distinguish between the two different “moves”, we call the former “move” operation as displacement, the latter one as spill. Recall that, the Saturation Counter (SC) adopted in SBC can measure the memory access pressure experienced by the corresponding cache set. It puts forward a new spilling technique for private cache based on the following insight, i.e. SC could also be used in spilling technique to guide the spill process with memory access pressure information detected. We call this new spilling technique as Balance Private Non-uniform Cache architecture (BP-NUCA) because private based cache design is a NUCA itself in nature, with varying access latency to different cache slices, and also because this SC-based spilling technique seeks to balance the cache space utilization of different private slices. BP-NUCA works in the following way: on a miss to a set, if the SC value of this set is larger than the migration limit (denoted as ThM), then the victim of the miss is allowed to spill into one of the sets of peer caches that are with the same index (called peer sets). A peer set is qualified to serve as a receiver of the spilled block on the condition that its SC value is smaller than a receiver limit (denoted as ThR). When there are more than one potential receiver set, the one with the shortest access latency is selected. (c) Balanced Private NUCA+ (BP-NUCA+): We also adapt SBC to BP-NUCA and get BP-NUCA+. We expect BP-NUCA+ to be beneficial because it attempts to balance the cache capacity utilization both horizontally (by spilling blocks to peer sets) and vertically (by displacing blocks to different sets of the same cache slice). As a result, blocks of highly accessed sets are also able to borrow the space of sets that are with different index address and also at remote cache slices. Consequently, BP-NUCA+ has five policy parameters, namely (SAT, ThM, ThA, ThL, ThR). Those parameters should follow SAT ≥ ThM ≥ ThA. SAT is the upper bound of SC. Spilled blocks have longer access latency than displaced ones, so we prefer to displace blocks first. Therefore, we let ThM ≥ ThA. Figure 9 compares the basic ideas of PSBC, BP-NUCA and BP-NUCA+. Figure 9. Comparison of PSBC, BP-NUCA and P-NUCA+ Experimental Methodology It uses g-cache of Virtutech Simics [9], a full system simulator for the performance studies. Evaluation is performed on a 4-core CMP with parameters given in Table I. An in-order core model is used so that it can evaluate our proposals within a reasonable time. For the study, it uses 23 SPEC CPU2006 benchmarks to create 16 4-benchmark multiprogrammed workloads randomly,as listed in Table II. All workloads are simulated till each benchmark in the workload executes at-least 250M instructions. make use of the non-uniformity of memory access distribution across sets for performance boost are not necessary for shared cache on CMP platforms. B. Results for Private Cache Experimental Results and Findings We examines the performance of SSBC over Shared. For private cache based design, we compare the performance of PSBC, BP-NUCA and BP-NUCA+ to that of Private. A. Results for Shared Cache To examine the performance of SSBC across a spectrum of memory configurations (specifically for different associativity), we run the simulation on three different configurations, namely 8-way 2MB cache, 16-way 2MB cache and 32-way 2MB cache respectively. We set (SAT, ThM, ThA) as (3A − 1, 2A − 1, A). Figure 10 shows the throughput of SSBC on those three different configurations, normalized to that of Shared of each corresponding configuration. Geomean is the geometric mean of all 16 workloads. As can be observed from Figure 5, although for each memory configuration, SSBC does outperforms the baseline shared cache for a few workloads, SSBC degrades the throughput performance for the majority of the workloads. The most obvious example is the case for 32-way 2MB cache. In general, SSBC has little to no average performance benefit for all three configurations, even with obvious average degradation for 32-way 2MB cache. Besides, SSBC don’t show the kind of stability that SBC holds for singlecore platforms. Based on the experimental results, we draw to two conclusions: (a) on CMP platforms with multiprogrammed workloads, the unpredicted interaction between memory accesses from different applications complicates the memory access distribution across sets, reducing the non-uniformity of memory access distribution across sets in general; (b) schemes seek to Figure 11 shows the throughput of PSBC, BP-NUCA and BP-NUCA+, normalized to Private. Geomean is the geometric mean of all 16 workloads. We expect PSBC to outperform Private because the memory access pattern for each private cache slice is similar to that for single-core cache due to the static partitioning of cache space, except for a few coherence requests. The results shown in Figure 6 confirm our expectation, with PSBC outperforms Private for almost all the workloads except for MIX 12, with an average performance boost of 2.0% across all 16 workloads. As previously stated, BP-NUCA has three policy parameters, i.e. (SAT, ThM, ThR), while BP-NUCA+ has five policy parameters, i.e. (SAT, ThM, ThA, ThL, ThR). The values for those parameters not only impact the accuracy of cache pressure level estimation of each set, but also tune the aggressiveness of the two schemes. It would be unrealistic to try the whole exploration space to find out the optimal values for all the parameters. Based on the initial conclusion on (ThA, ThL) of work [10] and Section III-B, we come up with several parameter configurations for each scheme empirically and search for the best configurations amongst them experimentally. The detailed parameter study is ruled out due to limited space. BP-NUCA and BP-NUCA+ achieve surprisingly high performance boost, far more than our expectation, outperforming the baseline Private by 7.7% and 7.6% on average, respectively. BP-NUCA outperforms PSBC for 14 out of all 16 workloads, while BP-NUCA+ outperforms PSBC for 13 out of all 16 workloads. BP-NUCA and BP-NUCA+ demonstrate considerably good stability in that they work better than the baseline Private for all 16 workloads simulated. One notable fact is that, although BP-NUCA+ takes into account the non-uniformity of memory access distribution across sets, it seems that this kind of enhancement over BP-NUCA does not necessarily result in much performance benefit. In fact, although BP-NUCA+ outperforms BP-NUCA for 8 out of the 16 workloads, it achieves slightly lower geometric mean average throughput over BP-NUCA. Based on above results and analysis, we draw to several conclusions: (a) on private based cache design, direct adaption of schemes that make use of non-uniform memory access distribution across sets for performance boost is proved to be beneficial; (b) however, if those schemes are used in conjunction with traditional private cache enhancing technique, namely spilling, they fail to be beneficial across a wide spectrum of multiprogrammed workloads.[19] Figure 10. Throughput Performance of SSBC (on 8-way 2MB cache, 16-way 2MB cache and 32-way 2MB cache respectively) Figure 11. Throughput Performance of PSBC, BP-NUCA and BP-NUCA+,normalized to that of Private A Non-Uniform Cache Architecture for Low Power System Design Non-Uniform Cache Architecture We determine the optimum number of cache-ways for each cache-set at design time. Although the number of active cache-ways can be changed dynamically by using a sleep transistor during the course of running an application program, we do not consider it in this work. The power supply of unused cache-ways (the gray portion of Figure 12) can be disconnected by eliminating vias used for connecting the power supply to memory cells. Unused memory cells can also be disconnected from bit and word lines in the same fashion. Figure 12. Deactivating sense amplifiers One possible way of marking unused cache blocks is to use a second valid bit [20]. If the bit is one, the corresponding cache block will not be used for replacement in case of a cache miss. Accessing an unused block will always cause a cache miss. To reduce the dynamic power consumption of the non-uniform cache, it is possible to deactivate sense-amplifiers of cache-ways which are marked as unused for the accessed cache-set. This can be easily implemented by checking the set-index field of the memory address register. For example in Figure 12, sense-amplifiers for tag1 and wayl are deactivated when the target cache-set is 4, 5, 6, or 7. Similarly, sense-ampkifiers for tag2, way2, tag3, and way3 are deactivated when one of sets 2-7 is accessed. Reducing Redundant Cache Accesses In [21], Panwer et al. have shown that cache-tag access and tag comparison do not need to be performed for all instruction fetches. Consider an instruction j executed immediately after an instruction i. There are three cases, 1. Intra-cache-line sequentiaiflow This occurs when both i and j instructions reside on the same cache-line and i is a non-branch instruction or an untaken branch. 2. Infer-cache-line sequentialflow This case is similar to the first one, the only difference is that i and j reside on different cache-lines. 3. Non-sequeprtiulflow In this case, i is a taken branch instruction andj is its target. In the first case (intra-cache-he sequential flow), it is easy to detect that j resides in the same cache-way as i. Therefore, there is no need to perform a tag lookup for instruction j [21][22][23]. On the other hand, a tag lookup and a cache-way access are required for a non-sequential fetch such as a taken-branch (non-sequential flow) or a sequential fetch across a cache line boundary (intercacheline sequential flow). As a consequence, the power consumption of the cache memory can be reduced by deactivating memory modules of tags and cache-ways in case of the intracache-line sequentid flow. Several embedded processors including ARM [22][23] use this technique. We refer to this technique as Inter-Line Way Memoization or ILWM. We use ILWM in our approach. Figure 13. A code placement technique for reducing redundant cache-way and cache-tag accesses Assume a basic block “a” consists of 7 instructions and its last instruction, a7, which is a taken-branch resides in the fourth word of the cache line “n” (see Figure 13). Further, assume the last instruction of the cache line “n” is not a branch instruction. A tag lookup is required when a3 or a7 is executed because in either case it is not clear whether the next instruction resides in the cache or not. However, if the location of the basic block “a” in the address space is changed so the basic block “n” is not located across a cache-line boundary, the cache and tag accesses for instruction a3 can be eliminated (see Figure 13). Therefore, we change the placement of basic blocks in the main memory so frequently accessed basic blocks are not located across a cacheline boundary. To the best of ow knowledge, this is the first code placement technique which reduces the number of redundant cache-way and cache-tag accesses. Figure 14 shows the power breakdown for a cache. For example in “JPEG-enc”, the inter-cache-line sequential flow is responsible for 10% of cache accesses. Note that for inter-cache-line sequential flows, all cache-ways and cache-tags are activated. Therefore, the power consumption of the cache memory due to the inter-cache-line sequential flow is large especially for highly associative caches. Assuming a 16-way set associative cache, more than 50% of the cache power in “JPEG-enc” is due to the inter cache-line sequentialflow. Therefore, decreasing the number of the inter cache-line sequential flow substantially reduces the cache power consumption. Another way of reducing thc number of times the inter cache-line sequential flow occurs is increasing the size of cache-lines. However, increasing the cache-line size increases the number of off-chip memory accesses in case of a cache miss. Our algorithm presented in the next section takes this trade-off into account and explores different cache-line sizes to minimize the total power consumption of the memory hierarchy. Since the behavior of a program depends on its input Figure 14. Power break down for a cache Experimental Results We compared the following four techniques, (a) performing cache sizing for uniform cache, (b) performing cache sizing for uniform cache and the conventional code placement after that, (c) performing our code placement and cache sizing for a uniform cache concurrently, and (d) concurrent optimization for nonuniform cache. Redundant cache-way and cache-tag access elimination (ILWM, [21] was used for all four techniques. The number of cache-sets in all experiments was 8. The power consumption results optimized without and with a time constraint are shown in Figure 8. The time constraint Tconst, is set to the execution time of the target application program with the original cache configuration. “Low Leak” and “High Leak” in Figure 8 correspond to low- and high-leakage scenarios, respectively. Since conventional code placement techniques reduce the number of cache misses only, they may increase the number of cache-way and tag accesses if the processor uses the ILWM technique [21]. For example, compare case (a) and (b) in the timeconstrained optimization results in Figure 8. On the other hand, our methods (c) and (d) always reduce the dynamic power consumption of the cache memories. Optimizing without a timeconstraint reduced the power consumption for “Compress” by 29% (17% on an average), while in presence of a time-constraint, up to 76% (52% on an average) reduction in the power consumption was achieved. The reason for better results for timeconstrained case is that it requires higher number of ways. Therefore, there is more opportunity for our method to reduce the average number of cache-ways accessed. Table III shows the number of ways, cache-line size and cache size (in byte) in the high-leakage case in our experiment. As one can see, in many cases our approach (d) reduces the effective size (the total size of blocks used) of the cache memory as well. TABLE III. The cache configuration results values, an object code and cache configuration optimized for a specific input value is not necessarily optimal for the other input values. To see the effect of changing the input value on the cache behavior, we calculated the power consumption of memory systems for different input values. We calculated the following three values for six different input values: 1. the power consumption for the original object code executed with a uniform cache optimized for Data0. 2. the power consumption (Ptotal) for the optimized object code executed with a non-uniform cache optimized for Data0. 3. the total execution time (Ttotal) for the optimized object code running on a processor with the non-uniform cache. The performance value is normalized to the performance for Data0. Figure 15 shows the results for six different input values for each benchmark program. The left and right vertical axes represent the power consumption of memories and the normalized performance of a processor with the non-uniform cache, respectively. The object code and cache configuration were optimized €or Data0 using our algorithm for non-uniform caches. As one can see, the object code and the cache configuration optimized for Data0 achieve very good results for other input values as well. Figure 15. Input Data Dependency Table IV shows computation time (in second) of four optimization methods executed on an UltraSPARC-II dual CPU workstation running Solaris8 at 450MHz with 2GB of memory. Since the optimization time in some cases is very large, our future plan is substantially reducing it. [24] Table IV. CPU-time for cache optimization (second) The hardware industry has adapted NUMA as a architecture design choice, primarily because of its characteristics like scalability and low latency. However, modern hardware changes also demand changes in the programming approaches (development libraries, data analysis) as well Operating System policies (processor affinity, page migration). Without these changes, full potential of NUMA cannot be exploited. It explores the feasibility of taking advantage of non-uniform memory access distribution across sets in nonfirst level cache management to improve system performance of CMPs. We present four cache management schemes for CMP platforms, namely SSBC, PSBC, BP-NUCA and BPNUCA+, based on a single-core scheme SBC [10], aiming to balance the memory access distribution across cache sets, on both shared cache and private cache. Experimental results using a full system CMP simulator indicate that on CMP platforms with multiprogrammed workloads: (a) for shared caches, the non-uniform memory access distribution across different cache sets is biased by the fact multiple applications are running concurrently and sharing the cache capacity. The proposed scheme SSBC, which is derived by adapting SBC to shared caches, is proved to be of little to no benefit or even lead to degradation; (b) for caches that are organized as private caches, simply adaption of SBC to each private cache, namely PSBC, is proved to outperform the baseline private organization by 2% on average; (c) however, for a private cache-based cache enhancement scheme we proposed, i.e. BP-NUCA, further adaption of SBC on top of it, i.e. BP-NUCA+ is of little to no benefit. Therefore, we come to the conclusion that on CMPs platforms with multiprogrammed workloads, the distribution of memory accesses across cache sets is less non-uniform or the non-uniformity can not be easily taken advantage of, due to the interactions between multiple applications. As a result, special efforts to seek more benefit by taking advantage of this kind of non-uniformity are not really necessary. we proposed the non-uniform cache architecture, a code placement technique for reducing the power consumption of caches, and an algorithm for simultaneous cache configuration optimization and code placement. In future we plan to enhance our method by dynamically disabling cache-ways during the course of running an application program. [1] “Introduction to Parallel Computing.”: https://computing.llnl.gov/tutorials/parallel_comp/ [2] “Optimizing software applications for NUMA”: http://software.intel.com/en-us/articles/optimizing-softwar e-applications-for-numa/ [3] “Parallel Computer Architecture - Slides”: http://www.eecs.berkeley.edu/~culler/cs258-s99/ [4] “Cache Latency Comparison”: http://arstechnica.com/hardware/reviews/2008/11/nehalem -launch-review.ars/3 [5] “Intel – Processor Specifications”: http://www.intel.com/products/processor/index.htm [6] “UMA-NUMA Scalability” www.cs.drexel.edu/~wmm24/cs281/lectures/ppt/cs282_le c12.ppt[7] “Non-Uniform Memory Access (NUMA)” http://cs.nyu.edu/~lerner/spring10/projects/NUMA.pdf [8] “Non-Uniform Memory Access” http://en.wikipedia.org/wiki/Non-Uniform_Memory_Acce ss [9] P S Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, B. Werner. Simics: a full system simulation platform. Computer, 2002, 35(2): 50–58. [10] Dyer Rol’an, Basilio B. Fraguela, Ram’on Doallo. Adaptive line placement with the set balancing cache, Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, New York, NY, ACM, 2009, pp. 529–540. [11] Xiaomin Jia, Jiang Jiang, Tianlei Zhao, Shubo Qi, Minxuan Zhang. Towards Online Application Cache Behaviors Identification in CMPs, The 12th IEEE International Conference on High Performance Computing and Communications, Melbourne, Australia, IEEE Computer Society, 2010. [12] G. E. Suh, S. Devadas, et al. A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning. Proc Int Symposium on High Performance Computer Architecture. Washington, DC, USA: IEEE Computer Society, 117–128. 2002. [13] S. Kim, D. Chandra, Y. Solihin. Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture, Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, Antibes, Juan-les-Pins, France, IEEE Computer Society, 2004, pp. 111-122. [14] Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely, Joel Emer. Adaptive insertion policies for managing shared caches on CMPs, Proc Int Conference on Parallel Architectures and Compilation Techniques, Toronto, CANADA, ACM, 2008, pp. 208–219. [15] Yuejian Xie and G. H. Loh. PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches. Proceedings of the 36th Annual IEEE/ACM International Symposium on Computer Architecture(ISCA-36), Austin, TX, USA, ACM. 2009. [16] J Chang, G S Sohi. Cooperative caching for chip multiprocessors. Proc Int Symposium on Computer Architecture. IEEE Computer Society: Boston, MS, USA, 2006. 264–276. [17] Qureshi, M. K.: Adaptive spill-receive for robust highperformance caching in CMPs, Proc Int Symposium on High Performance Computer Architecture, Raleigh, North Carolina, USA, IEEE Computer Society, 45–54. 2009. [18] Moinuddin K. Qureshi, David Thompson, Yale N. Patt. The V-Way Cache: Demand Based Associativity via Global Replacement. SIGARCH Comput. Archit. News, Vol. 33, 2005, No. 2, pp. 544-555. [19] “Understanding How Non-Uniform Distribution of Memory Accesses on Cache Sets Affects the System Performance of Chip Multiprocessors” [20] D. A. Patterson, et al., “Architecture of a VL81 instruction cache for a RISC”, In Proc. 10” Annual Int’l Symposium on Computer Architecture, vol. 11, no. 3, pp.108-116, June, 1983. [21] R. Panwar, and D. Rennels, “Reducing the Frequency of Tag Compares for Low Power I-Cache Design”, In Proc. of ISLPED, pp.57-62, August 1995. [22] S. Segars, “Low Power Design Techniques for Microprocessors”, ISSCC Tutorial note, February 2001. [23] M. Muller, “Power Efficiency & Low Cost: The ARM6 Family”, In Proc. of Hot Chips IV, August 1992. [24] “A Non-Uniform Cache Architecture fur Low Power System Design”