EE5364 – Project Proposal Fall 2003 ACCESS REGION LOCALITY AND DATA DECOUPLED ARCHITECTURE FOR INCREASING THE MEMORY BANDWIDTH OF A MULTI-ISSUE SUPERSCALAR PROCESSOR Amruta P Inamdar (2771311) Shivaraju B Gowda (2477474) Shubha B. (2908256) Abstract: The present and the future Multi-Issue processors impose a tremendous load on the memory bandwidth requirements. In order to achieve high performance, memory systems should be able to handle efficiently these requirements. This can be achieved in two ways. We can add-in more hardware in terms of multiple cache ports or we can make use of the behavior of the data to efficiently handle the memory references. Adding in more ports to the cache significantly increases hardware complexity and the cost. In this project we propose to evaluate and/or extend access region locality [2] and data decoupled architecture [1] for obtaining high-bandwidth for multi-issue super scalar processors. These techniques use the behavior of the data references in a typical program to increase memory bandwidth. 1. Introduction: Improvements in the performance of a microprocessor are growing at a much faster rate than that of a memory subsystem. Multiple issues of instructions in a cycle to achieve more instruction level parallelism (ILP) place a greater demand on the memory system to service multiple requests per cycle. Efficient handling of memory references is one of the keys to achieve high performance through ILP and thus it becomes extremely necessary that the data-bandwidth be high. The memory systems have to do much more than the traditional use of temporal and spatial data locality techniques in order to match these requirements. With the smaller feature size supported by technologies today, having a multi-ported cache is the most obvious solution. But, as the number of ports to the cache increases, the hardware complexity increases, thereby increasing the timing of the critical path, which might cost in terms of reduction in clock. An alternate or complementary approach to this would be to make use of the behavior of the data references in order to increase memory bandwidth. The data-decoupled architecture [1] and access region locality [2] are techniques used for this purpose. 2. Access Region Locality: The concept of access region locality is based on the premise that memory reference instruction typically accesses a single region at run time and thus the region that it accesses is predictable. Using this information we can effectively increase the memory bandwidth. Cho et. al. [2], showed that the premise is correct for a set of SPEC95 benchmark programs. They also developed a very effective run-time access region predictor (which is similar to branch predictor in structure). 3. Data-decoupled Architecture: Concept: An effective way to reduce hardware complexity is to partition the instruction window among the functional units so that only some kinds of instructions can use a particular functional unit. In the MIPS R10000, the window is partitioned into integer queue, floating-point queue and address-queue based on the type of instruction. They all have different functional units associated with them. The Data-Decoupled architecture, further partitions the window based on memory-access. Here, the data memory stream is divided into two independent memory streams before they enter the reservation stations and feeds each stream to a separate memory unit. Hence, two data-caches with their own dedicated pool of reservation stations or access-queues are used. As an example, in figure (1) the instructions are divided into two sets based on whether the memory access is from Network 1 or Network 2. The two main issues of the datadecoupled architecture are memory stream partitioning and load-balancing. Cache 1 N/W 1 LSQ 1 Dispatch Cache 2 N/W 2 LSQ 2 Figure (1) – Data-Decoupled Architecture with two memory-access-queues and caches with two ports each. Issues: Memory Stream Partitioning - The instructions should be partitioned into two independent streams. This can be done wither at run-time or compile-time. At run-time, speculation is to be used and the hardware mechanism necessary to purge the instruction form the queue in case of a misprediction should be included. If done at compile-time the hardware complexity is reduced while the compiler’s load increases. An intuitive solution, would be that use of both kinds would result in the most efficient and cost-effective method. Load-Balancing – It is obvious that, if the load on the queues is not balanced, then one queue is overused while the other queue is not used at all. So for this scheme to make a difference the load should be balanced. If it is not balanced, then one way to improve the performance would be to subdivide the memory streams into smaller sub-streams and feed them to separate cache. Having a simple cache-policy would be likely to increase the efficiency. Motivation: The main advantage of this architecture is that it enables the use of simple, cost-effective caches with small number of ports. Furthermore, the logic for control between the ports and reservation stations is simplified and each stream can be optimized separately. It is important to note that this architecture does not reduce the IPC; instead it tremendously reduces hardware complexity. This approach works well if the access to stack and nonstack regions are interleaved. But, certain floating-point programs have more bandwidth demand for non-stack region than stack and thus the performance is limited. We can enhance this version of the data-decoupled architecture by further dividing the stack and non-stack memory streams into sub-streams and feeding them to separate cache banks. This is done in order to improve the performance of data-decoupled architecture when the stack, non-stack data accesses are not interleaved or when they are busty. This is the main idea behind this project. 4. Simulations to be done: Implementation: Similar to figure (1) the data memory stream is divided into two windows – the stack and the non-stack – based on the frequency of the occurrence. The stack-memory instructionwindow is called the “local variable access queue” (LVAQ) and the non-stack-memory instruction window is called the “load-store queue” (LSQ). An ordinary cache is connected to the LSQ while a small stack cache called the “local variable cache” (LVC) is connected to the LVAQ. We will check for the opportunities to further divide the 2 Memory Access Queues into multiple independent queues and feed them to dedicated multiple cache banks in order to balance the load. Decoupled architecture with multiple cache banks can give more performance by eliminating unbalanced bandwidth demand for only one queue. Some memory access predictors and their accuracy should be studied in order to feed the correct data to these multiple queues and cache banks. Experiments: We will study the current SimpleScalar simulator that supports wide-issue, out-of-order issue and execution. We will implement the techniques of access region locality and data decoupled architecture to validate the performance quoted in the reference papers. We will also look forward for an opportunity to extend and modify it for gain in performance. We will modify the simulator as per our architecture’s requirements and will measure the performance of the new architecture. Benchmark programs: We plan to use SPEC2000 benchmark suite to evaluate performance of this architecture and compare it with the results of the SPEC95 benchmark in the reference paper. 5. Conclusions: The architecture involving different caches and prediction schemes would be challenging to implement. Since the architecture inherently has some issues, it would be interesting to see the performance variation in both the ideal and non-ideal (unbalanced loads) conditions. In the ideal case, we want to study the effect the accuracy of the predictors, cache-size and number of ports in the cache on performance. Also, we could implement sub-streaming of the queues and check the performance for imbalanced loads. It would also be interesting to see the difference in the access region locality for SPEC2000 and SPEC95 benchmarks suites. References: [1] G.Lee, S.Cho, P.C.Yew, "Decoupling Local Variable Accesses ina Wide-issue Superscalar Processor", Proc. of 26th Int'l Symp. on Computer Arch. May 1999. [2] G.Lee, S.Cho, P.C.Yew, "Access Region Locality for High-bandwidth Processor memory System Design", [3] J.A.Rivers, G.S.Tyson, E.S.Davidson, T.M.Austin, "On High-bandwidth Data Cache Design for Multi-issue Processors", Proc. of Micro-30, December 1997. [4] H.Neefs, H.Vandierendonck, K.DeBosschere, "A Technique for High-bandwidth and Deterministic Low Latency Load/Store Accesses to Multiple Cache Banks", IEEE 1999. [5] D.Burger, T.M. Austin, "The SimpleScalar Tool Set, Version 2.0", Computer Science Dept. Technical Report, No. 1342, Univ. of Wisconsin, June 1997.