EE5364 – Project Proposal

advertisement
EE5364 – Project Proposal
Fall 2003
ACCESS REGION LOCALITY AND DATA DECOUPLED ARCHITECTURE
FOR INCREASING THE MEMORY BANDWIDTH OF A MULTI-ISSUE
SUPERSCALAR PROCESSOR
Amruta P Inamdar (2771311)
Shivaraju B Gowda (2477474)
Shubha B.
(2908256)
Abstract:
The present and the future Multi-Issue processors impose a tremendous load on the
memory bandwidth requirements. In order to achieve high performance, memory systems
should be able to handle efficiently these requirements. This can be achieved in two ways.
We can add-in more hardware in terms of multiple cache ports or we can make use of the
behavior of the data to efficiently handle the memory references. Adding in more ports to
the cache significantly increases hardware complexity and the cost. In this project we
propose to evaluate and/or extend access region locality [2] and data decoupled
architecture [1] for obtaining high-bandwidth for multi-issue super scalar processors.
These techniques use the behavior of the data references in a typical program to increase
memory bandwidth.
1. Introduction:
Improvements in the performance of a microprocessor are growing at a much faster rate
than that of a memory subsystem. Multiple issues of instructions in a cycle to achieve
more instruction level parallelism (ILP) place a greater demand on the memory system to
service multiple requests per cycle. Efficient handling of memory references is one of the
keys to achieve high performance through ILP and thus it becomes extremely necessary
that the data-bandwidth be high. The memory systems have to do much more than the
traditional use of temporal and spatial data locality techniques in order to match these
requirements. With the smaller feature size supported by technologies today, having a
multi-ported cache is the most obvious solution. But, as the number of ports to the cache
increases, the hardware complexity increases, thereby increasing the timing of the critical
path, which might cost in terms of reduction in clock. An alternate or complementary
approach to this would be to make use of the behavior of the data references in order to
increase memory bandwidth. The data-decoupled architecture [1] and access region
locality [2] are techniques used for this purpose.
2. Access Region Locality:
The concept of access region locality is based on the premise that memory reference
instruction typically accesses a single region at run time and thus the region that it
accesses is predictable. Using this information we can effectively increase the memory
bandwidth. Cho et. al. [2], showed that the premise is correct for a set of SPEC95
benchmark programs. They also developed a very effective run-time access region
predictor (which is similar to branch predictor in structure).
3. Data-decoupled Architecture:
Concept:
An effective way to reduce hardware complexity is to partition the instruction window
among the functional units so that only some kinds of instructions can use a particular
functional unit. In the MIPS R10000, the window is partitioned into integer queue,
floating-point queue and address-queue based on the type of instruction. They all have
different functional units associated with them. The Data-Decoupled architecture, further
partitions the window based on memory-access. Here, the data memory stream is divided
into two independent memory streams before they enter the reservation stations and feeds
each stream to a separate memory unit. Hence, two data-caches with their own dedicated
pool of reservation stations or access-queues are used.
As an example, in figure (1) the instructions are divided into two sets based on whether
the memory access is from Network 1 or Network 2. The two main issues of the datadecoupled architecture are memory stream partitioning and load-balancing.
Cache 1
N/W 1
LSQ 1
Dispatch
Cache 2
N/W 2
LSQ 2
Figure (1) – Data-Decoupled Architecture with two
memory-access-queues and caches with two ports each.
Issues:
Memory Stream Partitioning - The instructions should be partitioned into two
independent streams. This can be done wither at run-time or compile-time. At run-time,
speculation is to be used and the hardware mechanism necessary to purge the instruction
form the queue in case of a misprediction should be included. If done at compile-time the
hardware complexity is reduced while the compiler’s load increases. An intuitive solution,
would be that use of both kinds would result in the most efficient and cost-effective
method.
Load-Balancing – It is obvious that, if the load on the queues is not balanced, then one
queue is overused while the other queue is not used at all. So for this scheme to make a
difference the load should be balanced. If it is not balanced, then one way to improve the
performance would be to subdivide the memory streams into smaller sub-streams and
feed them to separate cache. Having a simple cache-policy would be likely to increase the
efficiency.
Motivation:
The main advantage of this architecture is that it enables the use of simple, cost-effective
caches with small number of ports. Furthermore, the logic for control between the ports
and reservation stations is simplified and each stream can be optimized separately. It is
important to note that this architecture does not reduce the IPC; instead it tremendously
reduces hardware complexity. This approach works well if the access to stack and nonstack regions are interleaved. But, certain floating-point programs have more bandwidth
demand for non-stack region than stack and thus the performance is limited. We can
enhance this version of the data-decoupled architecture by further dividing the stack and
non-stack memory streams into sub-streams and feeding them to separate cache banks.
This is done in order to improve the performance of data-decoupled architecture when the
stack, non-stack data accesses are not interleaved or when they are busty. This is the main
idea behind this project.
4. Simulations to be done:
Implementation:
Similar to figure (1) the data memory stream is divided into two windows – the stack and
the non-stack – based on the frequency of the occurrence. The stack-memory instructionwindow is called the “local variable access queue” (LVAQ) and the non-stack-memory
instruction window is called the “load-store queue” (LSQ). An ordinary cache is
connected to the LSQ while a small stack cache called the “local variable cache” (LVC)
is connected to the LVAQ. We will check for the opportunities to further divide the 2
Memory Access Queues into multiple independent queues and feed them to dedicated
multiple cache banks in order to balance the load.
Decoupled architecture with multiple cache banks can give more performance by
eliminating unbalanced bandwidth demand for only one queue. Some memory access
predictors and their accuracy should be studied in order to feed the correct data to these
multiple queues and cache banks.
Experiments:
We will study the current SimpleScalar simulator that supports wide-issue, out-of-order
issue and execution. We will implement the techniques of access region locality and data
decoupled architecture to validate the performance quoted in the reference papers. We
will also look forward for an opportunity to extend and modify it for gain in performance.
We will modify the simulator as per our architecture’s requirements and will measure the
performance of the new architecture.
Benchmark programs:
We plan to use SPEC2000 benchmark suite to evaluate performance of this architecture
and compare it with the results of the SPEC95 benchmark in the reference paper.
5. Conclusions:
The architecture involving different caches and prediction schemes would be challenging
to implement. Since the architecture inherently has some issues, it would be interesting to
see the performance variation in both the ideal and non-ideal (unbalanced loads)
conditions. In the ideal case, we want to study the effect the accuracy of the predictors,
cache-size and number of ports in the cache on performance. Also, we could implement
sub-streaming of the queues and check the performance for imbalanced loads. It would
also be interesting to see the difference in the access region locality for SPEC2000 and
SPEC95 benchmarks suites.
References:
[1] G.Lee, S.Cho, P.C.Yew, "Decoupling Local Variable Accesses ina Wide-issue
Superscalar Processor", Proc. of 26th Int'l Symp. on Computer Arch. May 1999.
[2] G.Lee, S.Cho, P.C.Yew, "Access Region Locality for High-bandwidth Processor
memory System Design",
[3] J.A.Rivers, G.S.Tyson, E.S.Davidson, T.M.Austin, "On High-bandwidth Data Cache
Design for Multi-issue Processors", Proc. of Micro-30, December 1997.
[4] H.Neefs, H.Vandierendonck, K.DeBosschere, "A Technique for High-bandwidth and
Deterministic Low Latency Load/Store Accesses to Multiple Cache Banks", IEEE 1999.
[5] D.Burger, T.M. Austin, "The SimpleScalar Tool Set, Version 2.0", Computer Science
Dept. Technical Report, No. 1342, Univ. of Wisconsin, June 1997.
Download