View Survey on Cache Optimizations

advertisement
University of Massachusetts, Dartmouth
CIS 570
Survey Paper
Fall 2010
Memory Hierarchies-Basic Design and Optimization Techniques
Submitted by
Rahul Sawant, Bharath H. Ramaprasad
Sushrut Govindwar, Neelima Mothe
Computer and Information Science Department.
Survey on Memory Hierarchies – Basic Design and Cache
Optimization Techniques
Abstract - In this paper we provide a comprehensive survey of the past and current work of
Memory hierarchies and optimizations with a focus on cache optimizations. Firstly we
discuss various types of memory hierarchies and basic optimizations possible. Then we shift our
focus on cache optimizations and discuss the motivation for doing this survey on the same.
Then we discuss different types of cache memories and their mapping policies. Further to avoid
various categories of cache misses and achieve high performance and low energy consumption
we discuss different basic and advance cache optimizations. Moving further we’ll be discussing
about other basic and advance cache optimizations techniques like Trace caches [12] and other
optimization techniques. We then compare between the various cache optimizations discussed
above and also discuss a few novel memory hierarchies and cache optimizations deviating from
the conventional hierarchies like the Dynamic Memory Hierarchy Performance Optimization
[6] and Way- Predicting Set-Associative cache for High Performance and Low energy consumption
[7]. Lastly we discuss few open and challenging issues faced in various cache optimization
techniques.
Index Terms – Column Associative caches, Cache Misses, Temporal, Spatial, Multi-Level
Cache, Blocking
INTRODUCTION
Most modern CPUs are so fast that for most program workloads, the bottleneck is the
locality of reference of
memory accesses and the efficiency of the caching and memory
transfer between different levels of the hierarchy. As a result, the CPU spends much of its time
idling, waiting for memory I/O to complete. This is sometimes called the space cost, as a larger
memory object is more likely to overflow a small/fast level and require use of a larger/slower level.
This is where memory hierarchies come into picture.
A memory hierarchy in computer storage distinguishes each level in the hierarchy by
response time. Since response time, complexity, and capacity are related, the levels may also be
distinguished by the controlling technology. A Conventional Memory Hierarchy is as depicted
in Figure 1 as shown below.
The problem of design of memory hierarchies and its optimizations has been heavily
researched in the past couple of decades. Thus the basic problem statement is how to design a
reasonable cost memory system
that can deliver data at speeds close to the CPU’s
consumption rate. The most obvious answer to that is to design a memory hierarchy with slow
(inexpensive, large size) components at higher levels and with fastest (most expensive, smallest)
components at the lowest level[12].
Cache optimizations are required to overcome various types of cache misses such as
compulsory, capacity and conflict across various mapping policies which improves the
performance of the cache and brings the memory access latencies down which in turn speeds up
the memory fetches, minimizes cache misses, miss penalty rate such that the memory system
can deliver data near to the CPU clock cycle rate.
Figure 1. Conventional Memory Hierarchy
ITape
Disk
Main Memory
Cache (L2)
D-Cache
I-Cache
CP
U
Motivation
The Plot above depicts the advancement in technology trend of Memory and CPU
performance and maturity shows that the gap between CPU and memory curves are far apart from
one another
indicating a serious need in improvement and advancement in memory
technologies and optimizations. Different areas of applications in recent past have required
techniques which promise high performance and low power consumption which have a vast usage
in Content Delivery Networks(CDN), Grid Networks, Gaming Applications. Thus it is at most
essential to look into the issues of memory hierarchies, cache mapping policies and is the
motivation for our survey on the topic.
SURVEY
The survey will discuss basic optimization techniques such as column-associative caches[1] and
Predictive Sequential Associative Cache[2] for achieving higher associativity to reduce miss
rate. Then improving direct-mapped cache performance by adding a small fully- associative
cache and prefetch buffers [3] and Characteristics of Performance-Optimal Multi- Level Cache
Hierarchies[4] for achieving Multi-level caches to reduce miss penalty. Fixed and Adaptive
Sequential Prefetching in Shared Memory Multiprocessors for achieving larger block sizes to
reduce miss rate [11]. Further the Cache Performance and Optimization of Blocked
Algorithms[5],[3] is discussed which promises to use Larger caches to reduce miss rate.
Types of Basic Cache Optimizations
1. Larger block sizes to reduce miss rate
Miss rate reduction represents the classical approach to improving cache behavior, therefore
has received a lot of interest over last decade. The table (Table-1) below shows that the variation in
using different sizes of the blocks. It is shown that larger the block size, the cache size increases.
There are some advantages and Disadvantages of this technique. They are stated as below.
Advantages
1. Emphasize spatial locality of data/ instructions;
2. Reduction in compulsory misses
Disadvantages
1.
Increasingly sensitive to miss penalty as block size increases;
2.
Conflict misses increase.
Table 1 – Average memory access time against block size for 5 cache sizes
Block Miss
Size
Cache Size
Penalty
4K
16K
4.231
16
82
8.027
32
84
1.558
64
88
1.447
128
96
8.469
3.659
256
112
11.961
4.685
7.082
7.160
64K
256K
2.673
3.411
1.894
2.134
3.323
1.933
1.979
1.470
2.288
1.549
2. Higher associativity to reduce miss rate
Higher Associativity is one of the good techniques to reduce the miss rate. Consider there are ten
places to which the replacement policy could have mapped a memory location. It should check in that
location is in the cache. Checking more places takes more power, space, and time also. On the other
hand, caches with more associativity suffer fewer misses, so that the CPU wastes less time reading
from the slow main memory. The rule of thumb is that doubling the associativity, from direct mapped
to 2-way, or from 2-way to 4-way, has about the same effect on hit rate as doubling the cache size.
Associativity increases beyond 4-way have much less effect on the hit rate.
In order of increasing hit times and decreasing miss rates,
direct mapped cache:- the best (fastest) hit times
2-way set associative cache
2-way skewed associative cache
4-way set associative cache
Fully associative cache: – the best (lowest) miss rates.
Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped
Associativity
Cache Size
(KB)
1-way
2-way
4-way
8-way
1
2.33
2.15
2.07
2.01
2
1.98
1.86
1.76
1.68
4
1.72
1.67
8
1.46
1.48
1.47
1.43
16
1.29
1.32
1.32
1.32
32
1.20
1.24
1.25
1.27
64
1.14
1.20
1.21
1.23
128
1.10
1.17
1.18
1.20
1.61
1.53
(Red Color in the table above indicates A.M.A.T. not improved by more associativity)
3. Multi-level caches to reduce miss penalty
The increasing speed of new generation processors will exacerbate the already large difference
between CPU cycle times and main memory access times. As this difference grows, it wilI be
increasingly difficult to build single-level caches that are both fast enough to match these fast cycle
times and large enough to effectively hide the slow main memory access times. One solution to
address this problem is to use a multi-level cache hierarchy. The authors of this paper[4] examine the
relationship between cache organization and program execution time for multi-level caches.
They show that a first-level cache dramatically reduces the number of references seen by a secondlevel cache, without having a large effect on the number of second-level cache misses. This reduction
in the number of second-level cache hits changes the optimal design point by decreasing the
importance of the cycle-time of the second-level cache relative to its size. The lower the first-level
cache miss rate, the less important the second level cycle time becomes. This change in relative
importance of cycle time and miss rate makes associativity more attractive and increases the optimal
cache size for second-level caches over what they would be for an equivalent single-level cache
system [4].
They then infer that multi-level hierarchy design problem is substantially more complex than the
single-level case. Not only is there an additional set of design decisions introduced for each level in
the hierarchy, but the optimal choices at each level depend on the characteristics of the
adjacent caches. The direct influence of upstream caches is obvious in the miss ratio simulations that
they plot.
Multi-level caches provide one means of dealing with the large difference between the CPU cycle
time and the access time of the main memory [4]. By providing a second level of caching, one can
reduce the cost of the first level misses, which in turn improves the overall system performance.
Furthermore, by reducing the L1 cache miss penalty, the optimal Ll size is also reduced, increasing
the viability of high-performance RISC CPUs coupled with small, short cycle time L1 caches.
Advance Cache Optimization Techniques
1. Way Prediction
Many modern processors use set-associative caches as L1 or L2 caches. Increase in cache
associatively increases the hit rate. This result in longer cache access time as there is a delay for way
selection. To solve this problem, researchers proposed set-associative caches with way prediction.
One approach of reducing Hit Time is way prediction. Using way prediction, high performance is
achieved and energy consumption is relatively low for set-associative caches. Instead of accessing all
the ways in a set, only one cache way is predicted, by which energy consumption is reduced. It also
improves the energy-delay(ED) product by 60-70% as compared to conventional set-associative cache
[18].
Based upon speculations, way-predicting cache selects one way before it starts a normal cache
access. By doing so, way prediction is accurate, and energy consumption can also be relatively
reduced without any performance degradation. Figure 1.a shows how the way-predicting cache
accesses the predicted way. If the prediction is correct, the cache access has been completed
successfully. Otherwise, the cache searches for the remaining ways as shown in figure 1.b. [18]
1(a) Prediction Hit
1(b) Prediction Miss
Figure 1: Way-Predicting Set Associative Cache
When a way is predicted, as shown in fig 1(a), the way-predicting cache consumes energy for
activating the predicted way only. And also the cache access can be completed in one cycle. But on a
miss, the access time of way-predicting cache increases due to successive process of two phases as
shown in fig 1(b). It results in more energy consumption. So the performance and energy efficiency of
way-predicting cache largely depends on how accurately a way is predicted [18].
2. Multi-Bank Cache
A multi-bank cache, shown in figure 2(a), uses a crossbar interconnection to distribute the
memory reference stream among multiple cache banks. Each bank of a multi-bank cache can
independently service one cache request per cycle, and each bank has a single port. This technique
delivers high bandwidth access as long as simultaneous accesses map to independent banks. This
design has a lower latency and area requirement than the replicated cache, especially for large cache
sizes. Its small, single ported banks are less costly and faster, while its crossbar interconnects has a
high cost.
Figure 2(b) Multi-Bank Cache
For mapping reference addresses onto corresponding cache banks, a bank selection function is
necessary this function can affect the bandwidth delivered by the multi-bank implementation since it
influences the distribution of the access to the bank. An inefficient function may increase bank
conflicts, reducing the delivered bandwidth. Efficient functions must be weighed against
implementation complexity and the possibility of lengthening the cache access time. This therefore
renders accurate but complex selection functions highly unattractive for cache design. [19]
3. Pipelined Cache
A study was done during the design of microprocessor based on the MIPS instruction set
architecture, that is to implemented in GaAs direct-coupled FET logic with multi-chip module
packaging. Figure 3 shows a processor with ranges of parameters. The two level cache organizations
is necessary to hide the large disparity in speeds between the processor and main memory. The level
one (L1) cache is split in to instruction (L1-I) and data (L1-D) halves to provide an instruction or data
access every cycle. [20]
Figure 3: The access paths of pipelined instruction cache
Increasing the number of cache pipeline stages can reduce the CPU cycle time, but it will increase
the number of load and branch delay cycles due to pipeline hazards. If no usefull instructions are
executed during these delay cycles they will increase the number of cycles a program takes to execute,
potentially decreasing the performance of pipelining. Research shows that as many as three load or
branch delay cycles can be hidden by static compile-time instruction or by dynamic-based methods.
Static schemes are more effective at hiding branch delay cycles, but dynamic methods are more
effective at filling load delay slots. By combining these schemes, caches with two to three pipeline
stages have higher performance than caches with less pipeline stages. [20]
4. Non-Blocking Cache
Non-Blocking caches is used for hiding memory latency by exploiting the overlap of processor
computation with data access. A non-blocking cache allows execution to proceed concurrently with
cache misses as long as dependency constraints are observed. The research took following
considerations: 1) load operations are non-blocking, 2) write operations are non-blocking and 3) the
cache is capable of servicing multiple cache miss requests. In order to allow these non-blocking
operations and multiple misses, the author introduced Miss Information/ Status Holding Registers
(MSHRs) that are used to record the information pertaining to the outstanding requests. Each MSHR
entry includes the data block address, the cache line of the block, the word in the block which caused
the miss and the functional unit or register to which the data is to be routed. [21]
Non-blocking loads requires extra support in the execution unit of the processor in addition to
the MSHRs associated with a non-blocking cache. If static instruction scheduling in pipelines is used
in the processor, some form of register interlock is needed for preserving correct data dependencies.
A consistency problem can arise when processor allows non-blocking writes since a later read
may be needed before a previous buffered write is performed. Is these two operations are on the same
data block, an associative check in the write buffer or the MSHRs must be done to provide the correct
value to the following read. [21]
5. Software prefetching
Software prefetching is one of the better techniques for improving the performance of the
memory subsystem to match today’s high-performance processors. A compiler algorithm is used to
insert prefetch instructions in scientific codes that operate on dense matrices. The algorithm issues
prefetch only to the references those are most likely to have cache misses. By generating fully
functional code, the algorithm measures the improvements in cache miss rates. The algorithm also
improves the execution speed of the programs. [22]
There are few important concepts for developing these prefetch algorithms. Prefetches are
possible only if the memory addresses can be determined ahead of time. Prefetches are unnecessary if
the data are already in the cache at the time of prefetch. [22]
6. Hardware Prefetching
Several schemes have been proposed for prefetching that are strictly hardware-based.
Poterfield evaluated several cache line-based hardware prefetching schemes. These schemes are quite
effective at reducing miss rates, but at the same time they often increase memory traffic. Lee proposed
a scheme for prefetching a multiprocessor where all shared data is unchangeable. He found that these
schemes are effective by branch prediction and synchronization. Baer and Chen proposed a scheme
that uses a history buffer to detect strides. In this scheme a “look ahead PC” runs through a program
ahead of the normal PC using branch prediction. When the look ahead PC finds a matching stride
entry in the table, it issues a prefetch. They evaluated the scheme in a memory system with 30 cycle
miss latency and found good results. [22]
Some of the advantages of software prefetching are; better dynamic information that can
recognize things such as unexpected cache conflicts that are difficult to predict in the compiler and
they do not add any instruction overhead to issue prefetches. [22]
It has a disadvantage of having difficulty in detecting the memory access patterns. [22]
Cache Optimization Algorithms:Belady's Algorithm
The most efficient caching algorithm would be to always discard the information that will not be
needed for the longest time in the future. This optimal result is referred to as belady optimal
algorithm. This algorithm generally not implemented in practice because of one disadvantage that this
algorithm is impossible to predict how far in the future information will be needed.
Least Recently Used
Discards the least recently used items first. This algorithm requires keeping track of what was used
when, which is expensive if one wants to make sure the algorithm always discards the least recently
used item. General implementations of this technique require keeping "age bits" for cache-lines and
track the "Least Recently Used" cache-line based on age-bits. In such implementation, every time a
cache-line is used, the age of all other cache-lines changes. LRU is actually a family of caching
algorithms.
Most Recently Used
This algorithm discards (contrast to LRU), the most recently used items first. According to
[15] when a file is being repeatedly scanned in a reference pattern, MRU is the best replacement
algorithm" some author made point that for random access patterns and repeated scans over large
datasets MRU cache algorithms have more hits than LRU due to their tendency to retain older
data[16].
Pseudo-LRU
For caches with large associativity (generally >4 ways), the implementation cost of LRU becomes
prohibitive. If a scheme that almost always discards one of the least recently used items is sufficient,
the PLRU algorithm can be used which only needs one bit per cache item to work.
Segmented LRU
The modification of algorithm LRU is comes into picture. Which is called “Segmented
LRU (SLRU)”. In SLRU cache is divided into two segments.
1. Probationary segment
2. Protected segment (Finite)
Lines in each segment are ordered from the most to the least recently accessed. Data from misses is
added to the cache at the most recently accessed end of the probationary segment. Hits are removed
from wherever they currently reside and added to the most recently accessed end of the protected
segment. The protected segment is finite. So migration of a line from the probationary segment to the
protected segment may force the migration of the LRU line in the protected segment to the most
recently used (MRU) end of the probationary segment, giving this line another chance to be accessed
before being replaced. The size limit on the protected segment is an SLRU parameter that varies
according to the I/O workloadpatterns. Whenever data must be discarded from the cache, lines are
obtained from the LRU end of the probationary segment[17].
Research Work in the recent years
1. Analytical Model for Hierarchical Cache Optimization in IPTV Network [8]
In an IPTV network, Video on Demand and other video services generate large amount of unicast
traffic from Video Hub Office (VHO) to subscribers and, therefore, require additional bandwidth and
equipment resources in the network. To reduce this traffic and overall network cost, a portion of the
video content (the most popular titles) may be stored in caches closer to subscribers, e.g., in a Digital
Subscriber Line Access Multiplexer (DSLAM), a Central Office (CO), or in an Intermediate Office
(IO). The problem becomes to minimize cost by optimizing cache memory placement and amount.
They then present an analytical model of hierarchical cache optimization. This model depends on
several basic parameters: traffic volume, cache hit rate as a function of memory size, topology (i.e.
number of DSLAMs, service routers at CO, service switches at IO locations), and cost parameters.
Some reasonable assumptions about a network cost structure and a hit rate function allow us to obtain
an analytically optimal solution for the problem. The authors then analyze the factors that affect this
solution[8].
2. Peer Assisted Video Streaming With Supply-Demand-Based Cache Optimization. [9]
The Authors consider a hybrid P2P video on-demand architecture that utilizes both the server and the
peer resources for efficient transmission of popular videos. They propose a system architecture
wherein each peer dedicates some cache space to store a particular segment of a video file as well as
some of its upload bandwidth to serve the cached segment to other peers. Peers join the system and
issue a streaming request to a control server. Control server directs the peers to streaming servers or to
other peers who have the desired video segments[9]. Control server also decides which peer should
cache which video segment. Their main objective is to determine the proper caching strategies at
peers such that minimization of the average load on the streaming servers is possible. To minimize the
server load, the authors pose the caching problem as a supply-demand-based utility optimization
problem. By exploiting the inherent structure of a typical on-demand streaming application as well as
the availability of a global view on the current supply-demand at the control server, we demonstrate
how the system performance can be significantly improved over the brute-force caching decisions.
Further they mainly consider three caching mechanisms.
In the first mechanism cache prefetching, a segment is pre-fetched to a given peer for caching
purposes upon peer's arrival to the system regardless of whether that segment is currently demanded
by that peer or not. In the second mechanism (opportunistic cache update), a peer has the option of
replacing the segment that is currently in its cache with the last segment that it finished streaming. In
the third mechanism, we combine both mechanisms as a hybrid caching strategy. In particular, they
find that a dynamic-programming (DP)-based utility maximization solution using only the cache
update method performs significantly better in reducing the server load. Furthermore, our findings
suggest that even less sophisticated cache update solutions can perform almost as good as prefetching
strategies in interesting regions of operation.
3.
Web Cache Optimization in Semantic based Web Search Engine [10]
With the tremendous growth of information available to end users through the Web, search
engines come to play ever a more critical role. Nevertheless, because of their general-purpose
approach, it is always less uncommon that obtained result sets provide a burden of useless pages. The
next-generation Web architecture, represented by the Semantic Web, provides the layered architecture
possibly allowing overcoming this limitation. The ontology for multiple search engines is written such
that in this search engine for single query the final result is got from multiple search engines. After
getting the user query result we can use the clustering. In this clustering the user query results is
formed in the a to z form, the several search engines have been proposed, which allow increasing
information retrieval accuracy by exploiting a key content of Semantic Web resources, that is,
relations. We can use web cache optimization in search engine to get fast retrieval of user query
results. In this work I have used web cache optimization based on eviction method for semantic web
search engine. In this paper, analization of both advantages and disadvantages of some current Web
cache replacement algorithms including lowest relative value algorithm, least weighted usage
algorithm and least unified-value algorithm is done. The authors propose a new algorithm, called least
grade replacement (LGR), which takes recency, frequency, perfect history, and document size into
account for Web cache optimization.
They cite the work of Bahn et al. who proposed a web cache replacement algorithm called Least
Unified Algorithm (LUV) that uses complete reference history of documents, in terms of reference
frequency and recency.
Disadvantage of Existing System:
•
Text based searching example (Google, yahoo, msn, Wikipedia).
•
Without semantic relationship to give exact result.
•
Query only focus single search engine.
•
Most existing search engines however, provide poor support to accessing the web results.
•
No analysis of stopping keywords from the user query.
•
It will not give relevant or exact result.
•
Number of iterations is high.
•
A replacement policy is required for replacing a page from web cache to make room for new
page.
To overcome these disadvantages a proxy server, which runs with mentioned features, which
inherently helps speeder browsing of web pages with use of least grade page replacement algorithms.
Conclusion
A brief introduction on Memory Hierarchies design and optimization and the motivation to develop
such techniques were discussed. Following which a detailed survey on Basic and Advanced cache
optimization algorithms was discussed. Further the Cache Optimization Algorithms were discussed
as well as applications of memory hierarchies and cache optimization techniques in different
domains were discussed.
References
[1] Anant Agarwal and Steven D. Pudar ; “Column-Associative Caches: A Technique for
Reducing
the
Miss
Rate
of
Direct-Mapped
Caches”
A
Agarwal…
-
ANNUAL
INTERNATIONAL SYMPOSIUM ON …, 1993 - mprc.pku.cn
[2] Calder, B.; Grunwald, D.; Emer, J.; ,"Predictive sequential associative cache," HighPerformance Computer
Architecture, 1996. Proceedings. Second International Symposium on , vol., no., pp.244-253, 3-7
Feb 1996 doi:10.1109/HPCA.1996.501190
[3] Jouppi, N.P.; , "Improving direct-mapped cache performance by the addition of a small
fully-associative cache and prefetch buffers," Computer Architecture, 1990. Proceedings., 17th
Annual International Symposium on , vol., no., pp.364-373, 28-31 May 1990 doi:
10.1109/ISCA.1990.134547
[4] Steven Przybylski, Mark Horowitz, John Hennessy ;“Characteristics of Performance-Optimal
Multi-Level Cache Hierarchies”.ISCA '89 Proceedings of the 16th annual international symposium
on Computer architecture ©1989 table of contents ISBN:0-89791-319-1
[5] Monica S. Lam, Edward E. Rothberg and Michael E. Wolf; “The cache performance and
optimizations of blocked algorithms” ASPLOS-IV Proceedings of the fourth international
conference on Architectural support for programming languages and operating systems.
[6] Rajeev Balasubramonian ,
Dwarkadas ,
David Albonesi ,
Alper Buyuktosunoglu ,
Sandhya
Hya Dwarkadas; “Dynamic Memory Hierarchy Performance Optimization” 2000
siteseer.
[7] Koji Inoue et al.; “ Way-Predicting Set-Associative Cache for High Performance and
Low Energy Consumption” ISLPED '99 Proceedings of the 1999 international symposium on
Low power electronics and design.
[8]
Sofman,
L.B.;
Krogfoss,
B.;
,
“Analytical
Model
for
Hierarchical
Cache
Optimization in IPTV Network,” Broadcasting, IEEE Transactions
on,vol.55
no.1,pp.62-70,
March 09
[9] Kozat, U.C.; Harmanci, O.; Kanumuri, S.; Demircin, M.U.; Civanlar, M.R.; , “Peer
Assisted Video Streaming With Supply-Demand-Based Cache Optimization,” Multimedia, IEEE
Transactions on , vol.11, no.3, pp.494-508, April 2009 doi:10.1109/TMM.2009.2012918
[10] Dr.S.N.Sivanandam, Dr.M.Rajaram, S.Latha Shanmuga Vadivu, “Web Cache Optimization
in Semantic based Web Search Engine” International Journal of Computer Applications (0975 –
8887)Volume 10– No.9, November 2010
[11] Dahlgren, Fredrik; Dubois, Michel; Stenstrom, Per; , "Fixed and Adaptive Sequential
Prefetching in Shared Memory Multiprocessors," Parallel Processing, 1993. ICPP 1993.
International Conference on , vol.1, no., pp.56-63, 16-20 Aug. 1993.
[12] Weifeng Zhang; Checkoway, S.; Calder, B.; Tullsen, D.M.; , "Dynamic Code Value
Specialization Using the Trace Cache Fill Unit," Computer Design, 2006. ICCD 2006.
International Conference on , vol., no., pp.10 16, 1-4 Oct. 2007
[13] D. Patterson & J. Hennessey: Computer Architecture, A Quantitative Approach
[14] Hennessy and Patterson, Computer Architecture a Quantitative
Approach (3rd Ed ), Memory Hierarchy Design
[15]Hong-Tai Chou and David J. Dewitt. An Evaluation of Buffer
Management Strategies for Relational Database Systems. VLDB, 1985.].
[16] Shaul Dar, Michael J. Franklin, Björn Þór Jónsson, Divesh
Srivastava,
and
Michael
Tan. Semantic
Data
Caching
and
Replacement. VLDB, 1996.
[17] Ramakrishna Karedla, J. Spencer Love, and Bradley G. Wherry.
Caching Strategies to Improve Disk System Performance. In Computer,
1994.
[18]Koji Inoue, Tohru Ishihara and Kazuaki Murakami, “Way-Predicting
Set-Associative Cache for High Performance and Low Energy
Consumption”.
[19] Jude A. Rivers, Gary S. Tyson, Edward S. Davidson,”On HighBandwidth Data Cache Design for Multi-Issue Processors”.
[20] Kunle Olukotun, Trevor Mudge and Richard Brown, “Performance
Optimization of Pipelined Primary Caches”
[21] Tien-Fu Chen and Jean-Loup Baer, “Reducing Memory Latency
via Non-Blocking and Prefetcjing Caches”
[22] Todd C. Mowry, Monica S. Lam and Anoop Gupta, “Design and
Evaluation of a Compiler Algorithm for Prefetching”.
Download