Analytical Cache Models with Applications to Cache Partitioning in Time-Shared Systems

Analytical Cache Models with Applications to
Cache Partitioning in Time-Shared Systems
by
Gookwon Edward Suh
B.S., Seoul National University (1999)
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Master of Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
February 2001
KER
MASSACHUSETTS INSTITUTE
OF TECHNOLOGY
@Massachusetts Institute of Technology.
All rights reserved.
APR
0001
LIBRARIES
Author......................
Department of Electrical Engineering and Computer Science
February 5, 2001
Certified by................
......................................
Srinivas Devadas
Professor of Computer Science
Thesis Supervisor
Certified by...
.....
............
..................
Larry Rudolph
Principal Research Scientist
Thesis Supervisor
A ccepted by ...................
..
..........
Arthur C. Smith
Chairman, Committee on Graduate Students
Department of Electrical Engineering and Computer Science
2
Analytical Cache Models with Applications to Cache
Partitioning in Time-Shared Systems
by
Gookwon Edward Suh
Submitted to the Department of Electrical Engineering and Computer Science
on February 5, 2001, in partial fulfillment of the
requirements for the degree of
Master of Science
Abstract
This thesis proposes an analytical cache model for time-shared systems, which estimates the overall cache miss-rate from the isolated miss-rate curve of each process
when multiple processes share a cache. Unlike previous models, our model works for
any cache size and any time quantum. Trace-driven simulations demonstrate that
the estimated miss-rate is very accurate for fully-associative caches. Since the model
provides a fast and accurate way to estimate the effect of context switching, it is useful for both understanding the effect of context switching on caches and optimizing
cache performance for time-shared systems.
The model is applied to the cache partitioning problem. First, the problems of the
LRU replacement policy is studied based on the model. The study shows that proper
cache partitioning can enable substantial performance improvements for short or midrange time quanta. Second, the model-based partitioning mechanism is implemented
and evaluated through simulations. In our example, the model-based partitioning
improves the cache miss-rate up to 20% over the normal LRU replacement policy.
Thesis Supervisor: Srinivas Devadas
Title: Professor of Computer Science
Thesis Supervisor: Larry Rudolph
Title: Principal Research Scientist
3
4
Acknowledgments
It has been my pleasure to work under the expert mentorship of Doctor Larry Rudolph
and Professor Srinivas Devadas. Larry worked with me on daily basis, discussing and
advising every aspect of my research work.
His encouragement and advice were
essential to finishing this thesis. Srini challenged me to organize my ideas and focus
on real problems. His guidance was a tremendous help in finishing this thesis. I would
like to thank both my advisors for their support and advice throughout my master's
work.
The work described in this thesis would not have been possible without the help
of my fellow graduate students in Computational Structures Group. Dan Rosenband
has been a great officemate, helping me to get used to the life at LCS, both professionally and personally. Derek Chiou was always generous with his deep technical
knowledge, and helped me to choose the thesis topic. My research started from his
Ph.D. work. Enoch Peserico provided very helpful comments on my cache models
with his knowledge in algorithms and theory. Peter Portante taught me about real
operating systems.
Beyond my colleagues in CSG, I was fortunate to have great friends such as Todd
Mills and Ali Tariq on the same floor. We shared junky lunches, worked late nights
and hung out together. Todd also had a tough time reading through all my thesis
drafts and fixing my broken English writing. He provided very detailed comments
and suggestions. I would like to thank them for their friendship and support.
My family always supported me during these years. Their prayer, encouragement and love for me is undoubtedly the greatest source of my ambition, inspiration,
dedication and motivation. They have given me far more than words can express.
Most of all, I would like to thank God for His constant guidance and grace throughout my life. His words always encourage me:
See, I am doing a new thing! Now it springs up; do you not perceive
it? I am making a way in the desert and streams in the wasteland (Isaiah
43:19).
5
6
Contents
1
Introduction
1.1
1.2
1.3
1.4
2
M otivation . . . . . . . . . . . . . .
Previous Work . . . . . . . . . . .
The Contribution of This Research
Organization of This Thesis . . . .
.
.
.
.
.
.
.
.
Analytical Cache Model
2.1 Overview . . . . . . . . . . . . . . . . .
2.2 Assumptions . . . . . . . . . . . . . . .
2.3 Fully-Associative Cache Model . . . . .
2.3.1 Transient Cache Behavior . . .
2.3.2 The Amount of Data in a Cache
2.3.3 The Amount of Data in a Cache
2.3.4 Overall Miss-rate . . . . . . . .
2.4 Experiments . . . . . . . . . . . . . . .
2.4.1 Cache Flushing Cases . . . . . .
2.4.2 General Cases . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16
. . . . . . . . . . . . . .
16
. . . . . . . . . . . . . .
16
. . . . . . . . . . . . . .
17
. . . . . . . . . . . . . .
18
Starting with an Empty Cache 20
22
for the General Case . .
. . . . . . . . . . . . . .
25
. . . . . . . . . . . . . .
27
. . . . . . . . . . . . . .
27
. . . . . . . . . . . . . .
29
3 Cache Partitioning Based on the Model
3.1 Specifying a Partition . . . . . . . . . . . . .
3.2 Optimal Partitioning Based on the Model
3.2.1 Estimation of Optimal Partitioning
3.3 Tradeoffs in Cache Partitioning . . . . . . .
4
Cache Partitioning Using LRU Replacement
4.1 Cache Partitioning by the LRU Policy . . . . . . . . . . . . . . . . .
4.2 Comparison between Normal LRU and LRU with Optimal Partitioning
4.2.1 The Problem of Allocation Size . . . . . . . . . . . . . . . . .
4.2.2 The Problem of Cache Layout . . . . . . . . . . . . . . . . . .
4.2.3 Sum m ary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Problems of the LRU Replacement Policy for Each Level of Memory
H ierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Effect of Low Associativity . . . . . . . . . . . . . . . . . . . . . . . .
4.5 More Benefits of Cache Partitioning . . . . . . . . . . . . . . . . . . .
7
9
11
13
14
15
32
33
34
36
36
44
44
45
45
48
49
52
53
54
5 Partitioning Experiments
5.1 Implementation of Model-Based Partitioning . . . . .
5.1.1 Recording Memory Reference Patterns . . . .
5.1.2 Cache Partitioning . . . . . . . . . . . . . . .
5.2 Cache Partitioning Based on Look-Ahead Information
5.3 Experimental Results . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
56
56
57
58
61
62
6 Conclusion
68
A The Exact Calculation of E[x(t)]
70
8
Chapter 1
Introduction
The cache is an important component of modern memory systems, and its performance is often a crucial factor in determining the overall performance of the system.
Processor cycle times have been reduced dramatically, but cycle times for memory
access remain high. As a result, the penalty for accessing main memory has been
increasing, and the memory access latency has become the bottleneck of modern processor performance.
The cache is a small and fast memory that keeps frequently
accessed data near the processor, and can therefore reduce the number of main mem-
ory accesses [13].
Caches are managed by bringing in a new data block on demand, and evicting a
block based on a replacement policy. The most common replacement policy is the
least recently used (LRU) policy, which evicts the block that has not been accessed
for the longest time. The LRU replacement policy is very simple and at the same
time tends to be efficient for most practical workloads especially if only one process
is using the cache.
When multiple running processes share a cache, they experience additional cache
misses due to conflicts among processes. In the past, the effect of context switches
was negligible because both caches and workloads were small. For small caches and
workloads, reloading useful data into the cache only takes a small amount of time.
As a result, the number of misses caused by context switches was relatively small
compared to the total number of misses over a usual time quantum.
9
However, things have changed and the context switches can severely degrade cache
performance in modern microprocessors. First, caches are much larger than before.
Level 1 (LI) caches range up to a few MB [10], and L2 caches are up to several MB
[6, 21]. Second, workloads have also become larger. Multimedia processes such as
video or audio clips often consume hundreds of MB. Even many SPEC CPU2000
benchmarks now have a memory footprint larger than 100 MB [14]. Finally, context
switches tend to cause more cache pollution than before due to the increased number
of processes and the presence of streaming processes.
As a result, it can be cru-
cial for modern microprocessors to minimize inter-process conflicts by proper cache
partitioning [29, 17] or scheduling [25, 30].
A method of evaluating cache performance is essential both to predict miss-rate
and to optimize cache performance.
Traditionally, simulations have, virtually ex-
clusively, used for cache performance evaluation [32, 24, 20]. Although simulations
provide accurate results, simulation time is often too long. Moreover, simulations
do not provide any intuitive understanding of the problem to improve cache performance. To predict cache miss-rate faster, hardware monitoring can also be used [33].
However, hardware monitoring is limited to the particular cache configuration that
is implemented. As a result, both simulations and hardware monitoring can only be
used to evaluate the effect of context switches [22, 18].
To provide both performance prediction and the intuition to improve it, analytical
cache models have been developed. One such approach is to analyze a particular form
of source codes such as nested loops [12]. Although this method is both accurate and
intuitive, it is limited to particular types of algorithms. The other way to analytically
model caches is extracting parameters of processes and combining them with other
parameters defining the cache [1, 28]. These methods can predict the general trend
of cache behavior. For the problem of context switching, however, they only focus
on very long time quantum. Moreover, their input parameters are difficult to extract
and almost impossible to obtain on-line.
This paper presents an analytical cache model for context switches, which can
estimate overall miss-rate accurately for any cache size and any time quantum. The
10
characteristics for each process is given by the miss-rate as a function of cache size
when the process is isolated, which can be easily obtained either on-line or off-line.
The time quanta for each process and cache size are also given as inputs to the model.
With this information, the model estimates the overall miss-rate of a given cache size
running an arbitrary combination of processes. The model provides good estimates
for any cache size and any time quantum, and is easily applied to real problems
since the input miss-rate curves are both intuitive and easy to obtain in practice.
Therefore, we believe that the model is useful for any study related to the effect of
context switches on cache memory.
The problem of cache partitioning among time-shared processes serves as the example of the model's applications. Optimal cache partitioning is studied based on
the model, which shows the way to determine the best cache partition and the problems with the LRU replacement policy. From the study, a model-based partitioning
mechanism is implemented. The mechanism obtains the miss-rate characteristic of
each process by off-line profiling, and performs on-line cache allocation according to
the combination of processes that are executing at any given time.
1.1
Motivation
Figure 1-1 (a) shows the trace-driven simulation results when six processes are sharing
the same cache. In the figure, the x-axis represents the number of memory references
per time quantum, and the y-axis represents the cache miss-rate expressed as a percentage. The number of memory references per time quantum is assumed to be the
same for all processes. The Li caches are 256-KB 8-way associative, separate for
instruction stream and data stream. The L2 cache is a 2-MB 16-way associative
unified cache. Two copies of three benchmarks are simulated, which are the image
understanding program from the data intensive systems benchmark suite [23], vpr
and twolf from the SPEC CPU2000 benchmark suite [14].
In the figure, the miss-rate of the Li instruction cache is very close to 0% and the
miss-rate of the Li data cache is around 5%. However, the L2 cache miss-rate shows
11
90
90
80
-+-
Li Inst
--
Li Data
-+-
80
*- L2
70
70(
60
60
a) 50
C',
C',
'
50
40
40
30
30*
6
20
20
10
10-
0- --+
1 05
LRU
-0- Ideal
7
0
10
10
108
10
Time Quantum
(a)
106
10
108
Time Quantum
(b)
Figure 1-1: (a) Cache miss-rate for different length of time quanta, 256 KB 8-way
Li, 2 MB 16-way L2 (LI Inst is almost 0%). (b) Comparison of cache miss-rate
with/without partitioning.
12
that the impact of context switches can be very severe even for realistic time quanta.
For the time quantum of a million memory references, the L2 cache miss-rate is about
80%, but for longer time quanta the miss-rate is almost 0%. This means that the
benchmarks would get miss-rates close 0% if they use the cache alone, and therefore
context switches cause most of L2 cache misses. Even for the longer time quanta, the
inter-process conflict misses are a significant fraction of the total L2 misses. Therefore,
this example shows that context switches can severely degrade the cache performance
even for systems with a very high clock speed.
Figure 1-1 (b) shows the comparison of L2 miss-rates with and without cache
partitioning for the same cache configuration and the same benchmark set as before.
The normal LRU replacement policy is used for the case without cache partitioning.
For the cache partitioning case, the information of next reference time is recorded
from the trace, and a new block replaces other process' block only if the new block is
accessed again before other process' block. The result shows that for a certain range
of time quanta, such as one million memory references, partitioning can significantly
reduce the miss-rate.
1.2
Previous Work
Several early investigations of the effects of context switches use analytical models.
Thiebaut and Stone [28] modeled the amount of additional misses caused by context
switches for set-associative caches. Agarwal, Horowitz and Hennessy [1] also included
the effect of conflicts between processes in their analytical cache model and show
that inter-process conflicts are noticeable for a mid-range of cache sizes that are large
enough to have a considerable number of conflicts but not large enough to hold all
the working sets. However, these models work only for long enough time quanta, and
require information that is hard to collect on-line.
Mogul and Borg [22] studied the effect of context switches through trace-driven
simulations. Using a timesharing system simulator, their research shows that system
calls, page faults, and a scheduler are the main source of context switches. They
13
also evaluate the effect of context switches on cycles per instruction (CPI) as well
as the cache miss-rate. Depending on cache parameters the cost of a context switch
appears to be in the thousands of cycles, or tens to hundreds of microseconds in their
simulations.
Stone, Turek and Wolf [26] investigated the optimal allocation of cache memory
among two or more competing processes that minimizes the overall miss-rate of a
cache. Their study focuses on the partitioning of instruction and data streams, which
can be thought as multitasking with a very short time quantum. Their model for this
case shows that the optimal allocation occurs at a point where the miss-rate derivatives of the competing processes are equal. The LRU replacement policy appears to
produce cache allocations very close to optimal for their examples. They also describe
a new replacement policy for longer time quanta that only increases cache allocation
based on time remaining in the current time quantum and the marginal reduction
in miss-rate due to an increase in cache allocation. However, their policy simply assumes the probability for a evicted block to be accessed in the next time quantum
as a constant, which is neither validated nor is it described how this probability is
obtained.
Thiebaut, Stone and Wolf applied their partitioning work [26] to improve disk
cache hit-ratios [29].
The model for tightly interleaved streams is extended to be
applicable for more than two processes. They also describe the problems in applying
the model in practice, such as approximating the miss-rate derivative, non-monotonic
miss-rate derivatives, and updating the partition. Trace-driven simulations for 32-MB
disk caches show that the partitioning improves the relative hit-ratios in the range of
1% to 2% over the LRU policy.
1.3
The Contribution of This Research
Our analytical model differs from previous efforts in that previous work has tended to
focus on some specific cases of context switches. The new model works for any time
quanta, whereas the previous models focus only on long time quanta. Therefore, the
14
model can be used to study multi-tasking related problems on any level of caches.
Since a lower level cache only sees memory references that are missed in the higher
level, the effective time quantum decreases as we go up the memory hierarchy. Moreover, the model is based on a miss-rate curve that is much easier to obtain compared
to footprints or the number of unique cache blocks that previous models require,
which makes the model very helpful for developing solutions for cache optimization
problems.
Our partitioning work based on the analytical cache model also has advantages
over previous work. First, our partitioning works for any level of memory hierarchy
for any range of time quantum, unlike, for instance, Thiebaut, Stone and Wolf's
partitioning algorithm which can only be applied to disk caches. Further, the model
provides a more complete understanding of optimal partitioning for different cache
sizes, and time quanta.
1.4
Organization of This Thesis
The rest of this thesis is organized as follows. In Chapter 2, we derive an analytical
cache model for time-shared systems. Chapter 3 discusses cache partitioning based on
the model. This optimal partitioning is compared to the standard LRU replacement
policy and the problems of the LRU replacement policy are discussed in Chapter 4.
In Chapter 5, we implement a model-based cache partitioning algorithm and evaluate
the implementation by simulations. Finally, Chapter 6 concludes the paper.
15
Chapter 2
Analytical Cache Model
2.1
Overview
The analytical cache model estimates the cache miss-rate for a multi-process situation
when the cache size, the length of each time quantum, and a miss-rate curve for each
process as a function of the cache size are known. The cache size is given by the
number of cache blocks, and the time quantum is given by the number of memory
references. Both are assumed to be constants (See Figure 2-1 (a)).
2.2
Assumptions
The derivation of the model is based on several assumptions. First, the memory
reference pattern of each process is assumed to be represented by a miss-rate curve
that is a function of the cache size, and this miss-rate curve does not change over
time. Although, real applications do have dynamically changing memory reference
patterns, the model's results show that the average miss-rate works very well. For
abrupt changes in the reference pattern, multiple miss-rate curves can be used to
estimate an overall miss-rate.
Second, we assume that there is no shared address space among processes. Otherwise, it is very difficult to estimate the amount of cache pollution since a process can
use data brought into the cache by other processes. Although processes may share
16
miss-rate curves (mi(x))
time quanta (Ti)
Cache Model
overall miss-rate
-
cache size (C)
(a)
Process 1
Process N
Process 2
Process 1
Process 2
*
T1
T2o
...
TN
T1
Time
T2
(b)
Figure 2-1: (a) The overview of an analytical cache model. (b) Round-robin schedule.
a small amount of shared library code, this assumption is true for common cases
where each process has its own virtual address space and the shared memory space
is negligible compared to the entire memory space that is used by a process.
Finally, we assume that processes are scheduled in a round-robin fashion with a
fixed time quantum for each process as shown in Figure 2-1 (b). Also, we assume
the least recently used (LRU) replacement policy is used. Note that although the
round-robin scheduling and the LRU policy are assumed for the model, models for
other scheduling methods and replacement policies can be easily derived in a similar
manner.
2.3
Fully-Associative Cache Model
This section derives a model for fully-associative caches. Although most real caches
are set-associative caches, a model for fully-associative caches is very useful for understanding the effect of context switches because the model is simple. Moreover, since
a set-associative cache is a group of fully-associative caches, a set-associative cache
model can be derived based on the fully-associative model [27].
17
2.3.1
Transient Cache Behavior
Once a cache gets filled with valid data, a process can be considered to be in a steady
state and by our assumption, the miss-rate for the process does not change. The
initial burst of cache misses before steady state is reached will be referred to as the
miss-rate transient behavior.
For special situations, where a cache is dedicated to a single process for its entire
execution, the transient misses are not important because the number of misses in
the transient state is negligible compared to the number of misses over the entire
execution, for any reasonably long execution.
For multi-process cases, a process experiences transient misses whenever it restarts
from a context switch. Therefore, the effect of transient misses could be substantial
causing performance degradation. Since we already know the steady state behavior
from the given miss-rate curves, we can estimate the effect of context switching once
we know the transient behavior.
We make use of the following notations:
t the number of memory references from the beginning of a time quantum.
x(t) the number of cache blocks belong to a process after t memory references.
m(x) the miss-rate for a process with cache size x.
Note that our time t starts at the beginning of a time quantum, not the beginning
of execution. Since all time quanta for a process are identical by our assumption, we
consider only one time quantum for each process.
Figure 2-2 (a) shows a snapshot of a cache at time to, where time t is the number
of memory references from the beginning of the current time quantum. Although the
cache size is C, only part of the cache is filled with the current process' data at that
time. Therefore, the effective cache size at time to can be thought of as the amount
of current process' data x(to). The probability of a cache miss in the next memory
18
-3
m(x) for the current process
(D
Miss probability: Pmiss(t)
mriss(to)
of misses
Te number
0
\\
Cache size
Integ rate
~0
-0
x(to)
2~
The currentI
process' data
Time
Other process' data/Empty
The cache at time to
(a)
The length of
a time quantum
(T)
(b)
Figure 2-2: (a) The probability to miss at time to. (b) The number of miss-rate from
Pmiss (t) curve.
reference is given by
Pmiss(to) = m(x(to))
(2.1)
where m(x) is a steady state miss-rate for the current process when the cache size is
a:.
Once we have Pmiss (to), it is easy to estimate the miss-rate over the time quantum.
Since one memory reference can cause one miss, the number of misses for the process
over a time quantum can be expressed as a simple integral as shown in Figure 2-2 (b):
misses =
0
Pmiss(t)dt =
m(x(t))dt
(2.2)
f
where T is the number of memory references in a time quantum. The miss-rate of the
process is simply the number of misses divided by the number of memory references:
miss-rate =
1
PTs(t dt =
19
m(x(t))dt
(2.3)
2.3.2
The Amount of Data in a Cache Starting with an Empty
Cache
Now we need to estimate x(t), the amount of data in a cache as a function of time. We
shall use of the case where a process starts executing with an empty cache to estimate
cache performance for cases when a cache get flushed for every context switch. Virtual
address caches without process ID are good examples of such a case. We will show
how to estimating x(t) for the general case as well.
Consider x' (t) as the amount of the current process' data at time t for an infinite
size cache. We assume that the process starts with an empty cache at time 0. There
are two possibilities for x (t) at time t + 1. If the (t + 1)h memory reference results
in a cache miss, a new cache block is brought into the cache. As a result, the amount
of the process's cache data increases by one. Otherwise, the amount of data remains
the same. Therefore, the amount of the process' data in the cache at time t + 1 is
given by
X0 (t + 1) =
X (t) + 1
when the (t + 1)th reference misses
X CO(t)
otherwise.
(2.4)
Since the probability for the (t + 1)" memory reference to miss is m(x (t)) from
Equation 2.1, the expectation value of x(t + 1) can be written by
E[x'(t + 1)]
=
E[x (t) - (1 - m(x (t))) + (x (t) + 1) -m(x (t))]
= E[x (t) + 1 -m(x (t))]
(2.5)
= E[x (t)] + E[m(x (t))].
Assuming that m(x) is convex', we can use Jensen's inequality [8] and rewrite the
'If a replacement policy is smart enough, marginal gain of having one more cache block monotonically decreases as we increase the cache size
20
equation as a function of E[x (t)].
E[x (t + 1)] > E[x (t)] + m(E[x (t)]).
(2.6)
Usually, a miss-rate changes slowly. As a result, for a short interval such as from x
to x + 1, m(x) can be approximated as a straight line. Since the equality in Jensen's
inequality holds if the function is a straight line, we can approximate the amount of
data at time t + 1 as
E[xO (t + 1)]
E[x (t)] + m(E[x
(t)]).
(2.7)
We can calculate the expectation of xO (t) more accurately by calculating the
probability for every possible value at time t (See Appendix A). However, calculating
a set of probabilities is computationally expensive. Also, our experiments show that
the approximation closely matches simulation results.
If we approximate the amount of data x (t) to be the expected value E[x (t)],
xc (t) can be expressed with a differential equation:
x (t + 1) - x, (t) = m(x (t)),
(2.8)
which can be easily calculated in a recursive manner.
For the continuous variable t, we can rewrite the discrete form of the differential
equation 2.8 to a continuous form:
dx~
=xmn(X ).
dt
(2.9)
Solving the differential equation by separating variables, the differential equation
becomes
dx'.
t = fxo(t)
]O)
(2.10)
m(x')
We define a function M(x) as an integral of 1/m(x), which means that dM(x)/dx =
21
m(x), and then x (t) can be written as a function of t:
X0(t) = M
1
(t + M(x
(0)))
(2.11)
where M-1 (x) represents the inverse function of M(x).
Finally, for a finite size cache, the amount of data in the cache is limited by the
size of the cache C. Therefore, x4 (t), the amount of a process' data starting from an
empty cache, is written by
xO(t) = MIN[x (t), C]
2.3.3
=
MIN[M-1 (t + M(O)), C].
(2.12)
The Amount of Data in a Cache for the General Case
In Section 2.3.1, it is shown that the miss-rate of a process can be estimated if the
amount of the process' data as a function of time x(t) is given.
In the previous
section, x(t) is estimated when a process starts with an empty cache. In this section,
the amount of a process' data at time t is estimated for the general case when a
cache is not flushed at a context switch. Since we now deal with multiple processes,
a subscript i is used to represent Process i. For example, xi(t) represents the amount
of Process i's data at time t.
The estimation of xi(t) is based on round-robin scheduling (See Figure 2-1 (b))
and the LRU replacement policy. Process i runs for a fixed length time quantum Ti.
During each time quantum, the running process is allowed to use the entire cache. For
simplicity, processes are assumed to be of infinite length so that there is no change
in the scheduling. Also, the initial startup transient from an empty cache is ignored
since it is negligible compared to the steady state.
To estimate the amount of a process' data at a given time, imagine the snapshot
of an infinite size cache after executing Process i for time t as shown in Figure 2-3.
We assume that time is 0 at the beginning of the process' time quantum. In the
figure, the blocks on the left side show recently used data, and blocks on the right
side shows old data. P,k represents the data of Process j, and subscript k specifies
22
The snapshot of a cache
MRU data
i-1,1
...
LRU data
i-1,2
2[
Pi+1,1
Pi+ 1 ,2
...
-
x0i(t)
C)E
----------------------------
10
E-i
<i
x0i(t)
_ _
0
t
t+Ti
Time
Figure 2-3: The snapshot of a cache after running Process i for time t.
the most recent time quantum when the data are referenced. From the figure, we can
obtain xi(t) once we know the size of all P,k blocks.
The size of each block can be estimated using the xf(t) curve from Equation 2.12,
which is the amount of Process i's data when the process starts with an empty cache.
Since x(t) can also be thought of as the expected amount of data that are referenced
from time 0 to time t, x(T) is the expected amount of data that are referenced over
one time quantum. Similarly, we can estimate the amount of data that are referenced
over k recent time quanta to be xO(k -T). As a result, the expected size of Block P,k
can be written as
P, k
+ (k - 2) -T)
x(t + (k - 1) -T) - x4(t
x (k - T) - x" ((k - 1) -T)
if j is executing
(2.13)
otherwise
where we assume that x (t) = 0 if t < 0.
xi(t) is the sum of Pi,k blocks that are inside the cache of size C in Figure 2-3. If
we define li(t) as the maximum integer value that satisfies the following inequality,
23
C
0
C
0
as
xij~t)
xi(O)
xOi(t)
0
tstart(O,)
tend(O
tstart(i, 2 )
2
tend(i, )
Time
Figure 2-4: The relation between xi(t) and xi(t).
Q2(t) + 1 represents how many Pi,k blocks are in the cache.
1i(t)
N
N
- xt±
? (li(t) - T) <C
(l(t) -1) -Ti) +
k=O j=1
(2.14)
j=1,jAi
where N is the number of processes. From li(t) and Figure 2-3, the expectation value
of xi(t) is
N
{xfI(t
xi~t
+l (t).-T)
i(t+li(t)
+H
if x
jlj
N
C-
x (lI(t) -Tj) < C
xo(li(t) -Tj)
otherwise.
j=1,joi
(2.15)
Figure 2-4 illustrates the relation between x (t) and x2 (t). In the figure li(t) is
assumed to be 2. Unlike the cache flushing case, a process can start with some of its
data left in the cache. The amount of initial data xi(O) is given by Equation 2.15.
If the least recently used (LRU) data in a cache does not belong to Process i, xi(t)
increases the same as xo(t). However, if the LRU data belongs to Process i, xi(t)
24
does not increase on cache misses since Process i's block gets replaced.
Define tstart(j, k) as the time when the kth MRU block of Process
the LRU part of a cache, and
tend(j,
from the cache (See Figure 2-3).
j
(Pi,k) becomes
k) as the time when P,k gets completely replaced
tstart(j,
k) and
tend(j,
k) specify the flat segments in
Figure 2-4 and can be estimated from the following equations based on Equation 2.13.
N
X" (tstart (j,
k) + (k
1) Tj) +
-
x
((k
-
1) -Tp)
C.
(2.16)
Tp)
C.
(2.17)
p=1,pf j
N
x$(tend(j,
k) + (k - 2)
Tj)+
x((k
x3
-1)
p=1,pf j
tstart(j,
1j(t) + 1) would be zero if the above equation results in a negative value for
j
and 1j(t), which means that P(j, l(t) + 1) block is already the LRU part of
given
the cache at the beginning of a time quantum. Also, tend(j, k) is only defined for k
from 2 to l(t) + 1.
2.3.4
Overall Miss-rate
Section 2.3.1 explains how to estimate the miss-rate of each process when the amount
of the process' data as a function of time xi(t) is given. The previous sections showed
how to estimate xi(t). This section presents the overall miss-rate calculation.
When a cache uses virtual address tags and gets flushed for every context switch,
each process starts a time quantum with an empty cache. In this case, the miss-rate of
a process can be estimated from the results of Section 2.3.1 and 2.3.2. From Equation
2.3 and 2.12, the miss-rate for Process i can be written by
miss-ratei =-
J
Ti
1
fT
m(MIN[M71 (t + Mj(0)), C])dt.
(2.18)
If a cache uses physical address tags or has process ID's with virtual address tags,
it does not have to be flushed at a context switch. In this case, the amount of data
25
xi(t) is estimated in Section 2.3.3. The miss-rate for Process i can be written by
I
Ti
miss-ratei = -
Ti fo
mi (xi (t)) dt
(2.19)
where xi(t) is given by Equation 2.15.
For actual calculation of the miss-rate,
and 2.17 can be used.
Since
tstart(j, k)
tstart(j, k)
and
and tend(j, k) from Equation 2.16
ted(j,
k) specify the flat segments in
Figure 2-4, the miss-rate of Process i can be rewritten by
miss-rate
mi - (MIN[MII7 (t + M(xi(0)), C])dt
li(t)+1
+ E
mi(xz (tstart(i,k) + (k
-
- tstart(i, k))}
1) T) - (MIN[tend(i, k ), Tj]
k=dli
(2.20)
where di is the minimum integer value that satisfies tstart(i, di) < T. T is the time
that Process i actually grows.
l (t)+1
T- = T -
E
(MIN[tend(i, k), T]
-
(2.21)
tstart(i, k)).
k=di
As shown above, calculating a miss-rate could be complicated if we do not flush
a cache at a context switch. If we assume that the executing process' data left in
a cache is all in the most recently used part of the cache, we can use the equation
for estimating the amount of data starting with an empty cache.
Therefore, the
calculation can be much simplified as follows,
miss-ratei =
J--
Ti
mj(MIN[Mj-1(t + Mi(xi(0))), C])dt
(2.22)
where xj(0) is estimated from Equation 2.15. The effect of this approximation is
evaluated in the experiment section (cf. Section 2.4).
Once we calculate the miss-rate of each process, the overall miss-rate is straight
26
forwardly calculated from those miss-rates.
Overall miss-rate =
2.4
EN miss-ratei - T
=
(2.23)
Experiments
In this section, we validate the fully-associative cache model by comparing estimated
miss-rates with simulation results. A few different combinations of benchmarks are
modeled and simulated for various time quanta. First, we simulate cases when a
cache gets flushed at every context switch, and compare the results with the model's
estimation. Cases without cache flushing are also tested. For the cases without cache
flushing, both the complete model (Equation 2.20) and the approximation (Equation 2.22) are used to estimate the overall miss-rate. Based on the simulation results,
the error of the approximation is discussed.
2.4.1
Cache Flushing Cases
The results of the cache model and simulations are shown in Figure 2-5 in cases when
a process starts its time quantum with an empty cache. Four benchmarks from SPEC
CPU200 [14], which are vpr, vortex, gcc and bzip2, are tested. The cache is 32-KB
fully-associative cache with a block size of 32 Bytes. The miss-rate of a process is
plotted as a function of the length of a time quantum. In the figure, the simulation
results at several time quanta are also shown. The figure shows a good agreement
between the model's estimation and the simulation result.
As inputs to the cache model, the average miss-rate of each process has been
obtained from simulations. Each process has been simulated for 25 million memory
references, and the miss-rates of the process for various cache size have been recorded.
The simulation results were also obtained by simulating benchmarks for 25 million
memory references with flushing a cache every T memory references. As the result
shows, the average miss-rate works very well.
27
0.14
0.09
+I- simulation
model
-
-+-
0.08
0.12
C,
-
simulation
model
0.07
2)0.06
0.1
0.05
lto
-I-sm
C,,
0.08
0.04
0.03
0.06
0.02
0.04'
10 3
10
Time Quantum
10
10
(a) vpr
0.09
->-
simulation
-4+
0.085
model
0.08)
--
simulation
model
-
0.08
0.07
0.075
0.06
Cl,
CO,
10
(b) vortex
0.09
a,
10
Time Quantum
U)
0.05
0.065
0.04
0.06
0.03
0.02
10
0.07
0.055
0.05
104
Time Quantum
105
103
(C) gcc
104
Time Quantum
105
(d) bzip2
Figure 2-5: The result of the cache model for cache flushing cases. (a) vpr. (b) vortex.
(c) gcc. (d) bzip.
28
800
0.042
simulation
0.041 -
-
700
approximation
model
0.04-
--
vortex
vpr
600 -
0.039 -0-
500-
tZ0.038 4
C4Q
400 -
0.037 -!
300
0.036
E
<200 -
0.035 -20
0.034 -100~
0.033
0
5
Time Quantum
0'
0
10
x 104
4
2
Time Quantum
6
x 104
Figure 2-6: The result of the cache model when two processes (vpr, vortex) are sharing
a cache (32 KB fully-associative). (a) the overall miss-rate. (b) the initial amount of
data xi(0).
2.4.2
General Cases
Figure 2-6 shows the result of the cache model when two processes are sharing a
cache. The two benchmarks are vpr and vortex from SPEC CPU2000, and the cache
is a 32-KB fully-associative cache with 32-B blocks. The overall miss-rates are shown
in Figure 2-6 (a). As shown in the figure, the miss-rate estimated by the model shows
a good agreement with the results of the simulations.
The figure also shows an interesting fact that a certain range of time quanta could
be very problematic for cache performance. For short time quanta, the overall missrate is relatively small. For very long time quanta, context switches do not matter
since a process spends most time in the steady state. However, medium time quanta
could severely degrade cache miss-rates as shown in the figure. This problem occurs
when a time quantum is long enough to pollute the cache but not long enough to
compensate for the misses caused by context switches. The problem becomes clear in
29
0.07
1
..
-
_r__
simulation
-
approximation
model
0.065-
(D
0.06-
0.055
-
0.05-
0.045''
0
1
2
3
4
5
6
Time Quantum
7
8
9
10
X 10 4
Figure 2-7: The overall miss-rate when four processes (vpr, vortex, gcc, bzip2) are
sharing a cache (32K, fully-associative).
Figure 2-6 (b). The figure shows the amount of data left in the cache at the beginning
of a time quantum. Comparing Figure 2-6 (a) and (b), we can see that the problem
occurs when the initial amount of data rapidly decreases.
The error caused by our approximation (Equation 2.22) method can be seen in
Figure 2-6. In the approximation, we assume that the data left in the cache at the
beginning of a time quantum are all in the MRU region of the cache.
In reality,
however, the data left in the cache could be the LRU cache block and get replaced
before other process' blocks in the cache, although the current process' data are likely
to be accessed in the time quantum. As a result, the approximated miss-rate is lower
than the simulation result when the initial amount of data is not zero.
A four-process case is also modeled in Figure 2-7.
Two more benchmarks, gcc
and bzip2, from SPEC CPU2000 [14] are added to vpr and vortex, and the same
cache configuration is used as the two process case. The figure also shows a very close
30
agreement between the miss-rate estimated by the cache model and the miss-rate
from simulations. The problematic time quanta and the effect of the approximation
have changed. Since there are more processes polluting the cache compared to the
two process case, a process experiences an empty cache in shorter time quanta. As
a result, the problematic time quanta become shorter. On the other hand, the effect
of the approximation is less harmful in this case. This is because the error in one
process' miss-rate becomes less important as we have more processes.
31
Chapter 3
Cache Partitioning Based on the
Model
In this chapter, we discuss optimal cache partitioning for multiple processes running
on one shared cache. The discussion is mainly based on the analytical cache model
presented in Chapter 2, and the presented partition is optimal in a sense that it
minimizes the overall miss-rate of the cache.
The discussion in the chapter is based on similar assumptions as the analytical
model. First, the memory reference characteristic of each process is assumed to be
given by a miss-rate curve that is a function of cache size. To incorporate the effect
of dynamically changing behavior of a process, we assume that the memory reference
pattern of a process in each time quantum is specified by a separate miss-rate curve.
Therefore, each process has a set of miss-rate curves, one for each time quantum.
Within a time quantum, we assume that one miss-rate curve is sufficient to represent
a process.
Second, caches are assumed to be fully-associative. Although most real caches are
set-associative caches, it is much easier to understand the tradeoffs of cache partitioning using a simple fully-associative model rather than a complicated set-associative
model. Also, since a set-associative cache is a group of fully-associative caches, the
optimal partition could be obtained by applying a fully-associative partition for each
set.
32
Cache Partition at TQ(2, 1)
X (',2
X(2,1)
Amount of Process 2's
Data at this Point
Amount 1
Data at this Point
Start
Process 1
(2,1)
D
2
D
Process 2
...
Process
Process 1
Process 2
...
STime
TQ(1,1)
TQ(1,2)
TQ(1,N)
TQ(2,1)
TQ(2,1)
Round 2
Round 1
Figure 3-1: An example of a cache partition with N processes sharing a cache.
Finally, processes are assumed to be scheduled in a round-robin fashion. In this
chapter, we use the term "the length of a time quantum" to mainly represent the
number of memory references in a time quantum rather than a real time period.
3.1
Specifying a Partition
To study cache partitioning, we need a method to express the amount of data in a
cache at a certain time as shown in Figure 3-1. The figure illustrates the execution
of N processes with round-robin scheduling; we define the term round as one cycle
of N process being executed. A round starts from Process 1 and ends with Process
N. Using a round number (r) and a process number (p), TQ(r,p) represents Process
p's time quantum in the rth round. Finally, we define X(rp) as the number of cache
blocks belonging to Process i at the beginning of TQ (r,p).
We specify a cache partition by dedicated areas and a shared area. The dedicated
area of Process i is cache space that only Process i uses over different time quanta.
If we define Dr as the size of Process i's dedicated area between TQ(r - 1, i) and
TQ(r, i), D' equals to X ri). For static partitions when memory reference patterns
33
do not change over time, we can ignore the round number and express Process i's
dedicated area as Di.
The rest of the cache is called a shared area. A process brings in its data into
the shared area during its own time quantum. After the time quantum, the process's
blocks in this area are evicted by other processes. Therefore, a shared area can be
thought of as additional cache space a process can use while its active. S(ri), the size
of a shared area in TQ(r, i), can be written by
i
S(ri) =
C
N
D+1 -
-
P=1
D
(3.1)
P=i+1
where C is the cache size. The size of a shared area is expressed by S for static
partitioning.
3.2
Optimal Partitioning Based on the Model
A cache partition can be evaluated by the transient cache model that is presented
in Section 2.3.1. To estimate the number of misses for a process in a time quantum,
the model needs the cache size (C), the initial amount of data (xj(0)), the number
of memory references in a time quantum (Ti), and the miss-rate curve of the process
(m(x)). A cache partition effectively specifies the cache size C and the initial amount
of data xj(0) for each process. Therefore, we can estimate the total number of misses
for a particular cache partition from the model.
From the definition of a dedicated area, Process i starts its time quantum TQ(r, i)
with Dr cache blocks in a cache. In the cache model, the cache size C stands for the
maximum number of blocks a certain process can use in its time quantum. Therefore,
the effective cache size for Process i in time quantum TQ(r, i) is (D+S(ri)). For each
time quantum TQ(r,i), a miss-rate curve (mr) and the number of memory reference
(T[) are assumed to be given as an input.
From Equation 2.3 and 2.12, we can write the number of misses over time quantum
34
TQ(r, i) as follows:
missTQ(ri) = miss(m'(x), Ti, D', D' + S(r,'))
m (xi(t))dt
(3.2)
mr(MIN[(M[ )-1(t + M[(Dr)), D. + S(r i)])dt
where dMir(x)/dx = mr(x) from Equation 2.11. In the equation, miss(m(x), T, D, C)
represents the number of misses for a time quantum with a miss-rate curve (m(W)),
the length of the time quantum (T), the initial amount of data (D) and cache size
(C).
If we assume Process i terminates after Ri rounds of time quanta, the total
number of misses can be written by
N
total misses
Z
Rp
miss(r,p).
(3.3)
p=1 r=1
In practical cases when processes execute for a long enough time, the number of time
quanta Ri can be considered as infinite.
Note that this equation is based on two assumptions.
First, the replacement
policy assumed by the equation is slightly different from the LRU replacement policy.
Our augmented LRU replacement policy replaces the dormant process' data in the
shared area before starting to replace the executing process' data. Once dormant
process' data are all replaced, our replacement policy is the same as the normal LRU
replacement policy. The equation also assumes that the dedicated area of a process
keeps the most recently used blocks of the process.
Since S(ri) is determined by D', the total number of misses (Equation 3.3) can be
thought of as a function of D'. The optimal partition is achieved for a set of integer
Dr that minimizes Equation 3.3.
For a cache without prefetching, D' is always zero since the cache is empty when
a process starts its execution for the first time.
Xr
,
Also, Dr cannot be larger than
the amount of Process i's data at the end of its past time quantum.
35
3.2.1
Estimation of Optimal Partitioning
Although we can evaluate a given cache partition, it is not an easy task to find the
partition that minimizes the total number of misses. Since the size of a dedicated
area can be different for each round, an infinite number of combinations are possible
for a partition. Moreover, the optimal size of dedicated areas for a round is dependent
on the size of dedicated areas for the next round. For example, consider the optimal
value of Dr. Dr affects the number of misses in the r
round (miss(r, 2)), which is
also dependent on the size of Process 1's dedicated area in (r+1)h round. As a result,
determining the optimal partition requires the comparison of an infinite number of
combinations.
However, comparing all the combinations is intractable. This problem can be
solved by breaking the dependency between different rounds. By assuming that the
size of dedicated areas changes slowly, we can approximate the size of dedicated areas
in the next round from the size of dedicated areas in the current round, D+l ~ D,
which enables us to separate the partition of the current round from the partition of
the next round.
The number of misses over one round that is affected by Dr can be written as
follows from Equation 3.2.
p-1
miss(mr, T, D, C
missD (Dr) =
p=1
N
D+1
q=1
q=p+1
N
+ ~
D)
-
p-1
miss(m-',T7 1
, D-,
p=i+1
C-
N
Dq=1
Dq)
q=p+1
Since we assumed that D+1 ~ Dr and the size of dedicated areas for the first round
is given (D' = 0), we can determine the optimal value of D' round by round.
3.3
Tradeoffs in Cache Partitioning
In the previous section, we discussed a mathematical way to determine the optimal
partition based on the analytical cache model. However, the discussion is mainly
36
based on the equations, and does not provide an intuitive understanding of cache
partitioning. This section discusses the same optimal partitioning focusing on tradeoffs in cache partitioning rather than a method to determine the optimal partition.
For simplicity, the discussion in this section assumes one miss-rate curve per process
and therefore static cache partitionig.
Cache partitioning can be thought of as allocating cache blocks among dedicated
and shared areas. Each cache block can be used as either a dedicated area for one
process or a shared area for all processes. Dedicated cache blocks can reduce interprocess conflicts and capacity misses for a process. Shared cache blocks can reduce
capacity misses for multiple processes. Unfortunately, there are only a fixed number
of cache blocks. As a result, there is a tradeoff between (i) reducing inter-process
conflicts and (ii) reducing capacity misses for one process and reducing capacity misses
for multiple processes.
A dedicated area can improve the cache miss-rate of a process by reducing misses
caused by either conflicts among processes or lack of cache space. Specifically, if
data left in the dedicated area from past time quanta are accessed, we can avoid
inter-process conflict misses that would have occurred without the dedicated area.
A dedicated area also provides more cache space for the process, and may reduce
capacity misses within a time quantum. As a result, a dedicated area can improve
both transient and steady-state behavior of the process. However, the advantage of
a dedicated area is limited to one process. Also, a shared area has the same benefit
of reducing capacity misses.
The advantage of having a dedicated area is shown in Figure 3-2 based on the
analytical cache model. The dashed line in the figure represents the probability of a
miss at time t for a process without a dedicated area. Time 0 represents the beginning
of a time quantum. Without a dedicated area, the process starts a time quantum
with an empty cache, and therefore the miss probability starts from 1 and decreases
as the cache is filled with more data. Once the process fills the entire shared area,
the miss probability becomes a constant m(S), where S is the size of a shared area.
Since time corresponds to the number of memory references, the area below the curve
37
1
Advantage of Having a Dedicated Area
XD
m(D)
Cn
m(S) ---------------------m(D+S)--------------------
------
0
T
0
Time
Figure 3-2: The advantage of having a dedicated area.
from time 0 to T represents the number of misses expected for the process without a
dedicated area.
Now let us consider a case with a dedicated area D. Since the the amount of data
at the beginning of a time quantum is D, the miss probability starts from a value
m(D) and flattens at m(D + S) as shown by the solid line in the figure. The area
below the solid line between time 0 and T shows the number of misses expected in this
case. If m(D) is smaller than 1, the dedicated area improves the transient miss-rate
of the process. If m(D + S) is smaller than m(S), the dedicated area improves the
steady-state miss-rate of the process. Even though a dedicated area can improve both
transient and steady-state miss-rates, the improvement of the transient miss-rate is
the unique advantage of a dedicated area.
A larger shared area can reduce the number of capacity misses by allowing a
process to keep more data blocks while it is active. A process starts a time quantum
with data in its dedicated area, and brings new data into the cache when it misses.
The amount of the active process' data increases until both the dedicated area for
the process and the shared area are full with its data. After that, the active process
recycles the cache space it has. As a result, the size of shared area with the size of a
38
1
Advantage of Having a Larger Shared Area
-o
-0
0
m(D+S1)
-----------------
--
--
-
---- ------- ---
-
m (D +S 2) - - - - - - - - - - - - - - - - - -
0'
T
0
Time
Figure 3-3: The advantage of having a larger shared area for a process.
dedicated area determines the maximum cache size a certain process can use. A larger
shared area can increase cache space for all processes, and therefore can improve the
steady-state miss-rate of all processes.
Figure 3-3 illustrates the advantage of having a larger shared area for a process.
The dashed line represents the probability of a miss as a function of time with a shared
area SI. Time 0 represents the beginning of a time quantum. The process' expected
number of misses is the area below the curve from time 0 to T, where T is the length
of a time quantum. The miss probability starts from a certain value and becomes
a constant once both the dedicated and shared area are full. A cache with a larger
shared area S2 can store more data and therefore the steady-state miss probability
becomes m(D + S2) as represented by a solid line in the figure.
A larger shared
area can reduce the number of misses if m(D + S 2 ) is smaller than m(D + S 1 ). The
improvement can be significant if the time quantum is long and therefore a process
stays in a steady-state for a long time. On the other hand, a larger shared area may
not improve the miss-rate at all if the process does not reach a steady-state. Since a
shared area is for all processes, the advantage of having a larger shared area is the
sum of advantages for all the processes.
39
2048
1536
a Dedicated (P1)
1024
512
0
0
512
1024
1536
2048
Cache Size
Figure 3-4: Optimal cache partitioning for various cache sizes.
For small caches, the advantage of increasing a shared area usually outweighs the
advantage of increasing a dedicated area. If a cache is small, reloading cache blocks
only takes a relatively short period compared to the length of time quantum, and
therefore steady-state cache behavior dominates. At the same time, the working set
for a process is likely to be larger than the cache, which means that more cache
space can actually improve the steady-state miss-rate of the process. As a result,
the optimal partitioning is to allocate the whole cache as a shared area as shown in
Figure 3-4. The figure shows optimal cache partitioning for various cache sizes when
two identical processes are running with a time quantum of ten thousand memory
references. The miss-rate curve of processes, m(x), is assumed to be e-
01
.
Assuming the same time quanta and processes, a shared area becomes less important for larger caches.
As the size of the shared area increases, a steady-state
miss-rate becomes less important since a process starts to spend more time to refill
the shared area. Also, the marginal gain of having a larger cache usually decreases as
a process has larger usable cache space (a shared area and a dedicated area). Once the
processes are unlikely to be helped by having more shared cache blocks, it is better to
40
6M,
4no
300
2040,
10
10
10
10
10
X 104
5
T2
0
0
5X10
10,4
5X
T1
1
(a)
0 0
(b)
T1
Figure 3-5: Optimal cache partitioning for various time quanta.(a) The amount of
Process l's dedicated area. (b) The amount of shared area.
use those blocks as dedicated ones. For very large caches that can hold the working
sets for all the processes, it is optimal to allocate the whole cache to dedicated areas
so that no process experiences any inter-process conflict.
The length of time quanta also affects optimal partitioning. Generally, longer
time quanta make a shared area more important. As a time quantum for a process
becomes longer, a steady-state miss-rate becomes more important since the process
is likely to spend more time in a steady-state for a given cache space. As a result, the
process needs more cache space while it is active unless it already has large enough
space to keep all of its data.
An example of optimal partitioning for various time quanta is shown in Figure 3-5.
The figure shows the optimal amount of a dedicated area for Process 1 and a shared
area when two identical processes running on a cache of size 512 blocks. Time quanta
of two processes as the number of references, T and T2 , are shown in x-axis and y-axis
respectively. The miss-rate curve of the processes, m(x), is assumed to be e-0 Ol.
As discussed above, the figure shows that a shared area should be larger for longer
41
400600
4000
0.101
0.05
0.05
b2
0 0
bl
b2
(a)
0 0
bl
)
Figure 3-6: Optimal cache partitioning for various miss-rate curves. (a) The amount
of Process l's dedicated area. (b) The amount of shared area.
time quanta. If both time quanta are short, the majority of cache blocks are used as
dedicated ones since processes are not likely to grow within a time quantum. In the
case of long time quanta, the whole cache is used as a shared area.
Finally, the characteristics of processes affect optimal partitioning. Generally,
processes with large working sets need a large shared area and vice versa. For example,
if processes are streaming applications, having a large shared area may not improve
the performance at all even though time quanta are long. But if processes have a
random memory access pattern with a large working set, having a larger cache usually
improves the performance. In that case, it may make sense to allocate a larger shared
area.
Figure 3-6 shows optimal partitioning between two processes with various missrate curves when the cache size is 512 blocks and the length of a time quantum is
10000 memory references. The miss-rate curve of Process i, mi(x), is assumed to be
e-i. If a process has a large bi, the miss-rate curve drops rapidly and the size of the
working set is small. If a process has a small bi, the miss-rate curve drops slowly and
42
the size of the working set is large.
43
Chapter 4
Cache Partitioning Using LRU
Replacement
The least recently used (LRU) replacement policy is the most popular cache management scheme that is used in modern microprocessors. In this chapter, we evaluate
the LRU replacement policy in terms of cache partitioning. Cache partitioning by the
LRU replacement policy and its problems are discussed. Based on the cache model,
we also discuss how and when the LRU replacement policy can be improved by proper
cache partitioning.
4.1
Cache Partitioning by the LRU Policy
The LRU replacement policy does not partition a cache explicitly. All the processes
can use the whole cache space while they are executing. However, the LRU replacement policy also results in a certain amount of data left in the cache when a process
returns from a context switch. Therefore, the LRU replacement policy effectively
partitions the cache, and the partition can be represented by dedicated and shared
areas. The size of a dedicated area for a process is the amount of the process' data
left in the cache at the beginning of its time quantum, and that can be estimated
from the model in Section 2.3.3. The rest of cache space is a shared area.
The LRU replacement policy tends to allocate larger cache space to a shared area
44
compared to the optimal partition. For every miss, the LRU policy increases the
cache space for an active process unless the least recently used cache block happens
to be the process' block. As a result, each process usually consumes more cache space
than the optimal case. The LRU policy also tries to keep data from the recent time
quanta, which usually result in small dedicated areas for a round-robin schedule.
4.2
Comparison between Normal LRU and LRU
with Optimal Partitioning
If a cache is small enough for given time quanta, the LRU policy does the right
cache partitioning. The entire cache is used as a shared area. Therefore, optimal
partitioning cannot appreciably improve the performance for this case.
Also, the
performance improvement by optimal partitioning is marginal if the cache is large
enough to hold working sets of all processes. Even though the LRU policy may result
in a non-optimal cache partitioning, the miss-rate difference between the normal LRU
policy and the LRU policy with optimal partitioning is small for very large caches
since most of the useful data is in the cache anyways.
Cache partitioning of the normal LRU replacement policy could be problematic
in the cases of mid-range cache sizes and time quanta. A resultant cache partition is
determined by the size and the location of the cache space that each process uses in its
time quantum. Considering one cycle of time quanta, cache space that is accessed by
only one process is a dedicated area, and space that is accessed by multiple processes
is a shared area. Unfortunately, the LRU replacement policy has problems in both
the amount of cache space it allocates to a process and the layout of the cache space.
4.2.1
The Problem of Allocation Size
The LRU replacement policy determines the amount of cache space for a process
mainly based on the number of misses that occurs in the process' time quantum. This
could result in non-optimal cache allocation that shows poor cache performance. The
45
problem is clearer for very short time quanta such as the one with a few memory
references. In these cases, the entire cache can considered as dedicated areas because
processes are not likely to consume a significant amount of additional cache space.
The LRU replacement policy allocates cache space proportional to the number of
misses each process experience.
However, the optimal cache partition is achieved
by allocating cache space based on the actual advantage that the space provides
to a process. That is, additional cache blocks should be given to the process that
has the biggest reduction in miss-rate by having those blocks. Therefore, the LRU
replacement policy is problematic when a process with higher miss-rate has smaller
marginal gain than other processes.
A typical example is the case when there is a streaming application with an
application that requires large cache space. The LRU replacement policy tends to
allow too much cache space to the streaming application even though it is better
to keep the other process's data. The problem is shown in Figure 4-1. The figure
shows the relative miss-rate improvement by the optimal partitioning compared to
the normal LRU replacement policy. The relative improvement is calculated by the
following equation:
Relative improvement
=.
miss-ratenormal LRU -
miss-rateoptimal
miss-ratenormal LRU
Time quanta of processes are assumed to be the same length.
partitioning
(4.1)
(4)
One process is a
streaming application that shows a miss-rate of 12.5% with one cache block. The
other is a random memory reference application whose miss-rate decreases linearly
with having more cache blocks and becomes zero for caches larger than 2048 blocks.
The miss-rate curves are shown in Figure 4-1 (a). Figure 4-1 (b) illustrates the relative
improvement by the optimal cache partitioning over the normal LRU replacement
policy for various cache sizes and time quanta. The darker area represents larger
relative improvement. Finally, the slices of Figure 4-1 (b) for some time quanta are
shown in Figure 4-1 (c).
As discussed before, the figure shows that the normal LRU replacement policy
46
-
0.8 -
stream
random
~0.6 0.4 0.2-
0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
7
8
9
10
Cache Size (Blocks)
(a)
10000
8000
0 6000
4000
2000
0
0
1
2
3
5
4
6
References per Time Quantum
x 14
(b)
0.8
50000
100000
E 0.6 - -
1000000
E0.4 -
~02
0
0
-------------------1000
2000
3000
4000
5000
6000
.7000
8000
9000
10000
References per Time Quantum
(c)
Figure 4-1: Improvement by optimal partitioning for a random memory reference
application with a streaming application. (a) Miss-rate curves of processes. (b) (c)
Relative improvement.
47
performs very close to the LRU with optimal partitioning if a cache is relatively small
for a time quantum or extremely large to hold everything. But proper partitioning
improves a miss-rate up to 60% for mid-size caches with a small time quantum. The
problem is severe for caches where the random memory reference application shows
a lower miss-rate than the streaming application, which correspond to caches larger
than about 1800 blocks. Longer time quanta usually mitigate the problem of nonoptimal allocation size since the initial amount of data becomes less important for
longer time quanta.
4.2.2
The Problem of Cache Layout
The layout of a given cache space for each process may also be a problem. The LRU
replacement policy allocates additional cache blocks to an active process by replacing
LRU blocks, and tries to keep the data from recently executed processes. Therefore,
two consecutive processes do not share any cache space as long as a cache is large
enough to hold both spaces, which usually results in more cache pollution and smaller
dedicated areas than optimal partitioning. Generally, the LRU replacement policy can
suffer from small dedicated areas even though it allocates the right amount of cache
space for each process. The problem is especially severe for round-robin scheduling.
Consider the case when 9 identical processes are running on a cache of size 1024
blocks with round-robin scheduling. Also, let us say that a process needs 128 cache
blocks while it is active. For the LRU replacement policy, a process suffers an empty
cache whenever it restarts from a context switch. However, smart partitioning can
keep 112 blocks for each process and share 16 blocks among all processes. The problem
is that only two processes share a given cache block for the LRU replacement policy,
whereas all nine processes share the same 16 blocks for the smart partition.
For short time quanta, this layout problem is marginal since recently used data for
all processes are kept in a cache. Very long time quanta are also unproblematic since it
is usually optimal to have a large shared area for long time quanta. But this problem
could be severe for mid-range time quanta, which are just long enough to pollute the
entire cache in one cycle of time quanta. In this case, the LRU policy results in no
48
dedicated area, whereas the optimal partition should have large dedicated areas.
Problematic cache sizes and time quanta when Process i suffers from this layout
problem can be roughly estimated as follows:
N
Z
mj Tj ~ C,
(4.2)
j=OiAj
where N is the number of processes, m
is the miss-rate of process j, T is the
number of memory references in Process j's time quantum, and C is the number of
cache blocks.
This layout problem of the LRU replacement policy is illustrated in Figure 4-2.
The figure shows the relative improvement obatined by optimal partitioning over the
normal LRU replacement policy, which is calculated by Equation 4.1. Four processes,
vpr, vortex, gcc, and bzip2 from SPEC CPU2000 [14], share a cache with the same
length of time quantum and round-robin scheduling. The miss-rate curves of processes
are shown in Figure 4-2 (a). Figure 4-2 (b) illustrates the relative improvement for
various cache sizes and time quanta. Finally, Figure 4-2 (c) also shows the relative
improvement just for some cache sizes. These figures shows that the LRU replacement
policy is problematic for a certain range of time quantum for a given cache size.
The improvement is marginal for small time quanta. However, for some mid-range
time quanta, proper partitioning improves the miss-rate up to around 30%. For a
very long time quantum, again, there is no difference between the normal LRU and
optimal partitioning. As Equation 4.2 implies, the problematic time quantum for
LRU becomes longer as we increase the cache size.
4.2.3
Summary
The problematic time quanta and cache size can be summarized from the above
discussions as shown in Figure 4-3. The figure illustrates the problem of the LRU
replacement policy for various cache sizes (the number of cache blocks) and time
quanta (the number of memory references per time quantum). If a cache is very small
or very large for a given time quantum, the normal LRU replacement policy performs
49
100
VPR
VORTEX
BZIP2
0)
10
GC
.. ..... ..
U)
U)
102
... .
_........
..
....
....
-
10
0
1000
0
1
2000
10000
3000
4000 5000 6000
Cache Size (Blocks)
(a)
7000
8000
9000
8
9
10000
8000
ca
6000
(D
0o 4000
2000
0
2
2
0.4
6
7
4
5
3
References per Time Quantum
(b)
10
x 10,
500
2500
... 10000
-
E 0.3
E 0.2
-'
-''
0. 1
0
1
3
10
104
References per Time Quantum
10,
(c)
Figure 4-2: Improvement by optimal partitioning for four SPEC benchmarks. (a)
Miss-rate curves of processes. (b) (c) Relative improvement.
50
106
10M
--
4Z .-
0
1
1O
02
Main
Memo y
\
--
Q4
L
10' -A-
-
10
C
10
No Problem
(b)~-
C.(a)
i00
102
104
i00
106
References per Time Quantum
(a)
102
104
106
References per Time Quantum
(b)
Figure 4-3: (a) Problems of the normal LRU replacement policy for various cache
sizes and time quanta. (b) Usual cache size and time quanta for each level of memory
hierarchy.
almost as well as the LRU with optimal partitioning. For short time quanta, nonoptimal allocation size for each process is the main source of performance degradation
(Section 4.2.1). In this case, the problematic cache size depends on the characteristics
of processes, and could be any size. For longer time quanta, a non-optimal layout
problem may degrade the cache performance (Section 4.2.2).
In this problem, the
problematic cache size generally increases as a time quantum becomes longer. The
relationship can be formalized based on Equation 4.2. If we assume the same miss-rate
and time quantum for all processes, the equation can be simplified as
(N - 1) - M - T ~_C.
(4.3)
Usually, the number of process N is between 10 and 100, and the miss-rate M is
between 1% and 10%. In this case, we can cancel N and M. Therefore, the layout
could be a serious problem if T and C is in about the same order of magnitude.
51
4.3
Problems of the LRU Replacement Policy for
Each Level of Memory Hierarchy
Each level of memory hierarchy usually has a different cache configuration, that is,
cache size and block size are different for each level.
Generally, both cache size
and block size increases as we go up a memory hierarchy [13]. Current generation
microprocessors usually have an Li cache which ranges from 16 KB to 1 MB with 32
Byte cache blocks [2, 3, 15, 16, 6, 10, 21, 11]. L2 caches, often an order of magnitude
larger than Li caches, are between 256 KB to 8 MB. L2 cache blocks are also a few
times larger than Li cache blocks. Finally, main memory can be considered as a cache
with pages as cache blocks. Main memory, these days, usually ranges between 64 MB
and 4 GB, with page size of 4 KB. As a result, L2 caches usually have an order of
magnitude more cache blocks than Li caches, and main memory and L2 caches have
about the same number of cache blocks.
The number of memory references per time quantum also differs for each level
of memory hierarchy. Generally, a higher level cache sees fewer memory references
than lower level caches since a higer level of memory hierarchy only sees memory
references that miss in lower levels. For example, if a process misses 10% of memory
references in a Li cache, the number of memory references for a L2 cache is one tenth
of the number of memory references for the Li cache. For conventional time-shared
systems, usually the time quantum is around 10ms assuming a process is not I/O
bounded. For Compaq Tru64 Unix, default time quantum is 10 ticks ', and each tick
is roughly ims [7]. Therefore, Li caches usually service hundreds of thousand or more
memory references per time quantum for processors with a 1 GHz clock cycle. For
simultaneous multi-threading (SMT), processes are very tightly interleaved [31, 19, 93.
Therefore, there are only a few memory references per time quantum. Figure 4-3 (b)
summarizes the usual cache size and number of memory references per time quantum
for each level of memory hierarchy. A higher level cache is assumed to see one tenth
'As a root, we can check the default time quantum as follows; dbx -k vmunix; (dbx) p
round-robin-switch-rate.
52
of memory references that the higher level cache sees.
It is very clear from Figure 4-3 what kind of partitioning problems each level of
memory hierarchy may suffer from, when it is managed by the simple LRU replacement policy. For Li caches with conventional time-shared mechanism, the number
of memory references per time quantum is likely to be an order of magnitude larger
than the number of cache blocks. Therefore, for Li caches, partitioning is usually
not a problem although slow processors with a large Li cache may suffer from the
non-optimal layout problem. However, L2 caches of conventional time-shared systems
are very likely to suffer from the problem of non-optimal layout. To obtain the best
performance for L2 caches, it could be crucial to partition a cache properly so that the
right amount of data is kept in the cache over the time quanta. For main memory, the
number of memory references per time quantum is often relatively small considering
the number of cache blocks it has. Therefore, allocating the right amount of space to
each process is important for main memory and SMT systems. Especially, streaming
applications should be carefully managed in this case. Finally, for SMT systems, all
the levels of memory hierarchy could be problematic since time quanta are short on
all levels.
As technology improves, both cache size and a clock frequency tend to increase,
which result in more cache blocks and more memory references per time quantum.
However, the speed of increase is different for each level of memory hierarchy. The
size of Li caches tend to grow slowly compared to the clock frequency, since it is
crucial for Li caches to follow up with the speed of a processor. On the other hand,
the size of main memory grow much faster. Therefore, for future microprocessors, it
is likely that proper cache partitioning to be more important than it is now for lower
level caches and less important for Li caches.
4.4
Effect of Low Associativity
The previous sections are mainly based on the fully-associative cache model. However,
most of the discussions are generally valid also for set-associative caches. Discussions
53
are based on the assumption that the miss-rate of a process can be given by a function
of cache size m(x), which can also be a reasonable assumption for set-associative
caches. How the size of cache space x actually changes is based on the fully-associative
model, and may different from the case of fully-associative caches.
Generally, the size of cache space grows slower for set-associative caches than
fully-associative caches. For set-associative caches, the replacement is determined by
the set mapping and the LRU information. Since the mapping could be arbitrary, it
is possible for a process to replace its own data even though the LRU cache block is
the other process' block. Therefore, it usually takes more time for a set-associative
cache to pollute the whole cache than a fully-associative cache. As a result, the nonoptimal layout problem of the LRU replacement policy is generally less problematic
for set-associative caches.
Even though the general trend discussed is true for all caches, it is hard to predict
the exact effect of context switching for low associative caches.
As associativity
becomes lower, the replacement of cache blocks becomes more dependent on the set
mapping, which is arbitrary. If mapping happens to be the form that frequently
accessed sets of processes overlap, the problem of context switching could be very
severe. On the other hand, if frequently accessed sets of processes do not overlap at
all, context switches may not affect the cache performance at all.
4.5
More Benefits of Cache Partitioning
Previous sections focused only on cache miss-rates, and discussed how proper partitioning can improve a cache miss-rate. However, cache partitioning could have more
advantages in terms of overall system performance, and predictability.
Proper cache partitioning can improve the overall system performance by reducing
the burst of misses. Although cache miss-rates are generally used as a metric of cache
performance, high miss-rates do not necessarily mean lots of stalls for a processor.
Since modern processors usually exploit instruction level parallelism, only some misses
or lots of outstanding misses cause performance degradation. If a process starts an
54
execution with an empty cache, it is very likely that the process experiences a burst
of misses that causes a processor to stall. Proper cache partitioning can mitigate this
burst of misses by allocating dedicated space to the process. In some cases, it might
be better to remove a burst of misses at the beginning of time quantum even though
it results in a worse overall miss-rate.
The predictability of a system could also be improved by cache partitioning. For
some real-time applications, it is more important to know the performance of a certain
task. For example, if an application is playing an audio clip, it is crucial for the
application to produce a new packet every period. In this case, we should know the
worst case performance of the process, and schedule the application so that it can do
the job within a given period. By having a larger dedicated area for the process, cache
partitioning can improve the worst case performance and, therefore, predictability [5].
55
Chapter 5
Partitioning Experiments
This section discusses the implementation of dynamic cache partitioning based on
the analytical cache model. Using trace-driven simulations, we compare this model
based partitioning with the normal LRU and another partitioning method based
on look-ahead information. The implementation in this chapter is based on fullyassociative caches. However, this implementation can also be applied to each set of a
set-associative cache.
5.1
Implementation of Model-Based Partitioning
This section describes the implementation of our model based partitioning algorithm
in a simulation environment. Figure 5-1 illustrates the overview of our implementation. First, each benchmark is compiled and simulated using a system simulator
to generate the memory reference trace of the benchmark. The traces are fed into a
single process cache simulator to record the memory reference pattern of each process.
Finally, a combination of benchmarks is simulated by a multi-process cache simulator with our cache partitioning module. Note that the recorded information can be
reused for any combination of benchmarks as long as the cache configuration is the
same.
56
Benchmark 1
System
Simulator
Trace 1
Benchmark 2
System
Trace 2
Single Process
Cache Simulator
Record File 1
Single Process Record File 2
Process
Simulator
Model Based
Partitioning
01eal
Miss-Rate
Record File N
Benchmark N
System
Simulator
Trace N
Single Process
Cache Simulator
Figure 5-1: The implementation of the analytical-model-based cache partitioning.
5.1.1
Recording Memory Reference Patterns
This phase of partitioning records the memory reference pattern of each process so
that the information can be used by the on-line cache partitioning module.
The
memory reference pattern of a process is assumed to be represented as the missrate curve of the process. Therefore, we actually record the miss-rates of a process
with respect to the number of cache blocks.
Since the miss-rate depends on the
configuration of a cache, a cache should be configured the same as the cache used for
multi-process cases.
Ideally, the miss-rate curve for each time quantum should be recorded, however,
that results in extremely huge file sizes for short time quanta. Moreover, the number of
memory references for a time quantum may not be known at the moment of profiling,
which makes it harder to record one miss-rate curve for each time quantum. Therefore,
we record the average miss-rate curve over a certain time period with time stamps.
Each miss-rate curve is mapped to a corresponding time quantum at run-time.
The time stamp of a miss-rate curve consists of both the start and times of the
period in which the miss-rate is recorded. Time is expressed as the number of memory
57
Record Files
Scheduler
Add/remove a process,
End of a time quantum,
The Length of a time quantum
Miss-rate
Curves
Partition Module
)Replacement
Bit Vectors,
Reset the LRU Xinfo.
Unit
Figure 5-2: The implementation of on-line cache partitioning.
references from the beginning of a process. Therefore, the start time of a miss-rate
curve is the time of the first memory reference that contributes to the miss-rate curve.
Similarly, the stop time of a miss-rate curve is the time of the last memory reference
that contributes to the miss-rate curve.
At run-time, the miss-rate curve that the
majority of memory references of a time quantum contribute to is mapped to the
time quantum.
5.1.2
Cache Partitioning
Four modules are related to the cache partitioning flow as shown in Figure 5-2. A
scheduler provides a partition module with the scheduling information such as the set
of executing processes, when each process starts and finishes. The partition module
reads miss-rate information from recorded files and decides the cache partition for
each time quantum.
Finally, a replacement module effectively partitions a cache
based on the information from the partition module.
58
Cache Partitioning by Replacement Unit
We partition a cache based on the mechanism called column caching [4]. Column
caching is a mechanism to allow partitioning of a cache at cache column granularity,
where each column is one "way" or bank of the n-way set-associative cache.
A
standard cache considers all cache blocks in a set as candidates for replacement.
As a result, a process' data can occupy any cache block. Column caching, on the
other hand, restricts the replacement to a sub-set of cache blocks, which is essentially
partitioning the cache.
Column caching specifies replacement candidacy using a bit vector in which a bit
indicates if the corresponding column is a candidate for replacement. A LRU replacement unit is modified so that it replaces the LRU cache block from the candidates
specified by a bit vector. Each partitionable unit has a bit vector. Since lookup is
precisely the same as for a standard cache, column caching incurs no performance
penalty during lookup.
Our implementation uses the same mechanism as column caching at cache block
granularity. Each process has a bit vector with the same number of bits as the number
of cache blocks. Each bit in a bit vector specifies whether the corresponding cache
block can be replaced by the process or not. Although it might not be practical to
have a cache block granularity, we chose this design to verify the cache model. More
practical implementations of column caching can and will be studied.
Determination of a Partition
The partition module manages the bit vectors, and therefore effectively controls the
partition of a cache. The module obtains scheduling information from the scheduler,
and decides a new partition at the end of each process' time quantum. Based on the
decision, bit vectors are changed to actually partition a cache.
To decide a partition, the partition module maintains several data structures.
Based on the information from the scheduler, the module knows the number of processes (N), and the length of each process' time quantum (Ti). For partitioning, the
59
module also keeps the size of each process' dedicated area (Di), and the miss-rate
curve of each process (mi). We assume that D"
~ D as discussed in Section 3.2.1,
and we therefore have only one Di for each process. The size of a shared area is
calculated from the size of dedicated areas at any given moment.
If a new process is added to a scheduler, the scheduler notifies the increase in the
number of processes to the partition module, and the partition module initializes the
data structures for the process. First, the number of processes (N) is increased, and
the proper length of time quantum is assigned to TN. Second, since the new process
has not started executing yet, the size of dedicated area for the process (DN) is set
to zero. Finally, the partition module opens the record file for the new process and
loads MN(X).
After Process i finishes its time quantum, the partition module updates the information about Process i such as mi and T for the next round. The size of Process
i's dedicated area, Di, is also updated based on a new miss-rate curve and a time
quantum. The partitioning decision, based on the discussion in Section 3.2, is as
follows. Let S represent the size of a shared area in Process i's time quantum, the
new value of Di is the integer from 0 to Di + S that minimizes the following equation.
N
N
Z miss(mp(x), Tp, Dp, C P=1
N
D)
(5.1)
q=1,qIp
where miss(rn(x), T, D, C) is given in Equation 3.2.
Once the new Di is computed, bit vectors are also updated according to the new
size of the dedicated area. The Di MRU cache blocks are left in the dedicated area
for Process i, and the rest of cache blocks are moved to the next process' area. Also,
the moved cache blocks are set to the LRU blocks so that those blocks are replaced
before the dedicated area of the next process.
60
5.2
Cache Partitioning Based on Look-Ahead Information
In this section, we discuss a cache partitioning algorithm when we have the information of when a cache block will be accessed in the future. Since this partitioning
algorithm is based on complete knowledge of program behavior, it could be thought
of as the best we can do by cache partitioning. We will evaluate our model based
partitioning by comparing it with this ideal partitioning result.
The optimal replacement can be achieved by replacing the cache block that is going
to be accessed furthest in the future. This replacement policy is optimal in a sense
that it results in the minimum number of misses possible for a give cache. For multiprocess cases, the replacement policy also results in the optimal cache partitioning
among processes.
Therefore, simulation results with this ideal replacement policy
show the best cache miss-rate for a given cache. We call this replacement policy as
the ideal replacement policy.
The ideal replacement policy can be far better than what we can actually do
with just cache partitioning. Our cache partitioning is based on the LRU replacement policy within a single process, while the ideal replacement policy exploits its
knowledge of the future even within a process. As a result, the performance of the
ideal replacement policy is impossible to achieve just by proper cache partitioning
among processes as long as the normal LRU replacement policy is used within a time
quantum.
We propose the modified LRU replacement policy as a limit of cache partitioning
with the normal LRU replacement policy. The modified LRU replacement policy
considers the LRU cache block of active process and all the cache blocks of dormant
processes as a candidate of replacement, and replaces the block that is going to be
accessed furthest in the future from the candidates. This replacement policy is exactly
the same as the normal LRU replacement policy when there is only one process. For
multi-process cases, however, the policy tries to keep data of dormant processes if
they are more likely to be accessed in the near future than the LRU block of an
61
active process.
The modified LRU replacement policy keeps dormant process' cache blocks that
are going to be accessed in the nearest future, whereas our partitioning can only keep
the MRU cache blocks. Therefore, the performance of the modified LRU replacement
policy may still not be a realistic limit of our partitioning with the LRU replacement
policy. Especially, the modified LRU replacement policy performs almost as well as
the ideal replacement policy for short time quanta. For long time quanta, however,
the modified LRU replacement policy can be thought of as our limit. For long time
quanta, the cache blocks kept in the cache are more likely to be accessed within a time
quantum even though they are not accessed at the beginning of the time quantum.
5.3
Experimental Results
The case of eight processes sharing a 32-KB fully-associative cache is simulated to
evaluate the model-based partitioning. Seven benchmarks (bzip2, gcc, swim, mesa,
vortex, vpr, twolf) are from SPEC CPU2000 [14], and one (the image understanding
(iu) program) is from the data intensive systems benchmark suite [23]. The overall
miss-rate with partitioning is compared to the miss-rate only using the normal LRU
replacement policy. The result of look-ahead based partitioning serves as the best
miss-rate possible for a given benchmarks.
The simulations are carried out for fifty million memory references for each time
quantum. Processes are scheduled in a round-robin fashion with the fixed number
of memory references per time quantum.
Also, the number of memory references
per time quantum is assumed to be the same for the all eight processes.
Finally,
two record cycles (P), ten million and a million memory references, are used for the
model-based partitioning. The record cycle represents how often the miss-rate curve
is recorded for an off-line profiling. Therefore, a shorter record cycle implies more
detailed information about a process' memory reference pattern.
The characteristics of the benchmarks are illustrated in Figure 5-3. Figure 5-3 (a)
shows the change of a miss-rate over time. The x-axis represents simulation time.
62
0.07-
0.25
-
....
bzip2
gcc
0.06
swim
_ mesa
0.2
-
vortex
vpr
- twolf
-- iu
0.05
-
0. 151
-
.CD~0.U-U,44 -
CZ
-I
C,,
C,)
-
0.1
0.03
-
*
0.02
I
-
I
-
0.05
0.01
0
0
0
C
10
5
Time (million memory references)
1)
5
Time (million memory references)
(a)
0.9
0.9
bzip2
-- gcc
- swim
mesa
0.8j
0.7
--
0.7
0.6
2 0.5
0.6
CZ
- )
O 0.4
0.5
0.4
0.3
0.3
0.2
0.2
-
2
0.1
0.1
U
10 0
vortex
vpr
twolf
iu
-
0.8 h
102
101
Cache Size (blocks)
V
10 0
1?
102
10
Cache Size (blocks)
1&3
(b)
Figure 5-3: The characteristics of the benchmarks. (a) (b) The change of a miss-rate
over time. (c) (d) The miss-rate as a function of the cache size.
63
0.05
Nrmal LRU
-e- Model, P=10
0.045 F
-*-
Model, P=10
6
-
Modified LRU
0.04
F
4-
CO,
0.03-
0.025-
0.02'
10
--.1
0
10
-
--
104
103
102
References per Time Quantum
--
10
-
10 6
Figure 5-4: The results of the model-based cache partitioning for various time quanta
when eight processes (bzip2, gcc, swim, mesa, vortex, vpr, twolf, iu) are sharing a
cache (32 KB, fully associative).
The y-axis represents the average miss-rate over one million memory references at a
given time. As shown in the figure, bzip2, gcc, swim and iu show abrupt changes in
their miss-rate, whereas other benchmarks have very uniform miss-rate characteristics
over time. Figure 5-3 (b) illustrates the miss-rate as a function of the cache size. For
a 32-KB fully-associative cache, benchmarks show miss-rates between 1% and 5%.
The result of cache partitioning for various time quanta are shown in Figure 5-4.
The miss-rates are averaged over 50 million memory references. As discussed in the
cache model, the normal LRU replacement policy is problematic for a certain range of
time quanta. In this case, the overall miss-rate increases dramatically for time quanta
between one thousand and ten thousand memory references.
For this problematic
region, the model-based partitioning improves the cache miss-rate by lowering it from
4.5% to 3.5%, which is over a 20% improvement. For short time quanta, the relative
improvement is about 5%. For very long time quanta, the model-based partitioning
64
shows the exact same result as the normal LRU replacement policy. In general, it is
shown in the figure that the model-based partitioning always performs at least as well
as or better than the normal LRU replacement policy. Also, the partitioning with a
short record cycle performs better than the partitioning with a long record cycle.
From the figure, we can also compare the performance of the model-based partitioning with the modified LRU replacement policy. For long time quanta, our modelbased partitioning performs as well as the modified LRU replacement policy, which
means that our model-based partitioning closely achieves the best cache partition.
However, there is a noticeable gap between the model-based partitioning and the
modified LRU replacement policy for short time quanta. This is partly because our
model only has information averaged over time quanta, whereas the modified LRU
replacement policy knows the exact memory reference pattern for each time quantum.
The modified LRU replacement policy performs much better than the model-based
partitioning for short time quanta also because the modified LRU becomes the ideal
replacement policy if time quantum is short.
In our example of a 32-KB cache with eight processes, the problematic time quanta
are in the order of a thousand memory references, which is very short for modern
microprocessors. As a result, only systems with very fast context switching, such
as simultaneous multi-threading machines, can be improved for this cache size and
workload. However, as shown in Chapter 4, longer time quanta become problematic
if a cache is larger. Therefore, conventional time-shared systems with very high clock
frequency can also be improved by the same technique if a cache is large.
Figure 5-5 shows miss-rates from the same simulations as above. In this figure,
however, the change of a miss-rate is shown rather than an average miss-rate over
the entire simulation. It is clear from the figure how the short record cycle helps
partitioning. In Figure 5-5 (a), the model-based partitioning with the long record
cycle (P = 107) performs worse than LRU at the beginning of a simulation, even
though it outperforms the normal LRU replacement policy overall. This is because the
model-based partitioning has only one average miss-rate curve for a process. As shown
in Figure 5-3, some benchmarks such as bzip2 and gcc have a very different miss-rate
65
0.055
Normal LRU
-
0.05
(D
-e- Model, P=107
_-Model, P=10 6
0.045
CU)
I,
0.04
0.035
Time Quantum
0.03'
C
0.055
5
10 refs
15
20
25
30
35
Time (million memory references)
(a)
-
-L
0.045
40
45
50
_pNormal LRU
-&- Model, P=107
-- Model, P=10 6
-
0.05
a)
10
=
0.04
0.035
Time Quantum =5000 refs
0.03
5
10
0.04
15
20
30
35
25
Time (million memory references)
(b)
45
50
Normal LRU
6
Model, P=10
T Q0Model, P=10i
0.035
a)
40
-*_
0.03
C,)
C,,
0.025
0.02
.
-
Time Quantum =1000000 refs
A A1'
0
5
10
15
20
25
30
35
Time (million memory references)
40
45
50
(c)
Figure 5-5: The results of the model-based cache partitioning over time when eight
processes (bzip2, gcc, swim, mesa, vortex, vpr, twolf, iu) are sharing a cache (32 KB,
fully associative) with three difference time quanta. (a) ten. (b) five thousand. (c)
one million. Notice a 20% relative improvement in (b).
66
at the beginning. Therefore, the average miss-rate curve for those benchmarks do
not work at the beginning of the simulation, which results in the worse performance
than the normal LRU replacement policy. The model-based partitioning with the
short record cycle (P = 106), on the other hand, always outperforms the normal LRU
replacement policy. In this case, the model has correct miss-rate curves for all the
time quanta, and partitions the cache properly even at the beginning of processes.
Figure 5-5 (b) also shows that the short record cycle improves performance at the
beginning. Finally, the model-based partitioning can not improve miss-rates if the
time quantum is very long as shown in Figure 5-5 (c).
In conclusion, the simulation experiments show that the model-based partitioning
works very well for full-associative caches. For short or mid-range time quanta, cache
performance can be improved using the model-based partitioning, and partitioning
is especially helpful for mid-range time quanta when the normal LRU replacement
policy is problematic. The model-based partitioning for set-associative caches is a
subject of ongoing work.
67
Chapter 6
Conclusion
An analytical cache model to estimate overall miss-rate when multiple processes are
sharing a cache has been presented (Chapter 2). The model obtains the information
about each process from its miss-rate curve, and combines it with parameters that
define the cache configuration and schedule of processes. Interference among processes
under the LRU replacement policy is quickly estimated for any cache size and any time
quantum, and the estimated miss-rate is very accurate. More importantly, the model
provides not only the overall miss-rate but also very good understanding for the effect
of context switching. For example, the model clearly shows that the LRU replacement
policy is problematic for mid-range time quanta because the policy replaces the blocks
of the least recently executed process that are more likely to be accessed in the near
future.
The analytical model has been applied to the cache partitioning problem to show
that the model can be used for both understanding the problem and developing
new algorithms to solve the problem. First, we have defined cache partitioning and
discussed the optimal partition based on the model. From the study, we can conclude
that a certain amount of cache blocks should be dedicated to each process and the
rest should be shared among all processes. The optimal amount of dedicated blocks
depends on cache size, time quanta, and the characteristics of processes (Chapter 3).
We have also compared the standard LRU replacement policy with the optimal
partition (Chapter 4). The study shows that the LRU replacement policy has two
68
main problems: (i) allocation size (ii) cache layout. The LRU replacement policy
allocates too much cache space to streaming processes, which can be a severe problem
for short time quanta. The LRU replacement policy also tends to replaces wrong
process' data, which often results in unnecessary conflicts among processes.
This
layout problem can be severe for mid-range time quanta. Therefore, L2 caches and
DRAMs could suffer from context switches and can potentially be improved by proper
partitioning techniques.
Finally, the model-based partitioning has been implemented and verified by simulations (Chapter 5). Miss-rate curves are recorded off-line and partitioning is performed on-line according to the combination of processes that are executing. The
simulation result for the case when eight process share a 32-KB fully-associative
cache shows that the model-based partitioning can improve the overall miss-rate up
to 20%.
Even though we have used off-line profiling method to obtain miss-rate
curves, it should be not hard to approximate the miss-rate curve on-line using missrate monitoring techniques. Therefore, a complete on-line cache partitioning method
can be developed based on the model.
Only the cache partitioning problem has been studied in this paper. However, as
shown by the study of cache partitioning, our model can be applied any cache optimization problem that is related to the problem of context switching. For example, it
can be used to determine the best combination of processes that can be run on each
processor of a multi-processor system. Also, the model is useful in identifying areas
in which further research in improving cache performance would be fruitful since it
can easily provide the maximum improvement we can expect.
69
Appendix A
The Exact Calculation of E[x(t)]
In our cache model, we approximated E[x(t)], the expectation of the amount of data
in the cache after t memory references, using Jensen's inequality (Section 2.3.2).
Even though it is computationally more expensive than the approximation, it is also
possible to calculate the exact value of E[x(t)].
The expectation is the sum of every possible value multiplied by the probability
for the value. Therefore, to calculate the exact value of E[x(t)], we should know the
probability for x(t) to be k for every integer k from 0 to C when C is the cache size.
To calculate the probabilities, define Px=k(t) as the probability for x(t) to be k. We
also define a vector of the probabilities as follows:
Px(t) - (PX=O(t), PXi(t),. -7 -,PX=c
(0)).
(A. 1)
From the facts that the amount of data in the cache increases by one only if a
memory reference misses (Equation 2.4) and probability to miss with k cache blocks
in the cache is m(k), Px=k(t + 1) can be written by
Px=k(t + 1) - Px=k(t) - m(k) + Px=k-1(t) - m(k - 1).
(A.2)
Using matrix multiplication, the probability for all k can be written in one equa-
70
tion as follows:
/ 1-
m(O)
0
0
0
1 - m(1)
0
0
0
0
1 - m(C - 1)
m(C - 1)
0
0
m(O)
PX(t + 1) = PX(t) - --
0
(A.3)
If we assume that the cache is empty at time 0, only P,=o(0) = 1 and all other
probabilities P,=kk7o(0) = 0. Therefore, we can calculate Px(t).
Px(t) = Px(O) -At
(A.4)
where Px(O) and A are given by the following equations.
(A.5)
PX(O) = (1, 0, -.
- ,0).
(1-
M(0)
0
m(0)
0
0
1 - m(1)
0
0
(A.6)
0
0
1-m(C - 1)
m(C - 1)
0
0
0
1
I
Since we can calculate the probability for every possible value of x(t), it is straightforward to calculate the expectation E[X(t)].
C
E[x(t)] = E k - Px=k(t)
(A.7)
k=O
=
(1,2, ... , C) - (Px(O) - At)T.
71
Bibliography
[1] A. Agarwal, M. Horowitz, and J. Hennessy. An analytical cache model. ACM
Transactions on Computer Systems, 7(2), May 1989.
[2] AMD. AMD Athlon Processor Technical Brief, Nov. 1999.
[3] AMD, http://www.amd.com/K6/k6docs/pdf/21918.pdf. AMD K6-II Processor
Datasheet, Nov. 1999.
[4] D. T. Chiou. Extending the Reach of Microprocessors: Column and Curious
Caching. PhD thesis, Massachusetts Institute of Technology, 1999.
[5] D. T. Chiou, P. Jain, L. Rudolph, and S. Devadas. Application-specific memory
manangement for embedded systems using software-controlled caches. In the
37th conference on Design automation, 2000.
[6] Compaq. Compaq alphastation family.
[7] Compaq. Tru64 Unix V5.1 Documentation: Guide to Realtime Programming,
Aug. 2000.
[8] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, John
& Sons, Incorporated, Mar. 1991.
[9] S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, R. L. Stamm, and D. M. Tullsen.
Simultaneous multithreading: A platform for next-generation processors. IEEE
Micro, 17(5), 1997.
[10] C. Freeburn. The hewlett packard PA-RISC 8500 processor. Technical report,
Hewlett Packard Laboratories, Oct. 1998.
[11] D. Greenley, J. Bauman, D. C. andD. Chen, R. Eltejaein, P. Ferolito, P. Fu,
R. Garner, D. Greenhill, H. Grewal, K. Holdbrook, B. Kim, L. Kohn, H. Kwan,
M. Levitt, G. Maturana, D. Mrazek, C. Narasimhaiah, K. Normoyle, N. Parveen,
P. Patel, A. Prabhu, M. Tremblay, M. Wong, L. Yang, K. Yarlagadda, R. Yu,
R. Yung, and G. Zyner. UltraSPARC: The next generation superscalar 64-bit
SPARC. In Compcon '95, 1995.
[12] J. S. Harper, D. J. Kerbyson, and G. R. Nudd. Analytical modeling of setassociative cache behavior. IEEE Transactions on Computers, 48(10), Oct. 1999.
72
[13] J. L. Hennessy and D. A. Patterson.
Approach. Morgan Kaufmann, 1996.
Computer Architecture a Quantitative
[14] J. L. Henning. SPEC CPU2000: Measuring CPU performance in the new millennim. IEEE Computer, July 2000.
[15] Intel, ftp://download.intel.com/design/IA-64/Downloads/24547401.pdf.
nium Processor MicroarchitectureReference, Aug. 2000.
Ita-
[16] Intel, ftp://download.intel.com/design/Pentium4/manuals/24547002.pdf. IA-32
Intel Architecture Software Developer's Manual, 2001.
[17] D. B. Kirk. Process dependent static cache partitioning for real-time systems.
In Real-Time Systems Symposium, 1988.
[18] H. Kwak, B. Lee, A. R. Hurson, S.-H. Yoon, and W.-J. Hahn. Effects of multithreading on cache performance. IEEE Transactions on Computers, 48(2), Feb.
1999.
[19] J. L. Lo, J. S. Emer, D. M. T. Henry M. Levy, Rebecca L. Stamm, and S. J.
Eggers. Converting thread-level parallelism to insruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, 15, 1997.
[20] P. Magnusson and B. Werner. Efficient memory simulation in SimICS. In 28th
Annual Simulation Symposium, 1995.
[21] MIPS Technologies, Inc. MIPS R10000 Microprocessor User's Manual, 1996.
[22] J. C. Mogul and A. Borg. The effect of context switches on cache performance.
In the fourth international conference on Architectural support for programming
languages and operating systems, 1991.
[23] J. Muoz. Data-Intensive Systems Benchmark Suite Analysis and Specification.
http://www.aaec.com/projectweb/dis, June 1999.
[24] M. Rosenblum, S. A. Herrod, E. Witchel, and A. Gupta. Complete computer system simulation: The SimOS approach. IEEE Parallel & Distributed Technology,
1995.
[25] M. S. Squillante and E. D. Lazowska. Using processor-cache affinity information
in shared-momory multiprocessor scheduling. IEEE Transactions on Parallel
and Distributed Systems, 4(2), Feb. 1993.
[26] H. S. Stone, J. Turek, and J. L. Wolf. Optimal partitioning of cache memory.
IEEE Transactions on Computers, 41(9), Sept. 1992.
[27] G. E. Suh, L. Rudolph, and S. Devadas. Set-associative cache models for timeshared systems. Technical Report CSG Memo 432, Massachusetts Institute of
Technology, 2001.
73
[28] D. Thidbaut and H. S. Stone. Footprints in the cache. A CM Transactions on
Computer Systems, 5(4), Nov. 1987.
[29] D. Thiebaut, H. S. Stone, and J. L. Wolf. Improving disk cache hit-ratios through
cache partitioning. IEEE Transactions on Computers, 41(6), June 1992.
[30] J. Torrellas, A. Tucker, and A. Gupta. Benefits of cache-affinity scheduling in
shared-memory multiprocessors: A summary. In the 1993 ACM SIGMETRICS
conference on Measurement and modeling of computer systems, 1993.
[31] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading:
Maximizing on-chip parallelism. In 22nd Annual InternationalSymposium on
Computer Architecture, 1995.
[32] R. A. Uhlig and T. N. Mudge. Trace-driven memory simulation: A survey. A CM
Computing Surveys, 29(2), June 1997.
[33] M. Zagha, B. Larson, S. Turner, and M. Itzkowitz. Performance analysis using
the MIPS R1000. In Supercomputing'96, 1996.
74