Analytical Cache Models with Applications to Cache Partitioning in Time-Shared Systems by Gookwon Edward Suh B.S., Seoul National University (1999) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2001 KER MASSACHUSETTS INSTITUTE OF TECHNOLOGY @Massachusetts Institute of Technology. All rights reserved. APR 0001 LIBRARIES Author...................... Department of Electrical Engineering and Computer Science February 5, 2001 Certified by................ ...................................... Srinivas Devadas Professor of Computer Science Thesis Supervisor Certified by... ..... ............ .................. Larry Rudolph Principal Research Scientist Thesis Supervisor A ccepted by ................... .. .......... Arthur C. Smith Chairman, Committee on Graduate Students Department of Electrical Engineering and Computer Science 2 Analytical Cache Models with Applications to Cache Partitioning in Time-Shared Systems by Gookwon Edward Suh Submitted to the Department of Electrical Engineering and Computer Science on February 5, 2001, in partial fulfillment of the requirements for the degree of Master of Science Abstract This thesis proposes an analytical cache model for time-shared systems, which estimates the overall cache miss-rate from the isolated miss-rate curve of each process when multiple processes share a cache. Unlike previous models, our model works for any cache size and any time quantum. Trace-driven simulations demonstrate that the estimated miss-rate is very accurate for fully-associative caches. Since the model provides a fast and accurate way to estimate the effect of context switching, it is useful for both understanding the effect of context switching on caches and optimizing cache performance for time-shared systems. The model is applied to the cache partitioning problem. First, the problems of the LRU replacement policy is studied based on the model. The study shows that proper cache partitioning can enable substantial performance improvements for short or midrange time quanta. Second, the model-based partitioning mechanism is implemented and evaluated through simulations. In our example, the model-based partitioning improves the cache miss-rate up to 20% over the normal LRU replacement policy. Thesis Supervisor: Srinivas Devadas Title: Professor of Computer Science Thesis Supervisor: Larry Rudolph Title: Principal Research Scientist 3 4 Acknowledgments It has been my pleasure to work under the expert mentorship of Doctor Larry Rudolph and Professor Srinivas Devadas. Larry worked with me on daily basis, discussing and advising every aspect of my research work. His encouragement and advice were essential to finishing this thesis. Srini challenged me to organize my ideas and focus on real problems. His guidance was a tremendous help in finishing this thesis. I would like to thank both my advisors for their support and advice throughout my master's work. The work described in this thesis would not have been possible without the help of my fellow graduate students in Computational Structures Group. Dan Rosenband has been a great officemate, helping me to get used to the life at LCS, both professionally and personally. Derek Chiou was always generous with his deep technical knowledge, and helped me to choose the thesis topic. My research started from his Ph.D. work. Enoch Peserico provided very helpful comments on my cache models with his knowledge in algorithms and theory. Peter Portante taught me about real operating systems. Beyond my colleagues in CSG, I was fortunate to have great friends such as Todd Mills and Ali Tariq on the same floor. We shared junky lunches, worked late nights and hung out together. Todd also had a tough time reading through all my thesis drafts and fixing my broken English writing. He provided very detailed comments and suggestions. I would like to thank them for their friendship and support. My family always supported me during these years. Their prayer, encouragement and love for me is undoubtedly the greatest source of my ambition, inspiration, dedication and motivation. They have given me far more than words can express. Most of all, I would like to thank God for His constant guidance and grace throughout my life. His words always encourage me: See, I am doing a new thing! Now it springs up; do you not perceive it? I am making a way in the desert and streams in the wasteland (Isaiah 43:19). 5 6 Contents 1 Introduction 1.1 1.2 1.3 1.4 2 M otivation . . . . . . . . . . . . . . Previous Work . . . . . . . . . . . The Contribution of This Research Organization of This Thesis . . . . . . . . . . . . Analytical Cache Model 2.1 Overview . . . . . . . . . . . . . . . . . 2.2 Assumptions . . . . . . . . . . . . . . . 2.3 Fully-Associative Cache Model . . . . . 2.3.1 Transient Cache Behavior . . . 2.3.2 The Amount of Data in a Cache 2.3.3 The Amount of Data in a Cache 2.3.4 Overall Miss-rate . . . . . . . . 2.4 Experiments . . . . . . . . . . . . . . . 2.4.1 Cache Flushing Cases . . . . . . 2.4.2 General Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 . . . . . . . . . . . . . . 16 . . . . . . . . . . . . . . 16 . . . . . . . . . . . . . . 17 . . . . . . . . . . . . . . 18 Starting with an Empty Cache 20 22 for the General Case . . . . . . . . . . . . . . . . 25 . . . . . . . . . . . . . . 27 . . . . . . . . . . . . . . 27 . . . . . . . . . . . . . . 29 3 Cache Partitioning Based on the Model 3.1 Specifying a Partition . . . . . . . . . . . . . 3.2 Optimal Partitioning Based on the Model 3.2.1 Estimation of Optimal Partitioning 3.3 Tradeoffs in Cache Partitioning . . . . . . . 4 Cache Partitioning Using LRU Replacement 4.1 Cache Partitioning by the LRU Policy . . . . . . . . . . . . . . . . . 4.2 Comparison between Normal LRU and LRU with Optimal Partitioning 4.2.1 The Problem of Allocation Size . . . . . . . . . . . . . . . . . 4.2.2 The Problem of Cache Layout . . . . . . . . . . . . . . . . . . 4.2.3 Sum m ary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Problems of the LRU Replacement Policy for Each Level of Memory H ierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Effect of Low Associativity . . . . . . . . . . . . . . . . . . . . . . . . 4.5 More Benefits of Cache Partitioning . . . . . . . . . . . . . . . . . . . 7 9 11 13 14 15 32 33 34 36 36 44 44 45 45 48 49 52 53 54 5 Partitioning Experiments 5.1 Implementation of Model-Based Partitioning . . . . . 5.1.1 Recording Memory Reference Patterns . . . . 5.1.2 Cache Partitioning . . . . . . . . . . . . . . . 5.2 Cache Partitioning Based on Look-Ahead Information 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 56 57 58 61 62 6 Conclusion 68 A The Exact Calculation of E[x(t)] 70 8 Chapter 1 Introduction The cache is an important component of modern memory systems, and its performance is often a crucial factor in determining the overall performance of the system. Processor cycle times have been reduced dramatically, but cycle times for memory access remain high. As a result, the penalty for accessing main memory has been increasing, and the memory access latency has become the bottleneck of modern processor performance. The cache is a small and fast memory that keeps frequently accessed data near the processor, and can therefore reduce the number of main mem- ory accesses [13]. Caches are managed by bringing in a new data block on demand, and evicting a block based on a replacement policy. The most common replacement policy is the least recently used (LRU) policy, which evicts the block that has not been accessed for the longest time. The LRU replacement policy is very simple and at the same time tends to be efficient for most practical workloads especially if only one process is using the cache. When multiple running processes share a cache, they experience additional cache misses due to conflicts among processes. In the past, the effect of context switches was negligible because both caches and workloads were small. For small caches and workloads, reloading useful data into the cache only takes a small amount of time. As a result, the number of misses caused by context switches was relatively small compared to the total number of misses over a usual time quantum. 9 However, things have changed and the context switches can severely degrade cache performance in modern microprocessors. First, caches are much larger than before. Level 1 (LI) caches range up to a few MB [10], and L2 caches are up to several MB [6, 21]. Second, workloads have also become larger. Multimedia processes such as video or audio clips often consume hundreds of MB. Even many SPEC CPU2000 benchmarks now have a memory footprint larger than 100 MB [14]. Finally, context switches tend to cause more cache pollution than before due to the increased number of processes and the presence of streaming processes. As a result, it can be cru- cial for modern microprocessors to minimize inter-process conflicts by proper cache partitioning [29, 17] or scheduling [25, 30]. A method of evaluating cache performance is essential both to predict miss-rate and to optimize cache performance. Traditionally, simulations have, virtually ex- clusively, used for cache performance evaluation [32, 24, 20]. Although simulations provide accurate results, simulation time is often too long. Moreover, simulations do not provide any intuitive understanding of the problem to improve cache performance. To predict cache miss-rate faster, hardware monitoring can also be used [33]. However, hardware monitoring is limited to the particular cache configuration that is implemented. As a result, both simulations and hardware monitoring can only be used to evaluate the effect of context switches [22, 18]. To provide both performance prediction and the intuition to improve it, analytical cache models have been developed. One such approach is to analyze a particular form of source codes such as nested loops [12]. Although this method is both accurate and intuitive, it is limited to particular types of algorithms. The other way to analytically model caches is extracting parameters of processes and combining them with other parameters defining the cache [1, 28]. These methods can predict the general trend of cache behavior. For the problem of context switching, however, they only focus on very long time quantum. Moreover, their input parameters are difficult to extract and almost impossible to obtain on-line. This paper presents an analytical cache model for context switches, which can estimate overall miss-rate accurately for any cache size and any time quantum. The 10 characteristics for each process is given by the miss-rate as a function of cache size when the process is isolated, which can be easily obtained either on-line or off-line. The time quanta for each process and cache size are also given as inputs to the model. With this information, the model estimates the overall miss-rate of a given cache size running an arbitrary combination of processes. The model provides good estimates for any cache size and any time quantum, and is easily applied to real problems since the input miss-rate curves are both intuitive and easy to obtain in practice. Therefore, we believe that the model is useful for any study related to the effect of context switches on cache memory. The problem of cache partitioning among time-shared processes serves as the example of the model's applications. Optimal cache partitioning is studied based on the model, which shows the way to determine the best cache partition and the problems with the LRU replacement policy. From the study, a model-based partitioning mechanism is implemented. The mechanism obtains the miss-rate characteristic of each process by off-line profiling, and performs on-line cache allocation according to the combination of processes that are executing at any given time. 1.1 Motivation Figure 1-1 (a) shows the trace-driven simulation results when six processes are sharing the same cache. In the figure, the x-axis represents the number of memory references per time quantum, and the y-axis represents the cache miss-rate expressed as a percentage. The number of memory references per time quantum is assumed to be the same for all processes. The Li caches are 256-KB 8-way associative, separate for instruction stream and data stream. The L2 cache is a 2-MB 16-way associative unified cache. Two copies of three benchmarks are simulated, which are the image understanding program from the data intensive systems benchmark suite [23], vpr and twolf from the SPEC CPU2000 benchmark suite [14]. In the figure, the miss-rate of the Li instruction cache is very close to 0% and the miss-rate of the Li data cache is around 5%. However, the L2 cache miss-rate shows 11 90 90 80 -+- Li Inst -- Li Data -+- 80 *- L2 70 70( 60 60 a) 50 C', C', ' 50 40 40 30 30* 6 20 20 10 10- 0- --+ 1 05 LRU -0- Ideal 7 0 10 10 108 10 Time Quantum (a) 106 10 108 Time Quantum (b) Figure 1-1: (a) Cache miss-rate for different length of time quanta, 256 KB 8-way Li, 2 MB 16-way L2 (LI Inst is almost 0%). (b) Comparison of cache miss-rate with/without partitioning. 12 that the impact of context switches can be very severe even for realistic time quanta. For the time quantum of a million memory references, the L2 cache miss-rate is about 80%, but for longer time quanta the miss-rate is almost 0%. This means that the benchmarks would get miss-rates close 0% if they use the cache alone, and therefore context switches cause most of L2 cache misses. Even for the longer time quanta, the inter-process conflict misses are a significant fraction of the total L2 misses. Therefore, this example shows that context switches can severely degrade the cache performance even for systems with a very high clock speed. Figure 1-1 (b) shows the comparison of L2 miss-rates with and without cache partitioning for the same cache configuration and the same benchmark set as before. The normal LRU replacement policy is used for the case without cache partitioning. For the cache partitioning case, the information of next reference time is recorded from the trace, and a new block replaces other process' block only if the new block is accessed again before other process' block. The result shows that for a certain range of time quanta, such as one million memory references, partitioning can significantly reduce the miss-rate. 1.2 Previous Work Several early investigations of the effects of context switches use analytical models. Thiebaut and Stone [28] modeled the amount of additional misses caused by context switches for set-associative caches. Agarwal, Horowitz and Hennessy [1] also included the effect of conflicts between processes in their analytical cache model and show that inter-process conflicts are noticeable for a mid-range of cache sizes that are large enough to have a considerable number of conflicts but not large enough to hold all the working sets. However, these models work only for long enough time quanta, and require information that is hard to collect on-line. Mogul and Borg [22] studied the effect of context switches through trace-driven simulations. Using a timesharing system simulator, their research shows that system calls, page faults, and a scheduler are the main source of context switches. They 13 also evaluate the effect of context switches on cycles per instruction (CPI) as well as the cache miss-rate. Depending on cache parameters the cost of a context switch appears to be in the thousands of cycles, or tens to hundreds of microseconds in their simulations. Stone, Turek and Wolf [26] investigated the optimal allocation of cache memory among two or more competing processes that minimizes the overall miss-rate of a cache. Their study focuses on the partitioning of instruction and data streams, which can be thought as multitasking with a very short time quantum. Their model for this case shows that the optimal allocation occurs at a point where the miss-rate derivatives of the competing processes are equal. The LRU replacement policy appears to produce cache allocations very close to optimal for their examples. They also describe a new replacement policy for longer time quanta that only increases cache allocation based on time remaining in the current time quantum and the marginal reduction in miss-rate due to an increase in cache allocation. However, their policy simply assumes the probability for a evicted block to be accessed in the next time quantum as a constant, which is neither validated nor is it described how this probability is obtained. Thiebaut, Stone and Wolf applied their partitioning work [26] to improve disk cache hit-ratios [29]. The model for tightly interleaved streams is extended to be applicable for more than two processes. They also describe the problems in applying the model in practice, such as approximating the miss-rate derivative, non-monotonic miss-rate derivatives, and updating the partition. Trace-driven simulations for 32-MB disk caches show that the partitioning improves the relative hit-ratios in the range of 1% to 2% over the LRU policy. 1.3 The Contribution of This Research Our analytical model differs from previous efforts in that previous work has tended to focus on some specific cases of context switches. The new model works for any time quanta, whereas the previous models focus only on long time quanta. Therefore, the 14 model can be used to study multi-tasking related problems on any level of caches. Since a lower level cache only sees memory references that are missed in the higher level, the effective time quantum decreases as we go up the memory hierarchy. Moreover, the model is based on a miss-rate curve that is much easier to obtain compared to footprints or the number of unique cache blocks that previous models require, which makes the model very helpful for developing solutions for cache optimization problems. Our partitioning work based on the analytical cache model also has advantages over previous work. First, our partitioning works for any level of memory hierarchy for any range of time quantum, unlike, for instance, Thiebaut, Stone and Wolf's partitioning algorithm which can only be applied to disk caches. Further, the model provides a more complete understanding of optimal partitioning for different cache sizes, and time quanta. 1.4 Organization of This Thesis The rest of this thesis is organized as follows. In Chapter 2, we derive an analytical cache model for time-shared systems. Chapter 3 discusses cache partitioning based on the model. This optimal partitioning is compared to the standard LRU replacement policy and the problems of the LRU replacement policy are discussed in Chapter 4. In Chapter 5, we implement a model-based cache partitioning algorithm and evaluate the implementation by simulations. Finally, Chapter 6 concludes the paper. 15 Chapter 2 Analytical Cache Model 2.1 Overview The analytical cache model estimates the cache miss-rate for a multi-process situation when the cache size, the length of each time quantum, and a miss-rate curve for each process as a function of the cache size are known. The cache size is given by the number of cache blocks, and the time quantum is given by the number of memory references. Both are assumed to be constants (See Figure 2-1 (a)). 2.2 Assumptions The derivation of the model is based on several assumptions. First, the memory reference pattern of each process is assumed to be represented by a miss-rate curve that is a function of the cache size, and this miss-rate curve does not change over time. Although, real applications do have dynamically changing memory reference patterns, the model's results show that the average miss-rate works very well. For abrupt changes in the reference pattern, multiple miss-rate curves can be used to estimate an overall miss-rate. Second, we assume that there is no shared address space among processes. Otherwise, it is very difficult to estimate the amount of cache pollution since a process can use data brought into the cache by other processes. Although processes may share 16 miss-rate curves (mi(x)) time quanta (Ti) Cache Model overall miss-rate - cache size (C) (a) Process 1 Process N Process 2 Process 1 Process 2 * T1 T2o ... TN T1 Time T2 (b) Figure 2-1: (a) The overview of an analytical cache model. (b) Round-robin schedule. a small amount of shared library code, this assumption is true for common cases where each process has its own virtual address space and the shared memory space is negligible compared to the entire memory space that is used by a process. Finally, we assume that processes are scheduled in a round-robin fashion with a fixed time quantum for each process as shown in Figure 2-1 (b). Also, we assume the least recently used (LRU) replacement policy is used. Note that although the round-robin scheduling and the LRU policy are assumed for the model, models for other scheduling methods and replacement policies can be easily derived in a similar manner. 2.3 Fully-Associative Cache Model This section derives a model for fully-associative caches. Although most real caches are set-associative caches, a model for fully-associative caches is very useful for understanding the effect of context switches because the model is simple. Moreover, since a set-associative cache is a group of fully-associative caches, a set-associative cache model can be derived based on the fully-associative model [27]. 17 2.3.1 Transient Cache Behavior Once a cache gets filled with valid data, a process can be considered to be in a steady state and by our assumption, the miss-rate for the process does not change. The initial burst of cache misses before steady state is reached will be referred to as the miss-rate transient behavior. For special situations, where a cache is dedicated to a single process for its entire execution, the transient misses are not important because the number of misses in the transient state is negligible compared to the number of misses over the entire execution, for any reasonably long execution. For multi-process cases, a process experiences transient misses whenever it restarts from a context switch. Therefore, the effect of transient misses could be substantial causing performance degradation. Since we already know the steady state behavior from the given miss-rate curves, we can estimate the effect of context switching once we know the transient behavior. We make use of the following notations: t the number of memory references from the beginning of a time quantum. x(t) the number of cache blocks belong to a process after t memory references. m(x) the miss-rate for a process with cache size x. Note that our time t starts at the beginning of a time quantum, not the beginning of execution. Since all time quanta for a process are identical by our assumption, we consider only one time quantum for each process. Figure 2-2 (a) shows a snapshot of a cache at time to, where time t is the number of memory references from the beginning of the current time quantum. Although the cache size is C, only part of the cache is filled with the current process' data at that time. Therefore, the effective cache size at time to can be thought of as the amount of current process' data x(to). The probability of a cache miss in the next memory 18 -3 m(x) for the current process (D Miss probability: Pmiss(t) mriss(to) of misses Te number 0 \\ Cache size Integ rate ~0 -0 x(to) 2~ The currentI process' data Time Other process' data/Empty The cache at time to (a) The length of a time quantum (T) (b) Figure 2-2: (a) The probability to miss at time to. (b) The number of miss-rate from Pmiss (t) curve. reference is given by Pmiss(to) = m(x(to)) (2.1) where m(x) is a steady state miss-rate for the current process when the cache size is a:. Once we have Pmiss (to), it is easy to estimate the miss-rate over the time quantum. Since one memory reference can cause one miss, the number of misses for the process over a time quantum can be expressed as a simple integral as shown in Figure 2-2 (b): misses = 0 Pmiss(t)dt = m(x(t))dt (2.2) f where T is the number of memory references in a time quantum. The miss-rate of the process is simply the number of misses divided by the number of memory references: miss-rate = 1 PTs(t dt = 19 m(x(t))dt (2.3) 2.3.2 The Amount of Data in a Cache Starting with an Empty Cache Now we need to estimate x(t), the amount of data in a cache as a function of time. We shall use of the case where a process starts executing with an empty cache to estimate cache performance for cases when a cache get flushed for every context switch. Virtual address caches without process ID are good examples of such a case. We will show how to estimating x(t) for the general case as well. Consider x' (t) as the amount of the current process' data at time t for an infinite size cache. We assume that the process starts with an empty cache at time 0. There are two possibilities for x (t) at time t + 1. If the (t + 1)h memory reference results in a cache miss, a new cache block is brought into the cache. As a result, the amount of the process's cache data increases by one. Otherwise, the amount of data remains the same. Therefore, the amount of the process' data in the cache at time t + 1 is given by X0 (t + 1) = X (t) + 1 when the (t + 1)th reference misses X CO(t) otherwise. (2.4) Since the probability for the (t + 1)" memory reference to miss is m(x (t)) from Equation 2.1, the expectation value of x(t + 1) can be written by E[x'(t + 1)] = E[x (t) - (1 - m(x (t))) + (x (t) + 1) -m(x (t))] = E[x (t) + 1 -m(x (t))] (2.5) = E[x (t)] + E[m(x (t))]. Assuming that m(x) is convex', we can use Jensen's inequality [8] and rewrite the 'If a replacement policy is smart enough, marginal gain of having one more cache block monotonically decreases as we increase the cache size 20 equation as a function of E[x (t)]. E[x (t + 1)] > E[x (t)] + m(E[x (t)]). (2.6) Usually, a miss-rate changes slowly. As a result, for a short interval such as from x to x + 1, m(x) can be approximated as a straight line. Since the equality in Jensen's inequality holds if the function is a straight line, we can approximate the amount of data at time t + 1 as E[xO (t + 1)] E[x (t)] + m(E[x (t)]). (2.7) We can calculate the expectation of xO (t) more accurately by calculating the probability for every possible value at time t (See Appendix A). However, calculating a set of probabilities is computationally expensive. Also, our experiments show that the approximation closely matches simulation results. If we approximate the amount of data x (t) to be the expected value E[x (t)], xc (t) can be expressed with a differential equation: x (t + 1) - x, (t) = m(x (t)), (2.8) which can be easily calculated in a recursive manner. For the continuous variable t, we can rewrite the discrete form of the differential equation 2.8 to a continuous form: dx~ =xmn(X ). dt (2.9) Solving the differential equation by separating variables, the differential equation becomes dx'. t = fxo(t) ]O) (2.10) m(x') We define a function M(x) as an integral of 1/m(x), which means that dM(x)/dx = 21 m(x), and then x (t) can be written as a function of t: X0(t) = M 1 (t + M(x (0))) (2.11) where M-1 (x) represents the inverse function of M(x). Finally, for a finite size cache, the amount of data in the cache is limited by the size of the cache C. Therefore, x4 (t), the amount of a process' data starting from an empty cache, is written by xO(t) = MIN[x (t), C] 2.3.3 = MIN[M-1 (t + M(O)), C]. (2.12) The Amount of Data in a Cache for the General Case In Section 2.3.1, it is shown that the miss-rate of a process can be estimated if the amount of the process' data as a function of time x(t) is given. In the previous section, x(t) is estimated when a process starts with an empty cache. In this section, the amount of a process' data at time t is estimated for the general case when a cache is not flushed at a context switch. Since we now deal with multiple processes, a subscript i is used to represent Process i. For example, xi(t) represents the amount of Process i's data at time t. The estimation of xi(t) is based on round-robin scheduling (See Figure 2-1 (b)) and the LRU replacement policy. Process i runs for a fixed length time quantum Ti. During each time quantum, the running process is allowed to use the entire cache. For simplicity, processes are assumed to be of infinite length so that there is no change in the scheduling. Also, the initial startup transient from an empty cache is ignored since it is negligible compared to the steady state. To estimate the amount of a process' data at a given time, imagine the snapshot of an infinite size cache after executing Process i for time t as shown in Figure 2-3. We assume that time is 0 at the beginning of the process' time quantum. In the figure, the blocks on the left side show recently used data, and blocks on the right side shows old data. P,k represents the data of Process j, and subscript k specifies 22 The snapshot of a cache MRU data i-1,1 ... LRU data i-1,2 2[ Pi+1,1 Pi+ 1 ,2 ... - x0i(t) C)E ---------------------------- 10 E-i <i x0i(t) _ _ 0 t t+Ti Time Figure 2-3: The snapshot of a cache after running Process i for time t. the most recent time quantum when the data are referenced. From the figure, we can obtain xi(t) once we know the size of all P,k blocks. The size of each block can be estimated using the xf(t) curve from Equation 2.12, which is the amount of Process i's data when the process starts with an empty cache. Since x(t) can also be thought of as the expected amount of data that are referenced from time 0 to time t, x(T) is the expected amount of data that are referenced over one time quantum. Similarly, we can estimate the amount of data that are referenced over k recent time quanta to be xO(k -T). As a result, the expected size of Block P,k can be written as P, k + (k - 2) -T) x(t + (k - 1) -T) - x4(t x (k - T) - x" ((k - 1) -T) if j is executing (2.13) otherwise where we assume that x (t) = 0 if t < 0. xi(t) is the sum of Pi,k blocks that are inside the cache of size C in Figure 2-3. If we define li(t) as the maximum integer value that satisfies the following inequality, 23 C 0 C 0 as xij~t) xi(O) xOi(t) 0 tstart(O,) tend(O tstart(i, 2 ) 2 tend(i, ) Time Figure 2-4: The relation between xi(t) and xi(t). Q2(t) + 1 represents how many Pi,k blocks are in the cache. 1i(t) N N - xt± ? (li(t) - T) <C (l(t) -1) -Ti) + k=O j=1 (2.14) j=1,jAi where N is the number of processes. From li(t) and Figure 2-3, the expectation value of xi(t) is N {xfI(t xi~t +l (t).-T) i(t+li(t) +H if x jlj N C- x (lI(t) -Tj) < C xo(li(t) -Tj) otherwise. j=1,joi (2.15) Figure 2-4 illustrates the relation between x (t) and x2 (t). In the figure li(t) is assumed to be 2. Unlike the cache flushing case, a process can start with some of its data left in the cache. The amount of initial data xi(O) is given by Equation 2.15. If the least recently used (LRU) data in a cache does not belong to Process i, xi(t) increases the same as xo(t). However, if the LRU data belongs to Process i, xi(t) 24 does not increase on cache misses since Process i's block gets replaced. Define tstart(j, k) as the time when the kth MRU block of Process the LRU part of a cache, and tend(j, from the cache (See Figure 2-3). j (Pi,k) becomes k) as the time when P,k gets completely replaced tstart(j, k) and tend(j, k) specify the flat segments in Figure 2-4 and can be estimated from the following equations based on Equation 2.13. N X" (tstart (j, k) + (k 1) Tj) + - x ((k - 1) -Tp) C. (2.16) Tp) C. (2.17) p=1,pf j N x$(tend(j, k) + (k - 2) Tj)+ x((k x3 -1) p=1,pf j tstart(j, 1j(t) + 1) would be zero if the above equation results in a negative value for j and 1j(t), which means that P(j, l(t) + 1) block is already the LRU part of given the cache at the beginning of a time quantum. Also, tend(j, k) is only defined for k from 2 to l(t) + 1. 2.3.4 Overall Miss-rate Section 2.3.1 explains how to estimate the miss-rate of each process when the amount of the process' data as a function of time xi(t) is given. The previous sections showed how to estimate xi(t). This section presents the overall miss-rate calculation. When a cache uses virtual address tags and gets flushed for every context switch, each process starts a time quantum with an empty cache. In this case, the miss-rate of a process can be estimated from the results of Section 2.3.1 and 2.3.2. From Equation 2.3 and 2.12, the miss-rate for Process i can be written by miss-ratei =- J Ti 1 fT m(MIN[M71 (t + Mj(0)), C])dt. (2.18) If a cache uses physical address tags or has process ID's with virtual address tags, it does not have to be flushed at a context switch. In this case, the amount of data 25 xi(t) is estimated in Section 2.3.3. The miss-rate for Process i can be written by I Ti miss-ratei = - Ti fo mi (xi (t)) dt (2.19) where xi(t) is given by Equation 2.15. For actual calculation of the miss-rate, and 2.17 can be used. Since tstart(j, k) tstart(j, k) and and tend(j, k) from Equation 2.16 ted(j, k) specify the flat segments in Figure 2-4, the miss-rate of Process i can be rewritten by miss-rate mi - (MIN[MII7 (t + M(xi(0)), C])dt li(t)+1 + E mi(xz (tstart(i,k) + (k - - tstart(i, k))} 1) T) - (MIN[tend(i, k ), Tj] k=dli (2.20) where di is the minimum integer value that satisfies tstart(i, di) < T. T is the time that Process i actually grows. l (t)+1 T- = T - E (MIN[tend(i, k), T] - (2.21) tstart(i, k)). k=di As shown above, calculating a miss-rate could be complicated if we do not flush a cache at a context switch. If we assume that the executing process' data left in a cache is all in the most recently used part of the cache, we can use the equation for estimating the amount of data starting with an empty cache. Therefore, the calculation can be much simplified as follows, miss-ratei = J-- Ti mj(MIN[Mj-1(t + Mi(xi(0))), C])dt (2.22) where xj(0) is estimated from Equation 2.15. The effect of this approximation is evaluated in the experiment section (cf. Section 2.4). Once we calculate the miss-rate of each process, the overall miss-rate is straight 26 forwardly calculated from those miss-rates. Overall miss-rate = 2.4 EN miss-ratei - T = (2.23) Experiments In this section, we validate the fully-associative cache model by comparing estimated miss-rates with simulation results. A few different combinations of benchmarks are modeled and simulated for various time quanta. First, we simulate cases when a cache gets flushed at every context switch, and compare the results with the model's estimation. Cases without cache flushing are also tested. For the cases without cache flushing, both the complete model (Equation 2.20) and the approximation (Equation 2.22) are used to estimate the overall miss-rate. Based on the simulation results, the error of the approximation is discussed. 2.4.1 Cache Flushing Cases The results of the cache model and simulations are shown in Figure 2-5 in cases when a process starts its time quantum with an empty cache. Four benchmarks from SPEC CPU200 [14], which are vpr, vortex, gcc and bzip2, are tested. The cache is 32-KB fully-associative cache with a block size of 32 Bytes. The miss-rate of a process is plotted as a function of the length of a time quantum. In the figure, the simulation results at several time quanta are also shown. The figure shows a good agreement between the model's estimation and the simulation result. As inputs to the cache model, the average miss-rate of each process has been obtained from simulations. Each process has been simulated for 25 million memory references, and the miss-rates of the process for various cache size have been recorded. The simulation results were also obtained by simulating benchmarks for 25 million memory references with flushing a cache every T memory references. As the result shows, the average miss-rate works very well. 27 0.14 0.09 +I- simulation model - -+- 0.08 0.12 C, - simulation model 0.07 2)0.06 0.1 0.05 lto -I-sm C,, 0.08 0.04 0.03 0.06 0.02 0.04' 10 3 10 Time Quantum 10 10 (a) vpr 0.09 ->- simulation -4+ 0.085 model 0.08) -- simulation model - 0.08 0.07 0.075 0.06 Cl, CO, 10 (b) vortex 0.09 a, 10 Time Quantum U) 0.05 0.065 0.04 0.06 0.03 0.02 10 0.07 0.055 0.05 104 Time Quantum 105 103 (C) gcc 104 Time Quantum 105 (d) bzip2 Figure 2-5: The result of the cache model for cache flushing cases. (a) vpr. (b) vortex. (c) gcc. (d) bzip. 28 800 0.042 simulation 0.041 - - 700 approximation model 0.04- -- vortex vpr 600 - 0.039 -0- 500- tZ0.038 4 C4Q 400 - 0.037 -! 300 0.036 E <200 - 0.035 -20 0.034 -100~ 0.033 0 5 Time Quantum 0' 0 10 x 104 4 2 Time Quantum 6 x 104 Figure 2-6: The result of the cache model when two processes (vpr, vortex) are sharing a cache (32 KB fully-associative). (a) the overall miss-rate. (b) the initial amount of data xi(0). 2.4.2 General Cases Figure 2-6 shows the result of the cache model when two processes are sharing a cache. The two benchmarks are vpr and vortex from SPEC CPU2000, and the cache is a 32-KB fully-associative cache with 32-B blocks. The overall miss-rates are shown in Figure 2-6 (a). As shown in the figure, the miss-rate estimated by the model shows a good agreement with the results of the simulations. The figure also shows an interesting fact that a certain range of time quanta could be very problematic for cache performance. For short time quanta, the overall missrate is relatively small. For very long time quanta, context switches do not matter since a process spends most time in the steady state. However, medium time quanta could severely degrade cache miss-rates as shown in the figure. This problem occurs when a time quantum is long enough to pollute the cache but not long enough to compensate for the misses caused by context switches. The problem becomes clear in 29 0.07 1 .. - _r__ simulation - approximation model 0.065- (D 0.06- 0.055 - 0.05- 0.045'' 0 1 2 3 4 5 6 Time Quantum 7 8 9 10 X 10 4 Figure 2-7: The overall miss-rate when four processes (vpr, vortex, gcc, bzip2) are sharing a cache (32K, fully-associative). Figure 2-6 (b). The figure shows the amount of data left in the cache at the beginning of a time quantum. Comparing Figure 2-6 (a) and (b), we can see that the problem occurs when the initial amount of data rapidly decreases. The error caused by our approximation (Equation 2.22) method can be seen in Figure 2-6. In the approximation, we assume that the data left in the cache at the beginning of a time quantum are all in the MRU region of the cache. In reality, however, the data left in the cache could be the LRU cache block and get replaced before other process' blocks in the cache, although the current process' data are likely to be accessed in the time quantum. As a result, the approximated miss-rate is lower than the simulation result when the initial amount of data is not zero. A four-process case is also modeled in Figure 2-7. Two more benchmarks, gcc and bzip2, from SPEC CPU2000 [14] are added to vpr and vortex, and the same cache configuration is used as the two process case. The figure also shows a very close 30 agreement between the miss-rate estimated by the cache model and the miss-rate from simulations. The problematic time quanta and the effect of the approximation have changed. Since there are more processes polluting the cache compared to the two process case, a process experiences an empty cache in shorter time quanta. As a result, the problematic time quanta become shorter. On the other hand, the effect of the approximation is less harmful in this case. This is because the error in one process' miss-rate becomes less important as we have more processes. 31 Chapter 3 Cache Partitioning Based on the Model In this chapter, we discuss optimal cache partitioning for multiple processes running on one shared cache. The discussion is mainly based on the analytical cache model presented in Chapter 2, and the presented partition is optimal in a sense that it minimizes the overall miss-rate of the cache. The discussion in the chapter is based on similar assumptions as the analytical model. First, the memory reference characteristic of each process is assumed to be given by a miss-rate curve that is a function of cache size. To incorporate the effect of dynamically changing behavior of a process, we assume that the memory reference pattern of a process in each time quantum is specified by a separate miss-rate curve. Therefore, each process has a set of miss-rate curves, one for each time quantum. Within a time quantum, we assume that one miss-rate curve is sufficient to represent a process. Second, caches are assumed to be fully-associative. Although most real caches are set-associative caches, it is much easier to understand the tradeoffs of cache partitioning using a simple fully-associative model rather than a complicated set-associative model. Also, since a set-associative cache is a group of fully-associative caches, the optimal partition could be obtained by applying a fully-associative partition for each set. 32 Cache Partition at TQ(2, 1) X (',2 X(2,1) Amount of Process 2's Data at this Point Amount 1 Data at this Point Start Process 1 (2,1) D 2 D Process 2 ... Process Process 1 Process 2 ... STime TQ(1,1) TQ(1,2) TQ(1,N) TQ(2,1) TQ(2,1) Round 2 Round 1 Figure 3-1: An example of a cache partition with N processes sharing a cache. Finally, processes are assumed to be scheduled in a round-robin fashion. In this chapter, we use the term "the length of a time quantum" to mainly represent the number of memory references in a time quantum rather than a real time period. 3.1 Specifying a Partition To study cache partitioning, we need a method to express the amount of data in a cache at a certain time as shown in Figure 3-1. The figure illustrates the execution of N processes with round-robin scheduling; we define the term round as one cycle of N process being executed. A round starts from Process 1 and ends with Process N. Using a round number (r) and a process number (p), TQ(r,p) represents Process p's time quantum in the rth round. Finally, we define X(rp) as the number of cache blocks belonging to Process i at the beginning of TQ (r,p). We specify a cache partition by dedicated areas and a shared area. The dedicated area of Process i is cache space that only Process i uses over different time quanta. If we define Dr as the size of Process i's dedicated area between TQ(r - 1, i) and TQ(r, i), D' equals to X ri). For static partitions when memory reference patterns 33 do not change over time, we can ignore the round number and express Process i's dedicated area as Di. The rest of the cache is called a shared area. A process brings in its data into the shared area during its own time quantum. After the time quantum, the process's blocks in this area are evicted by other processes. Therefore, a shared area can be thought of as additional cache space a process can use while its active. S(ri), the size of a shared area in TQ(r, i), can be written by i S(ri) = C N D+1 - - P=1 D (3.1) P=i+1 where C is the cache size. The size of a shared area is expressed by S for static partitioning. 3.2 Optimal Partitioning Based on the Model A cache partition can be evaluated by the transient cache model that is presented in Section 2.3.1. To estimate the number of misses for a process in a time quantum, the model needs the cache size (C), the initial amount of data (xj(0)), the number of memory references in a time quantum (Ti), and the miss-rate curve of the process (m(x)). A cache partition effectively specifies the cache size C and the initial amount of data xj(0) for each process. Therefore, we can estimate the total number of misses for a particular cache partition from the model. From the definition of a dedicated area, Process i starts its time quantum TQ(r, i) with Dr cache blocks in a cache. In the cache model, the cache size C stands for the maximum number of blocks a certain process can use in its time quantum. Therefore, the effective cache size for Process i in time quantum TQ(r, i) is (D+S(ri)). For each time quantum TQ(r,i), a miss-rate curve (mr) and the number of memory reference (T[) are assumed to be given as an input. From Equation 2.3 and 2.12, we can write the number of misses over time quantum 34 TQ(r, i) as follows: missTQ(ri) = miss(m'(x), Ti, D', D' + S(r,')) m (xi(t))dt (3.2) mr(MIN[(M[ )-1(t + M[(Dr)), D. + S(r i)])dt where dMir(x)/dx = mr(x) from Equation 2.11. In the equation, miss(m(x), T, D, C) represents the number of misses for a time quantum with a miss-rate curve (m(W)), the length of the time quantum (T), the initial amount of data (D) and cache size (C). If we assume Process i terminates after Ri rounds of time quanta, the total number of misses can be written by N total misses Z Rp miss(r,p). (3.3) p=1 r=1 In practical cases when processes execute for a long enough time, the number of time quanta Ri can be considered as infinite. Note that this equation is based on two assumptions. First, the replacement policy assumed by the equation is slightly different from the LRU replacement policy. Our augmented LRU replacement policy replaces the dormant process' data in the shared area before starting to replace the executing process' data. Once dormant process' data are all replaced, our replacement policy is the same as the normal LRU replacement policy. The equation also assumes that the dedicated area of a process keeps the most recently used blocks of the process. Since S(ri) is determined by D', the total number of misses (Equation 3.3) can be thought of as a function of D'. The optimal partition is achieved for a set of integer Dr that minimizes Equation 3.3. For a cache without prefetching, D' is always zero since the cache is empty when a process starts its execution for the first time. Xr , Also, Dr cannot be larger than the amount of Process i's data at the end of its past time quantum. 35 3.2.1 Estimation of Optimal Partitioning Although we can evaluate a given cache partition, it is not an easy task to find the partition that minimizes the total number of misses. Since the size of a dedicated area can be different for each round, an infinite number of combinations are possible for a partition. Moreover, the optimal size of dedicated areas for a round is dependent on the size of dedicated areas for the next round. For example, consider the optimal value of Dr. Dr affects the number of misses in the r round (miss(r, 2)), which is also dependent on the size of Process 1's dedicated area in (r+1)h round. As a result, determining the optimal partition requires the comparison of an infinite number of combinations. However, comparing all the combinations is intractable. This problem can be solved by breaking the dependency between different rounds. By assuming that the size of dedicated areas changes slowly, we can approximate the size of dedicated areas in the next round from the size of dedicated areas in the current round, D+l ~ D, which enables us to separate the partition of the current round from the partition of the next round. The number of misses over one round that is affected by Dr can be written as follows from Equation 3.2. p-1 miss(mr, T, D, C missD (Dr) = p=1 N D+1 q=1 q=p+1 N + ~ D) - p-1 miss(m-',T7 1 , D-, p=i+1 C- N Dq=1 Dq) q=p+1 Since we assumed that D+1 ~ Dr and the size of dedicated areas for the first round is given (D' = 0), we can determine the optimal value of D' round by round. 3.3 Tradeoffs in Cache Partitioning In the previous section, we discussed a mathematical way to determine the optimal partition based on the analytical cache model. However, the discussion is mainly 36 based on the equations, and does not provide an intuitive understanding of cache partitioning. This section discusses the same optimal partitioning focusing on tradeoffs in cache partitioning rather than a method to determine the optimal partition. For simplicity, the discussion in this section assumes one miss-rate curve per process and therefore static cache partitionig. Cache partitioning can be thought of as allocating cache blocks among dedicated and shared areas. Each cache block can be used as either a dedicated area for one process or a shared area for all processes. Dedicated cache blocks can reduce interprocess conflicts and capacity misses for a process. Shared cache blocks can reduce capacity misses for multiple processes. Unfortunately, there are only a fixed number of cache blocks. As a result, there is a tradeoff between (i) reducing inter-process conflicts and (ii) reducing capacity misses for one process and reducing capacity misses for multiple processes. A dedicated area can improve the cache miss-rate of a process by reducing misses caused by either conflicts among processes or lack of cache space. Specifically, if data left in the dedicated area from past time quanta are accessed, we can avoid inter-process conflict misses that would have occurred without the dedicated area. A dedicated area also provides more cache space for the process, and may reduce capacity misses within a time quantum. As a result, a dedicated area can improve both transient and steady-state behavior of the process. However, the advantage of a dedicated area is limited to one process. Also, a shared area has the same benefit of reducing capacity misses. The advantage of having a dedicated area is shown in Figure 3-2 based on the analytical cache model. The dashed line in the figure represents the probability of a miss at time t for a process without a dedicated area. Time 0 represents the beginning of a time quantum. Without a dedicated area, the process starts a time quantum with an empty cache, and therefore the miss probability starts from 1 and decreases as the cache is filled with more data. Once the process fills the entire shared area, the miss probability becomes a constant m(S), where S is the size of a shared area. Since time corresponds to the number of memory references, the area below the curve 37 1 Advantage of Having a Dedicated Area XD m(D) Cn m(S) ---------------------m(D+S)-------------------- ------ 0 T 0 Time Figure 3-2: The advantage of having a dedicated area. from time 0 to T represents the number of misses expected for the process without a dedicated area. Now let us consider a case with a dedicated area D. Since the the amount of data at the beginning of a time quantum is D, the miss probability starts from a value m(D) and flattens at m(D + S) as shown by the solid line in the figure. The area below the solid line between time 0 and T shows the number of misses expected in this case. If m(D) is smaller than 1, the dedicated area improves the transient miss-rate of the process. If m(D + S) is smaller than m(S), the dedicated area improves the steady-state miss-rate of the process. Even though a dedicated area can improve both transient and steady-state miss-rates, the improvement of the transient miss-rate is the unique advantage of a dedicated area. A larger shared area can reduce the number of capacity misses by allowing a process to keep more data blocks while it is active. A process starts a time quantum with data in its dedicated area, and brings new data into the cache when it misses. The amount of the active process' data increases until both the dedicated area for the process and the shared area are full with its data. After that, the active process recycles the cache space it has. As a result, the size of shared area with the size of a 38 1 Advantage of Having a Larger Shared Area -o -0 0 m(D+S1) ----------------- -- -- - ---- ------- --- - m (D +S 2) - - - - - - - - - - - - - - - - - - 0' T 0 Time Figure 3-3: The advantage of having a larger shared area for a process. dedicated area determines the maximum cache size a certain process can use. A larger shared area can increase cache space for all processes, and therefore can improve the steady-state miss-rate of all processes. Figure 3-3 illustrates the advantage of having a larger shared area for a process. The dashed line represents the probability of a miss as a function of time with a shared area SI. Time 0 represents the beginning of a time quantum. The process' expected number of misses is the area below the curve from time 0 to T, where T is the length of a time quantum. The miss probability starts from a certain value and becomes a constant once both the dedicated and shared area are full. A cache with a larger shared area S2 can store more data and therefore the steady-state miss probability becomes m(D + S2) as represented by a solid line in the figure. A larger shared area can reduce the number of misses if m(D + S 2 ) is smaller than m(D + S 1 ). The improvement can be significant if the time quantum is long and therefore a process stays in a steady-state for a long time. On the other hand, a larger shared area may not improve the miss-rate at all if the process does not reach a steady-state. Since a shared area is for all processes, the advantage of having a larger shared area is the sum of advantages for all the processes. 39 2048 1536 a Dedicated (P1) 1024 512 0 0 512 1024 1536 2048 Cache Size Figure 3-4: Optimal cache partitioning for various cache sizes. For small caches, the advantage of increasing a shared area usually outweighs the advantage of increasing a dedicated area. If a cache is small, reloading cache blocks only takes a relatively short period compared to the length of time quantum, and therefore steady-state cache behavior dominates. At the same time, the working set for a process is likely to be larger than the cache, which means that more cache space can actually improve the steady-state miss-rate of the process. As a result, the optimal partitioning is to allocate the whole cache as a shared area as shown in Figure 3-4. The figure shows optimal cache partitioning for various cache sizes when two identical processes are running with a time quantum of ten thousand memory references. The miss-rate curve of processes, m(x), is assumed to be e- 01 . Assuming the same time quanta and processes, a shared area becomes less important for larger caches. As the size of the shared area increases, a steady-state miss-rate becomes less important since a process starts to spend more time to refill the shared area. Also, the marginal gain of having a larger cache usually decreases as a process has larger usable cache space (a shared area and a dedicated area). Once the processes are unlikely to be helped by having more shared cache blocks, it is better to 40 6M, 4no 300 2040, 10 10 10 10 10 X 104 5 T2 0 0 5X10 10,4 5X T1 1 (a) 0 0 (b) T1 Figure 3-5: Optimal cache partitioning for various time quanta.(a) The amount of Process l's dedicated area. (b) The amount of shared area. use those blocks as dedicated ones. For very large caches that can hold the working sets for all the processes, it is optimal to allocate the whole cache to dedicated areas so that no process experiences any inter-process conflict. The length of time quanta also affects optimal partitioning. Generally, longer time quanta make a shared area more important. As a time quantum for a process becomes longer, a steady-state miss-rate becomes more important since the process is likely to spend more time in a steady-state for a given cache space. As a result, the process needs more cache space while it is active unless it already has large enough space to keep all of its data. An example of optimal partitioning for various time quanta is shown in Figure 3-5. The figure shows the optimal amount of a dedicated area for Process 1 and a shared area when two identical processes running on a cache of size 512 blocks. Time quanta of two processes as the number of references, T and T2 , are shown in x-axis and y-axis respectively. The miss-rate curve of the processes, m(x), is assumed to be e-0 Ol. As discussed above, the figure shows that a shared area should be larger for longer 41 400600 4000 0.101 0.05 0.05 b2 0 0 bl b2 (a) 0 0 bl ) Figure 3-6: Optimal cache partitioning for various miss-rate curves. (a) The amount of Process l's dedicated area. (b) The amount of shared area. time quanta. If both time quanta are short, the majority of cache blocks are used as dedicated ones since processes are not likely to grow within a time quantum. In the case of long time quanta, the whole cache is used as a shared area. Finally, the characteristics of processes affect optimal partitioning. Generally, processes with large working sets need a large shared area and vice versa. For example, if processes are streaming applications, having a large shared area may not improve the performance at all even though time quanta are long. But if processes have a random memory access pattern with a large working set, having a larger cache usually improves the performance. In that case, it may make sense to allocate a larger shared area. Figure 3-6 shows optimal partitioning between two processes with various missrate curves when the cache size is 512 blocks and the length of a time quantum is 10000 memory references. The miss-rate curve of Process i, mi(x), is assumed to be e-i. If a process has a large bi, the miss-rate curve drops rapidly and the size of the working set is small. If a process has a small bi, the miss-rate curve drops slowly and 42 the size of the working set is large. 43 Chapter 4 Cache Partitioning Using LRU Replacement The least recently used (LRU) replacement policy is the most popular cache management scheme that is used in modern microprocessors. In this chapter, we evaluate the LRU replacement policy in terms of cache partitioning. Cache partitioning by the LRU replacement policy and its problems are discussed. Based on the cache model, we also discuss how and when the LRU replacement policy can be improved by proper cache partitioning. 4.1 Cache Partitioning by the LRU Policy The LRU replacement policy does not partition a cache explicitly. All the processes can use the whole cache space while they are executing. However, the LRU replacement policy also results in a certain amount of data left in the cache when a process returns from a context switch. Therefore, the LRU replacement policy effectively partitions the cache, and the partition can be represented by dedicated and shared areas. The size of a dedicated area for a process is the amount of the process' data left in the cache at the beginning of its time quantum, and that can be estimated from the model in Section 2.3.3. The rest of cache space is a shared area. The LRU replacement policy tends to allocate larger cache space to a shared area 44 compared to the optimal partition. For every miss, the LRU policy increases the cache space for an active process unless the least recently used cache block happens to be the process' block. As a result, each process usually consumes more cache space than the optimal case. The LRU policy also tries to keep data from the recent time quanta, which usually result in small dedicated areas for a round-robin schedule. 4.2 Comparison between Normal LRU and LRU with Optimal Partitioning If a cache is small enough for given time quanta, the LRU policy does the right cache partitioning. The entire cache is used as a shared area. Therefore, optimal partitioning cannot appreciably improve the performance for this case. Also, the performance improvement by optimal partitioning is marginal if the cache is large enough to hold working sets of all processes. Even though the LRU policy may result in a non-optimal cache partitioning, the miss-rate difference between the normal LRU policy and the LRU policy with optimal partitioning is small for very large caches since most of the useful data is in the cache anyways. Cache partitioning of the normal LRU replacement policy could be problematic in the cases of mid-range cache sizes and time quanta. A resultant cache partition is determined by the size and the location of the cache space that each process uses in its time quantum. Considering one cycle of time quanta, cache space that is accessed by only one process is a dedicated area, and space that is accessed by multiple processes is a shared area. Unfortunately, the LRU replacement policy has problems in both the amount of cache space it allocates to a process and the layout of the cache space. 4.2.1 The Problem of Allocation Size The LRU replacement policy determines the amount of cache space for a process mainly based on the number of misses that occurs in the process' time quantum. This could result in non-optimal cache allocation that shows poor cache performance. The 45 problem is clearer for very short time quanta such as the one with a few memory references. In these cases, the entire cache can considered as dedicated areas because processes are not likely to consume a significant amount of additional cache space. The LRU replacement policy allocates cache space proportional to the number of misses each process experience. However, the optimal cache partition is achieved by allocating cache space based on the actual advantage that the space provides to a process. That is, additional cache blocks should be given to the process that has the biggest reduction in miss-rate by having those blocks. Therefore, the LRU replacement policy is problematic when a process with higher miss-rate has smaller marginal gain than other processes. A typical example is the case when there is a streaming application with an application that requires large cache space. The LRU replacement policy tends to allow too much cache space to the streaming application even though it is better to keep the other process's data. The problem is shown in Figure 4-1. The figure shows the relative miss-rate improvement by the optimal partitioning compared to the normal LRU replacement policy. The relative improvement is calculated by the following equation: Relative improvement =. miss-ratenormal LRU - miss-rateoptimal miss-ratenormal LRU Time quanta of processes are assumed to be the same length. partitioning (4.1) (4) One process is a streaming application that shows a miss-rate of 12.5% with one cache block. The other is a random memory reference application whose miss-rate decreases linearly with having more cache blocks and becomes zero for caches larger than 2048 blocks. The miss-rate curves are shown in Figure 4-1 (a). Figure 4-1 (b) illustrates the relative improvement by the optimal cache partitioning over the normal LRU replacement policy for various cache sizes and time quanta. The darker area represents larger relative improvement. Finally, the slices of Figure 4-1 (b) for some time quanta are shown in Figure 4-1 (c). As discussed before, the figure shows that the normal LRU replacement policy 46 - 0.8 - stream random ~0.6 0.4 0.2- 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 7 8 9 10 Cache Size (Blocks) (a) 10000 8000 0 6000 4000 2000 0 0 1 2 3 5 4 6 References per Time Quantum x 14 (b) 0.8 50000 100000 E 0.6 - - 1000000 E0.4 - ~02 0 0 -------------------1000 2000 3000 4000 5000 6000 .7000 8000 9000 10000 References per Time Quantum (c) Figure 4-1: Improvement by optimal partitioning for a random memory reference application with a streaming application. (a) Miss-rate curves of processes. (b) (c) Relative improvement. 47 performs very close to the LRU with optimal partitioning if a cache is relatively small for a time quantum or extremely large to hold everything. But proper partitioning improves a miss-rate up to 60% for mid-size caches with a small time quantum. The problem is severe for caches where the random memory reference application shows a lower miss-rate than the streaming application, which correspond to caches larger than about 1800 blocks. Longer time quanta usually mitigate the problem of nonoptimal allocation size since the initial amount of data becomes less important for longer time quanta. 4.2.2 The Problem of Cache Layout The layout of a given cache space for each process may also be a problem. The LRU replacement policy allocates additional cache blocks to an active process by replacing LRU blocks, and tries to keep the data from recently executed processes. Therefore, two consecutive processes do not share any cache space as long as a cache is large enough to hold both spaces, which usually results in more cache pollution and smaller dedicated areas than optimal partitioning. Generally, the LRU replacement policy can suffer from small dedicated areas even though it allocates the right amount of cache space for each process. The problem is especially severe for round-robin scheduling. Consider the case when 9 identical processes are running on a cache of size 1024 blocks with round-robin scheduling. Also, let us say that a process needs 128 cache blocks while it is active. For the LRU replacement policy, a process suffers an empty cache whenever it restarts from a context switch. However, smart partitioning can keep 112 blocks for each process and share 16 blocks among all processes. The problem is that only two processes share a given cache block for the LRU replacement policy, whereas all nine processes share the same 16 blocks for the smart partition. For short time quanta, this layout problem is marginal since recently used data for all processes are kept in a cache. Very long time quanta are also unproblematic since it is usually optimal to have a large shared area for long time quanta. But this problem could be severe for mid-range time quanta, which are just long enough to pollute the entire cache in one cycle of time quanta. In this case, the LRU policy results in no 48 dedicated area, whereas the optimal partition should have large dedicated areas. Problematic cache sizes and time quanta when Process i suffers from this layout problem can be roughly estimated as follows: N Z mj Tj ~ C, (4.2) j=OiAj where N is the number of processes, m is the miss-rate of process j, T is the number of memory references in Process j's time quantum, and C is the number of cache blocks. This layout problem of the LRU replacement policy is illustrated in Figure 4-2. The figure shows the relative improvement obatined by optimal partitioning over the normal LRU replacement policy, which is calculated by Equation 4.1. Four processes, vpr, vortex, gcc, and bzip2 from SPEC CPU2000 [14], share a cache with the same length of time quantum and round-robin scheduling. The miss-rate curves of processes are shown in Figure 4-2 (a). Figure 4-2 (b) illustrates the relative improvement for various cache sizes and time quanta. Finally, Figure 4-2 (c) also shows the relative improvement just for some cache sizes. These figures shows that the LRU replacement policy is problematic for a certain range of time quantum for a given cache size. The improvement is marginal for small time quanta. However, for some mid-range time quanta, proper partitioning improves the miss-rate up to around 30%. For a very long time quantum, again, there is no difference between the normal LRU and optimal partitioning. As Equation 4.2 implies, the problematic time quantum for LRU becomes longer as we increase the cache size. 4.2.3 Summary The problematic time quanta and cache size can be summarized from the above discussions as shown in Figure 4-3. The figure illustrates the problem of the LRU replacement policy for various cache sizes (the number of cache blocks) and time quanta (the number of memory references per time quantum). If a cache is very small or very large for a given time quantum, the normal LRU replacement policy performs 49 100 VPR VORTEX BZIP2 0) 10 GC .. ..... .. U) U) 102 ... . _........ .. .... .... - 10 0 1000 0 1 2000 10000 3000 4000 5000 6000 Cache Size (Blocks) (a) 7000 8000 9000 8 9 10000 8000 ca 6000 (D 0o 4000 2000 0 2 2 0.4 6 7 4 5 3 References per Time Quantum (b) 10 x 10, 500 2500 ... 10000 - E 0.3 E 0.2 -' -'' 0. 1 0 1 3 10 104 References per Time Quantum 10, (c) Figure 4-2: Improvement by optimal partitioning for four SPEC benchmarks. (a) Miss-rate curves of processes. (b) (c) Relative improvement. 50 106 10M -- 4Z .- 0 1 1O 02 Main Memo y \ -- Q4 L 10' -A- - 10 C 10 No Problem (b)~- C.(a) i00 102 104 i00 106 References per Time Quantum (a) 102 104 106 References per Time Quantum (b) Figure 4-3: (a) Problems of the normal LRU replacement policy for various cache sizes and time quanta. (b) Usual cache size and time quanta for each level of memory hierarchy. almost as well as the LRU with optimal partitioning. For short time quanta, nonoptimal allocation size for each process is the main source of performance degradation (Section 4.2.1). In this case, the problematic cache size depends on the characteristics of processes, and could be any size. For longer time quanta, a non-optimal layout problem may degrade the cache performance (Section 4.2.2). In this problem, the problematic cache size generally increases as a time quantum becomes longer. The relationship can be formalized based on Equation 4.2. If we assume the same miss-rate and time quantum for all processes, the equation can be simplified as (N - 1) - M - T ~_C. (4.3) Usually, the number of process N is between 10 and 100, and the miss-rate M is between 1% and 10%. In this case, we can cancel N and M. Therefore, the layout could be a serious problem if T and C is in about the same order of magnitude. 51 4.3 Problems of the LRU Replacement Policy for Each Level of Memory Hierarchy Each level of memory hierarchy usually has a different cache configuration, that is, cache size and block size are different for each level. Generally, both cache size and block size increases as we go up a memory hierarchy [13]. Current generation microprocessors usually have an Li cache which ranges from 16 KB to 1 MB with 32 Byte cache blocks [2, 3, 15, 16, 6, 10, 21, 11]. L2 caches, often an order of magnitude larger than Li caches, are between 256 KB to 8 MB. L2 cache blocks are also a few times larger than Li cache blocks. Finally, main memory can be considered as a cache with pages as cache blocks. Main memory, these days, usually ranges between 64 MB and 4 GB, with page size of 4 KB. As a result, L2 caches usually have an order of magnitude more cache blocks than Li caches, and main memory and L2 caches have about the same number of cache blocks. The number of memory references per time quantum also differs for each level of memory hierarchy. Generally, a higher level cache sees fewer memory references than lower level caches since a higer level of memory hierarchy only sees memory references that miss in lower levels. For example, if a process misses 10% of memory references in a Li cache, the number of memory references for a L2 cache is one tenth of the number of memory references for the Li cache. For conventional time-shared systems, usually the time quantum is around 10ms assuming a process is not I/O bounded. For Compaq Tru64 Unix, default time quantum is 10 ticks ', and each tick is roughly ims [7]. Therefore, Li caches usually service hundreds of thousand or more memory references per time quantum for processors with a 1 GHz clock cycle. For simultaneous multi-threading (SMT), processes are very tightly interleaved [31, 19, 93. Therefore, there are only a few memory references per time quantum. Figure 4-3 (b) summarizes the usual cache size and number of memory references per time quantum for each level of memory hierarchy. A higher level cache is assumed to see one tenth 'As a root, we can check the default time quantum as follows; dbx -k vmunix; (dbx) p round-robin-switch-rate. 52 of memory references that the higher level cache sees. It is very clear from Figure 4-3 what kind of partitioning problems each level of memory hierarchy may suffer from, when it is managed by the simple LRU replacement policy. For Li caches with conventional time-shared mechanism, the number of memory references per time quantum is likely to be an order of magnitude larger than the number of cache blocks. Therefore, for Li caches, partitioning is usually not a problem although slow processors with a large Li cache may suffer from the non-optimal layout problem. However, L2 caches of conventional time-shared systems are very likely to suffer from the problem of non-optimal layout. To obtain the best performance for L2 caches, it could be crucial to partition a cache properly so that the right amount of data is kept in the cache over the time quanta. For main memory, the number of memory references per time quantum is often relatively small considering the number of cache blocks it has. Therefore, allocating the right amount of space to each process is important for main memory and SMT systems. Especially, streaming applications should be carefully managed in this case. Finally, for SMT systems, all the levels of memory hierarchy could be problematic since time quanta are short on all levels. As technology improves, both cache size and a clock frequency tend to increase, which result in more cache blocks and more memory references per time quantum. However, the speed of increase is different for each level of memory hierarchy. The size of Li caches tend to grow slowly compared to the clock frequency, since it is crucial for Li caches to follow up with the speed of a processor. On the other hand, the size of main memory grow much faster. Therefore, for future microprocessors, it is likely that proper cache partitioning to be more important than it is now for lower level caches and less important for Li caches. 4.4 Effect of Low Associativity The previous sections are mainly based on the fully-associative cache model. However, most of the discussions are generally valid also for set-associative caches. Discussions 53 are based on the assumption that the miss-rate of a process can be given by a function of cache size m(x), which can also be a reasonable assumption for set-associative caches. How the size of cache space x actually changes is based on the fully-associative model, and may different from the case of fully-associative caches. Generally, the size of cache space grows slower for set-associative caches than fully-associative caches. For set-associative caches, the replacement is determined by the set mapping and the LRU information. Since the mapping could be arbitrary, it is possible for a process to replace its own data even though the LRU cache block is the other process' block. Therefore, it usually takes more time for a set-associative cache to pollute the whole cache than a fully-associative cache. As a result, the nonoptimal layout problem of the LRU replacement policy is generally less problematic for set-associative caches. Even though the general trend discussed is true for all caches, it is hard to predict the exact effect of context switching for low associative caches. As associativity becomes lower, the replacement of cache blocks becomes more dependent on the set mapping, which is arbitrary. If mapping happens to be the form that frequently accessed sets of processes overlap, the problem of context switching could be very severe. On the other hand, if frequently accessed sets of processes do not overlap at all, context switches may not affect the cache performance at all. 4.5 More Benefits of Cache Partitioning Previous sections focused only on cache miss-rates, and discussed how proper partitioning can improve a cache miss-rate. However, cache partitioning could have more advantages in terms of overall system performance, and predictability. Proper cache partitioning can improve the overall system performance by reducing the burst of misses. Although cache miss-rates are generally used as a metric of cache performance, high miss-rates do not necessarily mean lots of stalls for a processor. Since modern processors usually exploit instruction level parallelism, only some misses or lots of outstanding misses cause performance degradation. If a process starts an 54 execution with an empty cache, it is very likely that the process experiences a burst of misses that causes a processor to stall. Proper cache partitioning can mitigate this burst of misses by allocating dedicated space to the process. In some cases, it might be better to remove a burst of misses at the beginning of time quantum even though it results in a worse overall miss-rate. The predictability of a system could also be improved by cache partitioning. For some real-time applications, it is more important to know the performance of a certain task. For example, if an application is playing an audio clip, it is crucial for the application to produce a new packet every period. In this case, we should know the worst case performance of the process, and schedule the application so that it can do the job within a given period. By having a larger dedicated area for the process, cache partitioning can improve the worst case performance and, therefore, predictability [5]. 55 Chapter 5 Partitioning Experiments This section discusses the implementation of dynamic cache partitioning based on the analytical cache model. Using trace-driven simulations, we compare this model based partitioning with the normal LRU and another partitioning method based on look-ahead information. The implementation in this chapter is based on fullyassociative caches. However, this implementation can also be applied to each set of a set-associative cache. 5.1 Implementation of Model-Based Partitioning This section describes the implementation of our model based partitioning algorithm in a simulation environment. Figure 5-1 illustrates the overview of our implementation. First, each benchmark is compiled and simulated using a system simulator to generate the memory reference trace of the benchmark. The traces are fed into a single process cache simulator to record the memory reference pattern of each process. Finally, a combination of benchmarks is simulated by a multi-process cache simulator with our cache partitioning module. Note that the recorded information can be reused for any combination of benchmarks as long as the cache configuration is the same. 56 Benchmark 1 System Simulator Trace 1 Benchmark 2 System Trace 2 Single Process Cache Simulator Record File 1 Single Process Record File 2 Process Simulator Model Based Partitioning 01eal Miss-Rate Record File N Benchmark N System Simulator Trace N Single Process Cache Simulator Figure 5-1: The implementation of the analytical-model-based cache partitioning. 5.1.1 Recording Memory Reference Patterns This phase of partitioning records the memory reference pattern of each process so that the information can be used by the on-line cache partitioning module. The memory reference pattern of a process is assumed to be represented as the missrate curve of the process. Therefore, we actually record the miss-rates of a process with respect to the number of cache blocks. Since the miss-rate depends on the configuration of a cache, a cache should be configured the same as the cache used for multi-process cases. Ideally, the miss-rate curve for each time quantum should be recorded, however, that results in extremely huge file sizes for short time quanta. Moreover, the number of memory references for a time quantum may not be known at the moment of profiling, which makes it harder to record one miss-rate curve for each time quantum. Therefore, we record the average miss-rate curve over a certain time period with time stamps. Each miss-rate curve is mapped to a corresponding time quantum at run-time. The time stamp of a miss-rate curve consists of both the start and times of the period in which the miss-rate is recorded. Time is expressed as the number of memory 57 Record Files Scheduler Add/remove a process, End of a time quantum, The Length of a time quantum Miss-rate Curves Partition Module )Replacement Bit Vectors, Reset the LRU Xinfo. Unit Figure 5-2: The implementation of on-line cache partitioning. references from the beginning of a process. Therefore, the start time of a miss-rate curve is the time of the first memory reference that contributes to the miss-rate curve. Similarly, the stop time of a miss-rate curve is the time of the last memory reference that contributes to the miss-rate curve. At run-time, the miss-rate curve that the majority of memory references of a time quantum contribute to is mapped to the time quantum. 5.1.2 Cache Partitioning Four modules are related to the cache partitioning flow as shown in Figure 5-2. A scheduler provides a partition module with the scheduling information such as the set of executing processes, when each process starts and finishes. The partition module reads miss-rate information from recorded files and decides the cache partition for each time quantum. Finally, a replacement module effectively partitions a cache based on the information from the partition module. 58 Cache Partitioning by Replacement Unit We partition a cache based on the mechanism called column caching [4]. Column caching is a mechanism to allow partitioning of a cache at cache column granularity, where each column is one "way" or bank of the n-way set-associative cache. A standard cache considers all cache blocks in a set as candidates for replacement. As a result, a process' data can occupy any cache block. Column caching, on the other hand, restricts the replacement to a sub-set of cache blocks, which is essentially partitioning the cache. Column caching specifies replacement candidacy using a bit vector in which a bit indicates if the corresponding column is a candidate for replacement. A LRU replacement unit is modified so that it replaces the LRU cache block from the candidates specified by a bit vector. Each partitionable unit has a bit vector. Since lookup is precisely the same as for a standard cache, column caching incurs no performance penalty during lookup. Our implementation uses the same mechanism as column caching at cache block granularity. Each process has a bit vector with the same number of bits as the number of cache blocks. Each bit in a bit vector specifies whether the corresponding cache block can be replaced by the process or not. Although it might not be practical to have a cache block granularity, we chose this design to verify the cache model. More practical implementations of column caching can and will be studied. Determination of a Partition The partition module manages the bit vectors, and therefore effectively controls the partition of a cache. The module obtains scheduling information from the scheduler, and decides a new partition at the end of each process' time quantum. Based on the decision, bit vectors are changed to actually partition a cache. To decide a partition, the partition module maintains several data structures. Based on the information from the scheduler, the module knows the number of processes (N), and the length of each process' time quantum (Ti). For partitioning, the 59 module also keeps the size of each process' dedicated area (Di), and the miss-rate curve of each process (mi). We assume that D" ~ D as discussed in Section 3.2.1, and we therefore have only one Di for each process. The size of a shared area is calculated from the size of dedicated areas at any given moment. If a new process is added to a scheduler, the scheduler notifies the increase in the number of processes to the partition module, and the partition module initializes the data structures for the process. First, the number of processes (N) is increased, and the proper length of time quantum is assigned to TN. Second, since the new process has not started executing yet, the size of dedicated area for the process (DN) is set to zero. Finally, the partition module opens the record file for the new process and loads MN(X). After Process i finishes its time quantum, the partition module updates the information about Process i such as mi and T for the next round. The size of Process i's dedicated area, Di, is also updated based on a new miss-rate curve and a time quantum. The partitioning decision, based on the discussion in Section 3.2, is as follows. Let S represent the size of a shared area in Process i's time quantum, the new value of Di is the integer from 0 to Di + S that minimizes the following equation. N N Z miss(mp(x), Tp, Dp, C P=1 N D) (5.1) q=1,qIp where miss(rn(x), T, D, C) is given in Equation 3.2. Once the new Di is computed, bit vectors are also updated according to the new size of the dedicated area. The Di MRU cache blocks are left in the dedicated area for Process i, and the rest of cache blocks are moved to the next process' area. Also, the moved cache blocks are set to the LRU blocks so that those blocks are replaced before the dedicated area of the next process. 60 5.2 Cache Partitioning Based on Look-Ahead Information In this section, we discuss a cache partitioning algorithm when we have the information of when a cache block will be accessed in the future. Since this partitioning algorithm is based on complete knowledge of program behavior, it could be thought of as the best we can do by cache partitioning. We will evaluate our model based partitioning by comparing it with this ideal partitioning result. The optimal replacement can be achieved by replacing the cache block that is going to be accessed furthest in the future. This replacement policy is optimal in a sense that it results in the minimum number of misses possible for a give cache. For multiprocess cases, the replacement policy also results in the optimal cache partitioning among processes. Therefore, simulation results with this ideal replacement policy show the best cache miss-rate for a given cache. We call this replacement policy as the ideal replacement policy. The ideal replacement policy can be far better than what we can actually do with just cache partitioning. Our cache partitioning is based on the LRU replacement policy within a single process, while the ideal replacement policy exploits its knowledge of the future even within a process. As a result, the performance of the ideal replacement policy is impossible to achieve just by proper cache partitioning among processes as long as the normal LRU replacement policy is used within a time quantum. We propose the modified LRU replacement policy as a limit of cache partitioning with the normal LRU replacement policy. The modified LRU replacement policy considers the LRU cache block of active process and all the cache blocks of dormant processes as a candidate of replacement, and replaces the block that is going to be accessed furthest in the future from the candidates. This replacement policy is exactly the same as the normal LRU replacement policy when there is only one process. For multi-process cases, however, the policy tries to keep data of dormant processes if they are more likely to be accessed in the near future than the LRU block of an 61 active process. The modified LRU replacement policy keeps dormant process' cache blocks that are going to be accessed in the nearest future, whereas our partitioning can only keep the MRU cache blocks. Therefore, the performance of the modified LRU replacement policy may still not be a realistic limit of our partitioning with the LRU replacement policy. Especially, the modified LRU replacement policy performs almost as well as the ideal replacement policy for short time quanta. For long time quanta, however, the modified LRU replacement policy can be thought of as our limit. For long time quanta, the cache blocks kept in the cache are more likely to be accessed within a time quantum even though they are not accessed at the beginning of the time quantum. 5.3 Experimental Results The case of eight processes sharing a 32-KB fully-associative cache is simulated to evaluate the model-based partitioning. Seven benchmarks (bzip2, gcc, swim, mesa, vortex, vpr, twolf) are from SPEC CPU2000 [14], and one (the image understanding (iu) program) is from the data intensive systems benchmark suite [23]. The overall miss-rate with partitioning is compared to the miss-rate only using the normal LRU replacement policy. The result of look-ahead based partitioning serves as the best miss-rate possible for a given benchmarks. The simulations are carried out for fifty million memory references for each time quantum. Processes are scheduled in a round-robin fashion with the fixed number of memory references per time quantum. Also, the number of memory references per time quantum is assumed to be the same for the all eight processes. Finally, two record cycles (P), ten million and a million memory references, are used for the model-based partitioning. The record cycle represents how often the miss-rate curve is recorded for an off-line profiling. Therefore, a shorter record cycle implies more detailed information about a process' memory reference pattern. The characteristics of the benchmarks are illustrated in Figure 5-3. Figure 5-3 (a) shows the change of a miss-rate over time. The x-axis represents simulation time. 62 0.07- 0.25 - .... bzip2 gcc 0.06 swim _ mesa 0.2 - vortex vpr - twolf -- iu 0.05 - 0. 151 - .CD~0.U-U,44 - CZ -I C,, C,) - 0.1 0.03 - * 0.02 I - I - 0.05 0.01 0 0 0 C 10 5 Time (million memory references) 1) 5 Time (million memory references) (a) 0.9 0.9 bzip2 -- gcc - swim mesa 0.8j 0.7 -- 0.7 0.6 2 0.5 0.6 CZ - ) O 0.4 0.5 0.4 0.3 0.3 0.2 0.2 - 2 0.1 0.1 U 10 0 vortex vpr twolf iu - 0.8 h 102 101 Cache Size (blocks) V 10 0 1? 102 10 Cache Size (blocks) 1&3 (b) Figure 5-3: The characteristics of the benchmarks. (a) (b) The change of a miss-rate over time. (c) (d) The miss-rate as a function of the cache size. 63 0.05 Nrmal LRU -e- Model, P=10 0.045 F -*- Model, P=10 6 - Modified LRU 0.04 F 4- CO, 0.03- 0.025- 0.02' 10 --.1 0 10 - -- 104 103 102 References per Time Quantum -- 10 - 10 6 Figure 5-4: The results of the model-based cache partitioning for various time quanta when eight processes (bzip2, gcc, swim, mesa, vortex, vpr, twolf, iu) are sharing a cache (32 KB, fully associative). The y-axis represents the average miss-rate over one million memory references at a given time. As shown in the figure, bzip2, gcc, swim and iu show abrupt changes in their miss-rate, whereas other benchmarks have very uniform miss-rate characteristics over time. Figure 5-3 (b) illustrates the miss-rate as a function of the cache size. For a 32-KB fully-associative cache, benchmarks show miss-rates between 1% and 5%. The result of cache partitioning for various time quanta are shown in Figure 5-4. The miss-rates are averaged over 50 million memory references. As discussed in the cache model, the normal LRU replacement policy is problematic for a certain range of time quanta. In this case, the overall miss-rate increases dramatically for time quanta between one thousand and ten thousand memory references. For this problematic region, the model-based partitioning improves the cache miss-rate by lowering it from 4.5% to 3.5%, which is over a 20% improvement. For short time quanta, the relative improvement is about 5%. For very long time quanta, the model-based partitioning 64 shows the exact same result as the normal LRU replacement policy. In general, it is shown in the figure that the model-based partitioning always performs at least as well as or better than the normal LRU replacement policy. Also, the partitioning with a short record cycle performs better than the partitioning with a long record cycle. From the figure, we can also compare the performance of the model-based partitioning with the modified LRU replacement policy. For long time quanta, our modelbased partitioning performs as well as the modified LRU replacement policy, which means that our model-based partitioning closely achieves the best cache partition. However, there is a noticeable gap between the model-based partitioning and the modified LRU replacement policy for short time quanta. This is partly because our model only has information averaged over time quanta, whereas the modified LRU replacement policy knows the exact memory reference pattern for each time quantum. The modified LRU replacement policy performs much better than the model-based partitioning for short time quanta also because the modified LRU becomes the ideal replacement policy if time quantum is short. In our example of a 32-KB cache with eight processes, the problematic time quanta are in the order of a thousand memory references, which is very short for modern microprocessors. As a result, only systems with very fast context switching, such as simultaneous multi-threading machines, can be improved for this cache size and workload. However, as shown in Chapter 4, longer time quanta become problematic if a cache is larger. Therefore, conventional time-shared systems with very high clock frequency can also be improved by the same technique if a cache is large. Figure 5-5 shows miss-rates from the same simulations as above. In this figure, however, the change of a miss-rate is shown rather than an average miss-rate over the entire simulation. It is clear from the figure how the short record cycle helps partitioning. In Figure 5-5 (a), the model-based partitioning with the long record cycle (P = 107) performs worse than LRU at the beginning of a simulation, even though it outperforms the normal LRU replacement policy overall. This is because the model-based partitioning has only one average miss-rate curve for a process. As shown in Figure 5-3, some benchmarks such as bzip2 and gcc have a very different miss-rate 65 0.055 Normal LRU - 0.05 (D -e- Model, P=107 _-Model, P=10 6 0.045 CU) I, 0.04 0.035 Time Quantum 0.03' C 0.055 5 10 refs 15 20 25 30 35 Time (million memory references) (a) - -L 0.045 40 45 50 _pNormal LRU -&- Model, P=107 -- Model, P=10 6 - 0.05 a) 10 = 0.04 0.035 Time Quantum =5000 refs 0.03 5 10 0.04 15 20 30 35 25 Time (million memory references) (b) 45 50 Normal LRU 6 Model, P=10 T Q0Model, P=10i 0.035 a) 40 -*_ 0.03 C,) C,, 0.025 0.02 . - Time Quantum =1000000 refs A A1' 0 5 10 15 20 25 30 35 Time (million memory references) 40 45 50 (c) Figure 5-5: The results of the model-based cache partitioning over time when eight processes (bzip2, gcc, swim, mesa, vortex, vpr, twolf, iu) are sharing a cache (32 KB, fully associative) with three difference time quanta. (a) ten. (b) five thousand. (c) one million. Notice a 20% relative improvement in (b). 66 at the beginning. Therefore, the average miss-rate curve for those benchmarks do not work at the beginning of the simulation, which results in the worse performance than the normal LRU replacement policy. The model-based partitioning with the short record cycle (P = 106), on the other hand, always outperforms the normal LRU replacement policy. In this case, the model has correct miss-rate curves for all the time quanta, and partitions the cache properly even at the beginning of processes. Figure 5-5 (b) also shows that the short record cycle improves performance at the beginning. Finally, the model-based partitioning can not improve miss-rates if the time quantum is very long as shown in Figure 5-5 (c). In conclusion, the simulation experiments show that the model-based partitioning works very well for full-associative caches. For short or mid-range time quanta, cache performance can be improved using the model-based partitioning, and partitioning is especially helpful for mid-range time quanta when the normal LRU replacement policy is problematic. The model-based partitioning for set-associative caches is a subject of ongoing work. 67 Chapter 6 Conclusion An analytical cache model to estimate overall miss-rate when multiple processes are sharing a cache has been presented (Chapter 2). The model obtains the information about each process from its miss-rate curve, and combines it with parameters that define the cache configuration and schedule of processes. Interference among processes under the LRU replacement policy is quickly estimated for any cache size and any time quantum, and the estimated miss-rate is very accurate. More importantly, the model provides not only the overall miss-rate but also very good understanding for the effect of context switching. For example, the model clearly shows that the LRU replacement policy is problematic for mid-range time quanta because the policy replaces the blocks of the least recently executed process that are more likely to be accessed in the near future. The analytical model has been applied to the cache partitioning problem to show that the model can be used for both understanding the problem and developing new algorithms to solve the problem. First, we have defined cache partitioning and discussed the optimal partition based on the model. From the study, we can conclude that a certain amount of cache blocks should be dedicated to each process and the rest should be shared among all processes. The optimal amount of dedicated blocks depends on cache size, time quanta, and the characteristics of processes (Chapter 3). We have also compared the standard LRU replacement policy with the optimal partition (Chapter 4). The study shows that the LRU replacement policy has two 68 main problems: (i) allocation size (ii) cache layout. The LRU replacement policy allocates too much cache space to streaming processes, which can be a severe problem for short time quanta. The LRU replacement policy also tends to replaces wrong process' data, which often results in unnecessary conflicts among processes. This layout problem can be severe for mid-range time quanta. Therefore, L2 caches and DRAMs could suffer from context switches and can potentially be improved by proper partitioning techniques. Finally, the model-based partitioning has been implemented and verified by simulations (Chapter 5). Miss-rate curves are recorded off-line and partitioning is performed on-line according to the combination of processes that are executing. The simulation result for the case when eight process share a 32-KB fully-associative cache shows that the model-based partitioning can improve the overall miss-rate up to 20%. Even though we have used off-line profiling method to obtain miss-rate curves, it should be not hard to approximate the miss-rate curve on-line using missrate monitoring techniques. Therefore, a complete on-line cache partitioning method can be developed based on the model. Only the cache partitioning problem has been studied in this paper. However, as shown by the study of cache partitioning, our model can be applied any cache optimization problem that is related to the problem of context switching. For example, it can be used to determine the best combination of processes that can be run on each processor of a multi-processor system. Also, the model is useful in identifying areas in which further research in improving cache performance would be fruitful since it can easily provide the maximum improvement we can expect. 69 Appendix A The Exact Calculation of E[x(t)] In our cache model, we approximated E[x(t)], the expectation of the amount of data in the cache after t memory references, using Jensen's inequality (Section 2.3.2). Even though it is computationally more expensive than the approximation, it is also possible to calculate the exact value of E[x(t)]. The expectation is the sum of every possible value multiplied by the probability for the value. Therefore, to calculate the exact value of E[x(t)], we should know the probability for x(t) to be k for every integer k from 0 to C when C is the cache size. To calculate the probabilities, define Px=k(t) as the probability for x(t) to be k. We also define a vector of the probabilities as follows: Px(t) - (PX=O(t), PXi(t),. -7 -,PX=c (0)). (A. 1) From the facts that the amount of data in the cache increases by one only if a memory reference misses (Equation 2.4) and probability to miss with k cache blocks in the cache is m(k), Px=k(t + 1) can be written by Px=k(t + 1) - Px=k(t) - m(k) + Px=k-1(t) - m(k - 1). (A.2) Using matrix multiplication, the probability for all k can be written in one equa- 70 tion as follows: / 1- m(O) 0 0 0 1 - m(1) 0 0 0 0 1 - m(C - 1) m(C - 1) 0 0 m(O) PX(t + 1) = PX(t) - -- 0 (A.3) If we assume that the cache is empty at time 0, only P,=o(0) = 1 and all other probabilities P,=kk7o(0) = 0. Therefore, we can calculate Px(t). Px(t) = Px(O) -At (A.4) where Px(O) and A are given by the following equations. (A.5) PX(O) = (1, 0, -. - ,0). (1- M(0) 0 m(0) 0 0 1 - m(1) 0 0 (A.6) 0 0 1-m(C - 1) m(C - 1) 0 0 0 1 I Since we can calculate the probability for every possible value of x(t), it is straightforward to calculate the expectation E[X(t)]. C E[x(t)] = E k - Px=k(t) (A.7) k=O = (1,2, ... , C) - (Px(O) - At)T. 71 Bibliography [1] A. Agarwal, M. Horowitz, and J. Hennessy. An analytical cache model. ACM Transactions on Computer Systems, 7(2), May 1989. [2] AMD. AMD Athlon Processor Technical Brief, Nov. 1999. [3] AMD, http://www.amd.com/K6/k6docs/pdf/21918.pdf. AMD K6-II Processor Datasheet, Nov. 1999. [4] D. T. Chiou. Extending the Reach of Microprocessors: Column and Curious Caching. PhD thesis, Massachusetts Institute of Technology, 1999. [5] D. T. Chiou, P. Jain, L. Rudolph, and S. Devadas. Application-specific memory manangement for embedded systems using software-controlled caches. In the 37th conference on Design automation, 2000. [6] Compaq. Compaq alphastation family. [7] Compaq. Tru64 Unix V5.1 Documentation: Guide to Realtime Programming, Aug. 2000. [8] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, John & Sons, Incorporated, Mar. 1991. [9] S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, R. L. Stamm, and D. M. Tullsen. Simultaneous multithreading: A platform for next-generation processors. IEEE Micro, 17(5), 1997. [10] C. Freeburn. The hewlett packard PA-RISC 8500 processor. Technical report, Hewlett Packard Laboratories, Oct. 1998. [11] D. Greenley, J. Bauman, D. C. andD. Chen, R. Eltejaein, P. Ferolito, P. Fu, R. Garner, D. Greenhill, H. Grewal, K. Holdbrook, B. Kim, L. Kohn, H. Kwan, M. Levitt, G. Maturana, D. Mrazek, C. Narasimhaiah, K. Normoyle, N. Parveen, P. Patel, A. Prabhu, M. Tremblay, M. Wong, L. Yang, K. Yarlagadda, R. Yu, R. Yung, and G. Zyner. UltraSPARC: The next generation superscalar 64-bit SPARC. In Compcon '95, 1995. [12] J. S. Harper, D. J. Kerbyson, and G. R. Nudd. Analytical modeling of setassociative cache behavior. IEEE Transactions on Computers, 48(10), Oct. 1999. 72 [13] J. L. Hennessy and D. A. Patterson. Approach. Morgan Kaufmann, 1996. Computer Architecture a Quantitative [14] J. L. Henning. SPEC CPU2000: Measuring CPU performance in the new millennim. IEEE Computer, July 2000. [15] Intel, ftp://download.intel.com/design/IA-64/Downloads/24547401.pdf. nium Processor MicroarchitectureReference, Aug. 2000. Ita- [16] Intel, ftp://download.intel.com/design/Pentium4/manuals/24547002.pdf. IA-32 Intel Architecture Software Developer's Manual, 2001. [17] D. B. Kirk. Process dependent static cache partitioning for real-time systems. In Real-Time Systems Symposium, 1988. [18] H. Kwak, B. Lee, A. R. Hurson, S.-H. Yoon, and W.-J. Hahn. Effects of multithreading on cache performance. IEEE Transactions on Computers, 48(2), Feb. 1999. [19] J. L. Lo, J. S. Emer, D. M. T. Henry M. Levy, Rebecca L. Stamm, and S. J. Eggers. Converting thread-level parallelism to insruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, 15, 1997. [20] P. Magnusson and B. Werner. Efficient memory simulation in SimICS. In 28th Annual Simulation Symposium, 1995. [21] MIPS Technologies, Inc. MIPS R10000 Microprocessor User's Manual, 1996. [22] J. C. Mogul and A. Borg. The effect of context switches on cache performance. In the fourth international conference on Architectural support for programming languages and operating systems, 1991. [23] J. Muoz. Data-Intensive Systems Benchmark Suite Analysis and Specification. http://www.aaec.com/projectweb/dis, June 1999. [24] M. Rosenblum, S. A. Herrod, E. Witchel, and A. Gupta. Complete computer system simulation: The SimOS approach. IEEE Parallel & Distributed Technology, 1995. [25] M. S. Squillante and E. D. Lazowska. Using processor-cache affinity information in shared-momory multiprocessor scheduling. IEEE Transactions on Parallel and Distributed Systems, 4(2), Feb. 1993. [26] H. S. Stone, J. Turek, and J. L. Wolf. Optimal partitioning of cache memory. IEEE Transactions on Computers, 41(9), Sept. 1992. [27] G. E. Suh, L. Rudolph, and S. Devadas. Set-associative cache models for timeshared systems. Technical Report CSG Memo 432, Massachusetts Institute of Technology, 2001. 73 [28] D. Thidbaut and H. S. Stone. Footprints in the cache. A CM Transactions on Computer Systems, 5(4), Nov. 1987. [29] D. Thiebaut, H. S. Stone, and J. L. Wolf. Improving disk cache hit-ratios through cache partitioning. IEEE Transactions on Computers, 41(6), June 1992. [30] J. Torrellas, A. Tucker, and A. Gupta. Benefits of cache-affinity scheduling in shared-memory multiprocessors: A summary. In the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems, 1993. [31] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In 22nd Annual InternationalSymposium on Computer Architecture, 1995. [32] R. A. Uhlig and T. N. Mudge. Trace-driven memory simulation: A survey. A CM Computing Surveys, 29(2), June 1997. [33] M. Zagha, B. Larson, S. Turner, and M. Itzkowitz. Performance analysis using the MIPS R1000. In Supercomputing'96, 1996. 74