17161 >> J.D. Douceur: Good morning. It's my pleasure to introduce Y.C. Tay, who joins us from the National University of Singapore. He's here in town visiting us today. And also presumably in town for Symmetrics, which is next week. And if any of you are interested in performance analysis, if you're not you probably wouldn't be here, and you have not registered for Symmetrics, we still have plenty of registration available. It's not too late. Love to have you at the conference. You can talk to lots of smart people including Tay and others. He's been involved in Symmetrics even longer than I have. Today he's going to tell us why the last 40 years of cache miss analysis should be completely thrown out and replaced with something new. So Tay. >> Y.C. Tay: I want to thank J.D. for hosting the talk despite all the work that he has to do for Symmetrics, and I appreciate your taking time to be here. Everybody's busy nowadays to spend an hour to listen to me talk about something that actually started 10 years ago when I was in Building 26. You can tell it's been that long from the long list of collaborators I have here. So how does this work? What I'm going to do is propose a research program and here this program I use the old-fashioned sense of the word. Meaning that I identify what I believe is an important problem and then and is a hard problem and I propose a particular way of attacking the problem. And if that problem is solved, if they're pushed to attacking, this particular way are attacking the problem works, then the problem will be solved and people wouldn't have to worry about it anymore. So it's kind of an ambitious program as all programs are. And what I'm going to do, the problem I'm going to attack, it's not any particular level of memory hierarchy. I'm saying that this way of attacking the problem would work for any level of the memory hierarchy. So a bit audacious there. So math equations, denote -- as J.D. mentioned people have been doing this forever. What new idea do I have here? So the new idea that I have is something that I call top down. What people have been doing in the past is what I call bottom up. So I need to distinguish it between the two of them, that's what I'm going to do first. Why is this particular way of doing it as powerful as I think it is? Because it's universal. Again I have to explain that. Once I explain what I mean by those two, then I can explain what the research program is about. And then I show you some applications. So cache miss equation, everybody knows what this is. This just expresses the probability of a cache miss as a function of a cache size. So cache is everywhere in the memory hierarchy. And I don't restrict myself to any particular level. I could be talking about processor cache or the RAM itself or database buffer or browser cache, et cetera. Size memory is cheap. But I say that size is still a problem. You may have a humongous amount of memory, but the reality is that any particular application may be restricted to only a segment of it. So you have to each worry about the memory allocation issue. So you have to do that with memory virtualization. Or in consumer electronics, there's only a small amount of real estate, and you could just buy more memory. And then the other end of it, which Microsoft should know about, you give a big setup with thousands of machines spinning away, data center kind of thing, then you're going to be using up a lot of order that's spinning away is going to be using up a lot of managed. So size is still an issue. So although I write the cache equation as a probability of a cache miss as a function of cache size, actually is a function of two other things. The reference pattern and the management policy. And in fact it is a very intricate interaction of these two. And that's where all the difficulty comes in. Reference pattern itself could depend on the hardware variation, whether you're talking about the data cache or instruction cache. They have completely different reference patterns. The reference pattern could depend on the data layout. For example, how you design the syntax. Depends on your application mix. For example, if you have concurrency control, then that concurrency control work is going to affect the timing of the reference pattern. Depends on what kind of compiler options you use, whether there's prefetching. I'm told that the biggest machines out there are actually bought by Wal-Mart and Citibank, and for these guys the data is changing every second. So that affects your reference pattern. And, of course, the whole thing depends on how you configure it, whether you have two level of caches, where cache is [inaudible] cache is mission control, et cetera. The upshot is that this whole mess is just mathematically intractable. You couldn't write enough equations to describe the system, and if you could, you couldn't solve those equations. So what do people do? They do what I call simplify bottom up analysis. First, you assume that it's pure LRU, that there's no prefetching, et cetera. Then so you get an idealized policy. And then you assume that you throw away locality and assume that the references are independent. And there's only one application, and there's only one process, et cetera, and then you get your reference pattern. Now that you have simplified everything you can analyze the interaction between the two of them and you can now write an equation for the cache miss. And you compare these two pictures and you wonder how accurate are you? So this kind of bottom-up analysis requires an expert. In any kind of performance analysis you need to weigh what kind of approximations and what kind of assumptions you're going to make and so forth. You need some expertise in order for the two sides to mesh reasonably. And if I built a model for your system and then he says, no, I want a model, too, but, no, my hardware is a little, my software is a little different then what am I going to do? I'm going to do another one from scratch there or somehow tweak this particular model and give it to him. If I build a model for him and tomorrow he wants to upgrade the hardware or software, what am I going to do? I'm going to tweak the model again? So this is just not scaleable. You can't be doing this. And things get more complicated because nowadays machines are very complicated and you want it to automatically configure itself when you turn on the power and you want it to adjust itself when the workload changes. So the whole situation is hopeless. Bottom up analysis is not going to work going forward. So what do I propose to do? I say let's give this up and let's do a top down analysis instead. What do I mean by a top down analysis? I say let's just find one equation that will work for everything for all time. Sounds good. What would it look like. It would look something like this. Instead of just a function of a cache size I'm going to give you a handful of parameters. How does that solve the problem? Well, if I build a system for him and build a model for him and he wants a model, then tomorrow, then in order to put the model to him I just have to tweak the values of the parameters. And if tomorrow he changes the hardware or the software, just change the values of the parameters. And if you want autonomic configuration, you turn it on and the system will automatically calibrate the parameters. And if there's a workload change, then the parameters will again be calibrated on the fly. I don't have to build a new model. All the while it's just one equation. So this is what I mean by universal. This one equation is supposed to work for all workloads all levels of the memory hierarchy for all time. Okay. So one equation. What would this equation look like? Let me write it in an alternative form, and still probably have a cache miss, the number of cache misses. You can go back and forth between these two. So the number of cache misses as a function of the cache size. M is the cache size and four parameters here. And the equation looks something like this. This is my proposal of the equation. Right? May not be the right one, but I gotta show you a particular example so that you see what I mean, right? So this is my proposal for the equation. And the cache size is right there. And then I have these four parameters. And what do those four parameters mean? Well, they are like minimal ones I need. If you plot the number of misses against the memory allocation or the memory size, then, of course, as the memory size increases, the number of misses decreases. Right? And at some point you hit the code misses. No matter what you do, you have to paste something from this. So these are the code misses. So N star is the number of code misses. And then if you reduce a memory, at some point the workload just won't run anymore. So a vertical asymptote here which is the minimum amount of memory you need there. That's the M knot there. Now how is this equation different from previous equations. Previous looks just like these two except you keep going down forever and ever. And it's just not true. At some point you will hit the minimal. And beyond that point, no matter how you increase the memory size, it's not going to buy you anything. So M star is this kind of ideal memory size. If you were any smaller than this, you're going to take pitchforks. If you're any bigger than this you're just wasting memory. Whereas previous equations don't have this particular N star. The only parameter left is the N knot. So what is this N knot. N knot is like everything else and the kitchen sink inside. If you change your prefetching policy, you add prefetching or subtract prefetching it's in there. You do dynamic allocation, it's in there. Basically N knot, for example, if you do prefetching then N knot would be a negative number. If you do dynamic allocation then it will be a positive number. N knot is the sum of all of this. Geometrically what N knot does is control the shape of the curve. Okay. So let me illustrate this. So here we do not a single process but multi programming, take Linux and seven benchmarks. This is very computer intensive workload. You run all seven of them together. Measure the number of pages read from the disk against the RAM size. This is not simulation; you actually measure it. And so these are the data points, and this is -- you do a fit of the equation to this. Remember your four parameters so you can calibrate these four parameters. You do kind of a regression fit and see it fits the data pretty well. Real people out there don't use Linux, yeah, so still try Windows workload. So this one we did Windows 2000. And here you see how old this is, right? With business wisdom, I don't know if you guys know this, maybe this old benchmark. What it did was they actually traced some secretary or whatever using Windows 2000. Cut and paste, open up two applications and click on icons, et cetera, and then they log all of those. We replayed the log. Changed the RAM size, measured the number of disks from the pages, and you see here the code misses are already way up there. So this is a very IO intensive workload. And again the equation does a pretty good fit. You compare the shape of the curve here and the shape of the curve previously and you see that the shape is drastically different and that's the N knot effect. Okay. So I claim that the equation is universal, right? Or disparate version of the equation should be universal. This means if I change the replacement policy it must still work. So this is what we did. We took Linux and then we patched it to do five full replacement, random replacement. So you get three sets of data points, and the equation fit or three sets of data points. Of course, you get three different M knot values, three different M star values, the code replacement is the same no matter what code policy you use. So this is nontrivial. Getting one equation that will work for any replacement policy, this is difficult. And then I said if I build a model for him and tomorrow he wants to change the hardware, the software, the models should still work. Okay? >>: Can you go back one slide? >> Y.C. Tay: Okay. >>: Oh, I see. The sample points are where you're fitting ->> Y.C. Tay: Yeah. >>: And so, okay so how do you explain the variation between the sample points and your curves? >> Y.C. Tay: The sample point and the curves? You mean like this distance? >>: Yeah. >> Y.C. Tay: The equation is not perfect. So you can't exactly fit the center points, the data points. And part of it is also nondeterministic junk coming in, because we cannot completely shut down all the processes in the machine. What was this? >>: So do you have like some straw man equations? >> Y.C. Tay: Straw man, sorry? >>: I guess my question is, how do you know how good your equation is. >> Y.C. Tay: How do I know how good the equation is? >>: You say it's not perfect, right? >> Y.C. Tay: Yes. >>: So there would be a thing that would be perfect. >> Y.C. Tay: I don't think there's any equation that can be perfect. >>: Okay. And then there's ->>: Like a quadratic equation. >>: Yes. >>: How much force is it fitting the data points to a ->> Y.C. Tay: Okay. So the question is J.D. told me to repeat the questions in case somebody's recording this. So the question is, no, how do I know this is not better than the power law, let's say. So the standard way to model cache misses is to use a power equation. And we did compare with the power law, and the only way -- the only comparison that we did was we took the R squared value. So here is the kind of regression. So you can measure the coefficient of determination. So the coefficient of determination for this datasets is of the order of more than 95 percent. With the power law, you can get below 90 percent. So just by comparing the coefficient of determination, that's how we did it. We cannot claim that this equation will work for everything. We tried it on -- so for certain benchmarks, the drop in the curve, the data point drops so drastically quickly that we could not match it. And also certainly for certain syntactic benchmarks we couldn't do it. So there are benchmarks out there that we can -- there are workloads that we can mesh. >>: You have an intuition about why that is, so in those cases you just mentioned what aspect of the system you're not capturing in your equation? >> Y.C. Tay: For certain -- for some of them, so the question is when it doesn't mesh, why doesn't it mesh? For some of them it has to do with locking or something. So, for example, at one point we ran two compilations of the same file. And the data points were just all over the place. And we couldn't match it. We didn't look further into why that happened. Just one of those things we couldn't mesh. And we staple curves on those ones that we could mesh. It's like the pilates [phonetic] anomaly. In the pilates anomaly, even if you increase the memory size, if you run five-fold, your cache misses could co-op. So there are always benchmarks, workloads for which things work in a weird way, and we were not going to spend time on workloads that work in a weird way. We wanted to just focus on naturally occurring workloads, like compilation and garbage collected applications and so forth. Certainly if there's a real workload that doesn't match the equation and it's a really important workload, then we should work on it. So to answer your question, no, we don't know why sometimes it works, sometimes it doesn't work. Okay. So if you -- so as I was saying, if I build a model for him and tomorrow he wants to change the hardware or software, then I should be able to just tweak the parameters and it should still work. So here the way to illustrate this was, is we run the same Linux version, and then we change the application from JCC 2.91 to 2.95 and you get a different set of data points and the equation matches the different set of data points. And then we keep the same application by changing the kernel. So if you remember, between 2.2 and 2.4 there was a drastic change in the Linux kernel, the memory manager, and the data points were lower, much lower than previously. And we were still able to match that. So to illustrate that, if you change the software, then it will still work. >>: When you change all the points around one side of the line ->> Y.C. Tay: Yeah ->>: Seems like you could tweak the parameters a little bit and push the line up to fit those data points better. >> Y.C. Tay: No, there were other points. >>: Oh, okay. >> Y.C. Tay: This one looked really bad, because it was just a simulation of Linux. Okay. So to give you -- so I'm claiming that this equation will work for everything. So I better give you some kind of intuition for why that might be so. So this is the equation. And let me give you the intuition for that. So to get intuition for that, I just need to introduce three parameters, three variables. One is the average time -- so assuming that this is RAM and I have pages coming into RAM, right? And every time a page comes into RAM it stays there for a while before it gets thrown out. So let me measure the average time a page stays in RAM. So there's T RAM. And every time it gets thrown out of this it sits there for a while before it gets called in again. So that's the average time. That will be the T disk. And R is just the number of times that the pages are coming to RAM. So we do that, and then I have something that I call, granularily call the references plus replacement invariant. As I mentioned, cache misses is an interaction of the references and the replacement. You have to take care of both before you know what a cache misses is going to be. What the invariant says is that this expression here is approximately equal to 1. Once you have that, and then you just have to add Little's Law, standard and robust in qeueing theory and you have the equation. So the only thing you have to worry about is why is this invariant true. So what this invariant is actually saying is this, remember the pages come into RAM and they get thrown out so forth. When it comes into RAM it's black. When it gets thrown out into this it's white. When I take the average of the black parts I have the RAM time, T RAM. If I take the average of the white parts I get T disk. It comes in the RAM three times so R is equal to three. So what the invariant is saying is that it is possible, it's likely when you have plenty of memory around, then the R value is small. You get patient, only a small number of times. And every time you come in you stay there for a long time. So R is small and this one is -- so this part is -- this value is small and this value here is large. That's one possibility. The other possibility is you're under memory pressure. If you're under memory pressure then you get called in many times. R is large. R is large. And every time you come in you can only stay there for a short time. That will also fit in the invariant. What is unlikely is that you come in only a small number of times. Every time you come in you get immediately thrown out. This is unlikely. And it's also unlikely that you come in many times and every time you come in you stay for a long time. This is unlikely. So this basically what the invariant is saying. And as I mentioned, you add Little's Law to it and then you're there. That's where you get that. Okay. So the equation is supposed to work for everything. So that's applied to the database buffer. Here we took data from this particular paper, this IBM paper. They measured the references on a real machine. This is industrial strength kind of machine, running real workloads. And then they reran the trace on different buffer pool size, and measured the misses. So these are real workloads and the equation works for this real commercial workloads. They actually have 12 traces. And we fit all but maybe one of them. And then we tried it on a web cache. Here what we did was took trace from one of the peer-to-peer websites in China, not be stopped there. But from the experiments point of view, the fit is pretty good. >>: So is the goal of this equation to sort of describe stuff after the fact or to actually predict things before you change the disk or you change the memory or something like that? In other words, is the main goal of the equation to sort of help you explain, okay, here's what my cache looks like, here's sort of an intuition for why that's happening or is it supposed to be something where I swap out my disk for a new one and all of a sudden I can sort of predict what the subsequent cache flow is? >> Y.C. Tay: No. At this point you can't do that kind of prediction, because you need parameters. And the parameters need to be calibrated. When you [inaudible] the disk for a new disk I don't know what the new parameters are for the new disk. So at this point you know -- but you can do a prediction in a different sense, which I'll come to. So at this point in the presentation I'm just worried about fitting the data. Okay. So the goal is to just have one equation to fit everything. But I don't think that's going to be the case. Right? So this way the research proposal comes in. I say everybody who does researching cache misses should just dump whatever they are doing and focus their expertise on just deriving one parameterized miss equation for each cache type and do it in a top down way. But let's postpone that point for the time being. And it seems unlikely that that can actually be achieved. But I've given you some evidence that indicates RAM and the database buffer in the proxy, this one equation works, actually. Why did I say it's unlikely that one equation will work for everything? That's why you have to restrict, scale down and say that for each cache type let's come up with one equation for each cache type. And the reason right off is that it doesn't seem like this will work for set of associative cache. Simple reason. This particular equation is one-dimensional. There's only one M value. So it's one-dimensional. But associative caches are two-dimensional. So it doesn't seem like the same equation might work for a set of associative caches. However, if you change the dimension of the set of associative cache, right, to me let's say you change from a direct math to direct associative to me this I can interpret it as changing the replacement policy. So if you change the replacement policy, the equation is supposed to still work. So it actually might still work. And it did. Except that I have to change instead of I have to just add a log there. At this point I don't know why if I just change it to log it works. But that's what we observed. So let's show you some data here. >>: With respect to the set size? >> Y.C. Tay: We tried it for different set sizes, different line sizes. >>: So in like a two-way associative a one-way associative, they both behave with the same, both the log works? >> Y.C. Tay: Yes. >>: That's amazing. >> Y.C. Tay: With set of associative caches, I think by the time you get to eight is more or less stable. >>: I agree. Eight should be kind of like a fully associated cache. But I'm amazed that is similar to say two-way associative. >> Y.C. Tay: Yeah. Even works with one-way. >>: That's wow. >> Y.C. Tay: So we did eight-way associative L-2 cache. This one is the data cache spike workload. They all use spike workloads there. So this is the data. And it has this waterfall behavior which the equation can't cache. But what the equation can do is roughly match this down trend. It's not doing very well here but for the rest of it it's reasonable. And we tried it on instruction cache as well and it works. So this is where the universality part of it comes in. >>: Again, I think someone already asked this but seems like the real question is if I plug those points into Excel and ask it to draw a curve into them I'll get something that looks a lot like that. So it seems like there needs to be some way to quantify why this is better than whatever Excel or [inaudible] gives me by just saying fit a curve to these points. >> Y.C. Tay: So if I use Excel, if I use Excel and plug these data points Excel will give me an equation and a fit then I will evaluate, compare the two fits and see who has a better fit. Right? How else can you compare ->>: I agree. I guess I just -- do you have evaluations of that type that you could show us? >>: So you could do that. You could actually fit the curve in the cell, let's say, and fit it with your equation and pick some point in space that you haven't evaluated yet and see what the two curves will say, the missed rate should be and see what that delta is. >> Y.C. Tay: So one particular -- so comparing this with what Excel might do, for example. Excel will not give you an N star value. Excel will come out with probably some kind of power curve, polynomial, whatever. If it uses a polynomial it will go really bad when you scale up N, for example. It's going to go to infinity or whatever. What Excel would do or math lab would do is fit the particular segment of the -- if your data is within this range, it will fit very well within this range but when I do some kind of extrapolation, I think math lab and Excel will do poorly. >>: In other words, I'll ask this, is your equation successfully predicted at what point there's a diminution on memory. Your equation has this behavior where it suddenly drops off. Can it predict where that point is based on -- if you happen to actually collect data past that point can it predict that. >> Y.C. Tay: I'll show you that the prediction. You had a question? A point? >>: Basically self- -- >> Y.C. Tay: Okay. All right. I've explained what I mean by parameterize. I explain what I mean by cache type. The only thing I haven't explained is this top down modeling part of it. What do I mean by top down modeling. Okay. So let's look at garbage collected heap. The difficulty with garbage collected applications is if the garbage collector of whatever is responsible changes, so if you change the memory allocation, then the heap size may also change. If you have automatic garbage, automatic heap size somewhere in there. So if you change the heap size, then the reference pattern is going to change. And this is really difficult with bottom up -- for bottom up analysis. In bottom up analysis, you start off by assuming that the reference pattern is, you start off with a particular model of the reference pattern. And then you assume that this particular reference model doesn't change when you change the memory size. And that's just not so in the case of garbage collected applications. So how does the equation work for garbage collected applications like this one? So here I introduce another variable which is the heap size, and the workload is a generational version of max street with one of the now it's more or less standard the couple benchmarks, PMD. And for a heap size of 14 megabytes, these are the data points and this is the fit. Fits pretty well here not so well here. But to some extent it's what do you want to cache. If this is what concerns you, I'm sure we can tweak the feed to do a better fit here. >>: Isn't that feature the different shapes of the equation, just different standard? >> Y.C. Tay: It is. So for the purpose of this talk I'm interested in a general fit of the rest of the curve. So for 100 megabytes, again, it's a reasonable fit. And you can complain again about this part of it. But that's not the emphasis for this particular fit I'm demonstrating here. So notice if I change the heap size, the shape of the curve is very different, but the equation can deliver on that part of it. So it seems like the same equation works for this. Except that that is the equation. The critical parameter is a heap size or the critical variable is a heap size. And here is nowhere in the equation. So that's not good. So what we did was we look at the parameters that we got and look at the M star value, for example, and we plotted the M star value against the heap size. And we said that -- we see that it's actually more or less on a straight line. So that says that M star is actually near in terms of page. And this already gives you some hint on heap sizing. So if I give you -- if I give you a particular memory allocation, what should be the heap size, intuitively the bigger the heap, the better. Because the bigger the heap the less garbage collection you need to do. So the bigger the heap, the better. But you don't want the heap to be so big that it starts going out and you start taking pitchforks. Pitchforks are very expensive. So for given memory size, memory allocation, you should just increase your heap size until the M star values equals to your memory allocation. At that point you don't increase anymore. If you increase any further then you're going to start getting pitchforks. So that gives you one rule about how this will help you, this gives you one rule for doing heap sizing. We look at the N knot value. The N knot value when plotted against a heap size decreases and then stabilizes. Now why would it do something like this? So remember that the N knot is this magic with the kitchen sink inside. If you do memory allocation the N knot value is positive. So think about it, right? When your heap size decreases, what happens? You do more and more garbage collection. So when you do more and more garbage collection, you're allocating stuff, putting stuff back on the freeways, allocating stuff, taking stuff out of the freeways, so you do more and more of that that's why the N knot value is going up as your heap size decreases. So that gives you another rule about this particular, this particular curve gives you another rule about heap sizing. You want to increase your heap size but at some point your heap is big enough to accommodate the footprint and there's no point increasing the heap size further than that, because if you do that then your garbage collector is going to walk over some item memory. So this here gives you one rule about what your maximum heap size should be. So with this tool you can already see how we can formulate a heap sizing policy using these two parameters. But, anyway, now that I've got these two equations I can plug them into that equation and now this is my new equation for heap sizing applications now. We have the heap size explicitly in the equation. And this is what I mean by top down. I start off with an equation I know works, and then if I'm interested in some particular variable, there's actually not explicitly in the equation, then I take the parameters and then I refine them. When I refine them and put them back in I'll have the variable explicitly in the equation. And you know this might not be it. You might be interested in how different variations, different garbage collectors affect the equation. In that case you analyze the garbage collection, how these four parameters, how these four parameters depend on the garbage collector and therefore refine the equation some. This is what I mean by top down. So let me zip through all the applications now. So how to power down item memory. So let me just use this example. So M star is here. Any memory beyond M star is just wasting energy. You can power that down. So if you have -- so here is an example where I have three sets of data and you fit the three sets of data with the equation and use that -- once you do the fit, you have the M star value. With the M star value now you can power down the extra amount of memory. So, by the way, that was referring to what -- this example of where you can actually do the prediction of where the M star is without any data point on this part of the curve. >>: So there's validation that that was a correct prediction? >> Y.C. Tay: Validation that that was a correct prediction? This one is not here. But I think in the paper itself you can see how accurate or inaccurate these values are. Not in this particular slide. Okay. Memory partitioning. So let's say you're going to partition memory into three pieces. So you have three sets of data points. You fit them. Once you fit them, you have these three equations. You can take these three equation you can completely map out. So this is buffer proof one, this is buffer pool two. And the third one is just the difference from the total buffer pool size. And then with the equation now I can completely map the surface. And once you know completely the surface you can do any kind of optimization that you want. Fit any optimization criterion that you are interested in to do your optimization. TPC workload. So as I mentioned, you want the equation to help you do dynamic adjustment of your system. So here's an example. So what we did was change the number of terminals. 50, 40, 30, et cetera. Simulating people coming in and going out. So the workload is dynamically changing, and we want to hold the missed rate constant at 0.2. So if you ignore this part of it you can see the missed rate is in a cycle. Now, the question is whether you can fit those data points. So one way to do this is grade and descend, standard way to do it. If you do it by grade and descend, without using the equation that I propose, then what you get is something like this, which can swing violently, because grade and descend is very sensitive to fluctuations. So instead of using grade and descend, what you do is take data points and smooth out the data points. How do you smooth out data points? You use the equation to fit the data points first. After you use the equation to fit the data points then you apply grade and descend with the equation. That moves with the fluctuations and release you from the tyranny of the randomness. If you do that, then the convergence is faster. But if you're going to take the data points and then fit them with the equation, then why bother doing grade and descend. Just use a equation to predict what the memory size should be to get the 0.2. So if you do that, then you have a faster convergence. How do you use the equation so you have this dynamic adjustment. So how do you free up memory when you have these dynamic adjustments. Here's an example with memory recommendation, and you're going to do it in a fair way. Imagine you have a memory partition into three workloads. Initially one of them -- initially is unfair. One of the workloads, buffer pool two, is always have a smaller missed ratio than the other two. The get between the three of them is big. So what I want to do is I want to reduce the gap and I don't want the same guy to win all the time. So there's my idea of fairness. >>: Sorry. Just slightly out of scope. But why is it fair if everybody has the same miss rate? It could be that some inherently have a high miss rate so if you push it out. >> Y.C. Tay: No, this is a ratio. >>: What is P star? >> Y.C. Tay: P star is the co-miss rate. >>: Sorry? >> Y.C. Tay: The co-miss rate. What I want to do is free up this amount of memory. So I want to take some memory from each one of them. And I want to do it in a fair way. So once you have the equation, you can just do your calculus, your equation, and you see that what you get out of it is some prorating, but a prorating is not obvious. You have to look at the M star value, the ideal allocation and the current allocation. If you do that, then you get after the recognition, the gets are smaller and not one of them is always the winner. Database buffer, tuner. Database tuning is complicated nowadays and the three major vendors have put out software to help people do this. And all these three pieces of software, if you look into it, somewhere inside there's actually a simulator. And what a simulator does is it predicts the number of IOs for a bigger or smaller buffer, or that's what it tries to do. Doing demolition is good. Has all the engineering details. But simulation is miss -- if it changes, you have to go in and change the simulator. So it's kind of a pain. That's one issue. The other issue is predicting for a smaller buffer. Now, in order to do any kind of simulation -- in order to do this kind of simulation you need to have a trace. Now, if your system is such that or if your program is such that you can only see the buffer, you cannot see the buffer itself it's encased in a black box, you only can see what comes out of the black box you can only see the misses, you cannot see the hits. If you cannot see the hits, when I reduce the buffer size, then the hits becomes misses and you never saw them. So how do you actually do the simulation. >>: Seems like it's a little unfair to say that their problem is that they have to be simulations because you effectively have to do a simulation, too, to cover what these parameters are. In other words, the equation itself doesn't have any -- because of this N knot kitchen sink, rather, the equation doesn't give you any prediction as to what will happen, even if you make a tiny change in the software. For example, I don't know how sensitive you are to workload, because you showed us 25 slides ago that just changing the version of the kernel or changing the version of the compiler dramatically changes the shape of the curve. And the equation is not going to give you any visibility into how the curve will change if you make any of these kind of changes. So it seems like effectively what you're arguing for is also doing a simulation in order to recover the parameters. You'll have to simulate either way. It's just if you simulate the way they say they give you the data points, they just don't fit a curve to it. You're saying you should run a simulation and also fit a curve to them. >> Y.C. Tay: That's right. That's a fact that -- okay. So forget about what I said about doing the simulation. So let's just go to -- so there's still this other issue about what happens if you have a smaller buffer. If you do a simulation with a smaller buffer, there's nothing you can do. If it's within a black box. And I've been told that situation actually arises. Maybe the -- the one that owns the machine doesn't want you to have access to the traces, to the reference pattern. Now, for us there's no problem, right? If you are on a smaller buffer then extra pull it backwards. I don't need to know the reference pattern. Okay. Gee, I'm on time. Yeah. So this is what I've been telling you about. Bottom up analysis is just not scaleable. And I believe that the way forward is to use, to go for a universal equation, do it in a top down way universal equation that will work for everything. And it sounds like it's too ambitious, but I think I've shown you some data to hopefully convince you that it's actually feasible. And the point I want to emphasize is this -- I think this is the way to do it. Give up bottom up analysis. So I'll be happy to answer more questions. >>: We had a bunch of questions during the talk, but any other questions? What I was curious about, a related work question. In the beginning you said the bottom up analysis they don't include any locality in the models. >> Y.C. Tay: Yes. >>: That's really the case with the existing? >> Y.C. Tay: Well, when you assume that the reference is independent, you've given up locality. What is locality? Temporal locality is, no, correlation between two references that are close in time. Spatial locality, correlation between two references goes in space. This is just correlation. The minute you say independent references, you have given up your ->>: And the existing models are -- you use references, wow. That's amazing. >> Y.C. Tay: There's some people who model their workload as a graph, for example. Then they keep some of the, they do keep some of the locality in that model. >>: There's the model, the stacked distance model. Basically say I don't know, you assume that you have the working set and you assume kind of how big the stack you would, what's the distance between basically two references to the same page, and I mean people have been using this to somehow -- it's a very simple model abstracts at all. But still it captures some notion of temporal locality, right? >> Y.C. Tay: Yeah, yeah. For that they probably have to assume LRU, though. >>: Yes, that's assuming LRU. >> Y.C. Tay: So you have to simplify this or you have to simplify that otherwise it's impossible to analyze the interaction between the two of them. >>: I think you had a comment on one of the last slides about the [inaudible] stuff. So the simulation that they do is actually white box simulation, white box simulation. >> Y.C. Tay: Yeah yeah. >>: They actually put that simulator inside a database engine and it sits right there in the actual cache itself. So it doesn't have the data that's cached but it has some of the stump. You can see whether this is a hit or miss. So they know about both. >> Y.C. Tay: Yeah, so this particular one, yeah, the SQL Server, the software, Resource Advisor. >>: That's the black box. >> Y.C. Tay: They use the trace to do the simulation. >>: [inaudible]. >> Y.C. Tay: No. When I say black box, I mean what you -- suppose you are the vendor and you are doing performance analysis for some company and the company doesn't let you look at the reference pattern, then what do you do? >>: But what I'm saying is the Oracle and DB2 stuff that's here, it's actually in the database itself. So it's not like it's external vendor that does the analysis. >> Y.C. Tay: Then I must have misunderstood. Because this point was brought up to me by Maryanna. So I must have misunderstood. >>: This is internal to the database engine. >> Y.C. Tay: Okay. So probably ->>: Inside the engine. >> Y.C. Tay: Clarify that. More questions? >> JD Douceur: Let's thank our speaker. [applause]