17161 >> J.D. Douceur: Good morning. It's my pleasure...

advertisement
17161
>> J.D. Douceur: Good morning. It's my pleasure to introduce Y.C. Tay, who joins us from the
National University of Singapore. He's here in town visiting us today. And also presumably in
town for Symmetrics, which is next week. And if any of you are interested in performance
analysis, if you're not you probably wouldn't be here, and you have not registered for Symmetrics,
we still have plenty of registration available. It's not too late. Love to have you at the conference.
You can talk to lots of smart people including Tay and others.
He's been involved in Symmetrics even longer than I have. Today he's going to tell us why the
last 40 years of cache miss analysis should be completely thrown out and replaced with
something new. So Tay.
>> Y.C. Tay: I want to thank J.D. for hosting the talk despite all the work that he has to do for
Symmetrics, and I appreciate your taking time to be here.
Everybody's busy nowadays to spend an hour to listen to me talk about something that actually
started 10 years ago when I was in Building 26.
You can tell it's been that long from the long list of collaborators I have here. So how does this
work? What I'm going to do is propose a research program and here this program I use the
old-fashioned sense of the word. Meaning that I identify what I believe is an important problem
and then and is a hard problem and I propose a particular way of attacking the problem.
And if that problem is solved, if they're pushed to attacking, this particular way are attacking the
problem works, then the problem will be solved and people wouldn't have to worry about it
anymore.
So it's kind of an ambitious program as all programs are. And what I'm going to do, the problem
I'm going to attack, it's not any particular level of memory hierarchy. I'm saying that this way of
attacking the problem would work for any level of the memory hierarchy.
So a bit audacious there. So math equations, denote -- as J.D. mentioned people have been
doing this forever. What new idea do I have here? So the new idea that I have is something that
I call top down.
What people have been doing in the past is what I call bottom up. So I need to distinguish it
between the two of them, that's what I'm going to do first.
Why is this particular way of doing it as powerful as I think it is? Because it's universal. Again I
have to explain that. Once I explain what I mean by those two, then I can explain what the
research program is about. And then I show you some applications.
So cache miss equation, everybody knows what this is. This just expresses the probability of a
cache miss as a function of a cache size.
So cache is everywhere in the memory hierarchy. And I don't restrict myself to any particular
level. I could be talking about processor cache or the RAM itself or database buffer or browser
cache, et cetera.
Size memory is cheap. But I say that size is still a problem. You may have a humongous amount
of memory, but the reality is that any particular application may be restricted to only a segment of
it.
So you have to each worry about the memory allocation issue. So you have to do that with
memory virtualization.
Or in consumer electronics, there's only a small amount of real estate, and you could just buy
more memory. And then the other end of it, which Microsoft should know about, you give a big
setup with thousands of machines spinning away, data center kind of thing, then you're going to
be using up a lot of order that's spinning away is going to be using up a lot of managed.
So size is still an issue. So although I write the cache equation as a probability of a cache miss
as a function of cache size, actually is a function of two other things. The reference pattern and
the management policy. And in fact it is a very intricate interaction of these two. And that's
where all the difficulty comes in.
Reference pattern itself could depend on the hardware variation, whether you're talking about the
data cache or instruction cache. They have completely different reference patterns.
The reference pattern could depend on the data layout. For example, how you design the syntax.
Depends on your application mix. For example, if you have concurrency control, then that
concurrency control work is going to affect the timing of the reference pattern.
Depends on what kind of compiler options you use, whether there's prefetching. I'm told that the
biggest machines out there are actually bought by Wal-Mart and Citibank, and for these guys the
data is changing every second.
So that affects your reference pattern. And, of course, the whole thing depends on how you
configure it, whether you have two level of caches, where cache is [inaudible] cache is mission
control, et cetera.
The upshot is that this whole mess is just mathematically intractable. You couldn't write enough
equations to describe the system, and if you could, you couldn't solve those equations.
So what do people do? They do what I call simplify bottom up analysis. First, you assume that
it's pure LRU, that there's no prefetching, et cetera.
Then so you get an idealized policy. And then you assume that you throw away locality and
assume that the references are independent. And there's only one application, and there's only
one process, et cetera, and then you get your reference pattern.
Now that you have simplified everything you can analyze the interaction between the two of them
and you can now write an equation for the cache miss.
And you compare these two pictures and you wonder how accurate are you? So this kind of
bottom-up analysis requires an expert. In any kind of performance analysis you need to weigh
what kind of approximations and what kind of assumptions you're going to make and so forth.
You need some expertise in order for the two sides to mesh reasonably.
And if I built a model for your system and then he says, no, I want a model, too, but, no, my
hardware is a little, my software is a little different then what am I going to do? I'm going to do
another one from scratch there or somehow tweak this particular model and give it to him.
If I build a model for him and tomorrow he wants to upgrade the hardware or software, what am I
going to do? I'm going to tweak the model again? So this is just not scaleable.
You can't be doing this. And things get more complicated because nowadays machines are very
complicated and you want it to automatically configure itself when you turn on the power and you
want it to adjust itself when the workload changes.
So the whole situation is hopeless. Bottom up analysis is not going to work going forward. So
what do I propose to do? I say let's give this up and let's do a top down analysis instead. What
do I mean by a top down analysis? I say let's just find one equation that will work for everything
for all time.
Sounds good. What would it look like. It would look something like this. Instead of just a function
of a cache size I'm going to give you a handful of parameters.
How does that solve the problem? Well, if I build a system for him and build a model for him and
he wants a model, then tomorrow, then in order to put the model to him I just have to tweak the
values of the parameters. And if tomorrow he changes the hardware or the software, just change
the values of the parameters.
And if you want autonomic configuration, you turn it on and the system will automatically calibrate
the parameters. And if there's a workload change, then the parameters will again be calibrated
on the fly.
I don't have to build a new model. All the while it's just one equation. So this is what I mean by
universal. This one equation is supposed to work for all workloads all levels of the memory
hierarchy for all time.
Okay. So one equation. What would this equation look like? Let me write it in an alternative
form, and still probably have a cache miss, the number of cache misses. You can go back and
forth between these two. So the number of cache misses as a function of the cache size. M is
the cache size and four parameters here. And the equation looks something like this.
This is my proposal of the equation. Right? May not be the right one, but I gotta show you a
particular example so that you see what I mean, right?
So this is my proposal for the equation. And the cache size is right there. And then I have these
four parameters. And what do those four parameters mean? Well, they are like minimal ones I
need. If you plot the number of misses against the memory allocation or the memory size, then,
of course, as the memory size increases, the number of misses decreases.
Right? And at some point you hit the code misses. No matter what you do, you have to paste
something from this. So these are the code misses. So N star is the number of code misses.
And then if you reduce a memory, at some point the workload just won't run anymore. So a
vertical asymptote here which is the minimum amount of memory you need there. That's the M
knot there. Now how is this equation different from previous equations. Previous looks just like
these two except you keep going down forever and ever. And it's just not true.
At some point you will hit the minimal. And beyond that point, no matter how you increase the
memory size, it's not going to buy you anything. So M star is this kind of ideal memory size. If
you were any smaller than this, you're going to take pitchforks. If you're any bigger than this
you're just wasting memory. Whereas previous equations don't have this particular N star.
The only parameter left is the N knot. So what is this N knot. N knot is like everything else and
the kitchen sink inside. If you change your prefetching policy, you add prefetching or subtract
prefetching it's in there. You do dynamic allocation, it's in there. Basically N knot, for example, if
you do prefetching then N knot would be a negative number. If you do dynamic allocation then it
will be a positive number. N knot is the sum of all of this.
Geometrically what N knot does is control the shape of the curve. Okay. So let me illustrate this.
So here we do not a single process but multi programming, take Linux and seven benchmarks.
This is very computer intensive workload.
You run all seven of them together. Measure the number of pages read from the disk against the
RAM size. This is not simulation; you actually measure it.
And so these are the data points, and this is -- you do a fit of the equation to this. Remember
your four parameters so you can calibrate these four parameters. You do kind of a regression fit
and see it fits the data pretty well.
Real people out there don't use Linux, yeah, so still try Windows workload. So this one we did
Windows 2000. And here you see how old this is, right? With business wisdom, I don't know if
you guys know this, maybe this old benchmark. What it did was they actually traced some
secretary or whatever using Windows 2000. Cut and paste, open up two applications and click
on icons, et cetera, and then they log all of those.
We replayed the log. Changed the RAM size, measured the number of disks from the pages, and
you see here the code misses are already way up there.
So this is a very IO intensive workload. And again the equation does a pretty good fit. You
compare the shape of the curve here and the shape of the curve previously and you see that the
shape is drastically different and that's the N knot effect.
Okay. So I claim that the equation is universal, right? Or disparate version of the equation
should be universal. This means if I change the replacement policy it must still work. So this is
what we did. We took Linux and then we patched it to do five full replacement, random
replacement. So you get three sets of data points, and the equation fit or three sets of data
points.
Of course, you get three different M knot values, three different M star values, the code
replacement is the same no matter what code policy you use.
So this is nontrivial. Getting one equation that will work for any replacement policy, this is difficult.
And then I said if I build a model for him and tomorrow he wants to change the hardware, the
software, the models should still work. Okay?
>>: Can you go back one slide?
>> Y.C. Tay: Okay.
>>: Oh, I see. The sample points are where you're fitting ->> Y.C. Tay: Yeah.
>>: And so, okay so how do you explain the variation between the sample points and your
curves?
>> Y.C. Tay: The sample point and the curves? You mean like this distance?
>>: Yeah.
>> Y.C. Tay: The equation is not perfect. So you can't exactly fit the center points, the data
points. And part of it is also nondeterministic junk coming in, because we cannot completely shut
down all the processes in the machine.
What was this?
>>: So do you have like some straw man equations?
>> Y.C. Tay: Straw man, sorry?
>>: I guess my question is, how do you know how good your equation is.
>> Y.C. Tay: How do I know how good the equation is?
>>: You say it's not perfect, right?
>> Y.C. Tay: Yes.
>>: So there would be a thing that would be perfect.
>> Y.C. Tay: I don't think there's any equation that can be perfect.
>>: Okay. And then there's ->>: Like a quadratic equation.
>>: Yes.
>>: How much force is it fitting the data points to a ->> Y.C. Tay: Okay. So the question is J.D. told me to repeat the questions in case somebody's
recording this. So the question is, no, how do I know this is not better than the power law, let's
say. So the standard way to model cache misses is to use a power equation.
And we did compare with the power law, and the only way -- the only comparison that we did was
we took the R squared value. So here is the kind of regression. So you can measure the
coefficient of determination. So the coefficient of determination for this datasets is of the order of
more than 95 percent.
With the power law, you can get below 90 percent. So just by comparing the coefficient of
determination, that's how we did it. We cannot claim that this equation will work for everything.
We tried it on -- so for certain benchmarks, the drop in the curve, the data point drops so
drastically quickly that we could not match it.
And also certainly for certain syntactic benchmarks we couldn't do it. So there are benchmarks
out there that we can -- there are workloads that we can mesh.
>>: You have an intuition about why that is, so in those cases you just mentioned what aspect of
the system you're not capturing in your equation?
>> Y.C. Tay: For certain -- for some of them, so the question is when it doesn't mesh, why
doesn't it mesh?
For some of them it has to do with locking or something. So, for example, at one point we ran
two compilations of the same file. And the data points were just all over the place. And we
couldn't match it.
We didn't look further into why that happened. Just one of those things we couldn't mesh. And
we staple curves on those ones that we could mesh.
It's like the pilates [phonetic] anomaly. In the pilates anomaly, even if you increase the memory
size, if you run five-fold, your cache misses could co-op. So there are always benchmarks,
workloads for which things work in a weird way, and we were not going to spend time on
workloads that work in a weird way.
We wanted to just focus on naturally occurring workloads, like compilation and garbage collected
applications and so forth.
Certainly if there's a real workload that doesn't match the equation and it's a really important
workload, then we should work on it.
So to answer your question, no, we don't know why sometimes it works, sometimes it doesn't
work. Okay. So if you -- so as I was saying, if I build a model for him and tomorrow he wants to
change the hardware or software, then I should be able to just tweak the parameters and it
should still work.
So here the way to illustrate this was, is we run the same Linux version, and then we change the
application from JCC 2.91 to 2.95 and you get a different set of data points and the equation
matches the different set of data points.
And then we keep the same application by changing the kernel. So if you remember, between
2.2 and 2.4 there was a drastic change in the Linux kernel, the memory manager, and the data
points were lower, much lower than previously. And we were still able to match that.
So to illustrate that, if you change the software, then it will still work.
>>: When you change all the points around one side of the line ->> Y.C. Tay: Yeah ->>: Seems like you could tweak the parameters a little bit and push the line up to fit those data
points better.
>> Y.C. Tay: No, there were other points.
>>: Oh, okay.
>> Y.C. Tay: This one looked really bad, because it was just a simulation of Linux.
Okay. So to give you -- so I'm claiming that this equation will work for everything. So I better give
you some kind of intuition for why that might be so. So this is the equation. And let me give you
the intuition for that.
So to get intuition for that, I just need to introduce three parameters, three variables. One is the
average time -- so assuming that this is RAM and I have pages coming into RAM, right? And
every time a page comes into RAM it stays there for a while before it gets thrown out.
So let me measure the average time a page stays in RAM. So there's T RAM. And every time it
gets thrown out of this it sits there for a while before it gets called in again. So that's the average
time. That will be the T disk. And R is just the number of times that the pages are coming to
RAM.
So we do that, and then I have something that I call, granularily call the references plus
replacement invariant. As I mentioned, cache misses is an interaction of the references and the
replacement. You have to take care of both before you know what a cache misses is going to be.
What the invariant says is that this expression here is approximately equal to 1. Once you have
that, and then you just have to add Little's Law, standard and robust in qeueing theory and you
have the equation.
So the only thing you have to worry about is why is this invariant true. So what this invariant is
actually saying is this, remember the pages come into RAM and they get thrown out so forth.
When it comes into RAM it's black. When it gets thrown out into this it's white. When I take the
average of the black parts I have the RAM time, T RAM. If I take the average of the white parts I
get T disk. It comes in the RAM three times so R is equal to three.
So what the invariant is saying is that it is possible, it's likely when you have plenty of memory
around, then the R value is small. You get patient, only a small number of times. And every time
you come in you stay there for a long time. So R is small and this one is -- so this part is -- this
value is small and this value here is large.
That's one possibility. The other possibility is you're under memory pressure. If you're under
memory pressure then you get called in many times. R is large. R is large. And every time you
come in you can only stay there for a short time. That will also fit in the invariant.
What is unlikely is that you come in only a small number of times. Every time you come in you
get immediately thrown out. This is unlikely. And it's also unlikely that you come in many times
and every time you come in you stay for a long time.
This is unlikely. So this basically what the invariant is saying. And as I mentioned, you add
Little's Law to it and then you're there. That's where you get that.
Okay. So the equation is supposed to work for everything. So that's applied to the database
buffer. Here we took data from this particular paper, this IBM paper.
They measured the references on a real machine. This is industrial strength kind of machine,
running real workloads. And then they reran the trace on different buffer pool size, and measured
the misses.
So these are real workloads and the equation works for this real commercial workloads.
They actually have 12 traces. And we fit all but maybe one of them. And then we tried it on a
web cache. Here what we did was took trace from one of the peer-to-peer websites in China, not
be stopped there. But from the experiments point of view, the fit is pretty good.
>>: So is the goal of this equation to sort of describe stuff after the fact or to actually predict
things before you change the disk or you change the memory or something like that? In other
words, is the main goal of the equation to sort of help you explain, okay, here's what my cache
looks like, here's sort of an intuition for why that's happening or is it supposed to be something
where I swap out my disk for a new one and all of a sudden I can sort of predict what the
subsequent cache flow is?
>> Y.C. Tay: No. At this point you can't do that kind of prediction, because you need
parameters. And the parameters need to be calibrated. When you [inaudible] the disk for a new
disk I don't know what the new parameters are for the new disk.
So at this point you know -- but you can do a prediction in a different sense, which I'll come to.
So at this point in the presentation I'm just worried about fitting the data.
Okay. So the goal is to just have one equation to fit everything. But I don't think that's going to
be the case. Right? So this way the research proposal comes in. I say everybody who does
researching cache misses should just dump whatever they are doing and focus their expertise on
just deriving one parameterized miss equation for each cache type and do it in a top down way.
But let's postpone that point for the time being.
And it seems unlikely that that can actually be achieved. But I've given you some evidence that
indicates RAM and the database buffer in the proxy, this one equation works, actually. Why did I
say it's unlikely that one equation will work for everything? That's why you have to restrict, scale
down and say that for each cache type let's come up with one equation for each cache type.
And the reason right off is that it doesn't seem like this will work for set of associative cache.
Simple reason.
This particular equation is one-dimensional. There's only one M value. So it's one-dimensional.
But associative caches are two-dimensional. So it doesn't seem like the same equation might
work for a set of associative caches. However, if you change the dimension of the set of
associative cache, right, to me let's say you change from a direct math to direct associative to me
this I can interpret it as changing the replacement policy. So if you change the replacement
policy, the equation is supposed to still work.
So it actually might still work. And it did. Except that I have to change instead of I have to just
add a log there. At this point I don't know why if I just change it to log it works. But that's what we
observed.
So let's show you some data here.
>>: With respect to the set size?
>> Y.C. Tay: We tried it for different set sizes, different line sizes.
>>: So in like a two-way associative a one-way associative, they both behave with the same,
both the log works?
>> Y.C. Tay: Yes.
>>: That's amazing.
>> Y.C. Tay: With set of associative caches, I think by the time you get to eight is more or less
stable.
>>: I agree. Eight should be kind of like a fully associated cache. But I'm amazed that is similar
to say two-way associative.
>> Y.C. Tay: Yeah. Even works with one-way.
>>: That's wow.
>> Y.C. Tay: So we did eight-way associative L-2 cache. This one is the data cache spike
workload. They all use spike workloads there. So this is the data. And it has this waterfall
behavior which the equation can't cache. But what the equation can do is roughly match this
down trend. It's not doing very well here but for the rest of it it's reasonable.
And we tried it on instruction cache as well and it works. So this is where the universality part of it
comes in.
>>: Again, I think someone already asked this but seems like the real question is if I plug those
points into Excel and ask it to draw a curve into them I'll get something that looks a lot like that.
So it seems like there needs to be some way to quantify why this is better than whatever Excel or
[inaudible] gives me by just saying fit a curve to these points.
>> Y.C. Tay: So if I use Excel, if I use Excel and plug these data points Excel will give me an
equation and a fit then I will evaluate, compare the two fits and see who has a better fit. Right?
How else can you compare ->>: I agree. I guess I just -- do you have evaluations of that type that you could show us?
>>: So you could do that. You could actually fit the curve in the cell, let's say, and fit it with your
equation and pick some point in space that you haven't evaluated yet and see what the two
curves will say, the missed rate should be and see what that delta is.
>> Y.C. Tay: So one particular -- so comparing this with what Excel might do, for example. Excel
will not give you an N star value. Excel will come out with probably some kind of power curve,
polynomial, whatever.
If it uses a polynomial it will go really bad when you scale up N, for example. It's going to go to
infinity or whatever. What Excel would do or math lab would do is fit the particular segment of
the -- if your data is within this range, it will fit very well within this range but when I do some kind
of extrapolation, I think math lab and Excel will do poorly.
>>: In other words, I'll ask this, is your equation successfully predicted at what point there's a
diminution on memory. Your equation has this behavior where it suddenly drops off. Can it
predict where that point is based on -- if you happen to actually collect data past that point can it
predict that.
>> Y.C. Tay: I'll show you that the prediction. You had a question? A point?
>>: Basically self- --
>> Y.C. Tay: Okay. All right. I've explained what I mean by parameterize. I explain what I mean
by cache type. The only thing I haven't explained is this top down modeling part of it. What do I
mean by top down modeling. Okay. So let's look at garbage collected heap.
The difficulty with garbage collected applications is if the garbage collector of whatever is
responsible changes, so if you change the memory allocation, then the heap size may also
change. If you have automatic garbage, automatic heap size somewhere in there.
So if you change the heap size, then the reference pattern is going to change. And this is really
difficult with bottom up -- for bottom up analysis.
In bottom up analysis, you start off by assuming that the reference pattern is, you start off with a
particular model of the reference pattern. And then you assume that this particular reference
model doesn't change when you change the memory size.
And that's just not so in the case of garbage collected applications. So how does the equation
work for garbage collected applications like this one?
So here I introduce another variable which is the heap size, and the workload is a generational
version of max street with one of the now it's more or less standard the couple benchmarks,
PMD.
And for a heap size of 14 megabytes, these are the data points and this is the fit. Fits pretty well
here not so well here. But to some extent it's what do you want to cache. If this is what concerns
you, I'm sure we can tweak the feed to do a better fit here.
>>: Isn't that feature the different shapes of the equation, just different standard?
>> Y.C. Tay: It is. So for the purpose of this talk I'm interested in a general fit of the rest of the
curve. So for 100 megabytes, again, it's a reasonable fit. And you can complain again about this
part of it. But that's not the emphasis for this particular fit I'm demonstrating here.
So notice if I change the heap size, the shape of the curve is very different, but the equation can
deliver on that part of it. So it seems like the same equation works for this.
Except that that is the equation. The critical parameter is a heap size or the critical variable is a
heap size. And here is nowhere in the equation. So that's not good.
So what we did was we look at the parameters that we got and look at the M star value, for
example, and we plotted the M star value against the heap size. And we said that -- we see that
it's actually more or less on a straight line.
So that says that M star is actually near in terms of page. And this already gives you some hint
on heap sizing. So if I give you -- if I give you a particular memory allocation, what should be the
heap size, intuitively the bigger the heap, the better. Because the bigger the heap the less
garbage collection you need to do.
So the bigger the heap, the better. But you don't want the heap to be so big that it starts going
out and you start taking pitchforks. Pitchforks are very expensive.
So for given memory size, memory allocation, you should just increase your heap size until the M
star values equals to your memory allocation. At that point you don't increase anymore. If you
increase any further then you're going to start getting pitchforks. So that gives you one rule about
how this will help you, this gives you one rule for doing heap sizing.
We look at the N knot value. The N knot value when plotted against a heap size decreases and
then stabilizes. Now why would it do something like this? So remember that the N knot is this
magic with the kitchen sink inside. If you do memory allocation the N knot value is positive.
So think about it, right? When your heap size decreases, what happens? You do more and more
garbage collection.
So when you do more and more garbage collection, you're allocating stuff, putting stuff back on
the freeways, allocating stuff, taking stuff out of the freeways, so you do more and more of that
that's why the N knot value is going up as your heap size decreases.
So that gives you another rule about this particular, this particular curve gives you another rule
about heap sizing. You want to increase your heap size but at some point your heap is big
enough to accommodate the footprint and there's no point increasing the heap size further than
that, because if you do that then your garbage collector is going to walk over some item memory.
So this here gives you one rule about what your maximum heap size should be. So with this tool
you can already see how we can formulate a heap sizing policy using these two parameters.
But, anyway, now that I've got these two equations I can plug them into that equation and now
this is my new equation for heap sizing applications now. We have the heap size explicitly in the
equation.
And this is what I mean by top down. I start off with an equation I know works, and then if I'm
interested in some particular variable, there's actually not explicitly in the equation, then I take the
parameters and then I refine them. When I refine them and put them back in I'll have the variable
explicitly in the equation.
And you know this might not be it. You might be interested in how different variations, different
garbage collectors affect the equation. In that case you analyze the garbage collection, how
these four parameters, how these four parameters depend on the garbage collector and therefore
refine the equation some.
This is what I mean by top down. So let me zip through all the applications now. So how to
power down item memory. So let me just use this example. So M star is here. Any memory
beyond M star is just wasting energy. You can power that down. So if you have -- so here is an
example where I have three sets of data and you fit the three sets of data with the equation and
use that -- once you do the fit, you have the M star value. With the M star value now you can
power down the extra amount of memory.
So, by the way, that was referring to what -- this example of where you can actually do the
prediction of where the M star is without any data point on this part of the curve.
>>: So there's validation that that was a correct prediction?
>> Y.C. Tay: Validation that that was a correct prediction? This one is not here. But I think in the
paper itself you can see how accurate or inaccurate these values are. Not in this particular slide.
Okay. Memory partitioning. So let's say you're going to partition memory into three pieces. So
you have three sets of data points. You fit them. Once you fit them, you have these three
equations. You can take these three equation you can completely map out. So this is buffer
proof one, this is buffer pool two. And the third one is just the difference from the total buffer pool
size.
And then with the equation now I can completely map the surface. And once you know
completely the surface you can do any kind of optimization that you want. Fit any optimization
criterion that you are interested in to do your optimization.
TPC workload. So as I mentioned, you want the equation to help you do dynamic adjustment of
your system. So here's an example. So what we did was change the number of terminals. 50,
40, 30, et cetera. Simulating people coming in and going out.
So the workload is dynamically changing, and we want to hold the missed rate constant at 0.2.
So if you ignore this part of it you can see the missed rate is in a cycle. Now, the question is
whether you can fit those data points.
So one way to do this is grade and descend, standard way to do it. If you do it by grade and
descend, without using the equation that I propose, then what you get is something like this,
which can swing violently, because grade and descend is very sensitive to fluctuations.
So instead of using grade and descend, what you do is take data points and smooth out the data
points. How do you smooth out data points? You use the equation to fit the data points first.
After you use the equation to fit the data points then you apply grade and descend with the
equation. That moves with the fluctuations and release you from the tyranny of the randomness.
If you do that, then the convergence is faster. But if you're going to take the data points and then
fit them with the equation, then why bother doing grade and descend. Just use a equation to
predict what the memory size should be to get the 0.2. So if you do that, then you have a faster
convergence.
How do you use the equation so you have this dynamic adjustment. So how do you free up
memory when you have these dynamic adjustments. Here's an example with memory
recommendation, and you're going to do it in a fair way. Imagine you have a memory partition
into three workloads. Initially one of them -- initially is unfair.
One of the workloads, buffer pool two, is always have a smaller missed ratio than the other two.
The get between the three of them is big. So what I want to do is I want to reduce the gap and I
don't want the same guy to win all the time. So there's my idea of fairness.
>>: Sorry. Just slightly out of scope. But why is it fair if everybody has the same miss rate? It
could be that some inherently have a high miss rate so if you push it out.
>> Y.C. Tay: No, this is a ratio.
>>: What is P star?
>> Y.C. Tay: P star is the co-miss rate.
>>: Sorry?
>> Y.C. Tay: The co-miss rate.
What I want to do is free up this amount of memory. So I want to take some memory from each
one of them. And I want to do it in a fair way. So once you have the equation, you can just do
your calculus, your equation, and you see that what you get out of it is some prorating, but a
prorating is not obvious. You have to look at the M star value, the ideal allocation and the current
allocation.
If you do that, then you get after the recognition, the gets are smaller and not one of them is
always the winner. Database buffer, tuner. Database tuning is complicated nowadays and the
three major vendors have put out software to help people do this. And all these three pieces of
software, if you look into it, somewhere inside there's actually a simulator. And what a simulator
does is it predicts the number of IOs for a bigger or smaller buffer, or that's what it tries to do.
Doing demolition is good. Has all the engineering details. But simulation is miss -- if it changes,
you have to go in and change the simulator. So it's kind of a pain. That's one issue.
The other issue is predicting for a smaller buffer. Now, in order to do any kind of simulation -- in
order to do this kind of simulation you need to have a trace.
Now, if your system is such that or if your program is such that you can only see the buffer, you
cannot see the buffer itself it's encased in a black box, you only can see what comes out of the
black box you can only see the misses, you cannot see the hits. If you cannot see the hits, when
I reduce the buffer size, then the hits becomes misses and you never saw them. So how do you
actually do the simulation.
>>: Seems like it's a little unfair to say that their problem is that they have to be simulations
because you effectively have to do a simulation, too, to cover what these parameters are. In
other words, the equation itself doesn't have any -- because of this N knot kitchen sink, rather, the
equation doesn't give you any prediction as to what will happen, even if you make a tiny change
in the software. For example, I don't know how sensitive you are to workload, because you
showed us 25 slides ago that just changing the version of the kernel or changing the version of
the compiler dramatically changes the shape of the curve.
And the equation is not going to give you any visibility into how the curve will change if you make
any of these kind of changes. So it seems like effectively what you're arguing for is also doing a
simulation in order to recover the parameters. You'll have to simulate either way. It's just if you
simulate the way they say they give you the data points, they just don't fit a curve to it. You're
saying you should run a simulation and also fit a curve to them.
>> Y.C. Tay: That's right. That's a fact that -- okay. So forget about what I said about doing the
simulation. So let's just go to -- so there's still this other issue about what happens if you have a
smaller buffer. If you do a simulation with a smaller buffer, there's nothing you can do. If it's
within a black box.
And I've been told that situation actually arises. Maybe the -- the one that owns the machine
doesn't want you to have access to the traces, to the reference pattern.
Now, for us there's no problem, right? If you are on a smaller buffer then extra pull it backwards.
I don't need to know the reference pattern. Okay. Gee, I'm on time. Yeah. So this is what I've
been telling you about.
Bottom up analysis is just not scaleable. And I believe that the way forward is to use, to go for a
universal equation, do it in a top down way universal equation that will work for everything.
And it sounds like it's too ambitious, but I think I've shown you some data to hopefully convince
you that it's actually feasible. And the point I want to emphasize is this -- I think this is the way to
do it. Give up bottom up analysis.
So I'll be happy to answer more questions.
>>: We had a bunch of questions during the talk, but any other questions? What I was curious
about, a related work question. In the beginning you said the bottom up analysis they don't
include any locality in the models.
>> Y.C. Tay: Yes.
>>: That's really the case with the existing?
>> Y.C. Tay: Well, when you assume that the reference is independent, you've given up locality.
What is locality? Temporal locality is, no, correlation between two references that are close in
time. Spatial locality, correlation between two references goes in space. This is just correlation.
The minute you say independent references, you have given up your ->>: And the existing models are -- you use references, wow. That's amazing.
>> Y.C. Tay: There's some people who model their workload as a graph, for example. Then they
keep some of the, they do keep some of the locality in that model.
>>: There's the model, the stacked distance model. Basically say I don't know, you assume that
you have the working set and you assume kind of how big the stack you would, what's the
distance between basically two references to the same page, and I mean people have been using
this to somehow -- it's a very simple model abstracts at all. But still it captures some notion of
temporal locality, right?
>> Y.C. Tay: Yeah, yeah. For that they probably have to assume LRU, though.
>>: Yes, that's assuming LRU.
>> Y.C. Tay: So you have to simplify this or you have to simplify that otherwise it's impossible to
analyze the interaction between the two of them.
>>: I think you had a comment on one of the last slides about the [inaudible] stuff. So the
simulation that they do is actually white box simulation, white box simulation.
>> Y.C. Tay: Yeah yeah.
>>: They actually put that simulator inside a database engine and it sits right there in the actual
cache itself. So it doesn't have the data that's cached but it has some of the stump. You can see
whether this is a hit or miss. So they know about both.
>> Y.C. Tay: Yeah, so this particular one, yeah, the SQL Server, the software, Resource Advisor.
>>: That's the black box.
>> Y.C. Tay: They use the trace to do the simulation.
>>: [inaudible].
>> Y.C. Tay: No. When I say black box, I mean what you -- suppose you are the vendor and you
are doing performance analysis for some company and the company doesn't let you look at the
reference pattern, then what do you do?
>>: But what I'm saying is the Oracle and DB2 stuff that's here, it's actually in the database itself.
So it's not like it's external vendor that does the analysis.
>> Y.C. Tay: Then I must have misunderstood. Because this point was brought up to me by
Maryanna. So I must have misunderstood.
>>: This is internal to the database engine.
>> Y.C. Tay: Okay. So probably ->>: Inside the engine.
>> Y.C. Tay: Clarify that. More questions?
>> JD Douceur: Let's thank our speaker.
[applause]
Download