>> Vic: So it's my pleasure to welcome Shivnath... Duke University. Shivnath got his Ph.D. from Stanford University...

>> Vic: So it's my pleasure to welcome Shivnath Babu who is an assistant professor at Duke University. Shivnath got his Ph.D. from Stanford University in 2005. And since then he's been at Duke. He's already been the recipient of three IBM faculty awards and the NSF career awards. He also has I guess in just in this last SIGMOD (phonetic), his work on the iTune system which he'll talk a little bit about today won the best demo award. And so with that, Shivnath, we'd like to hear from you today on the experiment-driven system management. >> Shivnath Babu: Thanks Vic. So experiment-driven system management is some sort of technology that we have been working on for quite a while now, around five years now. And let me first introduce the -- some of the type of problems that we are trying to target with it. So one very common problem that sort of shows up in database management, sort of the context, the problem of tuning problem queries. So imagine you have a complex query, maybe [indiscernible] generation workload that might be some requirements on it, and if the query's actually not running fine, right, there might be a user or a DB who might have to tune it. And probably improves performance by some factor. It's commonly called a SQL tuning task. So what ends up happening in this context when you have a specific query that you have to tune up and what DB might do, a usual course of action is to actually collect some monitoring data. And it might include in last [indiscernible], monitoring data might include even looking at the plan, you know, what different operators, the number of records the different operators are returning. We actually do a large work with booster SQL. And if you're familiar with booster SQL, you know that if you do this explain an alliance for a query, it actually shows the information which can be [indiscernible] like this that are -- that shows you the plan, the operators. It shows you the estimated cardinality. There's the number of records the optimizer talk that operator might return, and the actual cardinality. After looking at some like this, you might realize, oh, look, this is now showing [indiscernible] only run records are going to come actually 700 records are going to come, and there might be hypothesis [indiscernible] actually have this replacing that index [indiscernible] join with the hash join might actually improve performance. Or sometimes it might be looking at these things actually changing the or the [indiscernible] performance. You might give [indiscernible] to actually [indiscernible] and whatnot. But the main point here is to actually achieve such a task to really tune the [indiscernible] from problem application to actually getting a fix and putting in the system. A lot of runs might happen, runs of plans like this to observe, to collect more information, to actually sort of validate some of your hypothesis, and finally once you have found a fix to validate that fix, that your plan actually works before you can put it in the [indiscernible] system. Okay. Another domain. Parameters. In many databases systems, actually most database systems have these tuning parameters. Sometimes they are called configuration parameters, sometimes they are called [indiscernible] parameters. There are these parameters this control buffer pool sizes, I/O optimization, the block sizes and the port rate to checkpoint, parallelism, the number of concurrent queries you can run, lots and lots of parameters to treat the optimizer's cost model. With a system like booster SQL, again, we have done a lot of work with this. We have find that around 15 to 25 parameters, depending on whether it's a read-only or, OLAP setting versus OLTP setting can actually have significant impact on performance. And I think for most database systems, not just booster SQL, it can get very frustrating experience to tune these parameters [indiscernible] performance and there are known good holistic parameter tuning tools available, and for good reason. So here actually I'm trying to summarize a problem that vexed us for a long time. This actually got us into this whole parameter tuning game. So what I'm showing here is response surface, so there are two axes. Differ axes. This parameter here whose value is in megabytes is the main buffer for setting in booster SQL. See. And this one is actually -- it's an advisory parameter but you're telling the PostgreSQL system, look, there is such and such value of operating system buffer cache. It doesn't -- you can't really lie. Nobody is really going to check that. So we actually vary these two parameters. In fact, we varied many more. And what [indiscernible] left here is the performance for a workload [indiscernible] just once in the query [indiscernible]. [Indiscernible] part of the [indiscernible] database setting if you can actually call it that. [Indiscernible] TPCS database size one giga fram one is to four ratio, and in fact, this whole database is running inside azen [indiscernible]. So what actually -- so what sort of got us in here, we've been running the system with sort of parameter setting that was advised by the community which [indiscernible] set your shared buffers to around one-fourth of the total memory size and then giving some memory to the operating system, set this value to the remaining. So now kind of gets it into right a point over here that performance is really bad. And this actually was I/O, and the I/O was really -- the system was I/O bound. And we will see that. Okay. So the natural thing you would think of is increase the buffer pool size, and that will improve performance. We did that, and performance became worse. So it's like we're not seeing the surface. We are just tweaking things and then observing performance. That got us to generate the same task surface. And you can see one of the surfaces generated, there are some regions where performance is good, some regions where it's bad. You can see the trends. I'm not going into details of everything that happened, but this is what I want to tell you is the moment you see this, you know that I don't have to know what these parameters are. I can set it for a value and get good performance. But how do real users, really [indiscernible] doesn't deal with problems like this. There's no model. Actually, just I want to make one point. When you're changing parameters, a lot of things can change. The query [indiscernible] different plan. Even within the plan, the operator implementation might actually change. The impact of buffer boost settings can actually change. A lot of things can happen. [Indiscernible] into what how we actually diagnosis the cost of this interesting behavior. So how do real users deal with this thing? They do exactly the sort of things that we were trying to do. We were [indiscernible] a parameter. And [indiscernible] is one parameter time and try to tweak the system to good performance. And this can actually, if you think about it, this can be horrible if there are interactions between [indiscernible] parameters where the impact of changing one parameter, the impact of changing let's say effective cache size changes for different settings of -- different for different settings of shared buffers. Right? This one parameter at a time can actually lead you to real poor performance. So these are two examples of like how management -- system management or database management happens in the real world, where this is sort of [indiscernible] happening. I realized based on the model information that I have collected about a system that performance is bad. I have to tune it. Any general management task, I'm going to give you some interesting examples. And then you realize how to actually get more data. And I plan some experiments that I have to do. In the context of parameter tuning, it will be running the workload for a specific database configuration. Right? I plan the experiments. I conduct them. And that brings in information. There are different types of information you can collect. Process it, and maybe, at the end, you have enough information to achieve your task. Sometimes you don't have it. And you have [indiscernible]. This process is very, very common, and the sad news is it very [indiscernible], very ad hoc. Right? Even things like [indiscernible] done these experiments to [indiscernible] practice system, actually do a run it on my desk system. What if that is a different setting? So there are a lot of things which [indiscernible] really struggle, but [indiscernible] work is to sort of automate this process to whatever level it can be automated. So I'm going to give you some examples of how the different sort of domains where we have put this to use and tools we have built. I'm going to be spending most of my time today talking about that first one, SQL tuning. This is the problem I introduced at the beginning, and couple of tools that we have built there. The second one, which is the one that I usually used to talk about, in fact, I was at Stanford three months back and I talked about this iTuned tool. I'm not going to bore you with it, talk about the same thing again. So there is one [indiscernible] that we build that we are actually trying with the one that [indiscernible] actually mentioned. We actually presented a demo on this iTuned system at SIGMOR [phonetic]. And we are doing some more work on that. Down south, the sort of tuning problems actually appear in the Hadoop MapReduce setting. So this is one simple, you know, [indiscernible] surface and [indiscernible] to it, but Hadoop today, with each job today at around 190 configuration parameters, just like in a database system at around 10 to 15 of them can be really crucial for performance that are infractions and whatnot. So there's something that we're actually studying and building a tool on. One of the problems that we attacked the very beginning and a very common problem that we see is this whole notion of benchmarking on capacity planning, where in this data might have again some -- so this is actually for a storage server where we are waiting it's running the FS file system. We are waiting the number of NFSV [indiscernible] and the actual number of disks in the storage system. You can see that some performance aspect is being measured against either this dimension has a [indiscernible], this dimension doesn't. So in benchmarking, often people are interested in capturing either there is some map of how performance will be or which parameters are important. Right? Don't want to spend a lot of time doing this. In fact [indiscernible] report PPC hedge performance numbers, you might spend a lot of time doing something like that. And what we realize is that you don't really need this exact surface. It can take months or even days to actually generate this. You can giving a quick approximate surface and often be very useful or defining perform parameter. And we did some work which we call cutting corners. Right? Ideation [indiscernible] generating the surface, but you can cut corners and get an approximate [indiscernible]. That's often good enough. And take one-tenth of even smaller the amount of time. Then the problem that we have attacked is this thing which I'm calling in traction [indiscernible] scheduling. So we all know that other query [indiscernible] generate masking at one query at a time. The moment multiple query plans are running together, who knows what's going to happen? Maybe they actually help each other out. Maybe they beat up upon the same results, performance can be poor. Schedule [indiscernible] a pair of this in traction can do a really good job. But how does it know about this different attraction that can arise when queries are running in a mix? So for that, you need models. Who's going to give you these models? The only way you can actually do that is to run 40 or have your system for taking different mixes run some of these mixes, observe, and build the models will work any old context. Nothing will [indiscernible]. Right? So we have attacked actually that again using the running of mix, collecting information, selecting which mixes to run because [indiscernible] is huge experimental in management was really great. Number one, we are actually focusing on now is data in a database system or in a file system can easily get corrupted today because people run things on [indiscernible] hardware where hardware can actually be very flakey. You know, it can [indiscernible] bugs, network interface [indiscernible] bugs. All [indiscernible] actually can have -- make mistakes. So because of a lot of reasons, data can get corrupted, and we have this very interesting example I'm going to -- now I'm going to tell you where it is from. But there was a typical application which couldn't be -- it actually crashed and it couldn't be brought up for a long time, virtually a long time because what had happened was that corruption happened. Somewhere the data for that application had become corrupted. Back up and -- this [indiscernible] everything was in place. The backups actually became corrupted because nobody was really checking. Right? And the application crashed. We tried to bring it up from the backup. The backup was also corrupt. So literally, a lot of data was actually lost because of things like this. And again, how do we attack something like this? There are data [indiscernible] both in our checksums and there might be loss, right? The things that you can actually run to detect it and give advance warning to administrators. So that's again enter and trusting the main issue you're actively working on where this whole [indiscernible] ruined because you're running tests, you're getting some information about the test results and then figuring out maybe if this snapshot is wrong, this is [indiscernible] maybe the database could also be corrupted actually bring in and lose some interesting information. So after building a whole bunch of these things, what we realized is there's actually a lot of [indiscernible]. All of these are [indiscernible] twos. There's a lot of commonality in for instance at the level of maybe scheduling experiments. Maybe the whole [indiscernible] harness experiments. A lot of commonality in the planning algorithms. So we are like trying to raise a level of [indiscernible] back to the level of the language. And I don't know how much time I will have, but I'll try to at least give you some of the main ideas. And the way I usually do these talks is I pick one system after having given the overall vision, right, I pick one tool that we have built and then use that to sort of -- as an end to end thing to illustrate the different aspects of planning, running the harness and everything. And what I've chosen to do today is this Xplus SQL-tuning-aware optimizer and at the end I'll actually talk more about, depending on the time I have, some of our current focus. So this is the same problem I introduced in the beginning. You have a problem ready that you'd actually like to tune on the admin. He needs to improve the performance by some factor. And what happens today, actually, a lot of this happens manually when they do things. Wouldn't it be great if the administrator could instant pose the question back to the database system. Look, I have this query I have to do. Okay. Can you -- and I'm not happy with the current plan. Right? I need to improve its performance. Can you generate a better plan for me? So this [indiscernible] is actually one small part of this overall SQL tuning. Sometimes by relighting the query, sometimes by adding more indexes or sometimes by tuning this parameters, you can improve performance. You can actually fix this problem, but all of them are invasive in different ways. Adding an index can actually affect the performance of other queries. It can affect workloads. Right? The update ratios and things like that. So we focused on this smaller piece of the problem, and what we were able to show is if, you know, you can find such a plan with some caveats, it sort of means that there's no other [indiscernible] than some invasive tuning. And what we have built is a real optimizer. But suppose the regular optimization optimizes goal is given a [indiscernible] generate a good plan, right, [indiscernible] all of that. But it's supposed to [indiscernible] what we call tuning sessions. So in a tuning session to our system, the inputs that are given are of course ready. [Indiscernible] performance is not good. If you have the plan, the current plan for this performance is not good, you can give that as well. We actually assume that is available as of [indiscernible] current prototype. And then these tuning objectors, so they support different tuning objectors like improving performance by some factor or have one R. I need a better plan in an our. To find the best plan you can find in an hour. So we support multiple tuning objectives. And the way expressed actually works is by carefully choosing some plans or some subplans of this query to run. It runs them, collects information, and iterates. It's basically applying that same old exponent [indiscernible]. Okay. And because it's running [indiscernible], we always have -- our tools always have this -the number three. [Indiscernible] which is so tuning is going to base some or have somewhere. Right? The admin or the user would like to specify under what constraints to contain that might actually be. I'll give you an example. So one really simple example is if you have a database system that has some multiple [indiscernible] level of ten. Now, you can use one of those [indiscernible]s to run the tuning subplan. Make it just one very, very simple example. Somehow you want to limit the overhead of tuning. This type of tuning is not free. It's going to pay some overhead somewhere. Maybe it's in the cost. In the cost of a cloud. Maybe that [indiscernible] is possible in a system. Look, I have $10 to spend an hour. You can get results from the cloud on easy to and do [indiscernible]. Right? So there are different types of entrusting that was consistent and [indiscernible] harness that we have not prevented. So just to give a quick example so that we are all on the same page, assume this is a subplan that a system chooses to run for the [indiscernible] ready. Hash join on two tables. So even before the subplan is run, we can have the estimated cardinalities as estimated by the optimizer. Remember of course they're going to come out. At the end of the run, we actually collect two pieces of information. One is the actual cardinality, how many records have -- were produced [indiscernible] plan, and then something which we call estimated actually cardinality. So it's the 670 is actually if the query optimizer knew that the tables cannot -- R tables on S are going to produce thousand 50 and 850 records, respectively, this will be its estimator, not this. So if you see this, the 670 is pretty much close to 700. What that means is the -- you know, for the join, you also have to estimate the join selectivity. Right now just the input cardinalities. So the join is [indiscernible] estimate was more or less accurate. And the user. So essentially, as of now, be our optimizer uses these three-piece of information. We have much more because when we run this, we can observe how much you have [indiscernible] the bottom of I/O, [indiscernible]. These are all information we have but we have not. This optimizer found a way to plug it back in in a simple manner. Okay. So how does Xplus actually work once this inputs are given? It starts running these plans, and as it is running them and collecting information, being progressively it keeps on doing that. It might have better and better plans. So we're not actually showing there as a cost of the best plan found so far. The costs are -- keeps on going down. This is not the cost of the best plan as estimated by the optimizer at the A point. Its performance can actually go -- become worse because the optimizer is mixing accurate cardinalities with estimate cardinalities and sometimes that [indiscernible] be a good thing. Often -- I shouldn't say often. Sometimes it's not a good thing. Right? The user can actually stall this at any point. Say, I'm happy with this plan that was found. Or run it to so-called completion point, then we are able to provide a sort of strong of tunality guarantee. Okay. So I'll tell you actually when -- what happens, under what conditions we reach this point. Can be reached literally efficiently, but we can say look, the plan I have found so far, the best plan I have found so far is indeed the optimal plan for the current database configuration, and that includes the current physical design indexes, the current sober configuration parameter. We just on the large scale configuration. And the current hardware [indiscernible] have been allocated to it. So we are not changing that. We are not playing any games with that. And each and every plan goes to using accurate cardinality values. So every plan has now been cost [indiscernible] optimizer using accurate cardinality slow estimates. So if you are willing to [indiscernible] that -- and this is actually at least in DB two, they have some research verifying that and we have -- to some extent, we have done this for PostgreSQL two. If the accurate cardinalities are given, the cost more or less do a relatively good job in -- in a relative sense, finding the best plan or ordering the plan's right manner, at least the good plans. So this is again the -- something which we definitely want to take out because sometimes the cost model can be inaccurate. Right? Or the parameters and the costs probably actually plug in, but that's not being such a big problem so far. The main reason we have the first orders again as I said, we want to be noninvasive. In the sense we won't actually suggest a fix that would work with the current setting. [Indiscernible] other pieces of work that actually deal with that, in fact, all the parameter tuning work. Getting the plan right is really the first -- the most important thing for performance. Parameter tuning, all of those things are second rate to some extent. So there has been actually whole bunch of interesting work, and some of the main challenges in this context are which subplans you're going to run, [indiscernible] actually run them, and [indiscernible] of tuning on the production workload. There has been interesting work in this domain [indiscernible] work done by [indiscernible]. Right here at MSR. So we have actually separated our work from some of the earlier work, the query execution feedback work happening. But some -- for some practical reasons. One of the main reason which we separate our workers, we are very focused on a query. So the DBR user has given us, this is the query I want you to do. And that is the sort of context we see in the PostgreSQL community. Often 95 percent of the queries work fine. There are some of these, you know, killer bat rays. We actually focus on that. And we are very consciously chosen to operate these so-called tuning sessions from the regular optimization for more or less the same reasons. And an important reason is, you know, I was having this discussion with Rene. Was telling me that even if you bring in this query execution [indiscernible] and put it back in, there's no guarantee that the plan optimizer will actually find some accurate cardinalities and the rest has been estimated. There's no guarantee that that plan will actually be better than the previous plan. And so there's a lot of plan error, funny things that can happen. So we have chosen to separate these things out. Clearly separate out. Tuning sessions, use it and when you know how to use them and when you really want them. And another thing, and this is some thing that we got when we presented this work, the very first version of this work more than a year back, the DB test workshop in 2009. People don't -- so this sort of proposing a whole bunch of changes, right, people like I guess from the commercial database perspective. The more changes you propose to [indiscernible] there, the less likely is going to actually be adopted. So we definitely, since we are working on optimizer, we want to make zero changes to the query execution engine. And interestingly, actually also going to tell you how we get away by making zero changes to the commercial [indiscernible] database query optimizer. Might sound weird, but there's a way in which we actually do that. So this whole problem and some of the idea that we use especially running plans and using feedback is not really new. What I'm going to spend a lot of time focusing on, this big challenge of which plans, large space of plans to actually run to reach a tuning goal. And the main [indiscernible] the one to a single main idea in [indiscernible] is actually how we treat this overall physical planned space. So imagine this were to be run test space of physical plans. So P one through P eight are different physical plans. We actually try to group these plans together into what I call plan neighborhoods. And the plans in each neighborhood are related in some specific manner. There is some literature that we won't exploit. The key to this is the notion of cardinality set of physical operator or a plan or a neighborhood. Okay. So let me explain what a cardinality set is. So consider the simple, you know, actually as an operator hash join that works on two tables is a filter on one of these tables. So the way in which the query optimizer estimates the cost of an operator like this is it has some sort of cost formula to which you need input and some the pain inputs are the expressions and especially the cardinality values for expressions. Let me give you an example. Right. So for this hash join, in PostgreSQL, the -- to estimate the cost of the hash join, the optimizer needs the sizes of its inputs. So in this case expressions are the signal R and S. Right? For these expressions, it needs this blue, the cardinality values. And it plugs that in along with some other parameters like, you know, all of the [indiscernible] represent one the buffer size and the [indiscernible] size in all of this and [indiscernible], estimate the cost. Okay. So we define the cardinality set of a physical operator as the set of relation algebra expressions. Expressions, not values. Okay. And this is -- you know, this is a distinction I just want to emphasize. So the expressions whose cardinalities submitted the cost at operator. For a plan, again you can sort of estimate. Right. Just laid out the plan, and again for the plan, that will be -- for each operator, there will be a set of expressions [indiscernible] set you [indiscernible] of all of these expressions and you get the cardinality set of a physical plan. Again, expressions, not values. And once you can associate a set of expressions with a plan, you can actually say, look, I'm going to divide up, I'm going to start off in a cluster of this set of plans based on equality of car 96. Okay. I'm going to use the simple definition. In fact, we actually have a more sort of complex definition of planned neighborhoods because we want to minimize the number of neighborhoods. We want these neighborhoods to be as large as possible. Each neighborhood to be as large as possible. But to keep things in a simple, we're going to see for this [indiscernible] that our definition is a neighborhood contains plans, all of which have the same cardinality -- yes, [indiscernible]? >>: [Indiscernible]. >> Shivnath Babu: All of them. Yeah. Because the object of this really, the structure [indiscernible] the plan space, now I have to talk about how does the optimizer make progress for a tuning goal. That's where all these other things come in. Okay. Any other questions? So the [indiscernible] here is [indiscernible] is well, if I have added cardinality values to cost one plan in the neighborhood, I can cost all the other plans accurately. [Indiscernible] this is the relationship. [Indiscernible] simple model. And naturally, so our notion of a cardinality of a neighborhood it was trivial here is the you know of the cardinality sets of all plans in that neighborhood which again is trivial here because all plans have the same cardinality set. A simple example. Two plans. Just a little different, but in reality have the same cardinality set. At least in PostgreSQL. This plan is three-day join with some filters and a couple of hash joins. This one exactly are the same tables, same filters. There is a hash join being placed by a merge join. There's a hash join where the order has been flipped. So these sort of things don't really affect the cardinality set. So a single cardinality set actually consists of a whole bunch of different plans. >>: [Indiscernible]. >> Shivnath Babu: And communication. But it's really function of your, you know, the optimizer's cost model. Yeah. So that's exactly right. So basically within the same [indiscernible] PostgreSQL, within the same neighborhood essentially all possible replacements of operators [indiscernible] to include indexes and things they access and the premiere [indiscernible]. So this will define that neighborhood. In fact we use transformations to generate [indiscernible] that we -- within the neighborhood. >>: [Indiscernible]. >> Shivnath Babu: How's the -- so they are really talking to the level of physical plans, not really the logical one. So if you're talking the level of an expression, then -- so I want to -- yeah, I think that's basically one way to categorize it. [Indiscernible] group as they understand it, is really a logical subexpression. Okay. Now, so once we have that he is neighborhoods, now we are starting to talk more about the -- yes? >>: [Indiscernible] the question [indiscernible] class of complexity that you're thinking but [indiscernible] group does not. That was the [indiscernible] ->> Shivnath Babu: The [indiscernible] group is actually sort of representing all possible physical plans for that logical expression. >>: But there's just different plans but [indiscernible] complexity except through this physical [indiscernible], you [indiscernible] expression which is affecting the cardinality [indiscernible] with that property. >> Shivnath Babu: Mm-hmm. >>: [Indiscernible] and that is a stronger notion than a typical [indiscernible] group [indiscernible]. >> Shivnath Babu: Okay. Well, say that that way, essentially the grouping is all sort of plans which can be in a cost using the same set of cardinalities. >>: What is the difference [indiscernible]? Isn't it just captured by the [indiscernible]? >> Shivnath Babu: No. It could be that -- just to give you an example, so assume instead of the merge join was an index [indiscernible] join, right? To cost [indiscernible] join, a case in PostgreSQL, you don't know that you -- you need not know the actual size of the number of [indiscernible] on the probing side or on the -- yeah, on the index probing side. Right? You only need to know an average how many records ->>: [Indiscernible]. >> Shivnath Babu: -- [indiscernible], yeah. So the [indiscernible] physical was logical, there are ->>: [Indiscernible]. >> Shivnath Babu: Yes. Yeah. Okay. So the [indiscernible] why all of these neighborhoods, right? The main reason is the way express R makes its progress as it's trying to tune of query can be cache [indiscernible] in terms of how it goes about what we call covering neighborhoods. So a neighborhood effectively covered then accurate cardinality values are available for all expressions in this cardinality set. Okay. So if a neighborhood is covered, then all plans in it can be costed accurately [indiscernible] cost model. Okay. I can think of this sort of express progress, this is the picture I showed you earlier. Express progress is imagined physical plan space already had full neighborhoods. It runs, a process might actually end up covering this 1 first and then this one and actually when it runs a plan, multiple neighborhoods might get covered together and the end, the optimal -- optimality guarantee you get all neighborhoods are covered. So to actually -- in the process of achieving this, there are some nice efficiency and guarantees that express can actually make. The first one is it runs at most one plan to get the accurate cardinality values for a neighborhood. This might seem trivial, right? Because the way I define a neighborhood was all the plans which have the same cardinality set. But this is subtle distinction which I'll get into in a moment because the -- if it is the case that you can efficiently [indiscernible] cardinalities for all expressions that are needed to cost the plan, so, you know, in one case we are seeing these are things you need to cost the plan. In the other case we're seeing that he is are things we can measure when the a plan runs. This is sometimes different. And I'll get to it. Right? But we have ways in which we actually I'm sure that at most one plan, and the interesting thing is often not a plan. It's actually a subplan. And another thing is we have this feature that the moment a plan is actually run in a neighborhood and it gets covered, so accurate cardinality values of measured for some expressions, we have a way in which actually data [indiscernible] enables you to find the minimal set of other neighborhoods which have -- their plans have to be recosted because you now have accurate cardinality value for an expression there whereas you only have an estimate over here. Okay. This is just a data sub[indiscernible]. Then, all the neighborhoods can be covered by -- so the typical thing is to run one plan per neighborhood, a plan per neighborhood to cover all neighborhoods. It's [indiscernible] because of the intersection and, you know, putting some [indiscernible] properties and the other thing. You don't have to run plans for each neighborhood. In fact you end up running very few plans, just to give some numbers for query nine in TCPH that are around 36 neighborhoods and 8 or 7 or 8 subplans need to be run to cover all neighborhoods. So it's basically, you know, you can exploit all of these properties of -- of a DS setting to actually really cut down on the number of neighborhoods. Yes? >>: [Indiscernible]. If I execute these plans they get the cardinalities from [indiscernible]. >> Shivnath Babu: Definitely. So the way I actually say that if my tuning goal was to measure all cardinalities, so together -- to get this point as of surely as possible, right? Doesn't exactly the thing you would want to do. [Indiscernible] knows ahead of time. I want to figure out how much I can [indiscernible] performance improvement I can get, right, so that I can avoid all invasive things. That would be a ->>: [Indiscernible]. >> Shivnath Babu: Yeah. But that's how it will be. It could be applied, yeah. >>: [Indiscernible]. >> Shivnath Babu: And sometimes after running experiments we have [indiscernible] trying to [indiscernible] results and if you want to get to that point. Okay. So essentially once subspace is covered, we actually get this property. Okay. So now, let me just take a step back and ask about [indiscernible] works at some level. So what are the challenges, right? What are the sort of [indiscernible] problems that had to be solved? How do you [indiscernible] all these neighborhoods? Right? How do [indiscernible] plans within a neighborhood? [Indiscernible] in a neighborhood to actually run our subplan. In terms of progress, I have -- I'm at some point, always the case [indiscernible], there are some point there are some information and they are figuring on what to do next. So which neighborhood to go after next. Right? And once you pick a neighborhood, which plan to run or subplan to run within that neighborhood to get the required information in that neighborhood. So I'm going to go down and do each of these questions in some level of detail. The -it's a [indiscernible] one and a half years of maybe even closer two years and 1 complete rewrite to actually get to this point because a lot of high level ideas might seem too easy and probably even straightforward, but to actually get it to work and to really implement [indiscernible] optimizer [indiscernible] can actually [indiscernible] a lot of times. So I'm kind of trying to stay away from the details, but I'd be really happy to either talk about those [indiscernible] off line. Okay? So the first problem. How do you enumerate these neighborhoods? All right. And when the enumerating [indiscernible] neighborhood. First [indiscernible] try to [indiscernible] because we have PostgreSQL, which is a bottom-up fully optimizer. Right? Same old plans trying to impress some clustering that just [indiscernible]. So a transformation based on Postgres is the best thing in such a setting that you can carefully select and do the plans. The only plans you can afford to [indiscernible] -- the way to go in here and what we have done is we have actually [indiscernible] a sort of a transformation based process [indiscernible] optimize if you want to call it that. But the [indiscernible] thing is trying to write these transformations and these are physical plan transformations. And being able to characterize them into one of two categories. So this transformation applied to a physical plan will generate a new plan in a different neighborhood. And this transformation applied to a plan will direct a plan in the same neighborhood. So the logical -- the commuted transformations replacing physical operator transformations with some corrections [indiscernible] physical properties. That's how -- that will actually fall in here. Right. So changing join [indiscernible] and those sort of thing will actually fall in there [indiscernible] pushing up [indiscernible] expensive pushing down [indiscernible], which we have not really done, but they will go in there. [Indiscernible] doesn't have that. So we will like it. So this other key question is how do I decide at any point if time based on some neighborhoods that have been covered which neighborhood to cover next? Okay. Let me illustrate this sort of deletion with an example. So suppose I have covered that green neighborhood in one and I have these three neighbors I'm trying to decide which one to go after next. And let say that we have found based on the current set of available cardinality, this [indiscernible] and actual that P two, P three, and P four are the respective least cost and similar optimizer physical plans for each of these neighborhoods. And again, these are uncertain because there's not mixing easy but the accurate but estimate cardinalities. And suppose just again that in N two we have this -- when you look at the level of the cardinalities set, we have in these two neighborhoods two cardinality values that are missing. Right? And in here, there are three cardinality values that are missing. So this actually kind of poses [indiscernible] like should we go after N two next because that is where our current cheapest plan so to speak is -- exists. Right? And N four because this is [indiscernible] cheapest plan estimated. Could be totally wrong. Right? N four, if you were to go after N four, there's going to bring in more sort of, you know, accurate cardinality values so it's going to convert more uncertainty or uncertain values to certain. Right? This is the exploitation expiration problem manifested in SQL unit. So in the expiration exploitation problem is really like have some variables for whose values have some estimates as of now. Okay? And trying to make an addition based on that. So exploitation will be a sort of saying I know the uncertainty of the extreme exploitation would be. I say I know the uncertainty. I assume these are all accurate, and then go with my addition. In this case, that would be the -- going after the current and cheapest plan. Exploitation would be a [indiscernible] uncertainty. And I [indiscernible] resolve that uncertainty. And there are multiple -- and this is actually -- took us a long time to sort of converge on some sort of an architecture here. And our current sort of best solution is [indiscernible] solution that will be used this exposed [indiscernible] policy. Actually a well-known technique in this [indiscernible] area. Yeah? >>: [Indiscernible]. >> Shivnath Babu: Why don't you ->>: [Indiscernible]. >> Shivnath Babu: Yeah. >>: [Indiscernible]. >> Shivnath Babu: So that can be done. Right? I think this is where we in some sense, when we press in this first level of work, the sort of thing that regardless, there are some standard techniques that people use when they're going after sort of tuning. They look at -- they cannot look at join select [indiscernible] values or the values that sort of come out and based on that they figure out oh, this actually seems like a better [indiscernible] try next. So there are in some sense, right ->>: [Indiscernible]. >> Shivnath Babu: Yeah. >>: [Indiscernible]. >> Shivnath Babu: I'm not actually going to come in and say, look, this is the first we have is the best. Right? But from a practical standpoint, we probably have to try multiple things and actually I think our first one was really, at least my thought, was on [indiscernible] trying to actually capture the benefit because in the whole previous line of work on iTune and everything, that's exactly what we have done. Trying to quantify benefit based on the uncertainty. And, you know, coming up with the close form and going after that. >>: [Indiscernible]. >> Shivnath Babu: Actually in substance, being when you see a solution, something like this is what we have. So this notion of exploit was really -- each export -- let me show you this thing. So two exports you can think of are exactly what I showed you in the previous slide which is the pure exploiter is always going to just this neighborhood where the least cost plan lies. Go after that. Right? And the pure exploiter will actually go after just that neighborhood where the maximum number of uncertain cardinality values are. You can actually make them fancy. You can try to get some ideas of the uncertainty and [indiscernible] those and whatnot, which actually get pretty hairy. You can actually take them and implement other versions of these exploits. And after some purposely for that reason is very extensible. Okay? And there's going to be a selection policy that is sort of going to arbitrate among these exports. So the challenge of the sort of [indiscernible] would be to figure out for instance for SQL server what would be a good [indiscernible]. And even, you know, sometimes it could be that as a optimizer [indiscernible] might actually see cases that are coming to you for a solution, see a specific type of scenario where documents [indiscernible] making a mistake. So there's nothing preventing you from writing a new sport that can see this area and suggest a good strategy for that setting. We have [indiscernible] on this architecture primarily because in that -- in that [indiscernible] workshop you're talking real optimize of people from sigh base and they -they sort of let us [indiscernible] but I'm not going to say that this is the approach you should take. More what I want to sell is this overall [indiscernible] of [indiscernible] being applied to solving the problem. And of course they can actually have other exports which have some notion of some mix of exploitation and exploration. Right? And again, I don't have the time to get into details but these two sorts of experts sort of look at the joined [indiscernible] for instance. Remember, the estimated -- the estimated actual cardinality kind of showed you some time back where the -- looking at the particular operator, and after having gotten the accurate cardinality values for its children, you can ask the question if the optimizer [indiscernible] know the cardinality values of its children, then what would be his estimate of the cardinality of that operator? Right? So this sort of takes into account what is the error you can compare a sure and accurate cardinality with accurate cardinality and get an ideal which joins -- whichever joins [indiscernible] optimize [indiscernible] join was less selective than it is the other way around. And these guys take [indiscernible] exploration into account to suggest a potentially better join order. And then there's a selection policy. So I want to see this exports actually begin in [indiscernible] neighborhoods which exploit to go off to at any point in time. There are again standard things you can actually apply. The more fancy one would be a reward-based selection policy which sort of keeps history of how good the exports were in the past and tries to pick an ex-sport which was actually working better in the past. Okay. So that brings us to the -- to this other problem of once I have decided I'm going to go after this neighborhood, which plan to pick in that neighborhood. So in a sense the way we have cast this problem is saying look I won't actually find a subplan to run in that neighborhood which will bring in the unknown cardinality values as [indiscernible] as possible. So if you look at -- imagine this to be a plan in your neighborhood and as of now, the current -- the set of based on the cardinality values for which you have accurate -cardinality expression for which you have accurate values. Let's say that only ones that are missing are S and the join of R and S. In this case of course you don't have to do this part of the master plan. Right? So there's essentially a subplan identification which will kind of bring in the values you'll need. And this other issue that I mention is, imagine this were to be the [indiscernible] in this neighborhood is actually a indefinite slow join. This [indiscernible] are going to be the cheapest plan. [Indiscernible] join is not going to give you the cardinality of S. Right? Because it's not going to scan the [indiscernible]. It's just going to probe it. So there are some tricks you have to play depending on the complexity of the execution engine and we have decided not to make any changes to the actual execution engine. And do everything from the outside. So what we'll do is we'll go through each plan in this neighborhood and find what is the smallest plan with required modifications, if I may [indiscernible] the plan also a modification. It will bring in the cardinality values that we need. That's roughly the problem in this [indiscernible] here. Okay. So that brings me to the -- these last part of express which is the architecture and the implementation. Since I'm running out of time and I see I have ten more minutes to 2:30. Is that -- does somebody have -- >>: [Indiscernible]. >> Shivnath Babu: Okay. So architecture looks pretty much similar to [indiscernible] I introduced, something for exports and selection policy then [indiscernible] keep [indiscernible] all of these values, confident that once the neighborhood [indiscernible] picks the plan with the modifications and whatnot. Does the plan costly enumeration. And this part is really the -- you know, once a plan has been picked, a plan has been picked, the part is actually going to run that subplan and correct information. Right? So this is really tied to this other infrastructure that we [indiscernible] where you can -- if you only [indiscernible] experiments on a cloud or on a desk system or on the [indiscernible] system, this is a separate entity. And it's [indiscernible] can actually -essentially think of this part as finding out the plans and this guy actually running it, scaling it [indiscernible] possible and giving the information back. And the controller is the thing that is driving everything. And we have implemented a number of different controllers. So this is the architecture. We have implemented [indiscernible] for PostgreSQL and the hope is to release the code. PostgreSQL, unfortunately we did it for 8.3.4 version. They released nine recently. And they've actually changed the -- so the -- there was this interface that we had implemented through -- well, actually I think it will become clear when I show you something else. So we wanted to make it useable for PostgreSQL. That is how our in the main [indiscernible] we've been focusing on because it's open source. But I definitely like to get back on whether something like this would be interesting or useful in the case of SQL server. One importance on a feedback we got, Postgres is a challenge to us. So the focus at DB [indiscernible] actually told us, look, going to change, going to propose this architecture for an optimize. Look, nobody is going to take this. This looks horribly complex. Right? But they arranged to sell us a tuning tool which lives with the kind of optimizer and maybe you start off -- start your life as a tuning tool and then slowly incorporate in a optimizer. That we are willing to take. So that actually we kind of shaped the architecture. And what we have now is essentially this optimizer [indiscernible] has its own job opposite occasion. It's a tuning tool that we can use and there are -- there's a specific very defined interface which essentially looks like the -- this tool needs the [indiscernible] from the database to actually give it plans and get the costing, get the plans cost maybe with some accurate cardinality values I can give you and also a figurative plan and implementing this in Postgres required some changes to the [indiscernible] which they have sort of taken that part out. So we have to now implement a whole new [indiscernible] implement this interface for nine and we have not done that yet. Yes, Lee? >>: Was this [indiscernible]. >> Shivnath Babu: We have [indiscernible] that. Essentially if you think about it, in the [indiscernible] optimizer, that definitely is its functionality is taking cardinality values and estimating for plans right? We only have done this is taken this internal EPA and sort of push it out so you can call it from a client. So then that's why I said zero changes to the PostgreSQL optimizer, right? Although we implement new optimizer. And two other things we are focused on a lot is the extensibility and the system is extensible of the level of the controller. We have implemented multiple controllers. The experts controllers [indiscernible] described, both running in a single mode, but exploit is like us to any point in time a parallel mode as well. We have two pieces of work, related work. The IBM [indiscernible] work and the article automated, automatic tuning optimizer work. The reason we didn't implement the BS [indiscernible], we have not willing at this point to make changes to (inaudible). But if that's something and that introduces that, I think we can actually combine these two and [indiscernible] much better things because you get more information when you run subplan or a plan. The other thing is efficiency because in this whole [indiscernible] work the reason our papers used to get shot down is because we were say, look, it takes a long time. Tuning parameters are in some sense I think mentally it seems to me that a lot of people have [indiscernible]. I want tuning, I want this combination to be very fast. That's now how things happen, right? Because [indiscernible] spend weeks doing these systems. But at the same time, we took that as a challenge to always in our tools implement this so-called efficiency features, parallelism, subplans and a host of other things, right? Have a big laundry list where we have spent a large engineering effect to get good efficient see. So how do we go about evaluating all of this? Well, evaluating tuning turns out it's not an easy task. So we have developed a benchmark. So what we did is we [indiscernible] a whole bunch of people, including people at DB test, and now our colleagues have sort of demoed this, you know, set of tuning scenarios. So it's tuning scenarios undefined based on data and based on some potential root causes that can arise and practice like -- there could be a query-level issue. There's this user defined predicate in there which is very hard to estimate a selective default. Our data level issues. Skewing the data. Correlation of the data. So this is some issues. The difference may be still are missing. And [indiscernible] issues that might be [indiscernible] optimize is not taking a bad index or not. And for each of these things, there's also an objective. Right? And one objective might be finding the optimal plan [indiscernible] said actually wants us to find the best plan because they're trying to figure out for the current setting how far they can -- what is the best performance they can achieve [indiscernible] tuning [indiscernible]. Let's just say improve performance by five X. And there are different aspects of evaluation let's see some results. So four out of what is happening [indiscernible] and the query serve and TCP hash, so far one of our tuning scenarios, PostgreSQL, planning the PostgreSQL optimize and finding a plan and [indiscernible]. [Indiscernible] and 257 seconds. Learning Xplus. Xplus manages to find a plan that runs in 21 seconds. Okay. But it's a factor of 12 improvement which is not really surprising [indiscernible]. It's running [indiscernible] and actually seeing things. And the main thing wits actually taking the most time for us is running some of the other numbers, so it actually runs for this query only six subplans. No full plans, only six subplans to tune it, and the all over tuning time is 131 seconds. So the way to look at this number, it's going to be a big number. But the way -- best way to look at this number is look at the ratio this number to that one. So how many times will that bad plan sort of have to run and compare that with how much time it took to you actually find a better plan or this other plan. So that [indiscernible] and a half. Similarly, now focusing at the workload level, imagine these eight queries where the really bad queries in your work loud and they're taking around 97 minutes overall to run. Exploits after running is able to cut down the workload running time to 32 minutes. In 24 minutes. Okay. Just around 1-fourth of the time that takes you to run that workload. So the other sort of tuning aspect is whether -- where the administrator wants a specific performance improvement like in this case we're actually looking at five X improvement. All right. So Xplus, it's the same value or same scenario that we saw earlier in half the time it takes to run that master plan is able to produce 12 speedup. So if I'm looking for five X speedup, this definitely is able to produce satisfactory plan. Okay. IBM, the Leo and ATO because they're actually not really focused on a single query and there are -- the problem with Leo and I think this is the general problem that can arise [indiscernible] queries [indiscernible], what you can measure is basically determined by what plan you're running. So imagine you run a plan and you get some accurate cardinality values. And throw back into optimizer. Still makes other plan in that same neighborhood. In this case it will run it again. You're monitoring opportunities actually, at least in the case of [indiscernible], you know, which I [indiscernible] out as zero. So you get stuck. So you can actually get stuck and it found a plan with 3.2 speedup. Right? And ATO actually found one with 4.2 speedup. ATO is a more expert [indiscernible]. It runs, it plans for each single table and counts each [indiscernible]. It's not really scaleable at some level. Okay? Now other important why we have shown this in red is that if Leo [indiscernible] found a plan with 3.2 speedup, which doesn't meet my requirement here, but I don't know whether there's even better plan. Right? So in cases [indiscernible] it can't fail because there might not be a plan with FX speedup but a case you know subject to the assumption that optimizer cost more is accurate but accurate cardinalities, at least you know that probably now you need to invest in noninvasive tuning if you really want that speedup. And then similarly at the workload level, I'm just going the skip it in the interest of time. >>: [Indiscernible] are you exploiting the [indiscernible] AX [indiscernible]? >> Shivnath Babu: No. Not now. But that will be something I guess -- the answer is no, but here is in a one aspect which might be useful to address a question that Lee brought up and this issue which is so now I introduce a whole bunch of different policies inside Xplus, right? Can have these four experts, may not have many more experts or fewer experts, which policies actually best. So based on our experiments, in a tuning center you can think of two performance matrix. One is what we call convergence which is -- now, I'm looking for a performance improvement and the moment, how much time does it really pay to give me that improvement as opposed to a time to completion which is give me the guarantee and think of that as completion. So the sort of short answer because I do want to use five minutes to talk about some of the recent focus, short answer is you do exploitation when you're interested in some quick improvements. But if you are interested in getting to that final point, exploitation is a bad idea. Exploitation literally ends up making small, small, small progress. [Indiscernible] give some values maybe [indiscernible] that will actually suggest a [indiscernible] different plan. I knew that. [Indiscernible] values. It takes a long time to, you know, cover that space. So in those settings, having something like that, you know, but if I know that I'm looking for the best improvement, I can tell you what's a good policy for that. There's some results which actually show that. Any questions so far? Given that Vic has given me five more minutes, I want to press on please [indiscernible] to just illustrate some of the things we are thinking about. The [indiscernible] is all good. We are building whole bunch of tools, iTune and this other tool. And other things. Okay. So three areas that we are actually focusing on a lot, first of all we are trying to apply to more and more -- I won't say more and more, but some different systematic applications which I guess [indiscernible] DB we really haven't focused on that maximum. I've been talking to system administrators mostly, not purely database administrators, although they do maintain databases at both -- at actually [indiscernible] at Duke and a little bit with some data import teams at IBM. And now [indiscernible] actually talked to them, they're not -- performance is not really the first thing on their mind. For whatever reason. Right? One thing that's really solidly on their mind is corruption. Their data getting corrupted because there are many reasons, especially to this drive towards running things more on commodity hardware. Commodity hardware is flakey. It can run Hadoop and these other things and give you some numbers, but you really got to trust your data on it. There can be bugs. The hardware is pretty flakey. There are [indiscernible] bugs that can happen. And system administrators might make a mistake because date at that get corrupted, do something that [indiscernible] trying to apply some methodology to that context. By automatically learning these data and [indiscernible] checkers [indiscernible] checker, database checker and so on. Getting permission from that and figuring out what other checks to run. Can I think snapshots to run these things? Has this actually become very, very easy or very efficient because the source [indiscernible] whole bunch of focus on collecting snapshots. We're not going to spend too much time on that unless there are questions. The two other things I wanted to focus on where do you run these experiments? The so-called harness for running experiments. So if this were to be a production database system, with some clients running and, you know, we have tuned their system and maybe it's multi-code, multi-disk and whatnot. But often administrators will not actually let you do anything. They run experiments. Never. Right? So that got us thinking about where would be a place to run these experiments? So one [indiscernible] me, you know, you can run it on a stand by platform and [indiscernible] stand by platforms. They keep because if the production, this primary were to die, this guy has to take over. The harder configuration, least in some scenarios, not all, scenarios that we've seen is very similar to what is that because if a [indiscernible] actually gone. I've seen scenarios where there's three X investment in exact hardware production, in [indiscernible] and different [indiscernible] always looks pretty much the same as how that goes. The most important thing, data is actually kept pretty much up to date. That's because for experiments, we don't want to run into [indiscernible] copies of the data. Especially for a parameter, for query-level tuning. But this might be a resource. Which is actually turns out to be heavily under utilized. Or on the cloud. Imagine your database might some day actually be running on the cloud. So getting these [indiscernible] on the cloud, just taking a snapshot of your [indiscernible] to run them and then cloning it, start pairing up ten different easy to load to run these experiments is actually something which is very easily [indiscernible], just a few lines of easy to -- APS cryptic can actually do that. So what we have done is we have sort of gathered in some sense a -- I like to think of it as bonus the interface and the run time system which the sort of DB or the user who ordered is five -- this is language for [indiscernible] policies. So they can designate resources that experiments can be run. And they can designate each of these resources, conditions under which resources can be used for experiments. One of the first things we did is actually implement a policy like this in a context of the stand by system is being used for DBSS. Look, I can let you run experiments on my stand by under the condition [indiscernible] utilization on the stand by looks [indiscernible] value. And [indiscernible] one importing we have to guarantee us if the stand by -- the primary were to fail, the fact that we are running shouldn't increase the recovery time by anything [indiscernible] value. And what I have in this slide is actually a quick animation of that aspect. I'm already looking at the clock and it's 2:35. I just have this slide and another slide. Should I quickly [indiscernible] go to the summary or it can take two more minutes? >>: You can finish this. >> Shivnath Babu: Okay. So essentially this architecture that we have implemented using some technology on [indiscernible] although utilization is with a we need. So imagine you have a production. You are actually doing log shipping and the stand by is actually standing there. So [indiscernible] stand by will be doing is getting the logs, applying and being ready to take over when the primary actually dies. Right? So what we can do when our system kicks in and identifies an experiment is a plan, maybe it's [indiscernible] differently configuration. [Indiscernible] do an experiment in which case we use zones feature [indiscernible] to do this. You sort of take up what we call the home. It is the real use of that system. Cut it down and essentially in as few as 5 to 10 percent of resources on that machine is enough to take the logs and the good stuff. So that is kind of going on so we don't stop the stand by from doing it's actual functionality and we carve the remaining results into what we call the garage. The garage container. This is very [indiscernible]. And most importantly, we use in the ZSF file system that comes through [indiscernible] the ability to do copyrights so we don't create an entire copy of the data. So this guy is running express, this guy is keeping up to date, and only the blocks that change get updated. So literally they're sharing all the data. Of course one thing we have to be very careful about is that, look, now maybe this is going to put a large I/O overhead that can affect what you see in here. Okay. So there are some [indiscernible] issues that have to be sort of taken into account and that sort of happens when we get the result and then [indiscernible] monitoring and [indiscernible] really has low cost monitoring. And in fact strips this noise out. And once we have this harness that we can run experiments, now what do you think is -why not actually have a decorative language where the admin or the user can sort of specify the experiments, the type of experiments they want to run or more importantly their objectives. And we're putting this -- and implementing the planning part. Remember I showed you the planning the ducting and the iteration will be that -- that part will be generated by our system automatically. We call this dart X. And looking at two [indiscernible] for this. The first one is the one I kind of showed you which is this benchmarking and the tuning aspect really focusing on mapping out the response [indiscernible] entirely, or finding a good region. So this domain and [indiscernible] data corruption domain. So very likely probably in the next couple of months we'll actually have something to announce over here where this whole corruption, the type of test, and running this automatically on the cloud is -- to be able to actually have a tool [indiscernible]. So that's it. I think in the quick summary, the need for experiments and automating [indiscernible] has always been there. I think now the infrastructure, especially of the cloud ready to kind of carve [indiscernible], that has made some of this possible. We have built a number of tools around this paradigm, and there are interesting [indiscernible] questions to think about. Should for instance [indiscernible] through the system support [indiscernible] as a first class citizen, well then I will [indiscernible] coming and doing this. So thank you. [Applause.] Any other questions? >>: [Indiscernible] cardinality as the [indiscernible]. [Indiscernible] problem, difference are going to [indiscernible] cardinality create similar [indiscernible] if that's their chronic problem, right? If you were to combine the two, is there anything [indiscernible] that you can do to basically [indiscernible] cardinality [indiscernible] problem manifesting through the expressions. >> Shivnath Babu: So my answer is the -- so for us, this world, the world is sort of black and white between estimated cardinalities and accurate cardinalities. We don't [indiscernible] deeper and actually at the level of why didn't the optimizer [indiscernible] incorrect cardinality estimate? Was it because of correlation? Was it because of [indiscernible] assumption? Can we do some sampling to actually get a value? So essentially, [indiscernible] actually, yes, they have actually done a whole bunch of work on using sampling and they're coming up with this [indiscernible] and things like that. I've actually done some work myself. So how am I feeling? Great. Great research. But building a tool around that has taken us for whatever reason it's been very hard. So now why not just make them very simple. [Indiscernible] black and white. [Indiscernible] accurate. I don't even try to maybe [indiscernible] pretty accurate. Maybe it was actually based on a sample of something, right? Very accurate. So that is something we have not gone into. Maybe it would be interesting to act, but in some sense, that aspect of getting it, why the cardinality of values of wrong [indiscernible] and incorporating that somehow into the [indiscernible], the physical plan space and what we can -- once we have the plan, what information that we can take from that and give it to other plans in terms of estimating debt cost better. That's something which frankly I haven't thought about. >>: [Indiscernible] assumption specifically [indiscernible] as it effects cardinality you're right. [Indiscernible] used in your [indiscernible], you deduct [indiscernible] discern that properly. >> Shivnath Babu: Do you mean implement in the sense that we are summing up the cost across different operators rather than operators running together and things like that? >>: Right. That's what -- [indiscernible] that would have a different [indiscernible] cardinality. And how would you combine them? That's the question. Do you have any [indiscernible] on that? >> Shivnath Babu: Not really, but maybe next time we have in mind better. >>: Well [indiscernible] ->>: [Indiscernible]. Any other [indiscernible]? Thanks. [Applause.]

>> Vic: So it's my pleasure to welcome Shivnath... Duke University. Shivnath got his Ph.D. from Stanford University...

Related documents

Products

Support

&gt;&gt; Vic: So it's my pleasure to welcome Shivnath... Duke University. Shivnath got his Ph.D. from Stanford University...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Vic: So it's my pleasure to welcome Shivnath... Duke University. Shivnath got his Ph.D. from Stanford University...