>> Phil Bernstein: Phil Bernstein. It's a pleasure to be introducing Alvin Cheung today. Alvin is nearly done with Ph.D. at M.I.T., and working on problems related to both databases and program analysis and program synthesis. At least part of the work he's describing here was the subject of a paper that got the best paper award at the conference on innovative database research CIDR in January this year. And he's the winner of multiple fellowships, competitive fellowships, NSF, Intel, and National Defense Science and Engineering, which is actually something like that, which is actually a very competitive program run by DOD. I think there's a total of under 200 fellowships that are awarded in all science and engineering fields. And Alvin will be spending the day talking to a number of you. If you're not on the schedule and you want to be, let me know. Maybe we can double you up with one of the other meetings or talk to him briefly at the end of the day. With that, let me give the microphone to Alvin. >> Alvin Cheung: Great, thanks. So I'm Alvin. I guess my now I'm an nth year grad student at M.I.T. It's basically like a number that I try not to remember. I guess the proper term, they call it a tenured grad student. >>: [indiscernible]. >> Alvin Cheung: I hope like the count would basically stop counting. So yeah, so thanks for the invitation today. So actually, originally, I was supposed to give I was thinking of giving like a 30 minute talk of some of the work that we've just presented at PLDI this past week, but since Phil was kind enough to give me an hour slot, I decided to change the topic a little bit to talk about a broader piece of work that we've been working on with multiple people I have the pleasure of being involved with. So the general setting is that we're basically trying to you, know, use various program analysis techniques to speed up applications that uses a per extent data storage, and this is joint work with a bunch of folks at M.I.T. and those at Cornell. So to begin, let's just first review a little bit about like how these applications tend to be written. So in the original case, we have two servers, one hosting a database, in this case maybe SQL, and another one that is hosting the application server. So these two machines tend to be different physical machine, although they tend to reside in this same machine room, for that matter. So these applications tend to be written using, first of all, a high level imperative logic programming language, if you will, a general purpose one. For instance, Java or python. And embedded within these applications will be SQL queries that gets executed on the various application, when the application runs. So at some point in time, people figure out like one way to speed things up is perhaps to carve out part of the application logic into what is known as stored procedures. I'm sure all of you, in all [indiscernible] are familiar with that concept. So there's something wrong with this picture, unfortunately. First of all, the developer needs to commit to either a SQL or an imperative code theory development time, not to mention that he also has to learn what PL/SQL and all that stuff is all about. And as you can imagine, getting that wrong can have a drastic performance impact. And on the other hand, they also need to worry about how or where is the application going to be executed. You know, as I've shown you, you know, putting things on to the database server as stored procedures will make the application run at least a half of that is, in stored procedure representation, to be on the database and everything else will be on the app server. So being able to make these two choices tend to be a little bit difficult, at least during development time, because, you know, things can change. Things can both in terms of the application logic and also the state of the database, and these positions tend to be difficult. So in this respect, we try to use program analysis to help us solve these problems. So with that in mind, we started a new project called StatusQuo, funky name, I guess, which is a programming framework with two very simple tenets in mind. So first of all, we want to be able to allow the developers to express application logic in whatever way they are interested in or comfortable with. And with that regard, we think that it should be the compiler or the runtime's job to figure out the most efficient implementation of a piece of application logic. So notice that we're not coming up with some new program paradigm or new program language and ask people to please convert all your legacy applications into that. So on the contrary, we're basically saying you should be able to write whatever you will. Be lazy, just do whatever you want, and then we will basically do all the difficult lifting for you. So to kind of, as a first step to achieve this mission, we have developed two key technologies. So first, we have a piece of tool that will automatically infer SQL queries from imperative code. And secondly, we also have a piece of technology that will automatically migrate computation between the two servers for optimal performance. And I'm going to talk a little bit about these two pieces in this talk today. So first of all, let's just look at the first piece of technology here, which is we're trying to convert things from imperative code into queries. How does this work. Well, so here, what I've shown you here is a piece of code snippet from a real world open source application that uses some sort of an ORM, or object relational mapping library for persistent data manipulation. So in this case, what we have here is that this person, you know, uses a third party library, which basically abstracts out all these things above users and roles and then he's basically calling these functions to return a list of users and roles from some place, and then he's he then goes on in a nested fashion to figure out the ones that he's interested in and finally returning it at the end, right. So unfortunately, what he didn't know about is the fact that these two functions, in the beginning, when executed, they're actually issuing SQL queries against a database in order to get back the list of results that he ended up looping across. And so what happens, during execution, is that when the outer loop is encountered, the third party library; in this case, the ORM library is going to issue a select star query, because that's what, you know, this particular code snippet is trying to do. And likewise, when the inner loop is encountered, now these database people, when we look at this, we basically say well this looks very much like a relational join, right? So in order to speed things up, one thing that you might want to do is just convert this entire code snippet into a very simple joint query that basically folds in this predicate into the queries. So the benefit is obvious, right? Because when we execute these two select start queries, when the database gets large, the application can slow down substantially, whereas if we try to write it as a simple query, for instance, in this case we ended up having orders of magnitude performance difference. So our goal in this case is to basically try to do this conversion from here to here automatically. So at this point, you should probably be, you know, calling me out and saying there's just no way you can possibly do this. I mean, look at this thing is like in a declarative SQL language. This thing is in imperative Java. Look at the differences between these two is pretty dramatic, right, so how can we actually pull this off? Well, so, just to recap, what we're trying to do here, formally speaking, is to find a variable, in this case a result. We called it an output variable and our goal is to basically try to rewrite this into a SQL expression. And we are going to basically do this using a tool that we just built called query by synthesis or QBS for short. So, you know, this does two, does three things. So first of all, we try to identify the potential code fragments that we can possibly convert into SQL queries. And secondly, for all these what I was referring to as output variability, we try to find SQL expressions for them that we can convert them in to. And finally, being a, you know, good programming language, you know, person, I would try we should try to at least show that like whatever SQL expressions that we ended up find, better preserve the original semantics of the program. And if that works, then we'll go ahead and convert the code, right? So what do I mean by try to prove, you know, these things preserve semantics. Gets into a little bit of background in terms of program verification. But before I do that, can I get a show of hands, like how many of you are really familiar with Hoare logic program style verification? Okay. >>: [indiscernible]. >> Alvin Cheung: Yeah, except that in this case from the producer's point of view, he's not sure what's really going on in here, right. So it doesn't know that's what the actual code ended up doing or the underlying implementation or the library. So this is for modularity reasons, right. So the guy who wrote this code might not be the person who developed these libraries. And when, you know, someone who wrote these libraries, he cannot possibly imagine how the result set that he actually ended up returning will get used. So in that respect, we're basically so you can imagine, these guys can be implemented using some [indiscernible], for instance. But then from that respect, we still need to somehow be able to fold all these things into the object query large itself in order to get the performance. Otherwise, all we ended up doing is just, you know, fetching everything where data is and doing the nested. So that's not so efficient, as you imagine. So just a little bit of background in program verification. In this case, we use Hoare logic, which basically comes in this form, right. So this is a standard way of presenting this. So we have some sort of command that we want to try to prove in terms of state transformation. So we have a precondition that we know is true before this command was executed, and then we have some post condition that we're trying to enforce after the command is executed in terms of changing program state. So as an example, imagine we're trying, if we have a statement that, you know, simple assignment and we want to show that X is less than ten afterwards. So we can go backwards in terms of figuring out what must be true before this statement is executed in order to allow us to say that X is less than ten. So, of course, in this case, we better be okay that we use less than ten. So a more complicated example what if we have an if an else statement in this case. Let's say we're also trying to enforce somehow, we are trying to assert the fact that X is also less than ten afterwards. So what needs to be true before this whole statement gets executed? Or in this case, it better be a case that if B is true, meaning that we do get into the if clause, it better be okay that a was less than ten. Or the fact that if B is not true, mean that we'll go to the else branch. C, in this case, is less than ten. So there's a very routine, automatic process of coming up with what is known as a verification condition, begin a post condition that we're trying to enforce. Okay. So for loops, it's a little bit tricky. So let's say in this case we want to enforce the fact that, you know, we want to show that somehow in this case sum is equal to ten. We all know, of course, this is true. But what is the kind of condition or verification condition that we need in order to establish that fact in from the beginning of the program? So we involve this magic thing called a loop invariant, which, of course, as we all know means something that is true during the execution of the loop. So in this case, suppose someone has come up with, you know, this invariant that the sum is always less than on equal to zero during the loop. So with that in mind, we need to be able to establish three facts about this program. So first of all, we need to be able to show that this invariant is true when the loop was first entered. In this case, it is, because we know that sum was set to zero, and that, of course, implies it's less than or equal to ten. So we also need to show that this invariant is preserved, meaning that if we actually go into the loop and we go through one iteration, the invariant better be preserved, meaning that it still remains true. So in this case, we have, if sum is less than ten and the fact that this is the invariant, it better imply the variant back. In this case, this is also trivial. So it checks. And finally, we want to show that if this loop would ever get out of this loop, then the post condition, wherever that we're trying to prove, we can establish that fact given the invariant. So in this case, we want to basically show that if sum is greater than or equal to zero, meaning that we'd break out the loop and given the loop invariant, we have sum is equal to ten, which is the post condition that we're trying to establish. So in this case, you can also see that this kind of also checks out, because basically comes up with the only possible the only possibility is that sum is equal to ten, okay? So this is just a very tiny, very, very short introduction to program verification that what does that have to do with us, right? So if you look back into this example here, what we're trying to establish, as I was talking earlier, in terms of these output variable, is the fact that this results variable better be equal to some SQL expression. So this SQL expression is precisely the post condition that we're looking for except that, as I was showing you earlier, in terms of loops what we need to do in order to establish this post condition is the fact that we need some invariants for these loops. In this case, we have two loops so it [indiscernible] two different invariants. So and given all that, we can automatically generate all these verification conditions, like the ones I have just shown you in the earlier slide. Yes, question. >>: Seems like you should also make some [indiscernible] that, like, no other module is [indiscernible] in principle like [indiscernible] just somehow [indiscernible]. >> Alvin Cheung: So in this case, so for many of these third party libraries, they can mark things as transactional, which is what happens in this case. I mean, this whole thing is basically sitting in a transaction. >>: [indiscernible]. >> Alvin Cheung: Yeah, so getting back to this, so basically what we ended up doing is now you can basically see that we can automatically compute these sort of verification conditions that help us establish the fact that results equals to some post condition. Except that there's one problem in this case, right. So I haven't really told you what are all these invariants and what are all these post conditions. These are things that we're trying to look for, right. At this point, we don't really know. But as good computer scientists, what do we do, right? We just throw abstractions at a problem and somehow it will go away, right? What happens in this case is we'll just say at this point that these guys are some function calls. We don't know what the body of these function calls are, but we just say okay, here are some invariant function that is a function of everything, all the variables that currently in scope, and then we'll just leave it out for now and see what happens later on, right. Yes, question? >>: So how do you find this output variant? I think that's a pretty tough problem, because as soon as you have something like [indiscernible] I imagine that this is nearly impossible. >> Alvin Cheung: So we actually run a bunch of analysis to figure out that, and I have a slide to talk about that, so hold on to your question. I don't want to [indiscernible]. So that's a very good point. So how do we actually figure out, like, what are the things that we can possibly translate, right. So we'll get to that in a second. Okay. So I was talking about, okay so we basically treat these guys as some function calls that we don't know how to express, right? But before I talk about like how do we actually generate the bodies of these functions, we first need to understand that all these things actually written in some language that is used to express these kind of logical expressions, right, because you see there's some [indiscernible] things, implication things. So we need to be able to have some sort of language for us to represent all of these verification conditions. And notice that this is not like Java. It is not like the original program. This is really a logic language. Logical language. And in the previous slide I've shown you for the simple examples, which involve things like, you know, integers and simple variables, we're going to just use various standard, you know, logical language for that purpose. But in this case, it involves like nasty stuff like queries and [indiscernible] and all that. So we actually need to talk a little bit about what is the language for us to express these kind of things, right. So as I was saying, whatever this language we ended up using, it better be able to for us to be able to express all these verification conditions for the imperative code, which is the source language in our case, right. And more importantly, we need to be able to handle all these query operations that happen in a code. Because the code that we're doing has queries embedded within the code itself, right. So whatever language that we use, we better be able to express all of the relational operations. And in particular, the order of records actually matter. So that might come a little bit as a surprise. But if you recall like the nested loop example that I've shown you earlier, there's actually an orders of all the tuples that come out of the nested loop, and the ordering is basically based on what the database decides to return. But unfortunately, SQL itself does not enforce a particular ordering, unless we put in an order by class, right. So in order to preserve the semantics of the original program, we somehow need to capture the fact that the stuff that is coming out actually has an ordering embedded within it. So whatever language that we use to represent these revocation conditions, it better be the case that they preserve the ordering. So finally, we also want any post condition expressions that we actually come up with is able to translate to the query language that we have in mind. In this case, the target is SQL. And finally, it better be the case that, you know, the function body that I told you, kind of like a mystery, it better be able to come up what they are. Otherwise, we can establish the fact and do the kind of conversions, right? So what we ended up doing is using something called ordered relations, which is very similar to relational algebra with one caveat. That the relations themselves, rather than modeling them as bags of tuples, we actually modelled them as an ordered list. So what do I mean, right? So we basically say we basically mean the following. Here are some examples. We say that in this world, a query result set is an ordered list and a list construct that by either, like, setting [indiscernible] program variable where there's basically a query, or we construct that by concatenating multiple lists together concatenating with an expression. we can to some can or And along with so these are basically standard list operations that, you know, most people are familiar with. But besides that, we also included a bunch of database operations that can operate on these lists. For instance, we say that we can do joins we can do selections on these lists, which return other lists. And this top thing is basically the familiar, you know, limit clause that we have in SQL. So in this case, it basically returns the top I element from this list where I is the result of evaluating whatever this expression E is. >>: So [indiscernible] is concatenation? >> Alvin Cheung: Yes. So I'm kind of overloading this a little bit here. These are actually concatenating lists, versus this is just appending something. So for expression, you know, what you would expect. So, for instance, we can get something off a list. We can, you know, do some operation on two elements of a list, and then along with that, we also define a bunch of aggregate operations that can run a scalar result. Okay. yes? So now we have a language where we can express all these >>: What is order input? Because if you make the original SQL application, and run the same application twice, [indiscernible]. >> Alvin Cheung: Because the ordering of the tuples can be arbitrary from the point of the resolved set being fresh from the database. >>: [indiscernible] but even if you add application twice, [indiscernible] because the SQL doesn't give you any idea [indiscernible] so the [indiscernible] application is fine, but [indiscernible]. >> Alvin Cheung: That because we cannot really enforce that, right. In fact, we have seen applications where they do rely on the things that the ordering of the stuff that is >>: Can't you use [indiscernible]. >> Alvin Cheung: In that case, for instance, that's a loop example, right. So we know that like the underlying database can return the ordering of the tuples however they want. But then we know that whatever the first tuple that gets returned will then be joined in an inner loop fashion with everything else, right. So there's some implicit guarantee as to what the result set would look like if anything comes out of it. Do you see what I'm saying? So because they have explicitly written a nested loop, right, so then the first tuple that gets fetched from the outer loop, whatever it is, will be joined across with everything else in the inner loop, right, followed by the second tuple and so on, so forth. So there is some implicit ordering that so, you know, we don't know that that's really what the author has in mind when he wrote that piece of code. >>: [indiscernible]. >> Alvin Cheung: That's why we ended up doing this. As soon as we have become [indiscernible], we have to preserve the order. But I agree, I mean, for a simple selection, it really doesn't matter. Okay. So now I've basically left you with this kind of mystery, where how do we actually generate these post conditions and invariants. And for that purpose, we pull out this magic box called program synthesis. So we can basically think of this as a search engine over the space of all possible SQL expressions, right. So how it works is the following. So given a program synthesizer, which is a piece of, you know, code or off the shelf tool, we give it a symbolic description of the search space. So search space in this case being the space of all possible SQL expressions that we can generate, and we come we also put we also give as input a set of constraints that we want the synthesizer to respect. So guess what, right? So this set of constraints, in this case, is exactly the verification conditions that we have automatically come up with. So given these two things, what the synthesizer will do is it would do some sort of a search procedure, intelligent one, using a symbolic manipulation and what is known as counter example driven search. What does that mean? So that means we tried to come up with an expression, a can Dat expression that can possibly fulfill the requirements that we want, meaning that it's able to satisfy all the verification conditions, and we try it out in terms of proving where that's correct. And if that works out, then we're good. If not, then it incorporates that as a counter example, meaning that, well, you better not return this same expression again, right. So that's kind of at a high level how this new counter example driven search works. So hopefully, at the end, we'll get an expression that satisfy all the constraints, or in this case, what happens when you're back to this example, we basically want the synthesizer to come up with expressions for both the post condition and also all these invariants, right. And to just give you a sense of how this works, in this case, let's say we're talking about the post condition, the synthesizer will give, you know, as a description of the search space, right, a bunch of expressions, meaning something like this. So, for instance, we might try to infer that results equals to some selection of the user's list, some top limit expression or some more complicated thing, so on, so forth. And, of course, we don't really write all these out explicitly, because that would take forever, so we actually have a symbolic encoding of this search space and because of time, I'll not talk about this. But if you're interested, please come talk to me afterwards. So finally, so now you understand actually how we do this kind of transformation, right. So it's an iterative loop where we try to, you know, find these expressions and it all works out, then we go ahead and transform the program. But, you know, as being pointed out, we need to talk about how we actually identify these code fragments to begin with. So very simply, we can just yes, question? >>: It seems to me that you should also somehow model [indiscernible]. >> Alvin Cheung: Yeah, so yeah, exactly. So I skip a lot of details about how we actually efficiently encode a space of things and all that. But yeah, that's actually one of the things that we looked at. Okay? So, well, for one thing, we probably want to start at, you know, where the code is actually fetching stuff on the database as a starting point, right. Because we don't want to deal with all these other things that all parts of the program that actually does not involve any database operations, okay. So that's the starting point of where we want to look at these kind of code fragments, and then we run pointer analysis to figure out where things escape in terms of, you know, passing on to things that we don't know what's happening. So given these two information, this is actually sufficient for us to actually determine where do these persistent data, meaning that the stuff from the database actually flow to in terms of the program. And that, in turn, allow us to delimit the start and the end of these code fragments that we try to analyze. So these are actually very standard program analysis techniques that basically, you know, we just have to do a lot of engineering to actually get to. But to summarize this part of the talk, so this is the way that this whole tool chain works, right. So it takes in the source code off the application unmodified, and we then try to find all these places that uses query results from the database. And then we'll try to, you know, find all these code synthesis using the mechanism that I just talked to you about, and we ended up in something like, you know, we'll be able to identify a bunch of code fragments within the original application source that we can try to do these kind of analysis and conversions. And given that, it will go through some sort of verification condition computation to generate the kind of logical expressions that we try to enforce, and then we will do the synthesis step to try to come up with SQL expressions. And if that all checks out, then we'll go ahead and convert the code and put it back into the original application source code. And this thing, as I've shown you in the earlier example, is inter procedural. So meaning that we do look into different method [indiscernible] and all that stuff. And so this is the entire tool chain, and now let's talk about a little bit about does it actually work, right. So the experiments we set up in the following manner. So first of all, we understand that there's no standard berth marks available for these kind of transformations. I mean there are some ORM libraries, and there are a lot of open source application, but there's no standard benchmark available. So we did our best in terms of finding two large scale open source web applications for this purpose. In this case, these applications were written in Java using [indiscernible] as the ORM library. So we want to measure two things to test out our tool. First of all, we want to know how many code fragments we're actually able to identify and convert of the application source code. And secondly, we also want to measure what is the benefit that we get in terms of execution run time. Do we actually get any, you know, speed up in terms of converting things into SQL, right. So for the first part, so here is one application that we looked at. And I've basically broken this thing down into the types of different operations that we're able to identify in terms of the code snippet and also the number that we're able to convert as a result. So you can see that we do we were able to convert quite large portion of them. And they do come in a variety of different SQL operations, joins and projections and whatnot. And here's another one. Similar story. So in terms of performance, what actually happens. So we start to measure what is the effect of conversion in terms of execution run time. So on the right here, we have taken one piece of code snippet that we were able to transform. In this case, we transformed that into a simple selection query. The original application just fetches everything from the database and then do a filtering in the loop fashion and then return a subset of the result. So in this case, we try out, you know, with ten percent selectivity, and then we try to scale up the size of the database to see and measure the time that it takes to load up the entire page. Not just running the queries, right. So this is fetching stuff from the database and also rendering everything, getting the whole thing from the web server and displaying it on to the screen. So this is the original application, the time that it takes and as I have mentioned, it basically fetches everything on the database, right? And here is what happens when we try to convert into a single selection query. So you can see this is basically an orderer of magnitude difference, right? It's pretty obvious, because we have basically fashioned a subset of the things as opposed to, you know, the original application that fetches everything and then doing, you know, filtering inside the application itself. So now we also try with 50 percent selectivity. So again, this is the original application, and in this case, you know, we didn't scale as good, because, you know, obviously, this is because the application was fetching more stuff than, you know, 10 percent selectivity. So we wouldn't expect, you know, the same kind of performance benefit. But you can still see quite substantial difference in terms of the time. Now, the more interesting case is actually the nested loop example that I've shown you earlier, right. So this is the exact code snippet that I just shown you. So here is how the original application I mean, not so hot. As you can see, this is basically scaling up N square, because this is doing a nested loop, right. And guess what? I mean, we actually ended up scaling lineally, because we were able to convert from a nested loop join into a hash join because, you know, turns out that the database already has indexes for these tables. So, you know, just by converting into a SQL query, we magically get this kind of speed up, which is kind of good. And finally, same thing or aggregation, right. So in this case, this guy is trying to do a count. So it fetches everything and then he basically sums up the number of tuples that he returned. And, you know, this is the original performance. And again, we have, you know, another order of magnitude difference just by converting things into an aggregation and executing that inside a database, right. So just very briefly, I also want to touch on what are the things that we actually failed to convert, right, because as I've shown you earlier, we were not able to convert everything. So why is that the case, right? So first of all, there's like a bunch of cases where they use some sort of custom logic to do a comparison. So, for instance, they wrote their own, you know, they wrote their own comparison operator that, you know, compares some string value of the database and then sort the results that way. So in that case, since at this point, at least in this initial prototype we don't handle store procedures, so that those kind of expressions were not able to express [indiscernible] SQL. So therefore, we failed to convert that kind of thing. And another set of thinks, users, database schema information. So these are the kind of more interesting cases, for instance. In this case, you know, this guy is basically issuing a query that fetches things from the database and it turns out that ID is the primary key, right. Then he decides to sort things out and then he decides to get like, you know, do this kind of loop. So look at this carefully. This is really just getting the first ten elements from the table. But unfortunately, we're not able to convert this into the SQL representation, because we didn't know that, you know, ID is equivalent is the primary key and therefore, this is really the equivalent query in this case, right. So, of course, if we incorporate that information into our prototype, we should be able to do better. But that's basically [indiscernible]. Yes? >>: Did you find that the [indiscernible]. >> Alvin Cheung: At least in all the experiments that we've run, we have not encountered that case. And that's because, like, you know, things that we convert are things like selections or projections or joins. So unless, you know, unless the database implementation actually ends up slower than, you know, what the application source code written otherwise. Yeah, that's a good question, actually. Okay. So now we'll just talk about this thing about converting things into automatically converting things from Java to SQL. But as I mentioned to you in the first part of the talk, we are still left with the problem of how do we actually execute the application itself, right. So because as I'm saying, some part of it, we can convert into, you know, SQL [indiscernible] store procedures, but how do you make the decision as to when do we actually do the conversion, right. Again, the high level goal here is to be able to be flexible in terms of how to execute a piece of code, right. Okay. So for this part of the talk yes? >>: If you had nested loops in the application program and there's imperative logic in between the nested, are there cases where that's going to cause trouble in terms of reverse engineering a query just because of the expressiveness of the query, if statements and procedure calls and whatnot? >> Alvin Cheung: Yeah, that's a good question. So the classic example that we have seen, for instance, basically people putting log statements inside a loop. So somehow you want to print out something like, you know, progress report, I have fetch process, 50 percent. So for those kind of things, unfortunately, you cannot reverse engineer, because we cannot put, like, the print statement. >>: But you're also not going to bring back all the data in the outer loop. >> Alvin Cheung: Exactly. The fortunate thing is we can, indeed, detect those cases and bail out of it. So we are sound in that. Yes. >>: Can you partially convert? So, for instance, you have that application logic where something like you need to return a list based on a table, and you need to filter out the ones where the numbers in a certain field are prime or not. And so you can certainly propagate that logic, I'm sure, to SQL. But then can you kind of separate that logic through another filter within the application language, or does it kind of [indiscernible] on things like that? >> Alvin Cheung: Yeah, so for now we've decided to bail out on that particular case as well. But yeah, I think that's an interesting point, where we can actually try to but then that also gets to the question of like how much we want to convert, even though if we can, I mean, we don't have to convert everything. And that's actually part of this second half of the talk. That's a very good lead in, actually. Okay. So now we want to talk about where we want to execute a piece of application logic, right? So as a running example, I'm going to use this code fragment, which is all familiar with TPC C so in this case, this guy is trying to basically do some sort of a new order transaction. So as you can see here, what happens is that, you know, initially, we execute some sort of query, and then we, you know, compute sums of total amount and then we do an update into the database, so on and so forth. So notice that when we actually execute this piece of code, what's going to happen is that the first statement is going to execute it on the database, right, because this is a query, okay. So but then the second line is going to execute on the app server, because this is a piece of imperative application logic, and so on so forth for the rest of the program. So notice also because of this that we actually get some network communication between two statements, because either the database has to relay the results back to the application server, or the application server has to tell the database, okay, please execute this query, right? And in this particular example, it turns out to be quite a lot of these network communication, network round trip between the two servers and these ones again turn out to be not so good in terms of performance, right. Slows things down. So in this case, our goal is basically try to eliminate some of these round trip between the between servers. Of course, we cannot eliminate everything, because after all we do need to communicate with the database in order for the application to proceed, right. So one standard wisdom in our community in terms of speeding things up is to basically group things into stored procedures, okay. Push application logic to the server, in addition to queries. So in this case, what we might want to do for instance is to notice that in this case, you know, this thing prints out something on to the screen so it has to be on the application server itself. But then for everything else, we can try to group that together into a stored procedure and push that over to the database server and likewise for this, right. Now just doing this, as our experiment shows, actually decreases latency by three times, 3x. So it's obvious that's the case, because we are reducing the number of round trips, right? But notice, as I have pointed out, in this case, this line has to execute on the application server. So what does that mean? So that means we need to keep track of the ordering, because in this case, this is actually covered by an if statement. So this line is only executed if the if statement turns out to be true. So in programming language as well, that means we have a controlled dependency. We have a controlled dependency of these two lines on this particular if condition. These two lines are only executed depending on the outcome of the if statement. So if we want to do this, you know, stored procedure approach, we better at the end of this stored procedure return, whether this is true or not, because otherwise the application does not know where it should go ahead and print up stuff on to the screen or call another stored procedure or database query, for instance, right. And in addition to that, in order to make this work, we also need to keep track of all these data dependencies. But why? Notice that in this case this guy defines a variable that gets used later on. If we decide to go back to the application server. So that means at the end of this stored procedure, we also need to return the value of this variable and forward it back to the application server if we decide to go back to the application server, right. So okay. So now you're thinking, well, we just need to keep track of all these things when we write the stored procedures, just be a little bit careful and we should be fine. Go ahead and deploy this on the database as stored procedures. But then guess what, right? There's also this problem of multi tenancy. Meaning that the same database server can be shared across with different application. And in particular, if it is actually under heavy utilization, then if we push all these application logic to the database is not actually that beneficial after all. So maybe after all, we should go back to the original version, where we just incur all this extra network round trips and that might actually turn out better, right? So imagine doing this kind of thing for millions of lines of code, right? It's just not going to be, you know, a manual effort is just not going to cut it. So to resolve this problem, what we ended up doing is building a new system called Pyxis. And it does two things in particular. One is that it automatically story procedurizes, if you will, database applications and pushes part of the application logic through the database, like what I was showing you earlier. And to target this problem of, like, multi tenancy, we also adaptively control the amount of application logic that we want to push based on the current load on the database server, right. And we are able to do all of these without any programming intervention. Meaning we're able to do automatically, you know, with somehow partition the program, split things up and we can turn it up and execute different things differently. So at this point, you're probably thinking how is this, like, possible. How do we actually do this. Well, let me show you how this works. So imagine this is the original application which has basically some SQL and, you know, imperative logic embedded, intertwined. So what we do in Pyxis is we first create we first collect some sort of profile data about the application. And given this profile data, we have an automatic partitioner that takes in the profile data and tries to partition the application into two parts. One part is executed on the app server and another part is executed on the database server as stored procedures. Now, during execution, after deployment, I guess the database server, these two kind of partition application is going to periodically jump back and forth in terms of the application control, and we call that a control transfer. And to facilitate this adaptive property, we actually have a monitoring component on a database server that keeps track of the current load on the database. Now, for instance, in a case where it turns out to be, you know, under heavy utilization, our deployment tool will actually automatically switch back to another partitioning. For instance, in this case, we will decide to do the original partitioning, where we go back and forth between the two servers instead. So how do we do this kind of source code partitioning, right? So we take the original application source code and then we just, you know, process it from beginning to end. Now, as I mentioned to you earlier, we first do a profiling of the application, and we also collect some capabilities in terms of how many cores. We need the specification of the two servers that we are targeting, okay. So, for instance, in this case, given like this piece of code I've shown you earlier from TPC C, what we would do is we would collect the number of times that each instruction is executed [indiscernible], and we use that as part of the profile information, because this is going to guide us in terms of doing the program partitioning. Now, the next step is to basically do the partitioning and the way that it works is the following. So we basically take the original program and we create what is known as a program dependence graph. So for those of you who might not be familiar with this, so this is basically a simple graph where every node corresponds to a program statement in the original program, and it also keeps track of control flow edges and also data flow edges. So what do I mean? So by control flow edges, we mean like, you know, there's control dependence between every statement. Then we basically insert an edge. So in this case, for instance, we know that this is the order of the program. And then since this is an if statement, we have controlled dependencies with these two statements, okay. So we create this graph and we basically add edges in this manner. Now, for each node, as it turns out, also has a weight which will help us decide on the partitioning, and the weight, at this point we're basically just using the counts that we have collected during profiling of the initial application. So now we also insert a bunch of data flow edges in terms of talking like, you know where data flow from one statement to another. So in this case, for instance, notice that there's this count variable that's used on the second defined in the first statement and used in the second one. So we insert a data flow edge that talks about, like, there's a data dependency between these two statements and that's and the thing that we talked about here is this particular variable. And so on, so forth, for the rest of the program, right. So we just create all these data flow edges. So this is like basically part of the process of creating this program dependence graph, okay. So we use this graph for partitioning purposes, and in particular we generate a linear program to solve this problem. Where in this case, we're trying to minimize this expression here, where it's basically the sum of the edge wages I mean, the edge weights, since each of them represents basically the number of bytes that needs to be transferred should we decide to cut the program in any particular way. So in this case, each edge, both data and control, has an indicator variable, EI, that is set to one if we decide to cut the program at that point. And all these and this goal is basically subjected to a number of constraints, where we say that, you know, EI is equal to one if we decide to cut the program at that point, and zero otherwise. So all of these, we also have an extra final constraint that talks about how what is how is the load on the database if you will in terms of the amount of CPU, memory, IO resources that are available, and we want to say that whatever positioning that we're able to come up with better respect this particular budget constraint, because we don't want to push too much stuff to the database, right. So you can imagine, we can basically solve this linear program for different values of budget, which corresponds to different values, different amount of resources that is available on the database, and that basically gives us different partitioning of these same programs. So you can think of this as we just generate various versions of the same program, except that they all differ in terms of how much application logic we decide to push from one server to another. So solving this program, basically, gives us a solution where we assign each node, in this case a program statement, to either the application server or the database itself. And the output of this program is represented using something called Pyxil, which sounds like the name of a drug, and I apologize for that. Actually, stands for the Pyxis intermediate language. So here's an example. It looks very much just like Java. In fact, like, we can actually program in this language. The only difference is that every statement is now annotated with either D, meaning that it is intended to be executed on the database, or A, which means it's intended to be executed on the app server. So as I've shown you here, this is just one example. We can obviously generate different type of petitioning, given the different budget constraints that we have for the database server. Now we still have the problem, I mean, how do we actually compile this down to a real program that we can execute. So that's what this complication and run time component in Pyxis is all about. So again, you know, we want to somehow compile this program in this funky language into Java, and again one half of the program is going to be executed on the app server, another half on the database. And I just want to point out that we didn't change anything for in terms of, in this case, it's a Java program, so just run it on any normal JVM. We don't require any special tricks or any sort of specialized run time system in order to deploy this. Meaning that we basically take the program in this Pyxil representation, and we compile it to Java and you can basically find it anywhere, right? So the compilation procedure is pretty much straightforward except for two issues. One is how do we actually do these kind of control transfer implementation between the two servers, and also how do we synch up the Heap, because as I pointed out, we have all these data dependencies. We somehow need to be able to synch up the two runtimes, if you will. So again, back to this example, right. So how do we do control transfer? Well, we simply try to group together statements that have the same annotation. In this case, they're all annotated with database. So we know that these guys need to be executed on a database. And so basically what we do here is we take these statements, compile them to what we call a code fragment and execute them on the database server itself. And then we basically keep doing this until we encounter something that we know has to be executed on the app server. At that point, we automatically the runtime system on the two servers will basically go and tell the other guy, okay, I need to I need you to execute this next code fragment which corresponds to, you know, for instance, code fragment number 110, for instance. So on, so forth, for the rest of the program. So that's easy, practically speaking. Now, for the Heap, what do we do? So during execution, we actually keep track of all the things that the all of these code fragments, I guess. We actually keep track of all the state changes that each of these code fragments incur. So, for instance, in the code example that I've shown you, we'll basically note that when we first try to execute this particular code fragment on the database server, runtime automatically will figure out that we need to forward these two the [indiscernible] of these two variables over from the app server to the DB server, and that's because they are used inside this particular code fragment. So our run time system automatically figures all that out and also keeps track of anything that gets changed. For instance, if we decide if the runtime actually execute this code fragment on a database server, you will notice that we actually define the variable credit. So if the if condition turns out to be true, meaning that we do go back to the app server, we will forward it the value that is needed in order to print out the statement on to the screen. And if we decide to stay on the app server I mean, the database server, then we don't need to forward anything. Because there's no change to the Heap. I mean, it's basically the same Heap. So we do all that during inside of run time, which is basically just another Java program that sits on top of the JVM on the system server. There's nothing special about this. And so on, so forth, so we keep executing that way. And again, we have a monitoring component which will basically tune to different loads on the database, right. So if the load on the database turns out to be too much, we'll simply switch to another partitioning that we have pregenerated when we solve the linear program, and then we'll just execute as before. I mean, nothing really changes. It's just basically can think of this as just running another version of the same program. So now let's look at how do we actually do in terms of experimental results, right. So in terms of experiment setup, we have TPC C implementation in Java. In this case, we used, we have 20 terminals, each running simultaneously, each running new order transactions. Most [indiscernible] of all of them. Asked [indiscernible] setting, and in this case, database server has 16 core total. And we along with the one that we automatically generated, we also compared it against two other implementations. The first one, the original application, if you will, where everything is executed, using JDBC. So, you know, it basically incurs all these network round trips and another one where we try to manually push everything as much as we can over to the database server. So that's what we call our manual implementation. So with all the cores that are available in the beginning where we did this experiment, we measure the amount of this is the standard latency versus throughput groove that we all are familiar with. So the JDBC implementation has kind of a high latency, as you would expect, and it's also not able to sustain very high throughput numbers, because of all these network round trips that have been incurred, right. As compared to the manual implementation, where you can see here has much better latency and also able to sustain [indiscernible] because we push most application logic to the database and the database has a lot of cores that are available to run all these things. Now, for us, so we automatically detect that, you know, in this case, there are plentiful resources that are available on the database server. So we ended up choosing a partition that is very similar to the manual implementation or the stored procedurized implementation. We try to push as much stuff as we can over to the database, and this makes sense, right? So basically, we're able to do this automatically without the programmer, you know, doing all these manual process of creating all these stored procedures and whatnot. So we basically get a 3x latency reduction and we also get substantial improvement in throughput. Now, let's take a look at the reverse, right. So in this case, we are strict the number of cores that are available on the database server. So when we do the same experiment. So again, JDBC implementation, you know, does, you know, kind of all right and stable latency across the board. Now, with manual implementation, the interesting thing in this case is that we so in this case, it's not able to sustain to high throughput numbers, because at some point, it saturated the entire database server. I mean, there are no more cores available so it just ended up choking. It's not able to handle all these extra transactions that are coming in. Now, in our case, we actually are able to detect that there are not enough resource available so we choose a partitioning automatically that is similar to the JDBC implementation. So that's again, you know, makes sense because we basically want to use the JDBC implementation if there are not enough resources available on the database server. So finally, in this experiment, we did the dynamic switching thing, where in this case, this is X axis is now time. So we fix a particular number of throughput and we measure the average latency across the board. The only difference is that at some point the experiment, we decided to restrict the number of cores that are available to the database server. So initially, it has access to all 16 cores. But at some point into the experiment, we decided to cut some of them off, okay. So JDBC performed as is, and then in this case, the manual implementation originally did very well, because it has all the cores available. And at some point after this kind of shut down, the latency just skyrocketed, because of the fact that once again, a similar case as before, there's no more cores available. Now, for you are implementation, the interesting thing is we are able to automatically switch between these partitions. So you notice that in the beginning, we choose a partition that is very similar to the manual implementation, where we push things to the database. And at some point after we shut off all those most of the cores, we decided to switch automatically over to the JDBC implementation. So just for the record here, we also show the percentage of transactions that are served using the JDBC implementation. So you can see that initially, none of them were using that particular version, because there are plenty of resources that are available. But at some time, we gradually convert to this other mode of operation, if you will, where we execute most of the things using this JDBC implementation, right. So in summary, in this case, we're basically able to show that we automatically switched to the most efficient implementation based on what is the current server load on the database. Yes? >>: For the last few minutes, which are 100 percent, why is it not the same as [indiscernible]. >> Alvin Cheung: Oh, that's just because of measuring the way that we measure latency. It's an average, a running average. Okay. So that comes to the conclusion of this talk, where I basically have shown you this umbrella project where we tried to use different programming analysis, you know, techniques to try to speed things up. The goal of which is to help application developers write these kind of database applications. In particular, we've shown you two things. One is the ability to convert things from imperative code into SQL and another piece of work that is able to do automatic code partitioning and also distribution of the application across the two servers. And we have a website if you're interested in learning more, and at this point I would love to take questions. Thanks. >>: For the second part of the talk, how do object relational mapping systems play into that? Because you're actually generating SQL to run the server with the [indiscernible]. >> Alvin Cheung: So in the traditional [indiscernible] setting, most of the [indiscernible] logic would be on the app server. So we haven't done the integration yet, but I think it would be very interesting where we start pushing part of the ORM code over as stored procedures on to the database and see how they would work out. I think that's a very interesting follow up work. Because >>: Especially for the updates, because they tend to you're running a cache and then they have to somehow take a diff of the updated cache cache updates and somehow package up into updates that get sent down to the server. It can be quite [indiscernible]. >> Alvin Cheung: Right. So on one aspect, the good news is we already have the machinery for all of the blah, blah, blah, but the interesting thing is perhaps how we formulate the linear program, to also take all that stuff into account. So I think that would be an interesting point, follow up work. >>: So it seems the first part of the talk went to [indiscernible]. So what you're doing is you're taking an imperative code and you're turning it into a SQL query, right. And then the database afterwards turns the SQL query into imperative code. So can you also think of a way where you can actually take a normal application that does not involve the communication with the database and try to optimize it based on some [indiscernible]. >> Alvin Cheung: So I guess we probably don't want to push, like, things that the database is not good at doing into the database itself, right? >>: No, I'm just saying forget about the database. Just, you know, you're analyzing a program which does any sort of processing of data which does not sit on a database but sits in memory somewhere and you're just analyzing this program, creating a relational algebra and using standard [indiscernible] techniques to optimize [indiscernible]. Can you think of a way of using your work to do this kind of transactions? >> Alvin Cheung: >>: You mean for out of context? Yeah. >> Alvin Cheung: Exactly. So we're looking at similar techniques for map reduce programs. So we can think of somehow we're able to, you know, it would be nice if we could come up with a way to convert arbitrary pairs of code into [indiscernible] and leverage off the query engine's ability. So what database systems are good in doing is basically doing this kind of relational optimization, if you will, right. So it's very good at choosing, like, what is the best query plan and so on, so forth. So from that respect, I don't think we want to go in that direction as to bypass all that machinery. I think it's good that databases are good at doing that. So what we're trying to do here is leverage what they're good at doing, you know, right. >>: That's what I'm saying. Use the existing database technology to create a SQL query and give let's the database give you more optimized code in processing the same data. But not actually communicating with the database, but doing the same thing inside the program that you're running. >> Alvin Cheung: So you mean like actually do the reverse, which is we try to push some of the database logic back to the app server? >>: Yeah. >> Alvin Cheung: Oh, yeah, I think that would be interesting. Although I think in some sense, having the computation close to the data is a good choice, right. >>: I'm talking about situations where the data is actually in the application. >> Alvin Cheung: Oh, yeah, I see. Sure, sure. That would actually be interesting. I think that would get into the some of the work we're trying to do for map reduce, for instance. Because in this case, they do heavily use memory for, at least for temporary storage. So I think the interesting thing in that case would be to model what is, you know, how much data that we can imagine will be sitting in the cache and being able to optimize that way, as opposed to right now we assume that the cache does not exist, or in other words, we assume that, you know, most of the data is sitting on to the database and is not cached on the app server. But I think that's a good point. Yes? >>: So specify the question, if you have expressions [indiscernible]. So can you use [indiscernible]. >> Alvin Cheung: Yeah, so I think yeah, that's a good point. So, in fact, I think the reverse question is also interesting. Which is if we can push things from app servers to the databases, why not also lift some of these things back from the database over to the app server, you know, or cache locality, for instance. >>: So I have an example in mind. So for example, I have application written and what it is doing is it's selecting two table, select start on two tables and then within the application logic it's trying to do something with those data and [indiscernible]. Now, if we want to convert this to a SQL query, every time you have to write a new way depending on the parameter, right? But in the previous case, it was doing [indiscernible] A and B, which is more efficient in terms of using the database cache. So probably those [indiscernible] cache and it will be very fast and the further process is being done in the memory so it will be very efficient. But in your approach, we'd be writing a new SQL query depending on [indiscernible] and it won't be cached in a database. So in that case, the performance might very well >> Alvin Cheung: Yeah that's a good case, actually. Thanks, yeah, absolutely. I think that would be an interesting [indiscernible]. It's just that like that at some point, the linear program starts to get really unruly terms of being able to solve. So just all about how much tricks we can pull in order to be able to formulate a very small program that is SQL to solve. Yeah. >> HOST: Thanks again.