>> Phil Bernstein: Phil Bernstein. It's a pleasure... introducing Alvin Cheung today. Alvin is nearly done with...

advertisement
>> Phil Bernstein: Phil Bernstein. It's a pleasure to be
introducing Alvin Cheung today. Alvin is nearly done with Ph.D.
at M.I.T., and working on problems related to both databases and
program analysis and program synthesis. At least part of the
work he's describing here was the subject of a paper that got the
best paper award at the conference on innovative database
research CIDR in January this year. And he's the winner of
multiple fellowships, competitive fellowships, NSF, Intel, and
National Defense Science and Engineering, which is actually
something like that, which is actually a very competitive program
run by DOD. I think there's a total of under 200 fellowships
that are awarded in all science and engineering fields.
And Alvin will be spending the day talking to a number of you.
If you're not on the schedule and you want to be, let me know.
Maybe we can double you up with one of the other meetings or talk
to him briefly at the end of the day. With that, let me give the
microphone to Alvin.
>> Alvin Cheung: Great, thanks. So I'm Alvin. I guess my now
I'm an nth year grad student at M.I.T. It's basically like a
number that I try not to remember. I guess the proper term, they
call it a tenured grad student.
>>:
[indiscernible].
>> Alvin Cheung: I hope like the count would basically stop
counting. So yeah, so thanks for the invitation today. So
actually, originally, I was supposed to give
I was thinking of
giving like a 30 minute talk of some of the work that we've just
presented at PLDI this past week, but since Phil was kind enough
to give me an hour slot, I decided to change the topic a little
bit to talk about a broader piece of work that we've been working
on with multiple people I have the pleasure of being involved
with.
So the general setting is that we're basically trying to you,
know, use various program analysis techniques to speed up
applications that uses a per extent data storage, and this is
joint work with a bunch of folks at M.I.T. and those at Cornell.
So to begin, let's just first review a little bit about like how
these applications tend to be written. So in the original case,
we have two servers, one hosting a database, in this case maybe
SQL, and another one that is hosting the application server. So
these two machines tend to be different physical machine,
although they tend to reside in this same machine room, for that
matter.
So these applications tend to be written using, first of all, a
high level imperative logic programming language, if you will, a
general purpose one. For instance, Java or python. And embedded
within these applications will be SQL queries that gets executed
on the various application, when the application runs.
So at some point in time, people figure out like one way to speed
things up is perhaps to carve out part of the application logic
into what is known as stored procedures. I'm sure all of you, in
all [indiscernible] are familiar with that concept. So there's
something wrong with this picture, unfortunately. First of all,
the developer needs to commit to either a SQL or an imperative
code theory development time, not to mention that he also has to
learn what PL/SQL and all that stuff is all about. And as you
can imagine, getting that wrong can have a drastic performance
impact.
And on the other hand, they also need to worry about how or where
is the application going to be executed. You know, as I've shown
you, you know, putting things on to the database server as stored
procedures will make the application run at least a half of
that is, in stored procedure representation, to be on the
database and everything else will be on the app server.
So being able to make these two choices tend to be a little bit
difficult, at least during development time, because, you know,
things can change. Things can
both in terms of the
application logic and also the state of the database, and these
positions tend to be difficult. So in this respect, we try to
use program analysis to help us solve these problems.
So with that in mind, we started a new project called StatusQuo,
funky name, I guess, which is a programming framework with two
very simple tenets in mind. So first of all, we want to be able
to allow the developers to express application logic in whatever
way they are interested in or comfortable with. And with that
regard, we think that it should be the compiler or the runtime's
job to figure out the most efficient implementation of a piece of
application logic.
So notice that we're not coming up with some new program paradigm
or new program language and ask people to please convert all your
legacy applications into that. So on the contrary, we're
basically saying you should be able to write whatever you will.
Be lazy, just do whatever you want, and then we will basically do
all the difficult lifting for you.
So to kind of, as a first step to achieve this mission, we have
developed two key technologies. So first, we have a piece of
tool that will automatically infer SQL queries from imperative
code. And secondly, we also have a piece of technology that will
automatically migrate computation between the two servers for
optimal performance. And I'm going to talk a little bit about
these two pieces in this talk today.
So first of all, let's just look at the first piece of technology
here, which is we're trying to convert things from imperative
code into queries. How does this work. Well, so here, what I've
shown you here is a piece of code snippet from a real world open
source application that uses some sort of an ORM, or object
relational mapping library for persistent data manipulation.
So in this case, what we have here is that this person, you know,
uses a third party library, which basically abstracts out all
these things above users and roles and then he's basically
calling these functions to return a list of users and roles from
some place, and then he's
he then goes on in a nested fashion
to figure out the ones that he's interested in and finally
returning it at the end, right.
So unfortunately, what he didn't know about is the fact that
these two functions, in the beginning, when executed, they're
actually issuing SQL queries against a database in order to get
back the list of results that he ended up looping across. And so
what happens, during execution, is that when the outer loop is
encountered, the third party library; in this case, the ORM
library is going to issue a select star query, because that's
what, you know, this particular code snippet is trying to do.
And likewise, when the inner loop is encountered, now these
database people, when we look at this, we basically say well this
looks very much like a relational join, right? So in order to
speed things up, one thing that you might want to do is just
convert this entire code snippet into a very simple joint query
that basically folds in this predicate into the queries.
So the benefit is obvious, right? Because when we execute these
two select start queries, when the database gets large, the
application can slow down substantially, whereas if we try to
write it as a simple query, for instance, in this case we ended
up having orders of magnitude performance difference. So our
goal in this case is to basically try to do this conversion from
here to here automatically.
So at this point, you should probably be, you know, calling me
out and saying there's just no way you can possibly do this. I
mean, look at this thing is like in a declarative SQL language.
This thing is in imperative Java. Look at the differences
between these two is pretty dramatic, right, so how can we
actually pull this off?
Well, so, just to recap, what we're trying to do here, formally
speaking, is to find a variable, in this case a result. We
called it an output variable and our goal is to basically try to
rewrite this into a SQL expression. And we are going to
basically do this using a tool that we just built called query by
synthesis or QBS for short.
So, you know, this does two, does three things. So first of all,
we try to identify the potential code fragments that we can
possibly convert into SQL queries. And secondly, for all these
what I was referring to as output variability, we try to find SQL
expressions for them that we can convert them in to. And
finally, being a, you know, good programming language, you know,
person, I would try
we should try to at least show that like
whatever SQL expressions that we ended up find, better preserve
the original semantics of the program. And if that works, then
we'll go ahead and convert the code, right?
So what do I mean by try to prove, you know, these things
preserve semantics. Gets into a little bit of background in
terms of program verification. But before I do that, can I get a
show of hands, like how many of you are really familiar with
Hoare logic program style verification? Okay.
>>:
[indiscernible].
>> Alvin Cheung: Yeah, except that in this case from the
producer's point of view, he's not sure what's really going on in
here, right. So it doesn't know that's what the actual code
ended up doing or the underlying implementation or the library.
So this is for modularity reasons, right. So the guy who wrote
this code might not be the person who developed these libraries.
And when, you know, someone who wrote these libraries, he cannot
possibly imagine how the result set that he actually ended up
returning will get used. So in that respect, we're basically
so you can imagine, these guys can be implemented using some
[indiscernible], for instance. But then from that respect, we
still need to somehow be able to fold all these things into the
object query large itself in order to get the performance.
Otherwise, all we ended up doing is just, you know, fetching
everything where data is and doing the nested. So that's not so
efficient, as you imagine.
So just a little bit of background in program verification. In
this case, we use Hoare logic, which basically comes in this
form, right. So this is a standard way of presenting this. So
we have some sort of command that we want to try to prove in
terms of state transformation. So we have a precondition that we
know is true before this command was executed, and then we have
some post condition that we're trying to enforce after the
command is executed in terms of changing program state.
So as an example, imagine we're trying, if we have a statement
that, you know, simple assignment and we want to show that X is
less than ten afterwards. So we can go backwards in terms of
figuring out what must be true before this statement is executed
in order to allow us to say that X is less than ten. So, of
course, in this case, we better be okay that we use less than
ten.
So a more complicated example what if we have an
if an else
statement in this case. Let's say we're also trying to enforce
somehow, we are trying to assert the fact that X is also less
than ten afterwards. So what needs to be true before this whole
statement gets executed? Or in this case, it better be a case
that if B is true, meaning that we do get into the if clause, it
better be okay that a was less than ten. Or the fact that if B
is not true, mean that we'll go to the else branch. C, in this
case, is less than ten. So there's a very routine, automatic
process of coming up with what is known as a verification
condition, begin a post condition that we're trying to enforce.
Okay. So for loops, it's a little bit tricky. So let's say in
this case we want to enforce the fact that, you know, we want to
show that somehow in this case sum is equal to ten. We all know,
of course, this is true. But what is the kind of condition or
verification condition that we need in order to establish that
fact in from the beginning of the program?
So we involve this magic thing called a loop invariant, which, of
course, as we all know means something that is true during the
execution of the loop. So in this case, suppose someone has come
up with, you know, this invariant that the sum is always less
than on equal to zero during the loop. So with that in mind, we
need to be able to establish three facts about this program.
So first of all, we need to be able to show that this invariant
is true when the loop was first entered. In this case, it is,
because we know that sum was set to zero, and that, of course,
implies it's less than or equal to ten. So we also need to show
that this invariant is preserved, meaning that if we actually go
into the loop and we go through one iteration, the invariant
better be preserved, meaning that it still remains true. So in
this case, we have, if sum is less than ten and the fact that
this is the invariant, it better imply the variant back. In this
case, this is also trivial. So it checks.
And finally, we want to show that if this loop would ever get out
of this loop, then the post condition, wherever that we're trying
to prove, we can establish that fact given the invariant. So in
this case, we want to basically show that if sum is greater than
or equal to zero, meaning that we'd break out the loop and given
the loop invariant, we have sum is equal to ten, which is the
post condition that we're trying to establish. So in this case,
you can also see that this kind of also checks out, because
basically comes up with the only possible
the only possibility
is that sum is equal to ten, okay? So this is just a very tiny,
very, very short introduction to program verification that what
does that have to do with us, right?
So if you look back into this example here, what we're trying to
establish, as I was talking earlier, in terms of these output
variable, is the fact that this results variable better be equal
to some SQL expression. So this SQL expression is precisely the
post condition that we're looking for except that, as I was
showing you earlier, in terms of loops what we need to do in
order to establish this post condition is the fact that we need
some invariants for these loops. In this case, we have two loops
so it [indiscernible] two different invariants.
So and given all that, we can automatically generate all these
verification conditions, like the ones I have just shown you in
the earlier slide. Yes, question.
>>: Seems like you should also make some [indiscernible] that,
like, no other module is [indiscernible] in principle like
[indiscernible] just somehow [indiscernible].
>> Alvin Cheung: So in this case, so for many of these third
party libraries, they can mark things as transactional, which is
what happens in this case. I mean, this whole thing is basically
sitting in a transaction.
>>:
[indiscernible].
>> Alvin Cheung: Yeah, so getting back to this, so basically
what we ended up doing is now you can basically see that we can
automatically compute these sort of verification conditions that
help us establish the fact that results equals to some post
condition. Except that there's one problem in this case, right.
So I haven't really told you what are all these invariants and
what are all these post conditions. These are things that we're
trying to look for, right. At this point, we don't really know.
But as good computer scientists, what do we do, right? We just
throw abstractions at a problem and somehow it will go away,
right? What happens in this case is we'll just say at this point
that these guys are some function calls. We don't know what the
body of these function calls are, but we just say okay, here are
some invariant function that is a function of everything, all the
variables that currently in scope, and then we'll just leave it
out for now and see what happens later on, right.
Yes, question?
>>: So how do you find this output variant? I think that's a
pretty tough problem, because as soon as you have something like
[indiscernible] I imagine that this is nearly impossible.
>> Alvin Cheung: So we actually run a bunch of analysis to
figure out that, and I have a slide to talk about that, so hold
on to your question. I don't want to [indiscernible]. So that's
a very good point. So how do we actually figure out, like, what
are the things that we can possibly translate, right. So we'll
get to that in a second.
Okay. So I was talking about, okay so we basically treat these
guys as some function calls that we don't know how to express,
right? But before I talk about like how do we actually generate
the bodies of these functions, we first need to understand that
all these things actually written in some language that is used
to express these kind of logical expressions, right, because you
see there's some [indiscernible] things, implication things. So
we need to be able to have some sort of language for us to
represent all of these verification conditions.
And notice that this is not like Java. It is not like the
original program. This is really a logic language. Logical
language. And in the previous slide I've shown you for the
simple examples, which involve things like, you know, integers
and simple variables, we're going to just use various standard,
you know, logical language for that purpose.
But in this case, it involves like nasty stuff like queries and
[indiscernible] and all that. So we actually need to talk a
little bit about what is the language for us to express these
kind of things, right.
So as I was saying, whatever this language we ended up using, it
better be able to
for us to be able to express all these
verification conditions for the imperative code, which is the
source language in our case, right. And more importantly, we
need to be able to handle all these query operations that happen
in a code. Because the code that we're doing has queries
embedded within the code itself, right. So whatever language
that we use, we better be able to express all of the relational
operations. And in particular, the order of records actually
matter.
So that might come a little bit as a surprise. But if you recall
like the nested loop example that I've shown you earlier, there's
actually an orders of all the tuples that come out of the nested
loop, and the ordering is basically based on what the database
decides to return.
But unfortunately, SQL itself does not enforce a particular
ordering, unless we put in an order by class, right. So in order
to preserve the semantics of the original program, we somehow
need to capture the fact that the stuff that is coming out
actually has an ordering embedded within it. So whatever
language that we use to represent these revocation conditions, it
better be the case that they preserve the ordering.
So finally, we also want any post condition expressions that we
actually come up with is able to translate to the query language
that we have in mind. In this case, the target is SQL.
And finally, it better be the case that, you know, the function
body that I told you, kind of like a mystery, it better be able
to come up what they are. Otherwise, we can establish the fact
and do the kind of conversions, right?
So what we ended up doing is using something called ordered
relations, which is very similar to relational algebra with one
caveat. That the relations themselves, rather than modeling them
as bags of tuples, we actually modelled them as an ordered list.
So what do I mean, right? So we basically say we basically mean
the following. Here are some examples. We say that in this
world, a query result set is an ordered list and a list
construct that by either, like, setting [indiscernible]
program variable where there's basically a query, or we
construct that by concatenating multiple lists together
concatenating with an expression.
we can
to some
can
or
And along with
so these are basically standard list operations
that, you know, most people are familiar with. But besides that,
we also included a bunch of database operations that can operate
on these lists. For instance, we say that we can do joins we can
do selections on these lists, which return other lists. And this
top thing is basically the familiar, you know, limit clause that
we have in SQL. So in this case, it basically returns the top I
element from this list where I is the result of evaluating
whatever this expression E is.
>>:
So [indiscernible] is concatenation?
>> Alvin Cheung: Yes. So I'm kind of overloading this a little
bit here. These are actually concatenating lists, versus this is
just appending something.
So for expression, you know, what you would expect. So, for
instance, we can get something off a list. We can, you know, do
some operation on two elements of a list, and then along with
that, we also define a bunch of aggregate operations that can run
a scalar result.
Okay.
yes?
So now we have a language where we can express all these
>>: What is order input? Because if you make the original SQL
application, and run the same application twice, [indiscernible].
>> Alvin Cheung: Because the ordering of the tuples can be
arbitrary from the point of the resolved set being fresh from the
database.
>>: [indiscernible] but even if you add application twice,
[indiscernible] because the SQL doesn't give you any idea
[indiscernible] so the [indiscernible] application is fine, but
[indiscernible].
>> Alvin Cheung: That because we cannot really enforce that,
right. In fact, we have seen applications where they do rely on
the things that the ordering of the stuff that is
>>:
Can't you use [indiscernible].
>> Alvin Cheung: In that case, for instance, that's a loop
example, right. So we know that like the underlying database can
return the ordering of the tuples however they want. But then we
know that whatever the first tuple that gets returned will then
be joined in an inner loop fashion with everything else, right.
So there's some implicit guarantee as to what the result set
would look like if anything comes out of it. Do you see what I'm
saying? So because they have explicitly written a nested loop,
right, so then the first tuple that gets fetched from the outer
loop, whatever it is, will be joined across with everything else
in the inner loop, right, followed by the second tuple and so on,
so forth. So there is some implicit ordering that
so, you
know, we don't know that that's really what the author has in
mind when he wrote that piece of code.
>>:
[indiscernible].
>> Alvin Cheung: That's why we ended up doing this. As soon as
we have become [indiscernible], we have to preserve the order.
But I agree, I mean, for a simple selection, it really doesn't
matter. Okay. So now I've basically left you with this kind of
mystery, where how do we actually generate these post conditions
and invariants. And for that purpose, we pull out this magic box
called program synthesis. So we can basically think of this as a
search engine over the space of all possible SQL expressions,
right.
So how it works is the following. So given a program
synthesizer, which is a piece of, you know, code or off the shelf
tool, we give it a symbolic description of the search space. So
search space in this case being the space of all possible SQL
expressions that we can generate, and we come
we also put
we also give as input a set of constraints that we want the
synthesizer to respect.
So guess what, right? So this set of constraints, in this case,
is exactly the verification conditions that we have automatically
come up with. So given these two things, what the synthesizer
will do is it would do some sort of a search procedure,
intelligent one, using a symbolic manipulation and what is known
as counter example driven search.
What does that mean? So that means we tried to come up with an
expression, a can Dat expression that can possibly fulfill the
requirements that we want, meaning that it's able to satisfy all
the verification conditions, and we try it out in terms of
proving where that's correct. And if that works out, then we're
good. If not, then it incorporates that as a counter example,
meaning that, well, you better not return this same expression
again, right. So that's kind of at a high level how this new
counter example driven search works.
So hopefully, at the end, we'll get an expression that satisfy
all the constraints, or in this case, what happens when you're
back to this example, we basically want the synthesizer to come
up with expressions for both the post condition and also all
these invariants, right. And to just give you a sense of how
this works, in this case, let's say we're talking about the post
condition, the synthesizer will give, you know, as a description
of the search space, right, a bunch of expressions, meaning
something like this. So, for instance, we might try to infer
that results equals to some selection of the user's list, some
top limit expression or some more complicated thing, so on, so
forth.
And, of course, we don't really write all these out explicitly,
because that would take forever, so we actually have a symbolic
encoding of this search space and because of time, I'll not talk
about this. But if you're interested, please come talk to me
afterwards.
So finally, so now you understand actually how we do this kind of
transformation, right. So it's an iterative loop where we try
to, you know, find these expressions and it all works out, then
we go ahead and transform the program.
But, you know, as being pointed out, we need to talk about how we
actually identify these code fragments to begin with. So very
simply, we can just
yes, question?
>>: It seems to me that you should also somehow model
[indiscernible].
>> Alvin Cheung: Yeah, so yeah, exactly. So I skip a lot of
details about how we actually efficiently encode a space of
things and all that. But yeah, that's actually one of the things
that we looked at. Okay? So, well, for one thing, we probably
want to start at, you know, where the code is actually fetching
stuff on the database as a starting point, right. Because we
don't want to deal with all these other things that all parts of
the program that actually does not involve any database
operations, okay.
So that's the starting point of where we want to look at these
kind of code fragments, and then we run pointer analysis to
figure out where things escape in terms of, you know, passing on
to things that we don't know what's happening. So given these
two information, this is actually sufficient for us to actually
determine where do these persistent data, meaning that the stuff
from the database actually flow to in terms of the program. And
that, in turn, allow us to delimit the start and the end of these
code fragments that we try to analyze.
So these are actually very standard program analysis techniques
that basically, you know, we just have to do a lot of engineering
to actually get to. But to summarize this part of the talk, so
this is the way that this whole tool chain works, right. So it
takes in the source code off the application unmodified, and we
then try to find all these places that uses query results from
the database.
And then we'll try to, you know, find all these code synthesis
using the mechanism that I just talked to you about, and we ended
up in something like, you know, we'll be able to identify a bunch
of code fragments within the original application source that we
can try to do these kind of analysis and conversions. And given
that, it will go through some sort of verification condition
computation to generate the kind of logical expressions that we
try to enforce, and then we will do the synthesis step to try to
come up with SQL expressions.
And if that all checks out, then we'll go ahead and convert the
code and put it back into the original application source code.
And this thing, as I've shown you in the earlier example, is
inter procedural. So meaning that we do look into different
method [indiscernible] and all that stuff. And so this is the
entire tool chain, and now let's talk about a little bit about
does it actually work, right.
So the experiments we set up in the following manner. So first
of all, we understand that there's no standard berth marks
available for these kind of transformations. I mean there are
some ORM libraries, and there are a lot of open source
application, but there's no standard benchmark available. So we
did our best in terms of finding two large scale open source web
applications for this purpose.
In this case, these applications were written in Java using
[indiscernible] as the ORM library. So we want to measure two
things to test out our tool. First of all, we want to know how
many code fragments we're actually able to identify and convert
of the application source code. And secondly, we also want to
measure what is the benefit that we get in terms of execution run
time. Do we actually get any, you know, speed up in terms of
converting things into SQL, right.
So for the first part, so here is one application that we looked
at. And I've basically broken this thing down into the types of
different operations that we're able to identify in terms of the
code snippet and also the number that we're able to convert as a
result. So you can see that we do
we were able to convert
quite large portion of them. And they do come in a variety of
different SQL operations, joins and projections and whatnot.
And here's another one. Similar story. So in terms of
performance, what actually happens. So we start to measure what
is the effect of conversion in terms of execution run time. So
on the right here, we have taken one piece of code snippet that
we were able to transform. In this case, we transformed that
into a simple selection query. The original application just
fetches everything from the database and then do a filtering in
the loop fashion and then return a subset of the result.
So in this case, we try out, you know, with ten percent
selectivity, and then we try to scale up the size of the database
to see and measure the time that it takes to load up the entire
page. Not just running the queries, right. So this is fetching
stuff from the database and also rendering everything, getting
the whole thing from the web server and displaying it on to the
screen.
So this is the original application, the time that it takes and
as I have mentioned, it basically fetches everything on the
database, right? And here is what happens when we try to convert
into a single selection query. So you can see this is basically
an orderer of magnitude difference, right? It's pretty obvious,
because we have basically fashioned a subset of the things as
opposed to, you know, the original application that fetches
everything and then doing, you know, filtering inside the
application itself.
So now we also try with 50 percent selectivity. So again, this
is the original application, and in this case, you know, we
didn't scale as good, because, you know, obviously, this is
because the application was fetching more stuff than, you know,
10 percent selectivity. So we wouldn't expect, you know, the
same kind of performance benefit. But you can still see quite
substantial difference in terms of the time.
Now, the more interesting case is actually the nested loop
example that I've shown you earlier, right. So this is the exact
code snippet that I just shown you. So here is how the original
application
I mean, not so hot. As you can see, this is
basically scaling up N square, because this is doing a nested
loop, right. And guess what? I mean, we actually ended up
scaling lineally, because we were able to convert from a nested
loop join into a hash join because, you know, turns out that the
database already has indexes for these tables. So, you know,
just by converting into a SQL query, we magically get this kind
of speed up, which is kind of good.
And finally, same thing or aggregation, right. So in this case,
this guy is trying to do a count. So it fetches everything and
then he basically sums up the number of tuples that he returned.
And, you know, this is the original performance. And again, we
have, you know, another order of magnitude difference just by
converting things into an aggregation and executing that inside a
database, right.
So just very briefly, I also want to touch on what are the things
that we actually failed to convert, right, because as I've shown
you earlier, we were not able to convert everything. So why is
that the case, right? So first of all, there's like a bunch of
cases where they use some sort of custom logic to do a
comparison. So, for instance, they wrote their own, you know,
they wrote their own comparison operator that, you know, compares
some string value of the database and then sort the results that
way.
So in that case, since at this point, at least in this initial
prototype we don't handle store procedures, so that those kind of
expressions were not able to express [indiscernible] SQL. So
therefore, we failed to convert that kind of thing.
And another set of thinks, users, database schema information.
So these are the kind of more interesting cases, for instance.
In this case, you know, this guy is basically issuing a query
that fetches things from the database and it turns out that ID is
the primary key, right. Then he decides to sort things out and
then he decides to get like, you know, do this kind of loop. So
look at this carefully. This is really just getting the first
ten elements from the table.
But unfortunately, we're not able to convert this into the SQL
representation, because we didn't know that, you know, ID is
equivalent
is the primary key and therefore, this is really
the equivalent query in this case, right.
So, of course, if we incorporate that information into our
prototype, we should be able to do better. But that's basically
[indiscernible]. Yes?
>>:
Did you find that the [indiscernible].
>> Alvin Cheung: At least in all the experiments that we've run,
we have not encountered that case. And that's because, like, you
know, things that we convert are things like selections or
projections or joins. So unless, you know, unless the database
implementation actually ends up slower than, you know, what the
application source code written otherwise. Yeah, that's a good
question, actually.
Okay. So now we'll just talk about this thing about converting
things into
automatically converting things from Java to SQL.
But as I mentioned to you in the first part of the talk, we are
still left with the problem of how do we actually execute the
application itself, right. So because as I'm saying, some part
of it, we can convert into, you know, SQL [indiscernible] store
procedures, but how do you make the decision as to when do we
actually do the conversion, right.
Again, the high level goal here is to be able to be flexible in
terms of how to execute a piece of code, right. Okay. So for
this part of the talk
yes?
>>: If you had nested loops in the application program and
there's imperative logic in between the nested, are there cases
where that's going to cause trouble in terms of reverse
engineering a query just because of the expressiveness of the
query, if statements and procedure calls and whatnot?
>> Alvin Cheung: Yeah, that's a good question. So the classic
example that we have seen, for instance, basically people putting
log statements inside a loop. So somehow you want to print out
something like, you know, progress report, I have fetch process,
50 percent. So for those kind of things, unfortunately, you
cannot reverse engineer, because we cannot put, like, the print
statement.
>>: But you're also not going to bring back all the data in the
outer loop.
>> Alvin Cheung: Exactly. The fortunate thing is we can,
indeed, detect those cases and bail out of it. So we are sound
in that. Yes.
>>: Can you partially convert? So, for instance, you have that
application logic where something like you need to return a list
based on a table, and you need to filter out the ones where the
numbers in a certain field are prime or not. And so you can
certainly propagate that logic, I'm sure, to SQL. But then can
you kind of separate that logic through another filter within the
application language, or does it kind of [indiscernible] on
things like that?
>> Alvin Cheung: Yeah, so for now we've decided to bail out on
that particular case as well. But yeah, I think that's an
interesting point, where we can actually try to
but then that
also gets to the question of like how much we want to convert,
even though if we can, I mean, we don't have to convert
everything. And that's actually part of this second half of the
talk. That's a very good lead in, actually.
Okay. So now we want to talk about where we want to execute a
piece of application logic, right? So as a running example, I'm
going to use this code fragment, which is
all familiar with
TPC C so in this case, this guy is trying to basically do some
sort of a new order transaction. So as you can see here, what
happens is that, you know, initially, we execute some sort of
query, and then we, you know, compute sums of total amount and
then we do an update into the database, so on and so forth.
So notice that when we actually execute this piece of code,
what's going to happen is that the first statement is going to
execute it on the database, right, because this is a query, okay.
So but then the second line is going to execute on the app
server, because this is a piece of imperative application logic,
and so on so forth for the rest of the program.
So notice also because of this that we actually get some network
communication between two statements, because either the database
has to relay the results back to the application server, or the
application server has to tell the database, okay, please execute
this query, right? And in this particular example, it turns out
to be quite a lot of these network communication, network round
trip between the two servers and these ones again turn out to be
not so good in terms of performance, right. Slows things down.
So in this case, our goal is basically try to eliminate some of
these round trip between the between servers. Of course, we
cannot eliminate everything, because after all we do need to
communicate with the database in order for the application to
proceed, right.
So one standard wisdom in our community in terms of speeding
things up is to basically group things into stored procedures,
okay. Push application logic to the server, in addition to
queries. So in this case, what we might want to do for instance
is to notice that in this case, you know, this thing prints out
something on to the screen so it has to be on the application
server itself. But then for everything else, we can try to group
that together into a stored procedure and push that over to the
database server and likewise for this, right.
Now just doing this, as our experiment shows, actually decreases
latency by three times, 3x. So it's obvious that's the case,
because we are reducing the number of round trips, right? But
notice, as I have pointed out, in this case, this line has to
execute on the application server. So what does that mean? So
that means we need to keep track of the ordering, because in this
case, this is actually covered by an if statement. So this line
is only executed if the if statement turns out to be true.
So in programming language as well, that means we have a
controlled dependency. We have a controlled dependency of these
two lines on this particular if condition. These two lines are
only executed depending on the outcome of the if statement. So
if we want to do this, you know, stored procedure approach, we
better at the end of this stored procedure return, whether this
is true or not, because otherwise the application does not know
where it should go ahead and print up stuff on to the screen or
call another stored procedure or database query, for instance,
right.
And in addition to that, in order to make this work, we also need
to keep track of all these data dependencies. But why? Notice
that in this case this guy defines a variable that gets used
later on. If we decide to go back to the application server.
So that means at the end of this stored procedure, we also need
to return the value of this variable and forward it back to the
application server if we decide to go back to the application
server, right.
So okay. So now you're thinking, well, we just need to keep
track of all these things when we write the stored procedures,
just be a little bit careful and we should be fine. Go ahead and
deploy this on the database as stored procedures. But then guess
what, right? There's also this problem of multi tenancy.
Meaning that the same database server can be shared across with
different application. And in particular, if it is actually
under heavy utilization, then if we push all these application
logic to the database is not actually that beneficial after all.
So maybe after all, we should go back to the original version,
where we just incur all this extra network round trips and that
might actually turn out better, right? So imagine doing this
kind of thing for millions of lines of code, right? It's just
not going to be, you know, a manual effort is just not going to
cut it.
So to resolve this problem, what we ended up doing is building a
new system called Pyxis. And it does two things in particular.
One is that it automatically story procedurizes, if you will,
database applications and pushes part of the application logic
through the database, like what I was showing you earlier. And
to target this problem of, like, multi tenancy, we also
adaptively control the amount of application logic that we want
to push based on the current load on the database server, right.
And we are able to do all of these without any programming
intervention. Meaning we're able to do automatically, you know,
with somehow partition the program, split things up and we can
turn it up and execute different things differently.
So at this point, you're probably thinking how is this, like,
possible. How do we actually do this. Well, let me show you how
this works. So imagine this is the original application which
has basically some SQL and, you know, imperative logic embedded,
intertwined. So what we do in Pyxis is we first create
we
first collect some sort of profile data about the application.
And given this profile data, we have an automatic partitioner
that takes in the profile data and tries to partition the
application into two parts.
One part is executed on the app server and another part is
executed on the database server as stored procedures. Now,
during execution, after deployment, I guess the database server,
these two kind of partition application is going to periodically
jump back and forth in terms of the application control, and we
call that a control transfer. And to facilitate this adaptive
property, we actually have a monitoring component on a database
server that keeps track of the current load on the database.
Now, for instance, in a case where it turns out to be, you know,
under heavy utilization, our deployment tool will actually
automatically switch back to another partitioning. For instance,
in this case, we will decide to do the original partitioning,
where we go back and forth between the two servers instead.
So how do we do this kind of source code partitioning, right? So
we take the original application source code and then we just,
you know, process it from beginning to end.
Now, as I mentioned to you earlier, we first do a profiling of
the application, and we also collect some capabilities in terms
of how many cores. We need the specification of the two servers
that we are targeting, okay. So, for instance, in this case,
given like this piece of code I've shown you earlier from TPC C,
what we would do is we would collect the number of times that
each instruction is executed [indiscernible], and we use that as
part of the profile information, because this is going to guide
us in terms of doing the program partitioning.
Now, the next step is to basically do the partitioning and the
way that it works is the following. So we basically take the
original program and we create what is known as a program
dependence graph. So for those of you who might not be familiar
with this, so this is basically a simple graph where every node
corresponds to a program statement in the original program, and
it also keeps track of control flow edges and also data flow
edges. So what do I mean? So by control flow edges, we mean
like, you know, there's control dependence between every
statement. Then we basically insert an edge. So in this case,
for instance, we know that this is the order of the program. And
then since this is an if statement, we have controlled
dependencies with these two statements, okay. So we create this
graph and we basically add edges in this manner.
Now, for each node, as it turns out, also has a weight which will
help us decide on the partitioning, and the weight, at this point
we're basically just using the counts that we have collected
during profiling of the initial application.
So now we also insert a bunch of data flow edges in terms of
talking like, you know where data flow from one statement to
another. So in this case, for instance, notice that there's this
count variable that's used on the second
defined in the first
statement and used in the second one. So we insert a data flow
edge that talks about, like, there's a data dependency between
these two statements and that's
and the thing that we talked
about here is this particular variable.
And so on, so forth, for the rest of the program, right. So we
just create all these data flow edges. So this is like basically
part of the process of creating this program dependence graph,
okay.
So we use this graph for partitioning purposes, and in particular
we generate a linear program to solve this problem. Where in
this case, we're trying to minimize this expression here, where
it's basically the sum of the edge wages
I mean, the edge
weights, since each of them represents basically the number of
bytes that needs to be transferred should we decide to cut the
program in any particular way.
So in this case, each edge, both data and control, has an
indicator variable, EI, that is set to one if we decide to cut
the program at that point. And all these
and this goal is
basically subjected to a number of constraints, where we say
that, you know, EI is equal to one if we decide to cut the
program at that point, and zero otherwise.
So all of these, we also have an extra final constraint that
talks about how
what is
how is the load on the database if
you will in terms of the amount of CPU, memory, IO resources that
are available, and we want to say that whatever positioning that
we're able to come up with better respect this particular budget
constraint, because we don't want to push too much stuff to the
database, right.
So you can imagine, we can basically solve this linear program
for different values of budget, which corresponds to different
values, different amount of resources that is available on the
database, and that basically gives us different partitioning of
these same programs. So you can think of this as we just
generate various versions of the same program, except that they
all differ in terms of how much application logic we decide to
push from one server to another.
So solving this program, basically, gives us a solution where we
assign each node, in this case a program statement, to either the
application server or the database itself. And the output of
this program is represented using something called Pyxil, which
sounds like the name of a drug, and I apologize for that.
Actually, stands for the Pyxis intermediate language. So here's
an example. It looks very much just like Java. In fact, like,
we can actually program in this language. The only difference is
that every statement is now annotated with either D, meaning that
it is intended to be executed on the database, or A, which means
it's intended to be executed on the app server. So as I've shown
you here, this is just one example. We can obviously generate
different type of petitioning, given the different budget
constraints that we have for the database server.
Now we still have the problem, I mean, how do we actually compile
this down to a real program that we can execute. So that's what
this complication and run time component in Pyxis is all about.
So again, you know, we want to somehow compile this program in
this funky language into Java, and again one half of the program
is going to be executed on the app server, another half on the
database. And I just want to point out that we didn't change
anything for in terms of, in this case, it's a Java program, so
just run it on any normal JVM. We don't require any special
tricks or any sort of specialized run time system in order to
deploy this.
Meaning that we basically take the program in this Pyxil
representation, and we compile it to Java and you can basically
find it anywhere, right?
So the compilation procedure is pretty much straightforward
except for two issues. One is how do we actually do these kind
of control transfer implementation between the two servers, and
also how do we synch up the Heap, because as I pointed out, we
have all these data dependencies. We somehow need to be able to
synch up the two runtimes, if you will. So again, back to this
example, right.
So how do we do control transfer? Well, we simply try to group
together statements that have the same annotation. In this case,
they're all annotated with database. So we know that these guys
need to be executed on a database. And so basically what we do
here is we take these statements, compile them to what we call a
code fragment and execute them on the database server itself.
And then we basically keep doing this until we encounter
something that we know has to be executed on the app server. At
that point, we automatically
the runtime system on the two
servers will basically go and tell the other guy, okay, I need to
I need you to execute this next code fragment which corresponds
to, you know, for instance, code fragment number 110, for
instance.
So on, so forth, for the rest of the program. So that's easy,
practically speaking. Now, for the Heap, what do we do? So
during execution, we actually keep track of all the things that
the
all of these code fragments, I guess. We actually keep
track of all the state changes that each of these code fragments
incur. So, for instance, in the code example that I've shown
you, we'll basically note that when we first try to execute this
particular code fragment on the database server, runtime
automatically will figure out that we need to forward these two
the [indiscernible] of these two variables over from the app
server to the DB server, and that's because they are used inside
this particular code fragment. So our run time system
automatically figures all that out and also keeps track of
anything that gets changed.
For instance, if we decide
if the runtime actually execute
this code fragment on a database server, you will notice that we
actually define the variable credit. So if the if condition
turns out to be true, meaning that we do go back to the app
server, we will forward it the value that is needed in order to
print out the statement on to the screen. And if we decide to
stay on the app server
I mean, the database server, then we
don't need to forward anything. Because there's no change to the
Heap. I mean, it's basically the same Heap.
So we do all that during
inside of run time, which is
basically just another Java program that sits on top of the JVM
on the system server. There's nothing special about this.
And so on, so forth, so we keep executing that way. And again,
we have a monitoring component which will basically tune to
different loads on the database, right. So if the load on the
database turns out to be too much, we'll simply switch to another
partitioning that we have pregenerated when we solve the linear
program, and then we'll just execute as before. I mean, nothing
really changes. It's just basically can think of this as just
running another version of the same program.
So now let's look at how do we actually do in terms of
experimental results, right. So in terms of experiment setup, we
have TPC C implementation in Java. In this case, we used, we
have 20 terminals, each running simultaneously, each running new
order transactions. Most [indiscernible] of all of them. Asked
[indiscernible] setting, and in this case, database server has 16
core total. And we along with the one that we automatically
generated, we also compared it against two other implementations.
The first one, the original application, if you will, where
everything is executed, using JDBC. So, you know, it basically
incurs all these network round trips and another one where we try
to manually push everything as much as we can over to the
database server. So that's what we call our manual
implementation.
So with all the cores that are available in the beginning where
we did this experiment, we measure the amount of
this is the
standard latency versus throughput groove that we all are
familiar with. So the JDBC implementation has kind of a high
latency, as you would expect, and it's also not able to sustain
very high throughput numbers, because of all these network round
trips that have been incurred, right. As compared to the manual
implementation, where you can see here has much better latency
and also able to sustain [indiscernible] because we push most
application logic to the database and the database has a lot of
cores that are available to run all these things.
Now, for us, so we automatically detect that, you know, in this
case, there are plentiful resources that are available on the
database server. So we ended up choosing a partition that is
very similar to the manual implementation or the stored
procedurized implementation. We try to push as much stuff as we
can over to the database, and this makes sense, right? So
basically, we're able to do this automatically without the
programmer, you know, doing all these manual process of creating
all these stored procedures and whatnot. So we basically get a
3x latency reduction and we also get substantial improvement in
throughput.
Now, let's take a look at the reverse, right. So in this case,
we are strict the number of cores that are available on the
database server. So when we do the same experiment. So again,
JDBC implementation, you know, does, you know, kind of all right
and stable latency across the board. Now, with manual
implementation, the interesting thing in this case is that we
so in this case, it's not able to sustain to high throughput
numbers, because at some point, it saturated the entire database
server. I mean, there are no more cores available so it just
ended up choking. It's not able to handle all these extra
transactions that are coming in.
Now, in our case, we actually are able to detect that there are
not enough resource available so we choose a partitioning
automatically that is similar to the JDBC implementation. So
that's again, you know, makes sense because we basically want to
use the JDBC implementation if there are not enough resources
available on the database server.
So finally, in this experiment, we did the dynamic switching
thing, where in this case, this is X axis is now time. So we fix
a particular number of throughput and we measure the average
latency across the board. The only difference is that at some
point the experiment, we decided to restrict the number of cores
that are available to the database server. So initially, it has
access to all 16 cores. But at some point into the experiment,
we decided to cut some of them off, okay.
So JDBC performed as is, and then in this case, the manual
implementation originally did very well, because it has all the
cores available. And at some point after this kind of shut down,
the latency just skyrocketed, because of the fact that once
again, a similar case as before, there's no more cores available.
Now, for you are implementation, the interesting thing is we are
able to automatically switch between these partitions. So you
notice that in the beginning, we choose a partition that is very
similar to the manual implementation, where we push things to the
database. And at some point after we shut off all those
most
of the cores, we decided to switch automatically over to the JDBC
implementation.
So just for the record here, we also show the percentage of
transactions that are served using the JDBC implementation. So
you can see that initially, none of them were using that
particular version, because there are plenty of resources that
are available. But at some time, we gradually convert to this
other mode of operation, if you will, where we execute most of
the things using this JDBC implementation, right.
So in summary, in this case, we're basically able to show that we
automatically switched to the most efficient implementation based
on what is the current server load on the database. Yes?
>>: For the last few minutes, which are 100 percent, why is it
not the same as [indiscernible].
>> Alvin Cheung: Oh, that's just because of measuring the way
that we measure latency. It's an average, a running average.
Okay. So that comes to the conclusion of this talk, where I
basically have shown you this umbrella project where we tried to
use different programming analysis, you know, techniques to try
to speed things up. The goal of which is to help application
developers write these kind of database applications. In
particular, we've shown you two things. One is the ability to
convert things from imperative code into SQL and another piece of
work that is able to do automatic code partitioning and also
distribution of the application across the two servers. And we
have a website if you're interested in learning more, and at this
point I would love to take questions. Thanks.
>>: For the second part of the talk, how do object relational
mapping systems play into that? Because you're actually
generating SQL to run the server with the [indiscernible].
>> Alvin Cheung: So in the traditional [indiscernible] setting,
most of the [indiscernible] logic would be on the app server. So
we haven't done the integration yet, but I think it would be very
interesting where we start pushing part of the ORM code over as
stored procedures on to the database and see how they would work
out. I think that's a very interesting follow up work. Because
>>: Especially for the updates, because they tend to
you're
running a cache and then they have to somehow take a diff of the
updated cache
cache updates and somehow package up into
updates that get sent down to the server. It can be quite
[indiscernible].
>> Alvin Cheung: Right. So on one aspect, the good news is we
already have the machinery for all of the
blah, blah, blah,
but the interesting thing is perhaps how we formulate the linear
program, to also take all that stuff into account. So I think
that would be an interesting point, follow up work.
>>: So it seems the first part of the talk went to
[indiscernible]. So what you're doing is you're taking an
imperative code and you're turning it into a SQL query, right.
And then the database afterwards turns the SQL query into
imperative code. So can you also think of a way where you can
actually take a normal application that does not involve the
communication with the database and try to optimize it based on
some [indiscernible].
>> Alvin Cheung: So I guess we probably don't want to push,
like, things that the database is not good at doing into the
database itself, right?
>>: No, I'm just saying forget about the database. Just, you
know, you're analyzing a program which does any sort of
processing of data which does not sit on a database but sits in
memory somewhere and you're just analyzing this program, creating
a relational algebra and using standard [indiscernible]
techniques to optimize [indiscernible]. Can you think of a way
of using your work to do this kind of transactions?
>> Alvin Cheung:
>>:
You mean for out of context?
Yeah.
>> Alvin Cheung: Exactly. So we're looking at similar
techniques for map reduce programs. So we can think of somehow
we're able to, you know, it would be nice if we could come up
with a way to convert arbitrary pairs of code into
[indiscernible] and leverage off the query engine's ability.
So what database systems are good in doing is basically doing
this kind of relational optimization, if you will, right. So
it's very good at choosing, like, what is the best query plan and
so on, so forth. So from that respect, I don't think we want to
go in that direction as to bypass all that machinery. I think
it's good that databases are good at doing that. So what we're
trying to do here is leverage what they're good at doing, you
know, right.
>>: That's what I'm saying. Use the existing database
technology to create a SQL query and give
let's the database
give you more optimized code in processing the same data. But
not actually communicating with the database, but doing the same
thing inside the program that you're running.
>> Alvin Cheung: So you mean like actually do the reverse, which
is we try to push some of the database logic back to the app
server?
>>:
Yeah.
>> Alvin Cheung: Oh, yeah, I think that would be interesting.
Although I think in some sense, having the computation close to
the data is a good choice, right.
>>: I'm talking about situations where the data is actually in
the application.
>> Alvin Cheung: Oh, yeah, I see. Sure, sure. That would
actually be interesting. I think that would get into the some of
the work we're trying to do for map reduce, for instance.
Because in this case, they do heavily use memory for, at least
for temporary storage. So I think the interesting thing in that
case would be to model what is, you know, how much data that we
can imagine will be sitting in the cache and being able to
optimize that way, as opposed to right now we assume that the
cache does not exist, or in other words, we assume that, you
know, most of the data is sitting on to the database and is not
cached on the app server. But I think that's a good point. Yes?
>>: So specify the question, if you have expressions
[indiscernible]. So can you use [indiscernible].
>> Alvin Cheung: Yeah, so I think yeah, that's a good point.
So, in fact, I think the reverse question is also interesting.
Which is if we can push things from app servers to the databases,
why not also lift some of these things back from the database
over to the app server, you know, or cache locality, for
instance.
>>: So I have an example in mind. So for example, I have
application written and what it is doing is it's selecting two
table, select start on two tables and then within the application
logic it's trying to do something with those data and
[indiscernible]. Now, if we want to convert this to a SQL query,
every time you have to write a new way depending on the
parameter, right? But in the previous case, it was doing
[indiscernible] A and B, which is more efficient in terms of
using the database cache. So probably those [indiscernible]
cache and it will be very fast and the further process is being
done in the memory so it will be very efficient.
But in your approach, we'd be writing a new SQL query depending
on [indiscernible] and it won't be cached in a database. So in
that case, the performance might very well
>> Alvin Cheung: Yeah that's a good case, actually. Thanks,
yeah, absolutely. I think that would be an interesting
[indiscernible]. It's just that like that at some point, the
linear program starts to get really unruly terms of being able to
solve. So just all about how much tricks we can pull in order to
be able to formulate a very small program that is SQL to solve.
Yeah.
>> HOST:
Thanks again.
Download