23474 >> Rustan Leino: All right. Good morning everyone. ... pleasure to introduce William Cook, who is our speaker this...

advertisement
23474
>> Rustan Leino: All right. Good morning everyone. I'm Rustan Leino. It's my
pleasure to introduce William Cook, who is our speaker this morning. William is
a systems professor at the University of Texas at Austin. And he's done a lot of
work with languages and going into databases and other cool stuff. He's a
hacker, as we know. And he spent about ten years in industry and start-ups after
his Ph.D. and before returning to academia. And today he's going to tell us about
what are called batches.
>> William Cook: All right. Thanks Rustan. I have been working on this stuff for
many, many years and I finally have a good story to tell. So I'm pleased to be
able to be here and tell you about it. I want to give a quick advertisement for
some other current work. I'm working on four different areas right now, a new
system called NSO for programming with DSLs and support for hybrid partial
evaluation to be able to compile those DSLs and interpreters efficiently and also
working on structured concurrency language called ORC and batches. So this
last two are about distributed program and the first two are about
demand-specific languages. So I'm going to talk about batches today. What I
want to do is suggest that there are really three different kinds of remoteness that
we deal with commonly, at least three different kinds. There's more. But if we
take these three different kinds they're very similar in some ways and see if
there's some way to bring them together, the idea of computation that databases
or services are remote.
If we start with RPCs and look at where a lot of this, our thinking about this came
from, at least historically, is the notion of RPC and we're taking a procedure and
calling it on some data and we want to allow that procedure to be remote. After
20 or 30 years of development we had CORBA and DCOM and RMI and this
notion of distributed objects but it's based on the idea of take a procedure and
make it be remote so when you call the procedure the arguments get bundled up.
Message is created. It's sent over to the remote server it's executed and results
are returned.
And it seems like a reasonable idea. It's trying to hide the -- tide the remoteness,
make the remoteness completely transparent.
And it's been somewhat problematic. I'd actually go further and say that the
entire thing has been fundamentally a failure. CORBA, DCOM, all of these
things didn't work very well. And it's one of the largest sort of engineering
failures in the programming language community. It just has not been adopted
widely.
And there's a lot of work that's been done to figure out ways to work around the
problems with it. What are the problems? Well, it has big issues with latency,
that every time you make a remote call it's a round trip to the server. So the
more remote calls you make, the more latency you get and it tends to require
staple servers, that is, the servers have to maintain knowledge about the client
state. Tends to be very platform-specific because we're serializing objects back
and forth. And so there have been some ideas to solve some of these issues,
where you redesign your interfaces to make the communication have lower
latency, by combining different operations together to create a more of a facade
and doing bulk data transfer using data transfer objects.
And the problem is that these solutions are fairly ugly and they're sensitive and
that if the client behavior changes, you have to rewrite your server. So it creates
all kinds of bad dependencies in your architecture. So the benefits of saying,
well, we can use existing languages and we can have a nice elegant model
where remoteness is invisible, the problems with that approach kind of
overwhelm any benefits we get. So what I want to do is start again and try to find
something that unifies these three things and provide some of the benefits of
each of these different ideas. In particular Web services are very
across-platform, RPC is easy to use. SQL has a lot of efficiency and
transactional behavior. And so what I'm going to do is take a clue from an
ancient paper.
This was actually the paper that coined the phrase impedance mismatch or at
least applied it to software and databases. And what David Meyers said was that
whatever our programming model is, it has to allow complex data intensive
operations to be picked out of programs for execution and storage manager.
What's this idea of picking stuff out? And you'll see one way of looking at it is a
Microsoft link does that. But I have a different way to do it.
So here's what I'm going to do. I'm going to start over and take a different
example as my starting point. RPCs took as their starting point a single
procedure call with some data, and I'm going to take two procedure calls. And
that little change is going to change everything.
It just depends on what you take as your starting metaphor. So now let's just
consider what happens if we have remote object R and we want to make two
calls to it. Get the name and get the size and print out the answer of both of
those.
So there's actually an interleaving of the remote computation and the local
computation. What I want to do is I want to do this fast so it does it in one round
trip instead of two. I want to have it be stateless. I want it to be platform
independent so there's no assumptions of -- well, this case nothing is being
serialized but I could also have had a serialization involved there.
And keep the nice clean programming model. So anybody can solve this
problem, right? Give me the answer. You've already got the clue in the title of
the talk. Right? Batches. I sometimes present this in a tutorial mode where I
write this example on the board and have students or other professor types solve
it. It's really interesting. Because just giving this hint even with no context at all,
people will get the right answer if you tell them let's do two calls at once.
So what we do is we need to change the language, though. That's the part we
usually don't get to do. In the past all of these attempts to solve this problem
have always assumed that we have a programming language that's designed for
a sequential machine and now we want to make it work remotely but we can't
change the language. All we can do is write stubs and do co-generators and
write libraries but you can't solve the problem unless you change the language.
And so my change, the proposal, is to add a new kind of control flow statement
like the four loop, the if, the while, now we have batch. And what batch does is it
takes a focus object that comes from some service. So item R is an interface to
a virtual remote object. We're not going to say that that remote object exist
ounce the client. It doesn't necessarily even exist on the server, but the client
can refer to its properties and call methods on it like get name and get size and
the semantics of this is to run the name and the size together. So the program is
going to get partitioned into the remote and local part. It's going to send the
remote part to the server as a script.
So if you think about what RPC does, it sends a single procedure call to the
server and the server interprets it. It's like a one line script that means call this
procedure. Well, why can't we have a script that calls two procedures?
It's really easy to do. And so this actually creates a remote facade on the fly. It
allows you to call a double procedure in one call by composing two of the actual
interfaces. And the server will execute the two calls, and it will make a pair of
results and send that back to the client. And that creates a data transfer object
on the fly. Whatever calls you make, it will make up enough of a pair or triple or
TUPL to get the results back. And then it reruns the remaining client code and
prints out the first item of the pair compared to the second item in the pair.
The only data that gets transferred is the script and the results, as a TUPL or a
pair in this case. Pretty straightforward, right? Well, here's what it looks like, if
you generate, if you look at what the compiler generates, it's going to generate a
script constructor that will construct the script. It's got the two statements in it.
What I'm going to do is generalize this notion of a pair a little bit to be a named
TUPL, a record. And so this says capture the output of the first call and call it A.
Capture the output of the second call and call it B. Then execute the script
getting a forest or collection of results back and then print out the output of
getting A and getting B out of that result.
And that can be typed in this case. Question?
>>: Was it a forest or a list?
>> William Cook: It's going to be a forest because of this example. Thank you.
So now we want to do more operations on the server. We want to talk to a mail
server. This is again a mailer -- mail is a virtual handle to a mail service that has
a bunch of different operations on it. And one of it is to list the current messages.
And so we can iterate over the mail messages. All the messages occur over the
server. Iterate over the messages. Check if the size is greater than some limit.
The limit is the local variable. Now we're getting more mixing of local and
remote.
We need to check if the size of the remote message is greater than the limit if so
print out the subject and delete it. So we don't have to just read operations. We
can do imperative updates as well. Otherwise print out the name and the date.
Now, by putting this inside batch, I want it all to run in one round trip.
So our scripting language has to get more sophisticated. So instead of just
sending a pair of method calls, we need to actually send the whole loop to the
server. The server will run the loop, compare the size that it needs to have the
limit that came from the client. And then it needs to capture the subject. Can't
print it out because printing is a local operation, but it can capture the subject. It
can delete the message and otherwise it can capture the subject and the date.
In this sense the subject is always captured. The date is conditionally captured
and the delete is conditional. The server can do all that. And it can return back a
list of subjects and dates to the client to then run the loop again on the client and
print everything out.
But the client also needs to know this Boolean about whether or not the message
was deleted. So what is going on here is we can recreate the control flow on the
client to be the same as the control flow that occurred on the server.
So here's the code. And by slashes on this slide it means that this is generated
code. No programmer would ever write this. This is what the compiler
generates. So it creates a script in some way, an AST of a script, and it captures
the subject always. It captures the Boolean of whether the size is greater than
this input named X. And it deletes the message. If so. Otherwise it outputs the
date. So the output, the result here is going to be a table of, with a column
named A which always has a subject. A column named B which has a Boolean
and a column named C that has the date sometimes. And otherwise it's empty.
And here's the overall code. I couldn't fit it all on one slide. The idea here is after
creating the script it can create an input forest and put the limit in and call it X.
So we're sending a structure of inputs executing the script and getting a structure
of outputs out. It runs the script. Gets this result, and then it can run the loop
iterate over all messages get the Boolean named B. Get the subject and get the
subject and the date if necessary and print them all out.
And the funny thing about this code is that this code is very familiar. This pattern.
It's the ODBC, OALDB, JDBC link pattern actually. It's the same pattern. And I
call this the -- I deleted the name I think. I'll call it the batch script pattern. It's a
pattern where you are doing meta programming on the client. It's sending a
script and decoding some results. And my argument is that whenever you write
this pattern as a programmer, your programming language sucks. Because this
pattern is not a pattern that you should have written. This is compiled code. You
should never have to say this.
And so we need to generalize it. And what you meant to say was this. And the
separation of the thing into the script and the decoder and the results and the
buffering and all that stuff should have happened automatically.
>>: Am I limited from having local side effects in the batch?
>> William Cook: You are allowed to local side effects. You're allowed to have
remote side effects.
>>: How do you avoid double ->> William Cook: But you cannot have tangling of effects. Let me show you
what -- you're just feeding me the cues perfectly. It's not this slide, you jumped
one to the next slide after this. But this is the answer to your previous one: Why
is it a forest? It's a forest because if you have nested loops, you get nested
tables, nested result set. So this says for each one of the items print the name
and for each of the item print the name of that.
The output will be a list of items with their names and then for each item a list of
their parts and their names. So this is the master detail. N plus 1 query problem
that's quite common as well. Iterating over multiple nested sub collections. We
want to do that in one round trip as well.
Okay. And then this idea of one round trip and the tangling of effects, so the
guarantee is that batches require one round trip to the server, which means that
if you go and try and do a batch where you get the subject of a remote message
and then try and ask the user a question about whether or not it should be
deleted and then delete it, it's a syntax error. It's a semantic error, in the
definition of the batch. It's a programmer error. You can't write a batch that
tangles the state.
You can modify local state. You can modify remote state. It's a lot like loop
splitting in a compiler where you're doing aggressive out-of-order execution. The
assumption here is that all the effects that happen on the server can happen in
the right order and they can be updates. But those effects are independent of
any effects that happen on the client. That's the idea of remoteness. That
there's no implicit channel of communication of the server and the client.
The only communication is going to be through this script and the results. Now,
you can obviously set up cases where that's not true, but it's a very common
assumption. And if the effects of the server and the effects of the client don't
interfere with each other then it's okay to reorder the operations with respect to
each other. And that's what we do.
We take all the remote operations and collect them into a batch and send it over
and have them all be done at once. And then the client -- so it turns out the client
can do some local operations in the batch before sending the remote stuff. Then
it does all the remote things. Then it can do more local updates after. But it
needs to be -- in the effect system that's behind this, you need to be able to
stratify it into a bunch of local effects and remote effects and local effects and
you can't have any kind of intermediate or tangled effects, I guess, is the way to
say it. For example, try and go to the remote server twice, which is what this
example requires.
>>: [inaudible].
>> William Cook: Yeah. And it's an approximation. It just rejects programs that
look bad. So it's a conservative approximation. And the point is that the
programmer needs to be able to read it and see whether it makes sense. So you
need a set of rules that are very simple. And so it does a very simple kind of
effect analysis.
And if it doesn't look good, it rejects it. And the programmer can rewrite the
program to take some of the effects out and store them in local variables, to
simplify it in case, if it's a valid program which the effect system rejects. There
are ways to kind of rewrite it to make it clearer.
>>: If you just eliminated batch, the batch keyword, so what would happen?
>> William Cook: Well, so the problem is that you'd get lots of remote local
tangling. The issue here is that I believe ->>: Purely performance optimization or is it actually semantically?
>> William Cook: Oh I see. It's purely a performance. Yes, this is just about
performance. The intent is for it to have no semantic significance at all. Yes.
>>: I mean, in the end you could say [inaudible] just around every little remote
call.
>> William Cook: That will not work. No, that will not always work. So in theory
it should have no semantic significance, but it actually does. And let me tell you
why. Because it rejects -- so the system, the underlying system -- what I was
saying it has no semantic significance assuming that the objects were local. If
you really did have a remote system, then here's the problem. The current batch
model that I'm proposing does not allow remote proxies.
So you cannot have a local pointer to a remote object other than the route. So
when you iterate over dot messages in a normal RMI CORBA style, message
would be a local variable that's a pointer to a remote object. That's what causes
the state fullness of the server. What actually happens with batches is that this
message object never gets instantiated on the client. It only is instantiated -- it's
a local pointer on the server that gets created during the batch execution.
Once the batch is done running, that message can be freed. Right? So if you
look at the client code here, there are no references to remote objects at all.
They've all been removed. They're all in here. Here's a reference to a remote
object. But it's in the code that's being sent to execute locally on the server.
So, yeah, so we reject proxies and serialization completely. So you have to put
enough work in the batch to be able to access all the remote objects and
complete a transaction with them. And then get the answers back that you want.
And the answers have to be strings and integers and dates, because that's the
only thing that can be transmitted remotely. So what about serialization?
Everybody -- a lot of people like serialization. It's kind of fun and nice. Here's
our mail server again. And we want to just send a list of recipients, a set of
recipients to the server, to send our message.
What I said so far is we don't have serialization. So you're not allowed to do that.
Well, it's okay, though, because we can call factory methods on the server. We
can actually construct a remote set, a set on the server and populate it with
values. All right? So the server interface can provide any interface it wants and
so what we have to do is we have to tell the server, make a set. And then iterate
over the local state and populate it, copy the local names into the remote set.
Then we can send the message.
Now, the key point here is that this can be done in one round trip, because what
happens is that the system detects this is local code that needs to run before the
batch executes. Okay? What it does is it runs this loop, pulls out all the names
from the local set and puts them into this input record.
Then it sends the input record to the server. Then the server can create the
server side set, insert all the names into it, and then it's got the data that it needs.
So what it means is you just have to extract all the data out from the local one
and then write it into the server. So it's serialization by the public interfaces. One
of the nice things about it is it's just as fast as normal serialization, really, and it's
not representation-dependent. You can serialize, deserialize a Java hash table
and write it into a Python hash table, and it's completely cross-platform,
cross-language. And you don't have the problem of copying internal
representations around.
So that's kind of ugly if you had to write that over and over again. This little
serialization routine. So what we need now is we need to be able to modularize
the parts of the a batch. In order to do that, we just need to be able to write
functions that instead of executing locally that the procedure is also partitioned
it's a modular batch component, essentially.
What this does is I can write a set send helper function that's marked as a batch
procedure. So that means that if you use it inside a batch it gets partitioned. It's
as if it were in-lined essentially and gets partitioning applied to it.
Normally local functions are always executed exclusively locally. These are ones
that participate in the batch. So now I just have to say recipient, server recipients
equals send of the local names, and the send function will get split into a part that
collects up the names locally and a part that creates the set and writes them into
the set remotely. So it's just really the same code. It's just been made to be
reusable. If you really wanted to, you could make this be an implicit coercion that
got invoked when a local thing got copied to a remote thing. So we could
actually make it look like serialization, but it's using a custom serializer that uses
the custom function not the internal representation.
Okay. Exceptions are fun. Happened a lot, especially in the remote world. And
so we can get exceptions on the server. We can get exceptions on the client.
So exceptions on the server happen when the server is executing the script and
suddenly something goes wrong and it gets an error. One thing that I told you
was that the control flow on the server and the control flow on the client are
essentially synchronized. They're duplicated. What can happen is we can have
it set up so that when the server gets an exception at a certain point, it just
terminates the batch, sends the results that have been collected so far back to
the client. The client can rerun the same -- excuse me, the same computation,
the same control flow. When it gets to the point where the server threw an
exception the exception can get reraised and the client can handle it locally. It
doesn't allow the script on the server to recover from errors. My current proposal
is that server scripts do not support error correction. We could add that to the
scripting language, but right now it's not supported.
Client exceptions are a little more tricky, because of this out of order execution
you can have a control flow where the client normally, if you thought of it as going
through five iterations of a loop, if the client gets an error on the fifth iteration, it
seems like not a problem, but the problem is that the server has already done all
ten iterations. So the client can fail to process results that happened in the
future, essentially, from its point of view. And so the client has to be a little
careful in terms of how it handles its own exceptions.
I don't have any other answer for this other than you just need to be aware of the
way batches work at this point.
>>: If it were local effects before sending off to the server, do you have the
same ->> William Cook: No because what it does I didn't mention it explicitly. The
requirement is that all effects in a place happen in the correct order. So all local
effects happen in the correct order as if you ran the code sequentially. So in the
case that I was showing you here, where it does some local code before the
batch, those effects have to happen before any of the after effects of the local
code. So it doesn't allow reordering of -- the effects on the local system do not
get reordered with respect to each other.
So the argument is all effects occur in the right order at their home location. So I
think that was your question.
>>: So if the client bursts through ->> William Cook: If the client got an exception before -- during the prelocal, then
the batch would never even be invoked.
>>: I see.
>> William Cook: And the client needs to know that. But it can tell where that is
by, again, there's -- in looking at the static analysis that essentially partitions the
code, just analyzes it and assigns three different categories to it, it's either
prelocal, remote or post local. And doing that analysis is pretty straightforward,
and it tends to follow at least the prelocal and the post local have to be
sequentially ordered in the code.
Okay. So batches are not necessarily transactional, but you could make a
transactional server if you wanted to because it's sending the complete set of
operations to be performed. The batch, you could actually make a transactional
RPC server that somehow serialized the RPC calls on the server, if it wanted to.
And obviously you could use it for databases as well. But the server doesn't
have to be transactional. It can also just completely arbitrarily interleave the
execution of multiple batches if it wants to.
So it doesn't -- so batches don't eliminate errors and they don't eliminate
communication errors or failures. There's a lot of problems that happen when
you have remoteness. At least it reduces the potential for certain kinds of
intermediate errors, intermediate states, because the batch is sent as one script
to the server. It's either all going to be delivered or none. You're not going to get
communication errors in the middle of a server communication.
And it does lead to fewer round trips. So there's a little bit less opportunity for
error. But it doesn't solve this problem. The other thing it does it requires the
program server to explicitly put the batch in so they know that remoteness is
happening. My goal is not to hide the remoteness completely it's to hide it almost
completely but still make the programmer aware that it's happening and put a
boundary on it. So you're just saying what if you don't say put batch anywhere.
The batch says this is the boundary of both efficiency and to some degree failure
and so it makes it declared what's going on there.
>>: Is there a way to express -- you said basically if the server fails, at a certain
iteration, and the client will, too. Is there any way to say if the server ever fails
don't do any iterations, it gives you more of a ->> William Cook: That's a really good idea. The server could choose to
completely abort the script and sort of roll back all the things that it did. So you
could make a smart server that would throw an error and the client would just get
an error immediately.
>>: But that means that all clients would have that same semantics.
>> William Cook: But if you wanted to have the client choose to do that, I think
that -- I mean, the client could obviously make a call that requested what kind of
failure behavior it wanted dynamically. But unless the server has the capability to
sort of cancel all its partial operations, it's kind of hard to force a server to do that.
I think that it's actually a really good mode of execution, and I think that one of
the nice things about this is it lets the servers be smarter. It gives them
opportunities.
>>: [inaudible] using for client exceptions in the sense that the server actually
aborts the whole thing if the client ->> William Cook: You can even do a two-phase commit if you wanted. You
could make a two-phase commit. The server is going to do some operations
hand back a ticket that says I've done these but I need to wait for you to
complete your local code and then get a commit. You could actually implement
that pretty easily and that again would be just on a regular RPC server, not on
your fancy database, if you wanted it. So this is the thing we keep mentioning,
that the order of execution is preserved. The local operations and the remote
operations. And so this is a case that's actually an error, because what's going
on is we have a remote call update A and update B, but it has two arbitrary local
procedures that need to be called to produce the inputs. And then we're going to
present out the results.
Now, if you try to execute this in a batch in one round trip it requires both of these
local calls to be made first before we send the batch, because they're inputs to
the remote call. That is going to force this local get me to happen before the
print, which is an error.
And so this would be an invalid program. And the reason is that well get B could
look at the length of the print stream and it could be different if you did it before or
after the print. So the idea what I was getting at is the local code has to execute
in the natural order that it would have executed without batches. So this is
actually a syntax error. And you can fix this by assigning these get As and get
Bs to local variables if they're, to make it clear what the order of the effects
should be.
Okay. So summary. We have a new statement and it has a new compilation
semantics. Kind of control flow where it does out of order execution. It partitions
the code into local and remote. It manages the communication. It potentially
works in any language. I haven't really proven this yet. I'm still in the process of
implementing other clients. It's easy to write a new server. You just write a script
interpreter, and that's like PL 101 a week-long project for an undergraduate to
write another batch server. Writing a batch client is a little more complicated
because you have to change the interpreter or compiler of the client language
and do this partitioning. There are libraries I have that will do the partitioning
actual hard part but it's still some code to make that work.
>>: [inaudible].
>> William Cook: I have it working in Java, yeah. And the communication is
optimized as well. And it's essentially what we're doing here is we're borrowing
the SQL Server -- data execution model. Right? Send primitive values back and
forth. Send scripts over for execution. Let's just do it for other things besides
database access.
Okay. So here's the actual batch script language. This is equivalent to SQL.
This is my imperative remote scripting language. And what it lets you do is
counts the variables, conditionals, four loops, potentially aggregated four loops.
So you can do a sum. And it's like a monoid comprehension. You can create
local variables on the server. You can do assignments, access fields, call
methods, do primitive operations, capture inputs and or use inputs and capture
outputs and have first class functions because they're useful sometimes for doing
certain kinds of server-style predicates. And a fixed set of operators and a fixed
set of data types. We just say this is the scripted language.
And I don't have a concrete syntax for it yet. I have an abstract syntax. I'm
thinking of using JavaScript as the concrete syntax. The nice thing is it doesn't
let you call constructors or let you do any of the dangerous things, it just lets you
do all the things you could have done if you were talking to an RPC server
anyway. Doesn't let you do wild loops either. So there's no unbounded
computation.
So there's a nice sort of bound on that. Although, it does the four loops.
>>: [inaudible] something, right? So it's like it enumerates.
>> William Cook: Yes in a collection. So if the server wanted to provide an
infinite stream, then it could allow sort of unbounded computations, but the point
is that the collections are bounded by what is assumed to be a finite collection.
The iterations are bounded by a finite collection.
So let's take this on the road and talk about, you've already seen it looks a lot like
SQL. So let's just do it. So here's the batch script pattern executing JDBC.
Looks just like the generated code of a batch statement. And so instead we just
write this. And this is what we'd like to write. And then we just put batch around
it. And it partitions out the remote stuff. Turns it into this batch script, which
turns out can be deterministically compiled into efficient SQL. Turns out for any
batch you write, including any four nested loops, it always generates a constant
number of SQL series. That's a property that Lync does not have, for example.
The only other system that does that I know is Fairy at this point. So that solves
the ->>: Like in your batch script language, isn't strict enough it will always.
>> William Cook: No, you can write arbitrary code in here pretty much. Arbitrary
nested loops and conditionals. But it generates a batch script straightforward,
but the interesting property is that the batch script can always be compiled into a
constant number of SQL queries.
So you get a performance guarantee.
>>: But these loops inside the batch scripts are always looping over things on the
server side.
>> William Cook: Yes.
>>: And so, for example, if you had a data structure on the client side and you
wanted to loop over that, you really wouldn't do that in the script, basically.
>> William Cook: Well, I did do it with the mail sending messages so I could loop
over client things to get them to send to the server. I could also loop over a client
data -- you can actually do arbitrary client computation in here. You can do
exception handling and loops and stuff.
>>: Worry about the preorder?
>> William Cook: You have to worry about the ordering, yes. It captures a
pattern of doing collection of operations on the server and getting the results
back and doing something with them. There are use cases that require multiple
round trips and those would require multiple batches.
But, yeah, it is tricky, actually. When you have four nested loops here, it turns
out generating four SQL queries to generate that is nontrivial. I'm not going to
explain how it's done. But basically what it amounts to is that each loop gets its
own SQL and the SQL select statement generate captures all of the iterations of
that loop for all of its containing loops. I don't know, that's probably not clear.
But if you want an explanation we can talk about it after.
Okay. So if you do normal object relational mapping you can define the structure
of a data model using a bunch of interfaces or classes. And then I want to do a
quick comparison with Lync here. So in Lync, it handles the writing of the query
very nicely, but it still is the batch execution pattern. We're still creating an
artificial data structure manually and then decoding it in the results. So it's
actually you have to do the part of making that intermediate data structure, the
data transfer objects, you have to do that manually still. And that's the part
where batches excel and so you end up with a sort of cross dependency. If you
add another -- if you want to add another component here, you have to add it in
here and make changes in two different places.
Another one that's interesting is dynamic queries. So in a dynamic query, the
predicate or the condition depends on local data. It's sometimes there's a test
and sometimes there isn't. This is very common on Web pages where you enter
in a search criteria and depending on what criteria items you type in, it makes a
different select statement. A different where clause. So here's a case where
there's a title and if the title in a test string, we want to see if the title is equal to a
test string. If the test string is defined we want to do the test. If the test string is
empty we don't want to do the test. So in Lync we can create a query a virtual
collection. We can test the local state, create modify the query to add aware test,
sort of we're now doing sort of virtual collection manipulation here. And then we
can just iterate over that virtual collection which may or may not have the test in it
and get the results and iterate over them. The fun thing is that in batches you
just write the code and test the local state and it will figure out that this is local.
So it can be done before the batch is executed, and because it's short-circuit of
or it will not include the second part in the batch if the first part is true.
So kind of fun. So, yeah, so we have a compiler that translates the batch scripts
into SQL. You can either do that, there's at least two different modes you can do
it. You can write a program and have the client convert it into SQL or send the
batch script to the server and have the server convert it into SQL.
>>: Maybe the previous slide it's much harder to translate to SQL query again on
the server, right? Wanted to translate to SQL queries.
>> William Cook: The way this works the batch system creates the batch script
that either has the test in it or not. So the script -- the batch script that actually
gets created is either ->>: Always in there or not.
>> William Cook: It's either in there or it's not. So whichever.
>>: Not entangled.
>> William Cook: Not entangled. Yeah. From the point of the query translator it
doesn't see the difference. It's handled by the batch system. This is another
problem with Lync, if you write lambdas, the assumption is that lambdas are sent
to the server. But the problem is lambdas can contain arbitrary local
computation. So they can fail. So what batch does is by saying yes, we're going
to mix local and remote computation and partition it, we can actually handle that
case more explicitly and deal with one of those fundamental problems. Notice
that batches we use the same trick as Lync does to do aggregation. It still uses
Lync style aggregation, just doesn't use it for the standard select and project.
Aggregation stuff like sums.
So it really does support all aspects of SQL including updates, inserts. You can
do a bulk insert. You can do a four loop. Iterate over one database table and
insert into another database table and that will generate the bulk insert query that
no object relational mapper I know of will generate at this point without lots of
fancy hand waving or complex kind of programming.
And this constant number of queries guarantee. So what it gives you is a
fine-grained object oriented programming model with efficient SQL execution. By
fine-grained, this is the problem with batches, is that we've been told whenever
you're talking to a remote server you need to talk to complex large-scale
operations, right, to reduce the latency. What batches do is say no it's fine just
use fine-grained operations we'll collect them up for you and do them all in a
group. And then this is one of the things that Web services is about. And again
there's sort of been this long, very rapid experimentation with Web services, and
there's Web services that emulate RPCs so they're great. They have all the
pitfalls of XML and all the pitfalls of RPC. When people say Web services suck
that's what they're talking about, it's like the worst possible combination of things.
But there's another kind of Web service, which is the document or year-end Web
service, which is to me really the intuition behind Web services that you send a
bulk description of the purchase order to be processes.
And it processes it and sends it back. And you could send it in an e-mail
message if you wanted to. This is an example of the Amazon Web service
interface of two years ago it allows you to do a item lookup a list of item IDs and
a list of properties you want to look up and it will give you back a table of results.
This is an encoding of a kind of SQL encoding here but it's really encoding of a
batch. Custom scripting language, which has implicit iteration in here and implicit
projections down here. And every single API is a different scripting language.
It's just crazy. You know? And the code to write the clients is just incredible. It's
explicit meta programming. You have to construct the request object and then
abstract syntax and write an interpreter, every single API call has a different
interpreter. This is the batch execution pattern again. Here we are finally get
down to outputting the rank and image. With batches you just say get me the
item. Get me these two properties, get me the item these two properties and it
sorts it out. We put a batch front end on the Amazon web services thing and
were able to write a lot of the APIs, didn't do it for everything, but we did a really
good experiment.
And we also did this for this OSGI, this really huge API for managing packages in
Eclipse and they're struggling with how do we make this remote. There's one
proposal and another proposal and we're going to do it this way and change the
API. We just said no the API is good, slap batches on the front and you can talk
to that thing efficiently.
>>: That would be JavaScript then, this batch.
>> William Cook: You have to send JavaScript with a batch keyword, yep. And
that's what I'm in process of doing.
>>: So why is it necessary to have the keyword, other than -- because you can
imagine doing the syntax checking as sort of a ->> William Cook: I believe it's important to have the programmers declare their
intent and provide a scope. Because you can have a lot of different remote
operations and they might need to be grouped into batches according to some
very subtle semantic dependencies, and when the programmer says this is a
batch, they're saying something about the scope of that. It's like transaction
boundaries in a way.
>>: I was going to say you could have begin transaction, end transaction.
>> William Cook: Some kind of declaration. It's just a way of doing it.
>>: Do it in a library as opposed to ->> William Cook: No, you can't. It really has to partition the code. It has to do
things to the code that you can't do with a library. You just -- maybe you could do
it with expression trees. You could do it if you're an imposter, sure, yes.
>>: The language [inaudible].
>> William Cook: Maybe. With fully reflective access, in Small Talk you could
certainly do it, yes, as a library.
>>: But on the surface it seems like kind of it's a high bar to get the JavaScript
standard to include this. On the other hand, if you could eliminate the code that
you just showed us for Amazon and rewrite it with a little bit less syntax ->> William Cook: I think this is a fundamental idea. It's not just like this is a
hack. I really think this does have the merit to be considered as a real feature.
And you just cannot solve the problem. Opens up a lot of things. The same way
that Lync opened lots of new ideas you could have remoteness where the
remoteness was your GPU using batches.
It creates a new kind of composition in modularity. And so it really is a new kind
of control flow. Okay. So I have a thing called JABA, batch Java that's
100 percent compatible with current Java. I didn't have to make any changes to
the syntax or static semantics. And the trick that I used was rather than
introducing a batch statement, it turns out the batch statement in all the examples
I gave is sort of syntactically exactly analogous to the forced statement. So I just
use the forced statement.
And the trick is that the normal clue is that the invariant or requirement of the
forced statement is that this thing has to implement I enumerable or something
like that or else it completely fails. I say if it implements I batch service, then the
four loop is a four R in this service object do this computation. So it's really not -it's not even semantically that much of a hack, although it is a little bit of a pun,
you can assign an interpretation of four where it says four do this stuff remotely,
where the server is a remote connection.
Now if you want to you can put the batch keyword in. I didn't want to break my
tooling.
>>: You modify AA Java?
>> William Cook: I have modified the [inaudible] compiler. And I'm currently
modifying Java CC or Java C. And I'm going to propose this as a potential
enhancement for Java 8 or 9 and Oracle has said they're interested.
So that's about the status at this point. So there's a lot of interesting
opportunities here to -- well, the first one is to write this for your favorite language
and I would love to help anybody out who wants to do that.
I'm working on JavaScript right now. Interesting opportunities for partially
evaluating the batch. The way it works is actually it's configureable in the sense
that the client -- the mechanism of the batch system has the language does the
partitioning. So that's built in. But what it does with the batch script is up to a
handler. The handler gets to decide how to send it to the remote server or
execute it locally or do security checks on it or partially evaluate it or whatever.
So partially evaluating that batch handler would allow you to actually translate the
batch script into SQL in the compiler. So what got wrote would be completely
straight JDBC with hard coded SQL as the output of the compiler. And then
there's also questions about multiple levels of communication, can we talk to
multiple servers in a batch, can we talk to a server that talked to a server? Could
we have -- this is the one I'm most interested in is you talk to a server in a batch
but then the server wants to make some virtual calls back to the client, because
they're often call back that the server wants to make to tell the client information.
So there really needs to be a return batch that the client then executes and
maybe sends results back to the server. So you can do this really nice batch
handshake idea. But I haven't worked that out yet. As I mentioned doing it for
GPUs would be fun. There's a lot of stuff about asynchrony. A lot of people say
latency it's a problem we need to fix it by doing asynchrony. Asynchrony is the
answer to everything. Well, asynchrony doesn't work if you have if statements,
because you gotta get the -- you have to block. It's like the same problem with
instruction pipelining, right? You get the branch you have to wait until you know
the answer to know where to go.
So batches don't have any problem with if statements because it just sends them
over. However, it really is interesting to think about asynchronous execution of
the batch script with the client, because their control flow is synchronized you can
actually stream the results from the server and have the client start executing
before the batch was actually done on the server.
And that's what SQL Server does, too. So lots of related work. What's
fascinating is no one has ever come up with this before. Because it's so obvious.
And so simple. And it works so well. I mean, it just kills RPC. It's so much
better. You can argue maybe it's not better for SQL. You can argue that. SQL is
pretty complicated. But in terms of RPC, it's just better. And I've never seen
anything that is worse about it. If you want remote proxies you can do it. You
can put those in. There's nothing that prohibits you there.
>>: Can you contrast it with the work like J-Orchestra where you do automatic
partitioning between the client server?
>> William Cook: I don't know if you notice the list of people on this, one is Eli
Tilivich [phonetic]. He's working on rewriting J-Orchestra to use this. The thing
that killed J-Orchestra was the inability to have an efficient communication
model.
We can actually -- you keep saying do I need the batch statement. You could
infer it. You could put them in.
>>: At some level, purely performance then it's a question of whether or not how
many round trips you have.
>> William Cook: If you had an app that ran locally and you wanted to partition it
into local and remote and add the batch statements at the same time, then that
would be a cool thing to do and Eli is thinking about that.
So there's lots of other things that are related. I really like the fact that it's the
inverse of deforestation. Deforestation is when you take an intermediate data
structure and you get rid of it, where batches create an intermediate data
structure in order to optimize the communication.
So that's one of the reasons why I called it a forest. Okay. So we have papers
on this. We've unified RPC with the ease of programming. That was the first
paper. Got a pretty good description of the basic concept. We showed how it
could be made cross-platform with Web services. We have done some
theoretical work on SQL, early work that was the precursor to this. POPL and
more recently there was a nice paper at Database and Programming Languages
Conference. It's a symposium or workshop, I can't remember, but it was a nice
meeting, and we show how to do the SQL compilation and translation.
The actual algorithm is not in the paper because it had really small page limit.
But if you want to see the algorithm, I can give it to you. All right. So in
conclusion, we have a new kind of control flow statement called batches. And it
partitions programs into local and remote. And it unifies lots of different notions
of remoteness and potentially more. It gives you efficiency, a clean programming
model. No queries, no proxies, no requirement to be stateful. You don't have
this distributed garbage collection problem you have with CORBA and RMI, and
it's language and transport neutral. You can use XML, you can use ASN 1. You
can use JSON for the communication. The whole fetish about which
communication protocol we're going to use is just irrelevant, I think.
The big negative as you mentioned is you have to change your language. But
I'm going to argue you cannot solve this problem without changing the language
at some level. So that's it. Thank you.
[applause].
>> Rustan Leino: Further questions?
>>: You mentioned you have some syntactic rules for sorting out if it's a legal
batch statement.
>> William Cook: Yeah.
>>: Do you find frequently or do you find at all that there are cases that you
would like it to figure out are actually legal and it doesn't -- that is, where you
would like a more detailed analysis of the batch statement?
>> William Cook: We haven't found any of those yet. I think that the -- I
understand this question is important, because it means does the sort of
technical approximation really correspond naturally to the human desire intuition
about what should work. It seems like it does. That those seem to be fairly
consistent. I can't say that I've implemented everything that -- I haven't used it
enough to really push it to the absolute limit. When you release a technology like
this people tend to do that. They push it to the limit. So it will be really
interesting to see what happens. But my hunch is it's not going to be too bad.
There might need to be some tweaks but I'm not really worried about that.
>>: Linked to that, the interesting thing is that in Lync there's no check for side
effects, right? It just goes wrong and you entangle like link queries with side
effects in your lambdas. So it's interesting that you actually check for it. But I
guess so your check would be as soon as you have a method call, assume that
there might be an effect?
>> William Cook: Yes.
>>: And if it's like maybe a property excess or something, it will ->> William Cook: You could declare. So there are ways to kind of approximate
it. But right now if you do a property accessor it's going to assume there's an
effect and that forces you to introduce a local variable that captures that value.
The assumption is that values do not have effects. So that's the thing. You can
always capture the value and then that value can be used freely.
>>: And it will be ->> William Cook: And it will be in the pre part, yes.
>>: So one of the things about RPC is that it gives -- one, it's easy to use but also
gives this illusion of sort of the cost being similar. And what you're doing, you're
actually creating a higher cost for these batches, right? A single round trip is
going to take a certain amount of time but a round trip where you send over
batch, have to do the computation, bring back the whole thing, takes longer, so
you're introducing longer latencies.
>> William Cook: I would argue in the case of a single method call it's going to
be the same. The batch doesn't do anything more for a single call than a normal
RPC. I mean of a single call. If there's one remote call.
>>: Okay. Right. Okay. Yeah.
>> William Cook: So latency is not -- the other thing is bandwidth is much more
available than latency. Latency is the speed of light round trip time. So if you
can reduce the number of round trips you can reduce the latency. You may use
a little more bandwidth, but bandwidth is free. So bigger batches with fewer
round trips will always win. That's my argument in the way that ->>: But you break down RPC these calls take a lot longer to begin with. So
there's latency that's been introduced ->> William Cook: Yes, rather than a local call.
>>: But then you're adding more latency.
>> William Cook: I don't think we're really adding much more latency. I really
don't. Maybe you're talking about a few instructions more or something, but it's
not -- it's not -- relative to the cost of the network, it's not really significant. The
thing that it does do is that because you have to say batch, at least it's a clue that
remoteness is happening.
>>: Right.
>> William Cook: So it doesn't hide it completely, that's the thing. All right. Well,
good. Thank you very much.
[applause]
Related documents
Download