23474 >> Rustan Leino: All right. Good morning everyone. I'm Rustan Leino. It's my pleasure to introduce William Cook, who is our speaker this morning. William is a systems professor at the University of Texas at Austin. And he's done a lot of work with languages and going into databases and other cool stuff. He's a hacker, as we know. And he spent about ten years in industry and start-ups after his Ph.D. and before returning to academia. And today he's going to tell us about what are called batches. >> William Cook: All right. Thanks Rustan. I have been working on this stuff for many, many years and I finally have a good story to tell. So I'm pleased to be able to be here and tell you about it. I want to give a quick advertisement for some other current work. I'm working on four different areas right now, a new system called NSO for programming with DSLs and support for hybrid partial evaluation to be able to compile those DSLs and interpreters efficiently and also working on structured concurrency language called ORC and batches. So this last two are about distributed program and the first two are about demand-specific languages. So I'm going to talk about batches today. What I want to do is suggest that there are really three different kinds of remoteness that we deal with commonly, at least three different kinds. There's more. But if we take these three different kinds they're very similar in some ways and see if there's some way to bring them together, the idea of computation that databases or services are remote. If we start with RPCs and look at where a lot of this, our thinking about this came from, at least historically, is the notion of RPC and we're taking a procedure and calling it on some data and we want to allow that procedure to be remote. After 20 or 30 years of development we had CORBA and DCOM and RMI and this notion of distributed objects but it's based on the idea of take a procedure and make it be remote so when you call the procedure the arguments get bundled up. Message is created. It's sent over to the remote server it's executed and results are returned. And it seems like a reasonable idea. It's trying to hide the -- tide the remoteness, make the remoteness completely transparent. And it's been somewhat problematic. I'd actually go further and say that the entire thing has been fundamentally a failure. CORBA, DCOM, all of these things didn't work very well. And it's one of the largest sort of engineering failures in the programming language community. It just has not been adopted widely. And there's a lot of work that's been done to figure out ways to work around the problems with it. What are the problems? Well, it has big issues with latency, that every time you make a remote call it's a round trip to the server. So the more remote calls you make, the more latency you get and it tends to require staple servers, that is, the servers have to maintain knowledge about the client state. Tends to be very platform-specific because we're serializing objects back and forth. And so there have been some ideas to solve some of these issues, where you redesign your interfaces to make the communication have lower latency, by combining different operations together to create a more of a facade and doing bulk data transfer using data transfer objects. And the problem is that these solutions are fairly ugly and they're sensitive and that if the client behavior changes, you have to rewrite your server. So it creates all kinds of bad dependencies in your architecture. So the benefits of saying, well, we can use existing languages and we can have a nice elegant model where remoteness is invisible, the problems with that approach kind of overwhelm any benefits we get. So what I want to do is start again and try to find something that unifies these three things and provide some of the benefits of each of these different ideas. In particular Web services are very across-platform, RPC is easy to use. SQL has a lot of efficiency and transactional behavior. And so what I'm going to do is take a clue from an ancient paper. This was actually the paper that coined the phrase impedance mismatch or at least applied it to software and databases. And what David Meyers said was that whatever our programming model is, it has to allow complex data intensive operations to be picked out of programs for execution and storage manager. What's this idea of picking stuff out? And you'll see one way of looking at it is a Microsoft link does that. But I have a different way to do it. So here's what I'm going to do. I'm going to start over and take a different example as my starting point. RPCs took as their starting point a single procedure call with some data, and I'm going to take two procedure calls. And that little change is going to change everything. It just depends on what you take as your starting metaphor. So now let's just consider what happens if we have remote object R and we want to make two calls to it. Get the name and get the size and print out the answer of both of those. So there's actually an interleaving of the remote computation and the local computation. What I want to do is I want to do this fast so it does it in one round trip instead of two. I want to have it be stateless. I want it to be platform independent so there's no assumptions of -- well, this case nothing is being serialized but I could also have had a serialization involved there. And keep the nice clean programming model. So anybody can solve this problem, right? Give me the answer. You've already got the clue in the title of the talk. Right? Batches. I sometimes present this in a tutorial mode where I write this example on the board and have students or other professor types solve it. It's really interesting. Because just giving this hint even with no context at all, people will get the right answer if you tell them let's do two calls at once. So what we do is we need to change the language, though. That's the part we usually don't get to do. In the past all of these attempts to solve this problem have always assumed that we have a programming language that's designed for a sequential machine and now we want to make it work remotely but we can't change the language. All we can do is write stubs and do co-generators and write libraries but you can't solve the problem unless you change the language. And so my change, the proposal, is to add a new kind of control flow statement like the four loop, the if, the while, now we have batch. And what batch does is it takes a focus object that comes from some service. So item R is an interface to a virtual remote object. We're not going to say that that remote object exist ounce the client. It doesn't necessarily even exist on the server, but the client can refer to its properties and call methods on it like get name and get size and the semantics of this is to run the name and the size together. So the program is going to get partitioned into the remote and local part. It's going to send the remote part to the server as a script. So if you think about what RPC does, it sends a single procedure call to the server and the server interprets it. It's like a one line script that means call this procedure. Well, why can't we have a script that calls two procedures? It's really easy to do. And so this actually creates a remote facade on the fly. It allows you to call a double procedure in one call by composing two of the actual interfaces. And the server will execute the two calls, and it will make a pair of results and send that back to the client. And that creates a data transfer object on the fly. Whatever calls you make, it will make up enough of a pair or triple or TUPL to get the results back. And then it reruns the remaining client code and prints out the first item of the pair compared to the second item in the pair. The only data that gets transferred is the script and the results, as a TUPL or a pair in this case. Pretty straightforward, right? Well, here's what it looks like, if you generate, if you look at what the compiler generates, it's going to generate a script constructor that will construct the script. It's got the two statements in it. What I'm going to do is generalize this notion of a pair a little bit to be a named TUPL, a record. And so this says capture the output of the first call and call it A. Capture the output of the second call and call it B. Then execute the script getting a forest or collection of results back and then print out the output of getting A and getting B out of that result. And that can be typed in this case. Question? >>: Was it a forest or a list? >> William Cook: It's going to be a forest because of this example. Thank you. So now we want to do more operations on the server. We want to talk to a mail server. This is again a mailer -- mail is a virtual handle to a mail service that has a bunch of different operations on it. And one of it is to list the current messages. And so we can iterate over the mail messages. All the messages occur over the server. Iterate over the messages. Check if the size is greater than some limit. The limit is the local variable. Now we're getting more mixing of local and remote. We need to check if the size of the remote message is greater than the limit if so print out the subject and delete it. So we don't have to just read operations. We can do imperative updates as well. Otherwise print out the name and the date. Now, by putting this inside batch, I want it all to run in one round trip. So our scripting language has to get more sophisticated. So instead of just sending a pair of method calls, we need to actually send the whole loop to the server. The server will run the loop, compare the size that it needs to have the limit that came from the client. And then it needs to capture the subject. Can't print it out because printing is a local operation, but it can capture the subject. It can delete the message and otherwise it can capture the subject and the date. In this sense the subject is always captured. The date is conditionally captured and the delete is conditional. The server can do all that. And it can return back a list of subjects and dates to the client to then run the loop again on the client and print everything out. But the client also needs to know this Boolean about whether or not the message was deleted. So what is going on here is we can recreate the control flow on the client to be the same as the control flow that occurred on the server. So here's the code. And by slashes on this slide it means that this is generated code. No programmer would ever write this. This is what the compiler generates. So it creates a script in some way, an AST of a script, and it captures the subject always. It captures the Boolean of whether the size is greater than this input named X. And it deletes the message. If so. Otherwise it outputs the date. So the output, the result here is going to be a table of, with a column named A which always has a subject. A column named B which has a Boolean and a column named C that has the date sometimes. And otherwise it's empty. And here's the overall code. I couldn't fit it all on one slide. The idea here is after creating the script it can create an input forest and put the limit in and call it X. So we're sending a structure of inputs executing the script and getting a structure of outputs out. It runs the script. Gets this result, and then it can run the loop iterate over all messages get the Boolean named B. Get the subject and get the subject and the date if necessary and print them all out. And the funny thing about this code is that this code is very familiar. This pattern. It's the ODBC, OALDB, JDBC link pattern actually. It's the same pattern. And I call this the -- I deleted the name I think. I'll call it the batch script pattern. It's a pattern where you are doing meta programming on the client. It's sending a script and decoding some results. And my argument is that whenever you write this pattern as a programmer, your programming language sucks. Because this pattern is not a pattern that you should have written. This is compiled code. You should never have to say this. And so we need to generalize it. And what you meant to say was this. And the separation of the thing into the script and the decoder and the results and the buffering and all that stuff should have happened automatically. >>: Am I limited from having local side effects in the batch? >> William Cook: You are allowed to local side effects. You're allowed to have remote side effects. >>: How do you avoid double ->> William Cook: But you cannot have tangling of effects. Let me show you what -- you're just feeding me the cues perfectly. It's not this slide, you jumped one to the next slide after this. But this is the answer to your previous one: Why is it a forest? It's a forest because if you have nested loops, you get nested tables, nested result set. So this says for each one of the items print the name and for each of the item print the name of that. The output will be a list of items with their names and then for each item a list of their parts and their names. So this is the master detail. N plus 1 query problem that's quite common as well. Iterating over multiple nested sub collections. We want to do that in one round trip as well. Okay. And then this idea of one round trip and the tangling of effects, so the guarantee is that batches require one round trip to the server, which means that if you go and try and do a batch where you get the subject of a remote message and then try and ask the user a question about whether or not it should be deleted and then delete it, it's a syntax error. It's a semantic error, in the definition of the batch. It's a programmer error. You can't write a batch that tangles the state. You can modify local state. You can modify remote state. It's a lot like loop splitting in a compiler where you're doing aggressive out-of-order execution. The assumption here is that all the effects that happen on the server can happen in the right order and they can be updates. But those effects are independent of any effects that happen on the client. That's the idea of remoteness. That there's no implicit channel of communication of the server and the client. The only communication is going to be through this script and the results. Now, you can obviously set up cases where that's not true, but it's a very common assumption. And if the effects of the server and the effects of the client don't interfere with each other then it's okay to reorder the operations with respect to each other. And that's what we do. We take all the remote operations and collect them into a batch and send it over and have them all be done at once. And then the client -- so it turns out the client can do some local operations in the batch before sending the remote stuff. Then it does all the remote things. Then it can do more local updates after. But it needs to be -- in the effect system that's behind this, you need to be able to stratify it into a bunch of local effects and remote effects and local effects and you can't have any kind of intermediate or tangled effects, I guess, is the way to say it. For example, try and go to the remote server twice, which is what this example requires. >>: [inaudible]. >> William Cook: Yeah. And it's an approximation. It just rejects programs that look bad. So it's a conservative approximation. And the point is that the programmer needs to be able to read it and see whether it makes sense. So you need a set of rules that are very simple. And so it does a very simple kind of effect analysis. And if it doesn't look good, it rejects it. And the programmer can rewrite the program to take some of the effects out and store them in local variables, to simplify it in case, if it's a valid program which the effect system rejects. There are ways to kind of rewrite it to make it clearer. >>: If you just eliminated batch, the batch keyword, so what would happen? >> William Cook: Well, so the problem is that you'd get lots of remote local tangling. The issue here is that I believe ->>: Purely performance optimization or is it actually semantically? >> William Cook: Oh I see. It's purely a performance. Yes, this is just about performance. The intent is for it to have no semantic significance at all. Yes. >>: I mean, in the end you could say [inaudible] just around every little remote call. >> William Cook: That will not work. No, that will not always work. So in theory it should have no semantic significance, but it actually does. And let me tell you why. Because it rejects -- so the system, the underlying system -- what I was saying it has no semantic significance assuming that the objects were local. If you really did have a remote system, then here's the problem. The current batch model that I'm proposing does not allow remote proxies. So you cannot have a local pointer to a remote object other than the route. So when you iterate over dot messages in a normal RMI CORBA style, message would be a local variable that's a pointer to a remote object. That's what causes the state fullness of the server. What actually happens with batches is that this message object never gets instantiated on the client. It only is instantiated -- it's a local pointer on the server that gets created during the batch execution. Once the batch is done running, that message can be freed. Right? So if you look at the client code here, there are no references to remote objects at all. They've all been removed. They're all in here. Here's a reference to a remote object. But it's in the code that's being sent to execute locally on the server. So, yeah, so we reject proxies and serialization completely. So you have to put enough work in the batch to be able to access all the remote objects and complete a transaction with them. And then get the answers back that you want. And the answers have to be strings and integers and dates, because that's the only thing that can be transmitted remotely. So what about serialization? Everybody -- a lot of people like serialization. It's kind of fun and nice. Here's our mail server again. And we want to just send a list of recipients, a set of recipients to the server, to send our message. What I said so far is we don't have serialization. So you're not allowed to do that. Well, it's okay, though, because we can call factory methods on the server. We can actually construct a remote set, a set on the server and populate it with values. All right? So the server interface can provide any interface it wants and so what we have to do is we have to tell the server, make a set. And then iterate over the local state and populate it, copy the local names into the remote set. Then we can send the message. Now, the key point here is that this can be done in one round trip, because what happens is that the system detects this is local code that needs to run before the batch executes. Okay? What it does is it runs this loop, pulls out all the names from the local set and puts them into this input record. Then it sends the input record to the server. Then the server can create the server side set, insert all the names into it, and then it's got the data that it needs. So what it means is you just have to extract all the data out from the local one and then write it into the server. So it's serialization by the public interfaces. One of the nice things about it is it's just as fast as normal serialization, really, and it's not representation-dependent. You can serialize, deserialize a Java hash table and write it into a Python hash table, and it's completely cross-platform, cross-language. And you don't have the problem of copying internal representations around. So that's kind of ugly if you had to write that over and over again. This little serialization routine. So what we need now is we need to be able to modularize the parts of the a batch. In order to do that, we just need to be able to write functions that instead of executing locally that the procedure is also partitioned it's a modular batch component, essentially. What this does is I can write a set send helper function that's marked as a batch procedure. So that means that if you use it inside a batch it gets partitioned. It's as if it were in-lined essentially and gets partitioning applied to it. Normally local functions are always executed exclusively locally. These are ones that participate in the batch. So now I just have to say recipient, server recipients equals send of the local names, and the send function will get split into a part that collects up the names locally and a part that creates the set and writes them into the set remotely. So it's just really the same code. It's just been made to be reusable. If you really wanted to, you could make this be an implicit coercion that got invoked when a local thing got copied to a remote thing. So we could actually make it look like serialization, but it's using a custom serializer that uses the custom function not the internal representation. Okay. Exceptions are fun. Happened a lot, especially in the remote world. And so we can get exceptions on the server. We can get exceptions on the client. So exceptions on the server happen when the server is executing the script and suddenly something goes wrong and it gets an error. One thing that I told you was that the control flow on the server and the control flow on the client are essentially synchronized. They're duplicated. What can happen is we can have it set up so that when the server gets an exception at a certain point, it just terminates the batch, sends the results that have been collected so far back to the client. The client can rerun the same -- excuse me, the same computation, the same control flow. When it gets to the point where the server threw an exception the exception can get reraised and the client can handle it locally. It doesn't allow the script on the server to recover from errors. My current proposal is that server scripts do not support error correction. We could add that to the scripting language, but right now it's not supported. Client exceptions are a little more tricky, because of this out of order execution you can have a control flow where the client normally, if you thought of it as going through five iterations of a loop, if the client gets an error on the fifth iteration, it seems like not a problem, but the problem is that the server has already done all ten iterations. So the client can fail to process results that happened in the future, essentially, from its point of view. And so the client has to be a little careful in terms of how it handles its own exceptions. I don't have any other answer for this other than you just need to be aware of the way batches work at this point. >>: If it were local effects before sending off to the server, do you have the same ->> William Cook: No because what it does I didn't mention it explicitly. The requirement is that all effects in a place happen in the correct order. So all local effects happen in the correct order as if you ran the code sequentially. So in the case that I was showing you here, where it does some local code before the batch, those effects have to happen before any of the after effects of the local code. So it doesn't allow reordering of -- the effects on the local system do not get reordered with respect to each other. So the argument is all effects occur in the right order at their home location. So I think that was your question. >>: So if the client bursts through ->> William Cook: If the client got an exception before -- during the prelocal, then the batch would never even be invoked. >>: I see. >> William Cook: And the client needs to know that. But it can tell where that is by, again, there's -- in looking at the static analysis that essentially partitions the code, just analyzes it and assigns three different categories to it, it's either prelocal, remote or post local. And doing that analysis is pretty straightforward, and it tends to follow at least the prelocal and the post local have to be sequentially ordered in the code. Okay. So batches are not necessarily transactional, but you could make a transactional server if you wanted to because it's sending the complete set of operations to be performed. The batch, you could actually make a transactional RPC server that somehow serialized the RPC calls on the server, if it wanted to. And obviously you could use it for databases as well. But the server doesn't have to be transactional. It can also just completely arbitrarily interleave the execution of multiple batches if it wants to. So it doesn't -- so batches don't eliminate errors and they don't eliminate communication errors or failures. There's a lot of problems that happen when you have remoteness. At least it reduces the potential for certain kinds of intermediate errors, intermediate states, because the batch is sent as one script to the server. It's either all going to be delivered or none. You're not going to get communication errors in the middle of a server communication. And it does lead to fewer round trips. So there's a little bit less opportunity for error. But it doesn't solve this problem. The other thing it does it requires the program server to explicitly put the batch in so they know that remoteness is happening. My goal is not to hide the remoteness completely it's to hide it almost completely but still make the programmer aware that it's happening and put a boundary on it. So you're just saying what if you don't say put batch anywhere. The batch says this is the boundary of both efficiency and to some degree failure and so it makes it declared what's going on there. >>: Is there a way to express -- you said basically if the server fails, at a certain iteration, and the client will, too. Is there any way to say if the server ever fails don't do any iterations, it gives you more of a ->> William Cook: That's a really good idea. The server could choose to completely abort the script and sort of roll back all the things that it did. So you could make a smart server that would throw an error and the client would just get an error immediately. >>: But that means that all clients would have that same semantics. >> William Cook: But if you wanted to have the client choose to do that, I think that -- I mean, the client could obviously make a call that requested what kind of failure behavior it wanted dynamically. But unless the server has the capability to sort of cancel all its partial operations, it's kind of hard to force a server to do that. I think that it's actually a really good mode of execution, and I think that one of the nice things about this is it lets the servers be smarter. It gives them opportunities. >>: [inaudible] using for client exceptions in the sense that the server actually aborts the whole thing if the client ->> William Cook: You can even do a two-phase commit if you wanted. You could make a two-phase commit. The server is going to do some operations hand back a ticket that says I've done these but I need to wait for you to complete your local code and then get a commit. You could actually implement that pretty easily and that again would be just on a regular RPC server, not on your fancy database, if you wanted it. So this is the thing we keep mentioning, that the order of execution is preserved. The local operations and the remote operations. And so this is a case that's actually an error, because what's going on is we have a remote call update A and update B, but it has two arbitrary local procedures that need to be called to produce the inputs. And then we're going to present out the results. Now, if you try to execute this in a batch in one round trip it requires both of these local calls to be made first before we send the batch, because they're inputs to the remote call. That is going to force this local get me to happen before the print, which is an error. And so this would be an invalid program. And the reason is that well get B could look at the length of the print stream and it could be different if you did it before or after the print. So the idea what I was getting at is the local code has to execute in the natural order that it would have executed without batches. So this is actually a syntax error. And you can fix this by assigning these get As and get Bs to local variables if they're, to make it clear what the order of the effects should be. Okay. So summary. We have a new statement and it has a new compilation semantics. Kind of control flow where it does out of order execution. It partitions the code into local and remote. It manages the communication. It potentially works in any language. I haven't really proven this yet. I'm still in the process of implementing other clients. It's easy to write a new server. You just write a script interpreter, and that's like PL 101 a week-long project for an undergraduate to write another batch server. Writing a batch client is a little more complicated because you have to change the interpreter or compiler of the client language and do this partitioning. There are libraries I have that will do the partitioning actual hard part but it's still some code to make that work. >>: [inaudible]. >> William Cook: I have it working in Java, yeah. And the communication is optimized as well. And it's essentially what we're doing here is we're borrowing the SQL Server -- data execution model. Right? Send primitive values back and forth. Send scripts over for execution. Let's just do it for other things besides database access. Okay. So here's the actual batch script language. This is equivalent to SQL. This is my imperative remote scripting language. And what it lets you do is counts the variables, conditionals, four loops, potentially aggregated four loops. So you can do a sum. And it's like a monoid comprehension. You can create local variables on the server. You can do assignments, access fields, call methods, do primitive operations, capture inputs and or use inputs and capture outputs and have first class functions because they're useful sometimes for doing certain kinds of server-style predicates. And a fixed set of operators and a fixed set of data types. We just say this is the scripted language. And I don't have a concrete syntax for it yet. I have an abstract syntax. I'm thinking of using JavaScript as the concrete syntax. The nice thing is it doesn't let you call constructors or let you do any of the dangerous things, it just lets you do all the things you could have done if you were talking to an RPC server anyway. Doesn't let you do wild loops either. So there's no unbounded computation. So there's a nice sort of bound on that. Although, it does the four loops. >>: [inaudible] something, right? So it's like it enumerates. >> William Cook: Yes in a collection. So if the server wanted to provide an infinite stream, then it could allow sort of unbounded computations, but the point is that the collections are bounded by what is assumed to be a finite collection. The iterations are bounded by a finite collection. So let's take this on the road and talk about, you've already seen it looks a lot like SQL. So let's just do it. So here's the batch script pattern executing JDBC. Looks just like the generated code of a batch statement. And so instead we just write this. And this is what we'd like to write. And then we just put batch around it. And it partitions out the remote stuff. Turns it into this batch script, which turns out can be deterministically compiled into efficient SQL. Turns out for any batch you write, including any four nested loops, it always generates a constant number of SQL series. That's a property that Lync does not have, for example. The only other system that does that I know is Fairy at this point. So that solves the ->>: Like in your batch script language, isn't strict enough it will always. >> William Cook: No, you can write arbitrary code in here pretty much. Arbitrary nested loops and conditionals. But it generates a batch script straightforward, but the interesting property is that the batch script can always be compiled into a constant number of SQL queries. So you get a performance guarantee. >>: But these loops inside the batch scripts are always looping over things on the server side. >> William Cook: Yes. >>: And so, for example, if you had a data structure on the client side and you wanted to loop over that, you really wouldn't do that in the script, basically. >> William Cook: Well, I did do it with the mail sending messages so I could loop over client things to get them to send to the server. I could also loop over a client data -- you can actually do arbitrary client computation in here. You can do exception handling and loops and stuff. >>: Worry about the preorder? >> William Cook: You have to worry about the ordering, yes. It captures a pattern of doing collection of operations on the server and getting the results back and doing something with them. There are use cases that require multiple round trips and those would require multiple batches. But, yeah, it is tricky, actually. When you have four nested loops here, it turns out generating four SQL queries to generate that is nontrivial. I'm not going to explain how it's done. But basically what it amounts to is that each loop gets its own SQL and the SQL select statement generate captures all of the iterations of that loop for all of its containing loops. I don't know, that's probably not clear. But if you want an explanation we can talk about it after. Okay. So if you do normal object relational mapping you can define the structure of a data model using a bunch of interfaces or classes. And then I want to do a quick comparison with Lync here. So in Lync, it handles the writing of the query very nicely, but it still is the batch execution pattern. We're still creating an artificial data structure manually and then decoding it in the results. So it's actually you have to do the part of making that intermediate data structure, the data transfer objects, you have to do that manually still. And that's the part where batches excel and so you end up with a sort of cross dependency. If you add another -- if you want to add another component here, you have to add it in here and make changes in two different places. Another one that's interesting is dynamic queries. So in a dynamic query, the predicate or the condition depends on local data. It's sometimes there's a test and sometimes there isn't. This is very common on Web pages where you enter in a search criteria and depending on what criteria items you type in, it makes a different select statement. A different where clause. So here's a case where there's a title and if the title in a test string, we want to see if the title is equal to a test string. If the test string is defined we want to do the test. If the test string is empty we don't want to do the test. So in Lync we can create a query a virtual collection. We can test the local state, create modify the query to add aware test, sort of we're now doing sort of virtual collection manipulation here. And then we can just iterate over that virtual collection which may or may not have the test in it and get the results and iterate over them. The fun thing is that in batches you just write the code and test the local state and it will figure out that this is local. So it can be done before the batch is executed, and because it's short-circuit of or it will not include the second part in the batch if the first part is true. So kind of fun. So, yeah, so we have a compiler that translates the batch scripts into SQL. You can either do that, there's at least two different modes you can do it. You can write a program and have the client convert it into SQL or send the batch script to the server and have the server convert it into SQL. >>: Maybe the previous slide it's much harder to translate to SQL query again on the server, right? Wanted to translate to SQL queries. >> William Cook: The way this works the batch system creates the batch script that either has the test in it or not. So the script -- the batch script that actually gets created is either ->>: Always in there or not. >> William Cook: It's either in there or it's not. So whichever. >>: Not entangled. >> William Cook: Not entangled. Yeah. From the point of the query translator it doesn't see the difference. It's handled by the batch system. This is another problem with Lync, if you write lambdas, the assumption is that lambdas are sent to the server. But the problem is lambdas can contain arbitrary local computation. So they can fail. So what batch does is by saying yes, we're going to mix local and remote computation and partition it, we can actually handle that case more explicitly and deal with one of those fundamental problems. Notice that batches we use the same trick as Lync does to do aggregation. It still uses Lync style aggregation, just doesn't use it for the standard select and project. Aggregation stuff like sums. So it really does support all aspects of SQL including updates, inserts. You can do a bulk insert. You can do a four loop. Iterate over one database table and insert into another database table and that will generate the bulk insert query that no object relational mapper I know of will generate at this point without lots of fancy hand waving or complex kind of programming. And this constant number of queries guarantee. So what it gives you is a fine-grained object oriented programming model with efficient SQL execution. By fine-grained, this is the problem with batches, is that we've been told whenever you're talking to a remote server you need to talk to complex large-scale operations, right, to reduce the latency. What batches do is say no it's fine just use fine-grained operations we'll collect them up for you and do them all in a group. And then this is one of the things that Web services is about. And again there's sort of been this long, very rapid experimentation with Web services, and there's Web services that emulate RPCs so they're great. They have all the pitfalls of XML and all the pitfalls of RPC. When people say Web services suck that's what they're talking about, it's like the worst possible combination of things. But there's another kind of Web service, which is the document or year-end Web service, which is to me really the intuition behind Web services that you send a bulk description of the purchase order to be processes. And it processes it and sends it back. And you could send it in an e-mail message if you wanted to. This is an example of the Amazon Web service interface of two years ago it allows you to do a item lookup a list of item IDs and a list of properties you want to look up and it will give you back a table of results. This is an encoding of a kind of SQL encoding here but it's really encoding of a batch. Custom scripting language, which has implicit iteration in here and implicit projections down here. And every single API is a different scripting language. It's just crazy. You know? And the code to write the clients is just incredible. It's explicit meta programming. You have to construct the request object and then abstract syntax and write an interpreter, every single API call has a different interpreter. This is the batch execution pattern again. Here we are finally get down to outputting the rank and image. With batches you just say get me the item. Get me these two properties, get me the item these two properties and it sorts it out. We put a batch front end on the Amazon web services thing and were able to write a lot of the APIs, didn't do it for everything, but we did a really good experiment. And we also did this for this OSGI, this really huge API for managing packages in Eclipse and they're struggling with how do we make this remote. There's one proposal and another proposal and we're going to do it this way and change the API. We just said no the API is good, slap batches on the front and you can talk to that thing efficiently. >>: That would be JavaScript then, this batch. >> William Cook: You have to send JavaScript with a batch keyword, yep. And that's what I'm in process of doing. >>: So why is it necessary to have the keyword, other than -- because you can imagine doing the syntax checking as sort of a ->> William Cook: I believe it's important to have the programmers declare their intent and provide a scope. Because you can have a lot of different remote operations and they might need to be grouped into batches according to some very subtle semantic dependencies, and when the programmer says this is a batch, they're saying something about the scope of that. It's like transaction boundaries in a way. >>: I was going to say you could have begin transaction, end transaction. >> William Cook: Some kind of declaration. It's just a way of doing it. >>: Do it in a library as opposed to ->> William Cook: No, you can't. It really has to partition the code. It has to do things to the code that you can't do with a library. You just -- maybe you could do it with expression trees. You could do it if you're an imposter, sure, yes. >>: The language [inaudible]. >> William Cook: Maybe. With fully reflective access, in Small Talk you could certainly do it, yes, as a library. >>: But on the surface it seems like kind of it's a high bar to get the JavaScript standard to include this. On the other hand, if you could eliminate the code that you just showed us for Amazon and rewrite it with a little bit less syntax ->> William Cook: I think this is a fundamental idea. It's not just like this is a hack. I really think this does have the merit to be considered as a real feature. And you just cannot solve the problem. Opens up a lot of things. The same way that Lync opened lots of new ideas you could have remoteness where the remoteness was your GPU using batches. It creates a new kind of composition in modularity. And so it really is a new kind of control flow. Okay. So I have a thing called JABA, batch Java that's 100 percent compatible with current Java. I didn't have to make any changes to the syntax or static semantics. And the trick that I used was rather than introducing a batch statement, it turns out the batch statement in all the examples I gave is sort of syntactically exactly analogous to the forced statement. So I just use the forced statement. And the trick is that the normal clue is that the invariant or requirement of the forced statement is that this thing has to implement I enumerable or something like that or else it completely fails. I say if it implements I batch service, then the four loop is a four R in this service object do this computation. So it's really not -it's not even semantically that much of a hack, although it is a little bit of a pun, you can assign an interpretation of four where it says four do this stuff remotely, where the server is a remote connection. Now if you want to you can put the batch keyword in. I didn't want to break my tooling. >>: You modify AA Java? >> William Cook: I have modified the [inaudible] compiler. And I'm currently modifying Java CC or Java C. And I'm going to propose this as a potential enhancement for Java 8 or 9 and Oracle has said they're interested. So that's about the status at this point. So there's a lot of interesting opportunities here to -- well, the first one is to write this for your favorite language and I would love to help anybody out who wants to do that. I'm working on JavaScript right now. Interesting opportunities for partially evaluating the batch. The way it works is actually it's configureable in the sense that the client -- the mechanism of the batch system has the language does the partitioning. So that's built in. But what it does with the batch script is up to a handler. The handler gets to decide how to send it to the remote server or execute it locally or do security checks on it or partially evaluate it or whatever. So partially evaluating that batch handler would allow you to actually translate the batch script into SQL in the compiler. So what got wrote would be completely straight JDBC with hard coded SQL as the output of the compiler. And then there's also questions about multiple levels of communication, can we talk to multiple servers in a batch, can we talk to a server that talked to a server? Could we have -- this is the one I'm most interested in is you talk to a server in a batch but then the server wants to make some virtual calls back to the client, because they're often call back that the server wants to make to tell the client information. So there really needs to be a return batch that the client then executes and maybe sends results back to the server. So you can do this really nice batch handshake idea. But I haven't worked that out yet. As I mentioned doing it for GPUs would be fun. There's a lot of stuff about asynchrony. A lot of people say latency it's a problem we need to fix it by doing asynchrony. Asynchrony is the answer to everything. Well, asynchrony doesn't work if you have if statements, because you gotta get the -- you have to block. It's like the same problem with instruction pipelining, right? You get the branch you have to wait until you know the answer to know where to go. So batches don't have any problem with if statements because it just sends them over. However, it really is interesting to think about asynchronous execution of the batch script with the client, because their control flow is synchronized you can actually stream the results from the server and have the client start executing before the batch was actually done on the server. And that's what SQL Server does, too. So lots of related work. What's fascinating is no one has ever come up with this before. Because it's so obvious. And so simple. And it works so well. I mean, it just kills RPC. It's so much better. You can argue maybe it's not better for SQL. You can argue that. SQL is pretty complicated. But in terms of RPC, it's just better. And I've never seen anything that is worse about it. If you want remote proxies you can do it. You can put those in. There's nothing that prohibits you there. >>: Can you contrast it with the work like J-Orchestra where you do automatic partitioning between the client server? >> William Cook: I don't know if you notice the list of people on this, one is Eli Tilivich [phonetic]. He's working on rewriting J-Orchestra to use this. The thing that killed J-Orchestra was the inability to have an efficient communication model. We can actually -- you keep saying do I need the batch statement. You could infer it. You could put them in. >>: At some level, purely performance then it's a question of whether or not how many round trips you have. >> William Cook: If you had an app that ran locally and you wanted to partition it into local and remote and add the batch statements at the same time, then that would be a cool thing to do and Eli is thinking about that. So there's lots of other things that are related. I really like the fact that it's the inverse of deforestation. Deforestation is when you take an intermediate data structure and you get rid of it, where batches create an intermediate data structure in order to optimize the communication. So that's one of the reasons why I called it a forest. Okay. So we have papers on this. We've unified RPC with the ease of programming. That was the first paper. Got a pretty good description of the basic concept. We showed how it could be made cross-platform with Web services. We have done some theoretical work on SQL, early work that was the precursor to this. POPL and more recently there was a nice paper at Database and Programming Languages Conference. It's a symposium or workshop, I can't remember, but it was a nice meeting, and we show how to do the SQL compilation and translation. The actual algorithm is not in the paper because it had really small page limit. But if you want to see the algorithm, I can give it to you. All right. So in conclusion, we have a new kind of control flow statement called batches. And it partitions programs into local and remote. And it unifies lots of different notions of remoteness and potentially more. It gives you efficiency, a clean programming model. No queries, no proxies, no requirement to be stateful. You don't have this distributed garbage collection problem you have with CORBA and RMI, and it's language and transport neutral. You can use XML, you can use ASN 1. You can use JSON for the communication. The whole fetish about which communication protocol we're going to use is just irrelevant, I think. The big negative as you mentioned is you have to change your language. But I'm going to argue you cannot solve this problem without changing the language at some level. So that's it. Thank you. [applause]. >> Rustan Leino: Further questions? >>: You mentioned you have some syntactic rules for sorting out if it's a legal batch statement. >> William Cook: Yeah. >>: Do you find frequently or do you find at all that there are cases that you would like it to figure out are actually legal and it doesn't -- that is, where you would like a more detailed analysis of the batch statement? >> William Cook: We haven't found any of those yet. I think that the -- I understand this question is important, because it means does the sort of technical approximation really correspond naturally to the human desire intuition about what should work. It seems like it does. That those seem to be fairly consistent. I can't say that I've implemented everything that -- I haven't used it enough to really push it to the absolute limit. When you release a technology like this people tend to do that. They push it to the limit. So it will be really interesting to see what happens. But my hunch is it's not going to be too bad. There might need to be some tweaks but I'm not really worried about that. >>: Linked to that, the interesting thing is that in Lync there's no check for side effects, right? It just goes wrong and you entangle like link queries with side effects in your lambdas. So it's interesting that you actually check for it. But I guess so your check would be as soon as you have a method call, assume that there might be an effect? >> William Cook: Yes. >>: And if it's like maybe a property excess or something, it will ->> William Cook: You could declare. So there are ways to kind of approximate it. But right now if you do a property accessor it's going to assume there's an effect and that forces you to introduce a local variable that captures that value. The assumption is that values do not have effects. So that's the thing. You can always capture the value and then that value can be used freely. >>: And it will be ->> William Cook: And it will be in the pre part, yes. >>: So one of the things about RPC is that it gives -- one, it's easy to use but also gives this illusion of sort of the cost being similar. And what you're doing, you're actually creating a higher cost for these batches, right? A single round trip is going to take a certain amount of time but a round trip where you send over batch, have to do the computation, bring back the whole thing, takes longer, so you're introducing longer latencies. >> William Cook: I would argue in the case of a single method call it's going to be the same. The batch doesn't do anything more for a single call than a normal RPC. I mean of a single call. If there's one remote call. >>: Okay. Right. Okay. Yeah. >> William Cook: So latency is not -- the other thing is bandwidth is much more available than latency. Latency is the speed of light round trip time. So if you can reduce the number of round trips you can reduce the latency. You may use a little more bandwidth, but bandwidth is free. So bigger batches with fewer round trips will always win. That's my argument in the way that ->>: But you break down RPC these calls take a lot longer to begin with. So there's latency that's been introduced ->> William Cook: Yes, rather than a local call. >>: But then you're adding more latency. >> William Cook: I don't think we're really adding much more latency. I really don't. Maybe you're talking about a few instructions more or something, but it's not -- it's not -- relative to the cost of the network, it's not really significant. The thing that it does do is that because you have to say batch, at least it's a clue that remoteness is happening. >>: Right. >> William Cook: So it doesn't hide it completely, that's the thing. All right. Well, good. Thank you very much. [applause]