1 >> Jim Larus: It's my pleasure to introduce Joe... database guy who has seen the light and decided to...

1 >> Jim Larus: It's my pleasure to introduce Joe Hellerstein from Berkeley. Joe is a database guy who has seen the light and decided to go and apply database systems to sort of everything. I believe his first stop on the excursion was sensor networks, and he's now moved on to parallel distributed computing. And I saw this work a little while ago and thought that this was a very new, cool idea for how to program very large-scale systems. And fortunately Joe and a bunch of his students are up here today to talk about it. Most of you have probably seen the meeting after this. They're also around this afternoon, so if people want to talk to them, there'll be some time after 2 p.m. where you can get together with them. And just probably the easiest thing to do is to just send me an e-mail if you want to get together with them, and we'll work something out. So it's a pleasure to introduce Joe, and welcome. >> Joe Hellerstein: Thanks. Thanks. Yeah, so why don't I just jump right in, give you guys a sense of what we've been thinking about over the last couple years. And because the intention today was to bring up the gang and brainstorm, there's a lot of -- there's not a lot, but at least there's some at-the-end stuff we don't know yet, we haven't figured out thinking about, so conversation-starter stuff. So this is joint work with a whole team of folks including a collaboration with folks at Yahoo!, and Rusty Sears moved from the top list to the bottom list in the last year. So that's been a good collaboration for us and a lot of fun. So what I'll try to talk about today is kind of our take on what cloud programming is about. We can, if the audience thinks it's worthwhile, do a little quick tutorial on Datalog, because that is foundation for what we're doing. My assumption is we want to do that, but stop me maybe, Jim, if you think that I'm ->> Jim Larus: No, why don't you do it. >> Joe Hellerstein: We'll do it. Okay, yeah. And that's going to cut into our time to talk about the real stuff, but it's probably worth it. And then our project, BOOM, I'll give you a look at what we've done to kind of prep for it in a way, sort of we did an excursion or an experiment over the last year in that realm, and then a flavor of where we think we're going to with the language for BOOM, which is Bloom, and then some research directions. So I shouldn't have to talk too much about cloud at Microsoft, but our take on it is that we're interested in it as a programming platform. And it's an unusual programming platform. Every new platform brings different features and challenges. We know many of the features we're seeing in the cloud, the shared environment, dynamic sets of machines doing computations that evolve over time for services, and, you know, a lot of the applications people are talking about are very data intensive and some of them also session-centric, involving users on clients. And of course for a platform to really be interesting, it needs programmers. And when 2 you look at something, like a new platform like the iPhone, you know, what's fun is there are all these people thinking up crazy things do with the iPhone that people at Apple probably never thought of. And so when you think about the cloud, you know, you ask yourself, well, if there's going to be an app for that in the cloud, what are they and who's going to build them. Right? I mean, we can name some of them now, but it's not a long list. So how are we going to get the people who aren't in this room to innovate in this space, right? And this leads to the doom and gloom, which is in the title of the talk. You know, cloud programming is really hard because it's parallel and it's distributed. And, you know, there's this -- there was a blog post by Dave Patterson where he quoted John Hennessy, and the two of them have this very dark view of the future computing that they were using to raise money from the federal government. There's two ways to make someone interested in what you're doing. You can say my stuff is really great, or that you can say you're really screwed if you don't fund me. And they went for option 2. But Hennessy had this quote: When we start talking about parallelism and ease of use of truly parallel computers, we're talking about a problem that's as hard as any that computer science has faced. I would be panicked if I were an industry. Fortunately I'm running a large university with a big deficit, so my life is easy. But really it is. It's a challenge in this space because I think when the programming model is so hard it stymies innovation. So, you know, the programmers programming in the cloud today are either expected to knock up single-node software and just replicate it. And that's fine. But that's not innovative use. That's like running -- I don't know. That's like running a telephone app on your iPhone. It's like, well, yeah, sure, but that's not going to sell lots of new iPhones. If you want to write something new, you have to orchestrate concurrent communication and computation, tolerate delay and failure, in this elastic, minimally managed environment. And there's just very few people who can be expected to program like that in the traditional languages like C# and Java and so on. So this is a roadblock for the folks we really want, or the creative people who think about stuff that's about what the users want and it's not about computers. So the kind of creative software developers who aren't necessarily great developers but they have great ideas. And think about all those iPhone apps, right? Those are not necessarily Ph.D.s in computer science writing those things. But that's what makes it interesting. And, frankly, you know what? Building stuff in the cloud is hard for you guys too, right? Even if you have a Ph.D. in computer science or you're the world's best programmer, this 3 is a lot harder than writing single-threaded code. And so our take on this is basically to look at inverting kind of the control structure or the way we think about programs where instead of the kind of [inaudible] model where there's a processor manipulating memory, let's turn it upside down and say that computing is subservient to thinking about state. And this is where, you know, Jim was sort of making fun of me a little bit, but I think as the database point of view may be helpful, you say, well, computation is secondary. What you should start thinking about first is is the information the state of your system. So that's state like session state, the state of the system at every node, protocol state, permission state, and then of course like the data, such as it is, is in there too. But everything about your computation really is about deriving, updating, and communicating state. And that's a perfectly reasonable way to think about computation, and it's a way that, you know, we're going to argue is easier to parallelize, easier to deal with stuff like concurrency, because you're actually not thinking about that as your first problem, you're thinking about that inservice of managing the state. So the strategy we're arguing could be a winning strategy is this data-centric design, where you take all the state of your program that you would have thought of as data structures or messages, that kind of thing, and think about it instead as real data, first-class data. So everything that you're going to do in your program is going to be like data in a database, because it matters. All right. So you start all your system's data as first-class data, and your basic data types are going to be collections and streams. Right? Very much like you would have if you had data cared about. You wouldn't have one off little things in your database that are their own data structure. Right? You try to generalize, you try to come up with thematic designs that are reusable and schematize things. And that's what we're going to say is the right way to think about all these structures in your program. And then computation, again, is modeling that data carefully and then reacting to changes to it and evolving the state as you go. And part of our argument is that this is a language-independent piece of what we're doing. This is not going to be about Bloom per se, but many of these lessons, if you stepped back, you could exercise them in a disciplined way, in a language like C#. So just thinking about data first is a -- it's a design pattern. And then the second part of our -- kind of our conceit is that you can use a high-level language once you've done this, like a logic language, a declarative language, where you write things as specs. So you're basically saying specifications for correctness, specifications for handling events, safety of the system and safety of your transitions. And that you write that in a very high-level language where you're really just specifying outcomes. You get all the traditional benefits that database people like about data independence where because your spec doesn't include the implementation, the 4 representation and the placement of data in the cloud is left underneath the programmer's level of thought. It's left to some kind of system optimization. And then I believe we're getting -- and we're not there yet, but we're getting a lens on parallelism based on logic that is quite a bit more simple, I think is the way to say it, than thinking about threads. And I'll try to give you a whiff of that today, although I think really this is something I can only do kind of very sloppily right now, so it takes a while for me to get a convincing argument. We're getting there, though. Anyway, the point with both of these, whether you want to take the kind of language-independent design pattern or you want to also buy into declarative programming, the idea is to take things that seem to be hard problems and turn them into dead-simple problems. So we're not actually going to try to do anything fancy; we're going to try to make things you might have thought were fancy and just point out that they're easy. So, you know, concurrent programming sounds really hard. But data parallelism, like MapReduce, is really easy. You know, parallel computing people like to talk about embarrassing parallelism, so let's go find all the embarrassing parallelism we possibly can. That's kind of at some level part of the agenda, is to make things that you might have thought hard dead simple. And, you know, on a fancier level, I think -- and this is getting to the understanding of parallelism, when should you be coordinating in your programs. And one of the lessons that's coming out from thinking about it is you should only be coordinating when you're doing something nonmonotonic in your logic. And a sort of simple thing to put in your pocket to take away as a slogan is you only parallelize for counting. The only thing you should ever need to do -- I'm sorry. The only thing you don't parallelize, the only thing you coordinate for, is to get a correct count. Everything else that you program should just go data parallel. And so really what you do, look at your program to figure out when do I need to know how many of something there is, and everything else can be easy. >> [inaudible] complexity in distributed systems was this nonmonotonicity. >> Joe Hellerstein: Yes. >> So how -- are you going to be able to hide it? How can you hide it? >> Joe Hellerstein: No. No, I'm going to point it out very, very clearly, and everything else you won't even worry about it being hard. So, like I said, the whole goal is to take the easy stuff and say all that stuff is easy, stop thinking about it. But when you're writing concurrent Java programs, it's very hard to see what's the nonmonotonicity in the interactions between 25 threads. But we're going to see it very clearly. We're going to say here's a program where all it's doing is accumulating information, accumulating information, and then eventually it meets the count. 5 MapReduce. There's a whole bunch of map, and then there's a reduce. And everything should be that easy to some degree. There's really no other reason to coordinate. So let me go out, and hopefully we can talk about this more later. So it's not like I'm making this up off the top of my head. There goes way back. There's decades of theory of logic programming and dataflow. What's interesting I think at some phenomenologically is that there's a bunch of folks like me running around doing this independently. It's been kind of cropping up all over the place. It's a sign of something happening. So a lot of people have been doing declarative and dataflow languages for stuff. So obviously query processing and data analysis goes on. In addition to SQL we now have MapReduce. That's been very interesting to watch. And my group here in yellow, we started doing networking, both sensor networks and peer-to-peer networks. Distributed computing is what we're kind of doing now. We've done distributed statistical machine learning. At Cornell they've been looking at multiplayer games, 3-tier services at Carnegie Mellon, robotics at Hopkins, natural language processing at Stanford, compiler analysis, and security. There's just a lot of smart people, and we haven't been coordinating almost at all, even still, sadly, who've just been kind of going to the same toolkit to solve programming problems in a whole bunch of domains. >> [inaudible] focused on one? >> Joe Hellerstein: Just basic general, like, Bayesian inference. Yep. And so I wrote a blog post about this stuff about a year ago. I was invited to by the CRA and we tried to start putting together a bibliography, at least a limited one, and we'd love contributions to it if folks are interested. Okay. So that was a motivation. So we're going to go back to school for a little bit, okay, and I'll just teach you a little Datalog. Okay. So I had to learn this stuff in grad school, and I hated it. And I still kind of hate it. I'll try to make it painless as best I can. So here's the basic idea. Here's the world through the eyes of a real database geek. There's data, and then there's logic, which is the stuff that you know based on the data and some rules. And that's everything. Here's what logic looks like. If I know Q, then I know P. So that's the way to read that, if I know Q, then I know P. That's like an implication arrow. And if you want, you can think of this as SQL views. So it's an expression with a name, P, but it's not stored, it's just an expression that you can compute from the data. So it's a named query that's been stored as a query, not as the answer to the query. Okay. And that's all of computing. Really, at some level if you enriched your logic 6 enough, it's true and complete. The only thing is like until recently the only people who cared about this were Europeans. Yes. >> Where is time? >> Joe Hellerstein: Time is going to turn out to be really, really important for us. So it's interesting that you asked that, because it took us like many years to bother asking ourselves that question. So thank you for asking it. It will come up later. We can take care of enough time for our purposes in plain old Datalog actually. Although, we'll put sugar on top of it. And I'll show you that in our semantic stuff. Great question. Good. So far there's no time. And, in fact, Datalog there's no time because Datalog isn't about programming really. And so that's -- I'm starting with this pure logic thing that doesn't think about time. It doesn't even think about state update. All there is is there's data and there's what you can know from that data, and the world is static. And so that's what Datalog is kind of, that's the environment. So here's the classic Datalog example you're forced to learn in [inaudible] in Wisconsin in 1992. Or in Jeff Ullman's class at Stanford. Not -- yeah. So the canonical example is you have a family tree where it's stored as pairs of parent-child tuples and you want to compute something about who's an ancestor of whom. So over here is the family tree of IBM mainframes. This is upside-down from the usual family tree. And there's a rule that says if X is a parent of Y, then X is an ancestor of Y. So this is going to be an inductive statement of what an ancestor is. Base case, if you're a parent, you're an ancestor. Inductive case, if X is a parent of Y, okay, and Y is recursively an ancestor of Z, oops, that's -- that's my little joke -- Y is the ancestor of Z, then you know that X is an ancestor of Z. So that's the way to think about this. Another way to read this is take the parent table, join it to the ancestor table, where the second attribute of parent equals the first attribute of ancestor, and do that iteratively, or if you like, recursively, until you can't compute any more ancestors. This is just an SQL expression. It's a recursive SQL expression that joins these two tables' second attribute equals first attribute. Okay. All right. And then you can run queries over this stored relation. These two things are [inaudible] together. And this is -- think of this as a stored query expression. And then you can run queries over that query expression like you can run queries over a SQL view. So you can say find me all the ancestors of some person S. And convention -- so syntax is horrible in this stuff. Convention in Datalog is capitals are variables and lower case things are constants. Just because. 7 All right. So this is horrible. It's old stuff. It's no fun. Some notes. This join expression by variable repetition is called unification, in case somebody wants to use that word. Variables in capitals. This is called the head, this is called the body. The body implies the head. The body can have multiple things that you're joining together. And it sets semantics in Datalog, so there's no duplicates in any of these relations and you don't generate duplicates and that's how you know that this will sort of terminate. You find all the distinct ancestors that can be derived from this finite dataset. So the Internet changes everything, right? This is not -- this is old stuff. But, in fact, you know, really, if you look at what goes on on the Internet, it looks a whole lot like ancestors and descendants, it's just links and paths. So routers keep pairs of links. I am connected to you. And they want to find paths. Who can I get to from me. So it's exactly the same program, I've just changed the names. It's exactly the same thing. So this is a path finding Datalog query. It finds all paths from S to anybody. It's the all paths query, which you don't really want to do, but it's a good start. It's all paths. This is crazy in the Internet, right? What you really want -- well, before I go -- before we move on, let me focus on kind of why Datalog is nice as opposed to, say, an imperative programming language. This is just an expression of the output, very much like SQL. And what's pretty is that -- how should you interpret this. This is just logic. So this is just syntax. If you ever had to take a logic class, they'll beat you over the head with this forever. This has no meaning, it's just a bunch of characters. It means whatever you want it to mean. But let's have a convention about what it means that makes some sense. So the convention is going to be find a substitution for these variables that is consistent with this expression, all right, find a substitution that is the smallest possible one so that it doesn't contain any extraneous stuff. So that's a model for this program. It's a set of assignments of the variables, such that if you deleted any of them it would become inconsistent. All right. That's called a minimal model. So you want to find assignments in the variables that you can't take any of them away. So if you want, it's given a fixed database of links. It's the smallest sensible output database of paths. You could make up paths that aren't supported by the links, but that wouldn't be minimal. Okay. And there's this nice lemma if you want. I don't know, maybe it was a theorem once upon a time. Datalog programs have a unique minimal model. So modulo renaming, okay, there's exactly one minimal model for this program. That's really nice. That means that I can take this syntax and there's no ambiguity about this natural interpretation of it, which is the minimal interpretation. So this is well defined in a very strong sense. And another thing that as computer scientists is really nice is there's a very simple recursive join strategy, keep joining links and paths, links and paths, links and paths, sticking things in paths, until you can't compute anything else. That generates the 8 minimal model. Cool. Okay. So now what do we have? We have agreed-upon logical interpretation of the program and we have a very natural algorithm that computes it. And it's the algorithm you'd expect to run, too, so it all feels right. So that's really nice. So there's no implementation implicit in the program, but the natural implementation gives you the right answer. Very cool. >> A question. For this program, if you ignore that path-X-S-?, it could be the empty model, right? >> Joe Hellerstein: Not if you had link populated with data. But, yes, I agree with you, the empty model satisfies this. So typically what you'll do, and this is more annoying acronym terminology, so have some kind of stored database to start with as part of this, you would actually write it down. You would say link from Joe to Bob, link from Bob to Sally. It would be part of the program. That's called the extensional database, and you're not allowed to fuss with that. >> Your implementation is a purely SLG resolution? >> Joe Hellerstein: We're doing bottom-up ->> You're doing tabular and not [inaudible]. >> Joe Hellerstein: Correct. Yeah. It's bottom-up execution, actually dataflow. It's really doing joins. >> A traditional Datalog is semantic ->> Joe Hellerstein: Traditional Datalog. >> Not [inaudible]. >> Joe Hellerstein: That's right. >> Okay. >> Joe Hellerstein: Bottom-up [inaudible]. >> Okay. >> Joe Hellerstein: And it's strictly bottom-up. And using magic sets and all this stuff that makes bottom-up efficient. Okay. So what do we do so far? We found alt paths in the Internet, which is insane. What we might want to do is form paths that are like shortest paths. That would be sane. And that's, in fact, what we do on the Internet. Let's write that down. 9 First let's change our schema slightly to have costs on our links. That's what C will be. And then we can rewrite our program to propagate the costs. Okay? So if we have a link from X to Y of cost C, then we're going to have a path from X to Y of cost C. The other thing we want to do is actually construct these paths so that we can route along them. So what we're going to do is we're going to keep next hops. If we have a link from X to Y, then we have a path from X to Y, and if you're sitting at X, the way to get to Y is to next hop is Y. The recursive rule is more interesting. Here's our costs. If you have a link from X to Y of cost C, the path from Y to Z of cost D, let's say that we're measuring something like latency, that the cost from X to Z, then, is C plus D. If we were measuring capacity, we might have taken the min of C and D, sort of just an example. And then the next hop thing, let's be careful. The next hop, if we know that there's a path from Y to Z recursively, where you should go from Y to N and then onwards to Z, and there's a link from X to Y, then there's a path from X to Z where the next place you go from X is Y. And then of course from Y you would go to N and recursively you would unroll the path. Okay. So that's how you propagate next hops in this thing. I'll just make one note, which is I already cheated and I'm adding stuff to data log. I added plus. Plus is a relation. It's a ternary relation, two inputs, one output, that's infinitely big. It's all triples, X, Y, Z, such that X plus Y equals Z. And we have infinite relations, so that actually extends the expressive power of the language. So just -- I don't want to cheat. This is actually not simple Datalog anymore; this is Datalog with function symbols. >> [inaudible] form of that? >> Joe Hellerstein: No. You don't want to get the whole plus relation stored. It's a drag. Yeah. Although, memory is getting cheaper. >> [inaudible] really easy to overlay. It's also invertible in the direction, so some functions work, some don't. >> Joe Hellerstein: Okay. But what we really want is best paths. So let's take the same program. Okay. So that's the program we have so far, which compute all paths, their costs and their next hops, and now for each pair source destination, look at all the paths from that source to that destination, what's the cheapest cost among all those paths. And that's what this expression says. For a source destination pair the cheapest cost is the minimum of all the Cs that you see on this side that match. This is like a reduce when the key is XZ, or if you like in SQL it's group by query where they're grouped by columns [inaudible]. And I've just extended the language again with aggregation. And that's a big deal. And that's the thing where I said, you know, the only thing that matters is counting. Really 10 what I mean is the only thing that matters is anything that's aggregated. So that's a big change. >> [inaudible] in the angle brackets. >> Joe Hellerstein: It's in the angled brackets. And that's a syntax we made up. And then finally that just gave us the minimum cost. Now we need to find the path of the minimum cost, the arg min. So we take the mincost path, we join it with the path and we find that path from X to Z whose cost is the minimum cost, and its next hop. Now we've got shortest paths. All right. And, you know, five lines, pretty good. And you could ask ->> [inaudible] [laughter] >> Joe Hellerstein: Okay, okay, not fair, not fair. I was trying to gently introduce you to the full language. I did not in fact make up this language to suit this example. But that's a very fair critique of the presentation. As a teacher, I've done a bad job. Yeah. I don't recommend we keep doing this anymore. We're going to stop. Good. I'm suitably chastened. All right. So we just extend the Datalog with aggregation. And the funny thing about mins, right, is you can't know what they are until you've computed the full input. You can't compute the min until you know all the path costs. And that's why you need coordination. Whenever you want to count how many things do I have, you have to have all the things. And the most basic thing of that is do I have nothing. And so usually people talk about this in terms of negation, do I have nothing. And you won't know that until you've tried computing everything. So counting is a generalization of negation it's just count equals zero. Yes. >> [inaudible] recursive algorithm it's not going to terminate, because even if you're asking [inaudible]. >> Joe Hellerstein: No, because there's no duplicates. So this particular program will basically run these rules. And, again, there's a model theoretic semantics in a natural implementation ->> [inaudible] longer and longer paths [inaudible]. >> Joe Hellerstein: Oh. If you have -- this is sort of loops in the Internet. It will, in fact, go around cycles. This is very good. So it's a typical Internet thing. And, in fact, you should annotate the program to not go around cycles, or -- and my student has been thinking about that's not in the room -- you could actually infer from this minimum, you could propagate a constraint down to here that if you find a path with same source destination that's bigger than something you already have, you could stop. So you can 11 push that constraint actually into here. >> [inaudible] >> Joe Hellerstein: No, no. No, no. That's just a Datalog to data log rewrite. It's not stepping out of Datalog at all. You just put a less than thing in here that you don't add something to path if the C plus D is less than something that's already in path. >> [inaudible] >> Joe Hellerstein: Yeah, sure, sure. We have max and min and average and all these things. >> [inaudible] >> Joe Hellerstein: Right. Or a min path length if you have negated costs. Same story. And, you know, any algorithm then has a problem because there's no nonfinite answer. Sometimes you can test for that. There's conservative tests for safety of these answers, and sometimes you can't. Because it's kind of a halting problem sort of thing in the limit. So running around loops forever and checking for that should make you nervous, right? Yeah. These are very natural and good questions that I'm glossing over a little bit because time's short. I do actually have slides for some of this that we can go over offline, or we can just [inaudible]. Okay. I want to move on a little bit to talk about building real software with this. So so far what have we had? We had a logic for path finding in the Internet, shortest paths. That was all very cute, but like there's some database in the sky that we were computing this on, which is not the way the Internet works. I've heard -- I've read research papers where they said routing should be a service, you know, something like Google would do Internet routing, that it's crazy that Internet routing is this distributed computation. But that's not the way the world works. So can we implement the things the way the world works in logic. Can we basically build protocols from this specification. And the answers can be yes really, really easily using the first thing you might think of. So all we're going to do is we're going to do what databases and MapReduce and everybody else does, we're going to say logically there's a database in the sky. Physically we'll take the rows in that database, or, if you like, you'll take the key value pairs, and we'll partition them by key. We'll horizontally partition this table. And how will we do that? Well, we'll make the programmer do it for the moment. So we'll make them put a little @ next to the field that's the key, the field they want to partition on. And we'll make sure it's of type address in some network addressing scheme. So think of it as an Internet IP address for now if you like. 12 And then we'll place data, just the way you would in a parallel database, at the node that's in the location specifier. >> [inaudible] >> Joe Hellerstein: That's relations, but relations -- a single relation [inaudible] relation to the primary key concatenate the fields that are in the primary key that are the value. I'm trying to be ecumenical by saying key value pairs. So let's do it here. The link table, remember this was source destination. We'll partition it by source. So if you have a tuple that says to get from X to Y it costs you C and there's a link there, it's stored at X. And that indeed is somehow network routing tables are stored. You have a routing table that tells you who your neighbors are and who you can send to. Okay. And if you take the union of all those tables in the Internet routers, you get the link relation. All right. And paths are going to be partitioned the same. Everybody wants to know the paths outbound from them. And here's where the fun begins. Links are partitioned by the first attribute. Paths are partitioned by the first attribute. The join, if you like, the unification is on the second attributive link, which is not the partitioning, and the first attribute of path. And, by the way, I want the output partitioned by the first attribute because it's path again. So my problem is I need to compare things. Here's an example of data. Node A, node B, node C, node D on a network, a linear network. Here's the link table. It's been partitioned by the first attribute. Here, the path table same, partitioned by the first attribute, so the As, the Bs, the Cs, the Ds. I need to put together this link and this path. Right? And they're on different computers. Communication will have to happen to achieve this specification. Well, how are we going to do it? And then the output's got to go back, right? The output's got to go back. So how are we going to do it? We're going to do it in the way that I always thought was kind of disgusting in graduate school with Datalog which is we'll just rewrite the program. So we stay in Datalog but we're going to rewrite the Datalog to an equivalent program. And it's kind of a little bit operational, frankly, but it's still logic. So we're going to say take -- introduce a new predicate, link_d, which is the link table, but not partitioned by the second attribute. So just a copy. Okay. But it's partitioned by the second attribute. So it's going to look like -- you know, this tuple that we wanted to look at is now there in the link D relation. And that's the whole link D relation. And now our join is on a single node. This is partitioned by Y, this is partitioned by Y, life is good. Now we can do a local evaluation of the body of that rule, and then to propagate the head back is another communication. So this is actually the distance vector routing protocol that's used in Internet routers. It 13 does exactly this. But the way it's described is not in terms of where shall the data be. See, if you look at this program, all it says is there is data laid out the following way, and there should be data laid out this other way. It's pretty declarative. When you read about distance vector in a networking textbook, what it will say is first every node sends advertisements to its neighbors about its links, and the neighbors respond to the advertisements in a certain way, and this is done iteratively. Just much more operational. This really just says the data shall land in the following way. Uh-huh. >> Do you advocate doing this sort of rewrite [inaudible] automatically or ->> Joe Hellerstein: Yes. So one of the things we did in this early work on declarative networking was say you would have rewritten this difficultly, you could have rewritten this so we repartitioned path instead of repartitioning link. Would have been the same program. And of course we didn't write either of those things. This is done by an optimizer. This is what we wrote. We wrote something that didn't tell you how to do it at all. The optimizer was the one that rewrote it. You could rewrite it another way, and the fun thing is, you know, I said this is distance vector used in the Internet, right, just replaying history here. So exciting, right? If you rewrite it the other way, it's dynamic source routing which is used in wireless networks. Cool. Right? So a single specification, two implementations, one of which is better tuned to fairly stable, fairly low-variance links, one of which is better tuned to very unstable, high-variance links. In the networking literature, that's not the way they think about it. They think about as, well, there's the wired protocol and there's the wired list protocol, and I can publish a paper with a third protocol, and if I'm real clever, I'll come up with a hybrid protocol. Right? But every protocol is a research paper and implementation. Here what we're saying is no, you know, there's physical properties of the platform you're running on, and you optimize for those physical properties. And, by the way, the Internet is changing really fast. So if you're writing one protocol for every combination of devices on the Internet, you're doing something wrong. So that was kind of part of our message at that time. >> [inaudible] want to do a hybrid version? >> Joe Hellerstein: Cardinality is a bit of it, but a lot more of it in the Internet context is variance of various things, like communication costs. It's not actually raw numbers; it's the predictability of things in many cases. >> [inaudible] cases that the protocols deal with like, I don't know, count to infinity [inaudible] ->> Joe Hellerstein: Yeah. So that stuff actually isn't so bad. But when you talk about like the corner cases in like really looking at BGP and the way they do distance vector, I mean, like any standard, they've just added tons of stuff. Can you write all that stuff 14 down? Yeah, but then it gets ugly because it's just ugly. The spec is ugly. And I can't do better than the high-level spec at some point. But the algorithmic pieces of the spec, like not counting to infinity, those parts usually come out fine. And, you know, broad strokes, it's fine. All right. And I'll try to justify that later in the talk with having built something real. These are toys. Quite frankly, what we were doing a couple years ago was interesting toys. And we were learning and that was the point. So we had -- the language I showed you is something we call Overlog. It's a variant of Datalog. We added aggregation of function symbols along the way just to -- you know, as we needed them. But that's actually very standard in the Datalog literature. That's a classic thing to do. We horizontally partitioned the tables, which, again, we're talking about data, we never talk about messages. There's no send and receive in this language. There's just data and its layout. And then we added this thing called event tables which are tables that don't persist for very long. So the data in those databases kind of evaporates quickly. And those can do things like what is the state of the clock, so there's a tuple that tells you the current time and it might arrive at a given time to sort of trigger things. What's a message coming off the network, something that happened in your Java code maybe that's wrapping this stuff. All right. So events can be passed in as data that's sort of transient. And then the execution model is these iterated single-machine fixpoints. So think about this in a single node. We take a bunch of these events that were coming out of the environment. We turn them into data, we stick them in tables. Frees the world. On our local machine, computer Datalog fixpoint of all the persistent state we already had and this transient state we just inserted. But in the middle in phase 2 here, it just looks like a static database. So all these events have been turned into tuples, we have a static database, do Datalog, compute the outputs, and then do something with them. So generate more events, generate up-calls to Java, generate network messages, et cetera. Okay. So that was what Overlog was. There are problems with this. And for a long time we didn't even have it this clean. We were really confused about when does, say, state manipulation happen, when do inserts and deletes happen. How do you do atomicity. Like all this stuff was causing us enormous headaches. And later in the talk I'll show you how to model that. We understand this now. 15 In this model, which is pretty good, it's sort of an event handler with a Datalog engine in the middle. The database update happens atomically here before you start again. So the Datalog program you run in the middle of the loop at least is well defined. It's just plain old Datalog in the middle. And you go atomically from fixpoint to fixpoint on every given node in the system. >> How does that happen? Because, you know, like the update, imagine you take the Internet example, you update to some part of the [inaudible] would be happening in one part of the world and then you [inaudible] fixpoint here. So how can you ensure that the fixpoint happen atomically -- happens atomically? >> Joe Hellerstein: So, to be very, very clear, this fixpoint is local. So all the facts you have in your local partition are run to a fixpoint, which generates messages. The language I showed you in the example I showed you conveniently looked global. Right? And I said, oh, you partitioned the table. But it's really a logical global database. It took us a while to just decide that that was a lie. We don't try to propagate that lie anymore. But it's a lie. So you're absolutely right. How do people really program that stuff? They think about it. You know, they don't lie to themselves in that fashion. And I will stop doing that from now on in the talk. So that's going to be one of my conclusions from the experience of programming Datalog is that lie was a bad lie and we stopped using it. Yeah. >> So I'm not -- in that issue [inaudible] Java program is going to run. But it seems to me that [inaudible] because events might come and then you [inaudible] fixpoint [inaudible] fixpoint [inaudible] across these timesteps. >> Joe Hellerstein: Absolutely you're right. >> [inaudible] all kinds of things. >> Joe Hellerstein: Race is maybe not the term I would use, but the point is exactly well taken. So, you know, look, when we're in phase 2, I sort of conveniently said whatever I dequeued in phase 1 has been chosen. Phase 2 is all well defined. And from phase 2 phase 3 is well defined. And then crazy stuff happens. And then I dequeue it in phase 1. So, yeah, races can happen before I start phase 1 again in terms of what [inaudible]. >> [inaudible] persists. >> Joe Hellerstein: This program does. >> From the left-hand side to right-hand side? >> Joe Hellerstein: This database is state. >> So once phase 2 finishes, I can just throw away the [inaudible] and then I have to restart it from only the state of the database. Is that how we should think about? 16 >> Joe Hellerstein: Maybe the Java program thing is confusing. Think of it as there's many clients off in the networks and networks packets. These are going to be asynchronous, in some sense, Java calls. So there should be no difference between saying Java and saying network. Does that help? >> But all the state is in the Datalog. >> Joe Hellerstein: All the state is here. And then if you have state in the outside world because you're a separate process, that's your business. I mean, you may. That's sort of outside the model of this language. But I think that the thing that is important to note is that the order of the events in this queue and the choice of what to dequeue is not modeled in any way at all. And in that sense there's definitely races here. >> Yeah. The way I think about this is that the outer phases are asynchronous. But the inner phase is synchronous. There's time applied to the ->> Joe Hellerstein: Yes. >> -- to the thing. >> Joe Hellerstein: Yes. >> And so it's subject to races as synchronously clocked flip-flops are. So it's subject to races as synchronously clocked flip-flops are. There aren't any races. What? You know, there are no races in there. >> Joe Hellerstein: Yeah, right. Right. It's all the things come in, you do your thing, things come in. But the problem is the things are coming in in crazy order. You have no control over that. This is not bad. This is kind of like event programming with a really powerful event-handling language. It's really not bad. And if it weren't for the horrible syntax, actually, it might even be, you know, kind of attractive. But it's not what we want. So I'm going to show you something better in a little bit. What's happening is you guys are jumping to conclusions and it took a bunch of years in building software to believe. So, you know, four years ago we would have said no, come on, it's not that bad, no. But now I agree with you. But give me a minute to get there. I want to argue that this language is pretty good. And I'm going to show you some anecdotes to argue for that. And with language design, you revert to language a bunch, it seems to me, not knowing anything about how you do language design, doing it as an amateur. So here's a snapshot. It's a screen shot, sorry, of a research paper in the sensornet literature. This was I think the best paper in SenSys the year it was published. It's the Trickle Protocol from Phil Levis. It's a code propagation protocol. So it's flooding. 17 You're trying to flood code to all the nodes in your sensor network. You want to make sure every node has the latest version. So everybody's going to gossip about what version do they have and give you a better version if they have something newer. Except that there's contention in the radio space. You want to back off if someone else is communicating. So you don't want to be gossiping if other people want the channel. And this is this pseudocode for the program from the paper. And there's an implementation in the nesC event language, which is many hundreds of lines of very opaque-looking embedded C. This is an implementation in a variant of Overlog that runs on Berkeley Motes. So this is David Chu's thesis work, or a little snippet of it, really. And this is the line-for-line translation. And a lot of David's code here has comments that repeat what you see over here. So this was a standalone version of this slide, so it's redundant. Almost line for line. So the way that someone who does sensornet thinks is actually pretty well captured by this programming model. Which actually in retrospect isn't surprising because it is event programming. And if you're a network person, that's a fairly natural way to write protocols. >> So I'm just going to comment from aside the networking perspective, so it seems like the -- all the messages should be transferring has not been explicitly specified, and yet a lot of protocols worry about things like back off, retransmission ->> Joe Hellerstein: That's right. This is back off, right, because it's trying to avoid collisions. So all this stuff about timers and doubling your timers if something happens, it's all here. >> So all that is expressed [inaudible] TCP-like logic for transfers and ->> Joe Hellerstein: Sure. Yeah. TCP Windows setting is just a very simple stimulus response pattern. Right? I wait for something that doesn't happen, I change a timer. That stuff's not hard to say. The spec in Van Jacobson's paper is not very long. >> Except there has been a lot of changes since ->> Joe Hellerstein: Changes are good. And that's about making the network work well. The programming model I think is -- the actual specification is still pretty simple. You know, I would love actually offline -- I care less about network protocols than I used to. But offline you can tell me how wrong I am in saying that. Because I'm sure I'm missing stuff. All right. So we're trying to move up the food chain from protocols because we would run into people like you and they would tell us that we didn't know what we're talking about. So, okay, fine, maybe not real network protocols, let's do peer-to-peer protocols, okay, or sensornet protocols. So now we do peer-to-peer protocols. But let's pick at least 18 an interesting one. So the next thing we did was we said I bet we could build a real distributed hash table, content-based routing protocol out of this language. And we built chord. This is [inaudible] four years ago. And we did a high-function implementation with all the stuff in the paper. So all the things they said, you know, it's not really chord unless you add this and you add this, because otherwise it won't perform. So we added all that stuff. And the whole thing was still pretty compact. And in fact this is it. Okay. So why is "Where's Waldo" on this slide? Waldo's right there. He's in the middle of the bull's-eye, okay, is the point. So we took something that you would have thought was hard and we just made it really, really easy, is the point here. So this is in the spirit of saying, you know, chord shouldn't be that complicated. The paper is not complicated. The routing player of the C++ implementation from MIT is 10,000 lines of C++. For something that's well described in a ten-page research paper, and most of the paper is not presenting the protocol, it's not presenting the spec, it's talking about why it's interesting, it's presenting performance results. So, you know, I would argue our spec, our implementation is pretty close to the spec. And when you read it, that's true. And I just want to make one other point, because it's easy to get into arguments about, well, the code's dense, I don't really think it's any easier to read the C, you know, look, it's one piece of paper, okay, and you can print it out and you can sit down, and I know the syntax is weird. You can get it all in your head all at the same time. You don't have to context switch. 10,000 lines of C++ you cannot print out, go to the coffee shop and just read and understand. It just can't be done. So we like to talk about orders of magnitude code size difference, because I think anything else is kind of irrelevant. But when you're two orders of magnitude smaller, now it's -- you're obviously saying things at a different level. That's the point I'd like to carry away. Obviously you can't read this code. Yeah. >> [inaudible] >> Joe Hellerstein: Performance? Performance is fine, man. [laughter] >> Joe Hellerstein: So, in fact, you know, there were performance graphs in the paper. Performance of a wide-area distributed hash table is not a very demanding exercise. The performance actually was more or less fine. It wasn't great. We didn't try to make it great. We didn't bother optimizing it. We got the same performance characteristics that the MIT guys got. The figure of merit, actually, is churn handling. So as nodes come and go from the system, how long does it take for people to agree on the routes. And there, you know, 19 they would get like 99 percent agreement with some distribution of people leaving the network, and we'd get like 97 percent. And at that point we're like, eh, close enough, let's move on to the next paper. So I think it really was fine, but actually I'm not on strong ground saying that, and so I didn't even put it up. >> I don't see a single primary key anywhere. >> Joe Hellerstein: Oh, they're here. They're up there. And they're a terrible thing. Primary keys are a terrible thing. >> Why? >> Joe Hellerstein: Because they mean update. If you have a primary key and you want to add another tuple for the same primary key, that's an update. And that's nonmonotonic. And that's what's happening between those loops and the state. We're going to get all that right with my copious spare time when I talk about our new language, Dedalus. >> So suppose adding the key were really an event and it had a time associated with it. >> Joe Hellerstein: Good. You're way ahead of us. Good. >> [inaudible] >> Joe Hellerstein: No, no, no, no. That's exactly what we're going to do. Precisely what we're going to do. That's great. >> Good idea. >> Joe Hellerstein: It's been done before, I'm sure. Okay. So I've now brought you up to the present and we have 15 minutes left I think. Is that right? >> Jim Larus: We have a little more. >> Joe Hellerstein: Okay, okay. I'll go fast. At some level this is just taking that stuff and trying to make it realer in a different application domain. So the BOOM project is really to try to take this idea of orders of magnitude smaller code and apply it to building orders of magnitude bigger systems, namely cloud systems. So can we really write big-scale systems in this kind of code style that makes things much, much simpler, qualitatively simpler. Orders of magnitude less code means qualitatively we're saying things differently, we're saying it simpler. And we're going to build on this experience, but we want to address a broader application 20 domain, because really fighting with networking experts about TCP is -- I lose. It's just maddening. And I don't care enough. And most people don't, actually. So it's not a very big pool to swim in. So we're trying to actually -- and this we haven't gotten to yet, but we'd like to really address real developers and not ourselves. We would like to get out of the academic language design and into real language design. So that's -- I got these new grad students to come in a couple years ago. And Neil and Peter are sitting here. And I said, you know, we should design the next language. This Overlog thing, nobody can program it. We're going to design a real language. And they kind of walked in and said, you know, first of all, we just got here and we don't even believe that your previous language is any good. And, second of all, even if we take it as a given that you're our advisor and we have to believe you, we couldn't know how to start designing the next language because we haven't built anything in the previous language, so would we please, like, go build something and see if this is going to work. Which was really good advice. It's really nice to have smart graduate students tell you to not do what you think you want to do and just do something different. So their idea was -- this was with the advice of Tyson Condie, who had been collaborating with the Yahoo! guys on Pig, was let's just go build something big that's cloudy in Overlog before we design the next language. In particular, let's build Hadoop. And you know what? Let's really build Hadoop. Let's not cheat so when someone asks us is it fast enough we cannot -- we can't say. We're going to really build Hadoop. So the goals are convince ourselves that this makes some sense, inform the designer this is a better language, certainly for cloud, maybe some day we'll look at this in the multicore space. We'll see. And, you know, they're smart guys and they need to be famous, so we better do some hard stuff along the way. Turns out Hadoop is really easy to build. And you can build it -- anybody seen BashReduce? It's MapReduce implemented in Bash. A trivial implementation of MapReduce is really easy. So we wanted to do some stuff that was hard along the way. And the metrics couldn't be just lines of code anymore because people were complaining that that was a stupid metric. We really wanted to show that the code would be better, it would be more flexible, more malleable, and it would perform. All right. So here is the game as we would, first of all, rebuild Hadoop in Overlog. And then we would extend it and sort of prototype where Hadoop's going to be five to ten years from now, add availability, because, embarrassingly enough, both Google's MapReduce on Hadoop have a single node as the master. It's a single point of failure. And it's a scalability limit because it's a single node. And when -- in the GFS paper they basically say, look, if you can't fit your metadata in one computer, buy a bigger computer. It's just like so unGooglely, it's scary. So we wanted to sort of say, well, they're not going to stay there forever, let's see how we would extend those features. And then we wanted to do some monitoring along the way. 21 So this took these two guys and a couple other folks about, realistically, three months, plus we wrote a new interpreter for Overlog and we were trying to be honest. So let's call it nine months. But it was mostly like three to four months. So it was very easy. So here's what we did. Pete and Neil rebuilt HDFS from scratch and they built what they called BOOM-FS. In Overlog is about this big. And then they wrapped it up in Java API so it would be API-compliant with HDFS. So Yahoo!'s Hadoop runs on BOOM-FS just fine. So it really is API-compliant. And that was 1,400 lines of skin in Java over these 470 lines of Overlog. And then the -- Tyson Condie, my senior student, he was interested in the interesting bits of the MapReduce scheduler. Because most of MapReduce is really boring. The interesting bits are like straggler handling and the scheduler. That's where people are kind of innovating. So what he did is he did a brain transplant. He scooped out the brains of Hadoop, which was roughly about 6,000 lines of code -- everything else is plumbing; swear to God -- and he replaced it with about 400 lines of Overlog. So that's the orders of magnitude thing, roughly speaking, you know, 20,000 of 400, 6,000 of 400. This one's a little more impressive, but this one had to really -- brain transplant is harder than [inaudible], right? >> Is the check point and restart capability of MapReduce in the Java or Overlog? >> Joe Hellerstein: It's in the Overlog. Yeah. Yeah. I mean, you know, the -- actually, I shouldn't say that. Putting things into HDFS is in Java. Only the scheduler is in Overlog. HDFS is in Overlog, though. So the decision to put something at the end of the map phase on a local disk, the decision to put something at the end of the reduce phase into HDFS, that's in Java. But that's not interesting. >> Yeah, but there's the failure handling question where you lose a node and then you've got to go find a copy ->> Joe Hellerstein: That's all here. The decisions about how to schedule the stragglers when someone hasn't answered in a while, that's all here. All the interesting decision-making and policy is in Overlog. So just somebody wanted performance. So here's a super high-level performance graph. This is all pairs, components. So over here we've got Hadoop on HDFS. Over here we've got BOOM on BFS. And then we've got Hadoop running on our filesystem, our MapReduce running on their filesystem. And what you're seeing is cumulative distribution functions percentage of mappers finishing in blue over time, percentage of reducers finishing in red over time. You can see reduces start after maps finish. They all look the same. The high-level takeaway here is this is running a particular boring word count, I think, example from the paper, the usual crap. It's the same. 22 The reason it's the same is because it's not hard. This stuff is not hard. It's all spec. There's no efficiency issues. >> [inaudible] >> Joe Hellerstein: Less, I think. But something like that. I think we may have tuned that. I shouldn't claim. It's about the same. We are much more CPU intensive. I will say that. We measured that and we were pretty open about that. All right. But the fun stuff is now. So suppose we want availability. HDFS is a single point of failure. If you go look at their issue trackers, there's, like, oh, you know, we have a single point of failure. That's bad. Maybe we should do -- what's best practices, first of all, in the issue tracker. Best practices, put your metadata on NFS. That's a Hadoop community best practices for ensuring the metadata of your terabytes of file. And we had lots of stories from friends of, oh, yeah, you know, actually we did lose terabytes of stuff because we lost our metadata. That's just -- it's embarrassing. I mean, Hadoop is a very immature piece of software. Much as it's really exciting and enabling piece of software for a big community, it's pretty immature. So there is an issue tracker proposal for warm-standby, you know, you trickle logs to somewhere else and you can at least recover as of some time in the past. We wanted to do hot-standby and process pairs with a Paxos consensus. Because these guys wanted to do feats of derring-do, right? Well, we wouldn't invent no warm -- sinking warm-standby. And also we'd always wanted to do Paxos. Some folks at Harvard did Paxos in the previous system back when we were doing chord. They did a toy version. It only would pass one decree, but it was still -- it was like, wow, you can do Paxos, okay, let's really do Paxos. So Pete did the Paxos implementation. At the end of the day, it was about 50 rules, which is about as complicated as chord. Okay. Paxos is pretty complicated; that makes sense. And it's -- we're talking, you know, real multiPaxos, so it can issue at many decrees, it does leader election, it sort of -- it's a real service. And, again, you know, so I showed you David Chu's slide. This is single-round Paxos. This is Leslie Lamport Paxos. So it can issue one decree and then it stops. This turns out to be really pretty. So this is Lamport Prose, and this is Pete Alvaro Overlog. And it maps up -- maps really, really nicely. Because Leslie Lamport thinks in invariants and our language is an invariant language. Unfortunately when you start doing leader election and you start doing multiple rounds of things, most of the papers end up looking much more state machineny, because most people are not Leslie Lamport. And when -- Pete can tell you more about this, but when he had to turn this into real like multiPaxos, things got less pretty. But what is pretty -- Pete put this together for a workshop paper -- is kind of what he 23 found looking back at what he had written in Paxos where the design pattern's inside there. So at the bottom here, here's plain old Datalog: Selection, join, and recursion. Here's what Overlog adds: Messaging, aggregation, state update, timers. So this is the language we provide is down here. Above that he found he had sort of built some patterns. He did both two-phase commit and Paxos, just to kind of compare them. So the pattern multicast was standard primitive in both of them. Counting was a standard primitive in both of them. And then choosing amongst a group of people, which is a lot like counting. Roll call. Who's around. Voting. Who's around and agrees. Dequeue and semaphores. And these little components added up in various combinations to Paxos and two-phase commit. So I think one of the contributions by writing a really clean Paxos was to look at the structure of kind of what are the primitives in a consensus protocol. And you can ask something like, well, we cut this diagram here and gave you a language [inaudible] here, but maybe if you wanted domain-specific language for distributed systems, maybe the cut's here or maybe the cut's here. And this is the language-independent point. This is what are your libraries that you should build. Should you just ship a Paxos library and be done? So these are -- you know, this kind of gets you to what's a good DSL for distributed systems. And I thought this was one of the things you could take away from our work that wasn't sort of tied up in buying the whole farm, right? So I wanted to bring this up. There's stuff happening because of the work we're doing where I think we're adding clarity to the discussion. Okay. So that was variability. So we added multimaster availability to a Hadoop filesystem and Hadoop. The next thing we want to do is scale out. All right. Scale out is great. So, you know, Google says if your master can't scale, buy a bigger box. And that's what Hadoop does, because Hadoop does whatever Google tells them to do. And our friends at Yahoo! are like, you know, this is actually a problem. We've got really big Hadoop clusters at Yahoo! and we have problems with scaling our metadata. So how can we scale out the master to multiple machines? Well, really easy. So this is the design that Neil and Pete had for the metadata in HDFS. There's a file relation. There's a derived relation that's logical, which is the paths that you can derive recursively from parent-child relationships. Right? Fully qualified paths are just recursion over parent-child relationships. And then there's chunks that make up the file. Partition these tables like a relational database, and then you have a scaled-out master node. Yeah. That's right. So it took one day to do that. And the guy who did it was actually the guy -- he's like our operating systems guy. He doesn't even -- at the time I would say he did not understand Overlog because he kept doing weird things with it. 24 We're like no, man, don't do that. And it still only took him a day. That's crazy that this only takes a day. If you tried to do this in the Hadoop code base in the Java, it would be a nightmare. It would just be a nightmare. Because the assumptions about this stuff being centralized are not in one place and the data layout is not in one place. And this is just a pure win for data-centric design. If you [inaudible] I have a collection of stuff and I'm going to store that collection of stuff, then you can parcel the stuff out. Right? All right. So I'm going to skip this one because time is short and I want to move on. I'm also going to skip bragging mostly. Except to say that, you know, we did build something real and we got the benefits we hoped to get this time still. Orders of magnitude smaller. And performance was really good enough in some measurable way. Apples-to-apples comparison. But there's lots more we could do. The thing is, this is not an end goal. We actually don't care about bragging about this because we didn't want to build Hadoop. The world already has Hadoop, and, frankly, Hadoop's kind of dorky. So what are the lessons, right? So lesson 1 ->> Change the talk when you give [inaudible] ->> Joe Hellerstein: I'm going to give it at Yahoo! I think. [laughter] >> Joe Hellerstein: This whole thing about the fascinating ecosystem that's growing up around this phenomenon that is Hadoop, yeah, which I agree, I don't talk about the software artifact as much. Yeah. Now, Dryad, now there's a system. It's beautiful. [laughter] >> Good start. >> Joe Hellerstein: That's a good start. Okay. So I try to partition the lessons of this. We're working hard to make this something that you can take home in your pocket, and, frankly, we still don't really have it nailed, as Jim has pointed out to us in shepherding our paper. But part of what we're doing here, like the scale out of HDFS, is really just about thinking about the state of your system as a collection data type. You know, it could have been MapReduce. It could have by Python comprehensions. It didn't have to be logic. Could have, frankly, been a C list library with the appropriate programming discipline to not assume that it's all in one place. You could do that in any language. And I think, you know, this is just good design principle for building distributed systems. Take your collections of data and make them 25 collections and put collection interfaces for them. Use link. This is sort just link at some level. I love that. That's great. And I'm not going to try to explain that. And then part of it is the declarative stuff. So we did have recursion in the filesystem to form paths, and we got to play with stuff like when to materialize the paths and when to leave them unmaterialized as optimizations that were separate from the design. So should we actually form the paths and store them or should we compute them on the fly, that's all underneath the covers the way it would be in a database. We did in other parts of the system things like dynamic programming falls out really nicely in a logic language, and anything involving graphs and transitive closures, like network paths falls out nicely in a logic language. And then as I said, you know, Lamport-style invariant specification is natural in a logic language, because that's what you're doing, you're doing invariant specification. That's what logic is. And so certain parts of distributed systems are a good fit. And actually the tricky bits are a good fit. And then finally -- and I didn't get to talk about this -- we did have this sort of aspect-oriented ability to kind of go and express invariants over all the state, like make sure that the number of messages sent is what we would have expected from Paxos. We didn't have to look at any APIs to do that. We just expressed that as extra invariants on the data. And so you get this kind of cross-cutting ability to put in debugging specifications. And because it's Datalog, the spec of the invariant is the check. You just write down the rule. The engine enforces the rule for you or flags you when it's not true. So there's no translation of your specs for debugging into some kind of implementation. There's no tweaking of the code. >> But there seems like there's another aspect of it, which you haven't talked about too much, which is a major point of leverage, which is essentially the dealing with race conditions and disambiguation of race conditions by using timestamps. That's a big part it too. >> Joe Hellerstein: There are a lot of limitations. And that one false in here. All right. So, first of all, the syntax is hard to write. That's fine. But, worse, it's hard to read. So it's not just that people don't like it, it's that I don't like reading my own code, which I don't write that much of. These guys don't like reading their own code. It's just it's very terse syntax. It's not a pleasant syntax. More to the point, and this is what was brought up before, this idea that there's state all over the world and that you express invariants over distributed state is a lie. And so we stopped using it. These guys are better developers than I am, I think, and they just didn't use that feature of Overlog in their code. They didn't use distributed rules in essence. They work in special cases, like totally monotonic things, like paths in the Internet. Routing protocols actually are mostly monotonic, so you can get away with that kind of nonsense. They don't work for stuff like Paxos. 26 However, if you have Paxos, you can do it on top of that, because then you have a consistent global view. So it has its place, but it's a lie in general, and so we mostly didn't use it. And you don't want a language that promotes a lie, because you'll write bad code and you won't understand why it doesn't work. Here's a really interesting one. And obviously I'm going to stop the talk now and we're not going to get much further, but this one's come up kind of recently. Neil pointed this out. Distributed invariants are a very natural thing to specify. The way that they're specified in most systems is actually through protocol, which is not declarative. So I'll give you an example. Everybody knows from reading the Google filesystem paper that there should be three replicas of every block. So isn't an invariant of GFS that there are three replicas of every block? No. Sometimes there's two replicas and then you do something. When there's not three copies of a block, then a heartbeat doesn't come to the master and the master thinks maybe there aren't two anymore and it initiates some copies and lalalalala. What is the invariant that they're going after with this? Could you even write down what their goal is with this protocol? It's something about there should most always be one copy of the block, I think. Or the mean time to a block getting lost is way lower than something. But it's not like they wrote that down. So even writing down these specs for what do you want out of a distributed system is not -- there's no art yet for that, really. And even if you wrote them down then, we couldn't enforce them because we can't even check them. So suppose that the invariant was indeed 99 percent, you know, mean time to failure is whatever for a block. Like how do you test that. How do you translate that spec into implementation. I don't know. So we're actually interested in this. And we think we have some ideas for sort of really software synthesis, I guess, from specification. Got to start with testing and then go on to enforcement. But the goal is to really be able to do distributed invariants. Right now we could really only do single-node invariants. And I think for distributed systems design, this is generally important and hard and not well treated. Okay. As been pointed out, Jonathan pointed out, state update in Overlog is outside the logic. It's illogical. And a bunch of people have been starting to pay enough attention to us that they've pointed this out in public. So they've written papers saying, hey, you know what, P2, the system you built for Overlog, like it does some crazy things that's not what you say in your paper. And what you say in your paper is crazy too, but it's just crazy in a different way. And this is all true. So this is a problem. Okay. And it was -it bit these guys when they were coding as well. Another thing we really haven't dealt with is, so, is it better than Erlang? Why? You know, I don't know. I actually don't really have a good answer. I have ideas for answers. And that's one reason I really want to talk to you guys about [inaudible], is I want to 27 understand maybe by hearing each other's pitches for what's good, we can get more clarity on the pieces you like. All right. So I think with my very brief remaining time ->> Is "good" the right question or "necessary" the right question? >> Joe Hellerstein: Elaborate, please. >> Well, as soon as I have a monad, which is link for Pascal or link from any of the .NET languages, dot dot dot, I can make this happen. I mean, I can embed Datalog inside of any of these and make sure that all the state is contained in the declarative part in all the rest of that. And once I have that, now you're talking about programming practices. But better -- they're all [inaudible] they could actually all interact with each other. So I think it's more of what's necessary. You know, C++ doesn't give me what I need by itself to actually pull this off. Now, we can enforce it. I can give you libraries you have to live inside of and all the rest of that. The language itself doesn't help by itself. >> Joe Hellerstein: I have a real hard time with that line of thought because, of course, anything too incomplete is fine. As necessary as -- it's like I understand it from a complexity point of view, but everything after that is programming practice. I mean, so it's -- I think we would just agree it's all about good. And your definition of necessary is really a strong version of good. >> Sure. Okay. >> Joe Hellerstein: Yeah. And ->> You might want to say which one of these makes it more automatic to do the right thing. >> Joe Hellerstein: Yeah. Yeah, maybe. Or how easy is it to shoot yourself in the foot. Something. And we haven't really looked at it from that angle. That would be really useful, actually, just example of design patterns you can easily say in these languages that are bad. Yeah. Good. Thank you. That's great. All right. In my very brief amount of time, which is negative -- what would you guys like? One thing we could do is we could say that was interesting, let's take that break, and if you've had enough interesting for today you can go, and then a few of us can stick around for another 20 minutes. Does that sound good? >> Jim Larus: Yeah, that's probably [inaudible] >> Joe Hellerstein: Okay. I will not be offended if you take off. We're mostly going to talk about language ideas at this point. So if this was the part of the talk you had wanted to hear, I apologize. [laughter] 28 >> Jim Larus: It's all on videotape [inaudible]. >> Joe Hellerstein: Oh, okay. Then you can suffer through it again is what you're saying. Okay. >> Jim Larus: [inaudible] >> Joe Hellerstein: There you go. Okay. That's good. Okay. So the next thing I want to talk about is the foundation logic for Bloom, for our new language which is called Dedalus, which if you're not catching the reference, you can talk to Peter afterwards because he was an English major and I wasn't in college. So Dedalus is the foundational language for Bloom, as I said. It's the Datalog in time and space. And the first observation that sort of came out and somebody -- right away one of you guys pointed this out was, time has got to be really important. And in fact space doesn't matter. So distributed systems aren't about space, they're about time. Basically it doesn't matter whether the thing is like in the next room or two rooms over or if the thing is in Albuquerque or it's in New York or if it's Beijing. That only matters from a kind of performance perspective. Once it's far away and you can't clock with it and you can't do things atomically with it, it's distributed. So distributed systems are not about space, they're about whether you can atomically do stuff in time together. And that's the thing we should be reasoning about as a programmer is about time. And so in Dedalus, there's three things you can say. You can say something is true now, something should be true instantaneously after now, or something's true asynchronously. >> Eventually. >> Joe Hellerstein: I don't like that word "eventually." Because -- yeah. People have a lot of connotations for that one. So I prefer later. Or actually we just say async. Yeah. The other thing -- so that was the first sort of conceit was that Overlog was all about partitioning in space, and that was wrong. It should be about partitioning in time. The other conceit I think in the language, which is liberating, is that there's no database. Every fact is transient. Every fact is true only for one instantaneous timestep. >> Sounds like it's streaming. >> Joe Hellerstein: It's very much a streaming system. Yeah. Absolutely. But -- yeah. But there's windows. Or there's no built-in windows. So every local node has atomic timesteps. So on a given node a fact is true for a single atomic timestep, and that's it. All right. And I think you pointed this out before: Put a timestamp on every tuple, if you like. Same effect. And it's a local clock timestamp, a local, logical clock timestamp. 29 And what's [inaudible] so what is persistence, what is storage. It's just induction in time. And what this is going to give us is a model theoretic framework to talk about side effects and update, which is really quite cool. So all this stuff that we didn't get before about like what does it mean to do update and when do updates happen, that's all going to be logic now. And it's going to have a unique minimal model and all these lovely things we had for Datalog. So I'll show you this in just a sec. And then delay and failure are just nondeterminism in assigning timestamps. So all the things we worry about in distributed systems will be asynchronous specifications with nondeterminism in the timestamps. And there's actually been work -- there's nice work on capturing nondeterminism in logics that we just lean on for this. And so we didn't have to invent anything. So here's sort of core Dedalus example. I know some event E, which is attributes A and B, and a timestamp attribute. Every table in Dedalus has a timestamp attribute, the last attribute. If I know AB is an E at time T, then I know AB is in P a time S. And let's say T equals S. If you write a rule like that, that's a plain old Datalog rule. It's deductive. It says if I know EAB, I know PAB at the same time. That's plain old Datalog. If instead of T equals S you say successor, this is next. This means instantaneously in the next timestamp, if I knew E at what timestamp, I know Q at the very next timestep atomically. With no intervening state of Q. And then asynchrony just says if I have a time T, then of all possible times in the university, which, again, we will not store, all possible times in the university, this is the logic expression from our friend -- from our Italian friends at UCLA for expressing nondeterminism keyed on AB and T, choose an arbitrary S. Right? That's just the way to think about it. It's like for each ABT triple that satisfies these two rules, you pick an arbitrary S that is in the time relation. This is just basically rand, if you like. >> Presumably that's S greater than T. >> Joe Hellerstein: Ah. We actually like to say it doesn't have to be. And there are cases where it really doesn't. And this is all just logic, man. [laughter] >> Zero of time back to time. >> Joe Hellerstein: It turns out it has to do with how much nonmonotonicity is in your program. So if you have a totally monotonic program and all it's doing is accumulating stuff, you can have it in first stuff that was true before. And it will catch up. And it -Pete, what's the quote, the MIT quote? >> Peter Alvaro: Time is a device that was invented to keep everything from happening at once. 30 >> Joe Hellerstein: All right. But in most of our programs, because we do want to admit nonmonotonic update, it makes sense if you do it in monotonic time and it doesn't make sense if time has cycles. Yeah. So update gives you headaches if time doesn't go forward. >> If you put a range limit on S, do you get the [inaudible] algebra out of this? >> Joe Hellerstein: I don't know the answer to that because I don't know what you're talking about. >> Well, the [inaudible] algebra is two times, not one. >> Joe Hellerstein: Oh, your algebra. >> I'm saying put a range. Put a range on S. That gives you the two times. You say it has to have [inaudible] to be true. >> See, but every tuple here only has one time. The tuple is not true for an interval ->> No, no, no. The tuple has one, but you can put an interval on top of that and say I will do the async random deferral during this period that that tuple now applies. >> Joe Hellerstein: You could probably say something like this fact can only be true if it arrives within a certain amount of time. >> Within an interval. >> Joe Hellerstein: That's fine. You could say that. Sure. I don't know how that maps to your language. That would be hard work for me to -- I'd first have to really understand what you're doing instead of vaguely understand what you're doing. All right. Here's the sugared version. So you don't really have to write that. That's horrible. If you want to write a Datalog rule, write a Datalog rule. Leave out the timestamps. If you want to write a Datalog rule with next, say @next. And if you want to write later, say @later or @async. So that's the sugared version of Datalog -- Dedalus. Here's where it gets fun. Persistence. Persistence is induction. If P is positively true, AB, then P is positively true at the next timestep AB. So this is just infinite storage of that fact. It's true now, and if it's true at X, it's true at X plus 1. Persistence through induction. You don't want to implement that. That's like RAM. That's like implementing DRAM in your software, which is just a bad idea. You have to keep refreshing all your facts. But it's a fine model. And, after all, this is just the model. The implementation is independent. Now let's suppose you want mutable state. You want to do deletion. Let's have a convention. For every predicate P, which we'll call P positive for now, we will also have a predicate called P negative by convention. And if you like, this is, you know, like Ruby on Rails. You must name your tables the way I say so because I invented it. 31 But in fact we can put sugar on top of this, so you always generate you say P and I say, oh, you mean there's a P_pos and a P_neg, and I do that under the covers. Now ->> [inaudible] >> Joe Hellerstein: Hold on. Let me do this because it's fun. And then I'll -- whatever you want to do which is less fun. I'm sorry. [laughter] >> Joe Hellerstein: The arrogant bastard. So all right. If we know that A1, B was in P_pos at some time sometime, and it was not also in P_neg at the same time, then it is in P_pos at the next time. >> [inaudible] >> Joe Hellerstein: It's a bug. What A? Oh, this is on video. [laughter] >> Joe Hellerstein: Go back in time. It's nonmonotonic. All right. There's no A, man. But think about this for a sec. Let's look at this example. P(1,2) is true at timestep 101. Very nice. It's in P_pos. Okay. P(1,3) is true at timestep 102. Fine. P_neg has (1,2)@300. Does P_pos contain (1,2)@300? Yes. P_pos does not contain (1,2)@301. Because we're saying is if a fact is currently true and has been asked to be negated right now, it will be gone in the next timestamp. We break the induction. >> Do you want a comma there in the first line? >> Joe Hellerstein: I put a period. >> No, the first line. >> Joe Hellerstein: No, that's just bad -- that's bad, you know, keynote wrap. This is the arguments of P_pos. >> Oh, it is -- it's the argument of [inaudible]. >> Joe Hellerstein: Yeah, yeah, yeah. Sorry. >> Looked like a separate [inaudible]. >> Joe Hellerstein: Sorry. So but the point is the induction has been broken at timestep N, and that appears because there's no induction to timestep [inaudible]. This is actually extremely cool. Because now what we've got is this is all just Datalog. There's no updates. There's no nothing. This is just good old-fashioned Datalog being 32 used in a very stylized way. Which means we have a beautiful, lovely model-theoretic treatment of state update and side effects. And all is -- all the tools we know about in Datalog can be thrown now at these programs. Jonathan. >> [inaudible] >> Joe Hellerstein: We have windows. >> These are windows. >> Yep. >> Joe Hellerstein: They're little bitty windows. >> They're -- you can create a rule that says for each fact [inaudible] to P_pos you add a P negative with the timestamp [inaudible] the original timestamp plus some constant. >> Joe Hellerstein: You could absolutely implement windows. Yeah. >> You also got rid of the Prolog retract. You don't need it. >> Joe Hellerstein: Which I wouldn't have wanted in the ->> No, no, that's the point, is this works correctly in a logic framework without having to do ->> Joe Hellerstein: Yeah. Actually, you know what would be -- if I cared about such things, and, you know, what I care about seems to change over time, because I used to think programming languages were really boring, if I cared about teaching the Prolog community something I guess we would go back and do cuts and all this kind of stuff and retracts. >> Oh, with this? >> Joe Hellerstein: With this. >> Yes. >> Joe Hellerstein: Right? Basically you implement ->> But it doesn't fit this [inaudible]. >> Joe Hellerstein: You do. But you have to implement the Prolog interpreter in Dedalus. [laughter] >> Joe Hellerstein: Which is awesome. So we implemented a query optimizer in Overlog which did bottom-up search. And then somebody on the review said, well, you say it's extensible, can you do top-down search? And we do. In a bottom-up language 33 we'd implement a top-down search. So, yeah, it's very doable. It would be -- like I say, if I cared, which I actually right now don't. >> [inaudible] parallelism that you have in pure Prolog by getting rid of cut and get rid of retract that you can now do the logic and this thing now parallelizes like crazy. It's good. >> Joe Hellerstein: Thank you. Okay. Once we're in Datalog now we can start playing -- writing theory papers and playing games. So we sent a paper off to PODS where we took all the -- we took the two classical analyses that you do to Datalog programs and we recast them in this temporal model so that we could get their benefit. So one is stratification, which is finding out where your nonmonotonicity boundaries are and organizing your program so that you step through them in a way that has a unique minimal model and that has a natural implementation. And we show that you can do that with this temporal extension. And it does exactly what you just described. It tells you that there's a whole batch of stuff that because it doesn't count essentially you can do it as parallel as you like. And so it points out in these programs, even with state update, it tells you where your races are at some level and it tells you to wait for them. It tells you to hold out and wait until that race is done. And somebody was talking about lock-free stuff. What's lock. What's a semaphore. Counting semaphore. Counting, counting. Didn't I just say everything's about counting? Yes, it is. What this is going to tell you is where your locks should be. And if you want, they're counting. Sure. Yeah. It sounds really cosmic. I don't know if it's true. But it sounds good. And then also safety. When you have conservative checks that can tell you your program will terminate, we can translate those to checks over time. Okay. And that's really useful sometimes. And the key is that time -- and this gets to your point about going backwards in time, time is a source of monotonicity and we can use it in our analysis. If you know the time's going to march forward, you can say stuff about what your program is doing. You can actually take your program, you can unroll it into a plain Datalog program with lots and lots of these strata laid out one timestamp at a time. And then you can collapse the strata if you do something totally monotonic across two strata. So time is the source of monotonicity because facts from the past cannot negate themselves in the future. And therefore many things that look -- actually the classic examples of nonmonotonic reasoning from the database theory textbooks are fine in time. They look like flip-flop programs. And they're meaningful. They're well specified. They don't terminate, but they're well specified. You toggle between your beliefs. I win, I lose, I win, I lose. It's true, it's false, it's true, it's false. It's all fine. It's all well specified. And so the static checks are easy. 34 Wow. Yeah. Okay. Good. So a little more formalism. This is not yet written down and we haven't figured out what to do with it. This is Pete's drawing of what Dedalus does. It takes -- these concentric circles are like strata in a single Datalog fixpoint. So if you want, it's just the single Datalog fixpoint. And then, you know, in Overlog what we'd say is state update happens, and then you do another fixpoint. In Dedalus what we're saying is that these are next rules. So these are the facts you know at the next timestep, which define a selection on the database at that timestep. And then there's these asynchronous rules that go through the randomizer that pop in at some later timestep. And, you know, there are these foreign agents sending us data from across the network. Because, after all, we want to go back to doing distributed systems. We can't control when the messages arrive. And they could be written in other languages. They could be written in Java. So these arrive at some random amount of time, some random times. If everything you write is just nexts and you don't have any laters and you don't have any foreign things, your entire program's outcome is specified in the first timestep by the base data and the rules. The only nondeterminism, the only thing you don't know logically from the very beginning is these nondeterministic timestamp assignments, and of course the data and the timestamps of arriving stuff. That's what you want to describe in terms of the semantics of your program. All the semantics of the next stuff is here. It's given at the beginning. The rest of the semantics are captured in the timestamps on these messages. So call that a trace. It's just an assignment of timestamps. And now really what we want to talk about with program correctness is what are equivalence classes of traces with respect to certain properties. So like Church-Rosser property would say can I prove that all traces -- see, ignore the red stuff. But all assignments of delay give me the same answer. That's the Church-Rosser property. It says all traces are equivalent. It's giving me the same output database. But you could have weaker forms of equivalents. You could have equivalence classes on traces that are something else; that like, I don't know, the numbers in the bank accounts add up the same even though the timestamps on the issuance of the checks are different. And so I think we have a nice tool to talk about, things like eventual consistency. I think you said something about eventually. Eventual consistency, lose consistency [inaudible] in terms of trace equivalence. I think. We haven't done this. So this is pure conjecture. Maybe all we have is a pretty way of saying that we don't understand this stuff and nobody ever will. But I think we have a leg up. And then the other thing to keep in mind is if the red guys are also the Dedalus programs, kind of what we're capturing here is kind of things like Lamport clocks and causality and stuff. But we're capturing it in a way where we can dig deeper into the program and look at the dataflow analysis of the logic. So you can actually talk about causality with respect 35 to the meaning of the program, not just like, oh, somebody sent me a message at a certain time and I couldn't possibly know what they were doing with that message, so I have to be very conservative. We can actually integrate the assignment of clocks with the semantics of the program. And so I think we can -- I may be wrong and may be underrepresenting what's possible in sort of classic TLA-type stuff, but I think we have a lever here as well. >> So TLA actually deals with things like [inaudible] things over intervals, as does the [inaudible] algebra, which we mentioned before. You can say this is true from now to infinity. Every tuple has two times, not one, to make that go. >> Joe Hellerstein: A start and an end. >> Yeah. Yeah. Yeah. You get there from here with induction, but ->> Joe Hellerstein: Yeah. It's not as pretty. >> But you can do this with that by just saying T1, T1. >> Joe Hellerstein: That's right. You can go either direction. You can [inaudible] one or the other. Yeah. Good. >> Either can simulate the other and pretty efficiently, I would think. >> Joe Hellerstein: But TLA doesn't include all of nonmonotonic logic, right, so do that part. >> No, I wasn't trying to get there. >> Joe Hellerstein: And so that's where I think you -- having these both in a single language I'm hoping is going to help us reason better about stuff. And an example of that is like, you know, this notion Acid 2.0, you say, well, if you have a set of interfaces and they're all associative, commutative, and [inaudible], that's a good way to program distributed systems. Can we analyze these programs for those properties and then say, oh, we can relax. Trace equivalence will now be easy to prove because I've proved associativity of something. I think maybe we can. And we're starting to think about it. We're marching down that path. So best practices for building distributed systems have actually a logic -- constraint on what you can say in the logic. And I think maybe we can test for those constraints. All right. So much for Dedalus. It's really just a formalism, okay? So here's what I think of for Bloom, and then I'll leave you with this. Because Bloom doesn't exist. Bloom is a fiction. It's going to be some version of Dedalus that's supposed to be great. But here's how I think about it because it's fun. It's what I call the 36 Bloom loop. So the Bloom loop is -- you know, it's an acrostic for bloom. So it's easy to remember. So the Bloom program, what do they do? They're going to write a timestep essentially. They're going to say given stuff that's come off the queue, what do I want to dequeue. So you can have some control over the asynchroty. You can postpone things. They've arrived, but I don't want to deal with them right now. So you have a -- basically a priority queue with a priority function. And that will define your trace. That gives you control of your trace. And then you write your logic, which is your "now" stuff. That's L. And that's if you wanted safety testing in distributed systems or it's assertions, it's invariants. It's derivations of new stuff. Then you write your operations, which are like state update side effects. That's really what "next" is. And then you need another O to make it spell bloom. So that's the orthography step. We call that acronym enforcement. And then there's messages, which is "later," and you'll send messages. So you show this to a programmer and you say dequeue, test some stuff, make your updates in batch, send messages. This is not crazy. This does not look like European logic. This looks like an event loop programming design pattern. I -- my, you know, ethnic roots are in Europe. I have a great respect for the continent. Shortest paths. Here's our shortest paths program, and I did a Ruby-style version of this for fun. So this is not Bloom. This is the Bloom version negative 1. And my guys aren't particularly fond of this, but it's better than not showing anything. So here's batch for our shortest paths program. Every time a new path tuple arrives or every one second, do the following. All right. Here's the logic. Here's the definition of the link table key value. Here's the path table key value defined by link.each for each link. Yield, a key value pair for path. For each pair of path and link that match on to and from, yield, you know, the from/to for the combined path. This is not great. This unification syntax is a little confusing. But so be it. But these are comprehensions instead of four loops. They could be four loops. That would be fine. And then shortest paths has kind of got a reduce in it, right, because you have to do min. And, again, this is pretty close to Ruby. It's a little bit of Pythony stuff in here, but it's kind of Rubyesque. And then there's no ops in shortest paths because you never do any state update at all. And then there's messages to be sent. All right. But that's the kind of Bloom loop look at things. 37 When we think about like our entire Hadoop filesystem implementation, this doesn't seem to help us much as a framework for thinking about how to factor that implementation into modules or functions or objects or what. So it's actually -- it's cute, but it's not clear that it's all you need for structuring a program. Okay. Now I'm actually going to be on time if you give me a 90-minute talk. So where are we going. Here's some stuff we're doing on the sort of systems side, and then there's a slide of stuff we're doing on the languages side. And I realized as I did these two slides that there's no way we could do all this stuff with the size of the team we have. These guys haven't seen these slides, but they're going to kind of scare you. [laughter] >> Joe Hellerstein: What we desperately need, which we're putting on the calendar, is to sit down and prioritize this list. Add things to it, brainstorm some more, and then we know it. But here's stuff that we've been talking about. Neil, who is an excellent and very experienced coder, is building a new runtime for Dedalus or for Bloom called C4, which is a plastic explosive, so part of the Bloom project. The target is to be able to do as many Paxos rounds per second as necessary from our Paxos spec. Okay. So can we build [inaudible] for real at the performance -the throughput we need. And to do that you need very little latency message handling. And most sort of database engines are tuned for bandwidth, they're tuned for throughput, they're not tuned for low latency. So that's what C4 is, the challenge there. And that's coming along. Harry Otti [phonetic] came from Wisconsin and he does storage and recovery. And we've been talking about this idea of failure as a service in the cloud. So the idea is if you want to do testing of your software and how it's robust to failure, you want to be able to do large-scale -- orchestrate large-scale experiments where you fail components in the system. And orchestrating that seems like something very natural in a high-level language. So we've been trying to put his ideas for testing and software for storage together with our notion of declarative programming with this idea of failure as a service. And this is a very young idea. But it's exciting, because Harry Otti has done a lot of really good work with declarative testing for correctness of storage systems. We've been talking a lot about Paxos and virtual synchrony and other group communication and consensus protocols, and I think we can make some progress here. I'm arrogant enough to think we can waltz into that area now with this new logic and make some progress. We'll see. We'll see. I don't know. And then we want to build some stuff. So we built Hadoop. And the next thing we built which didn't use Overlog at all was a streaming version of Hadoop called HOP, the Hadoop Online Prototype. It does continuous queries and it does online aggregation, if you know what that is, without changing the MapReduce fault tolerance model. 38 So it takes out the stages of maps and reduces and it lets it stream while also doing the backup restart that's built into MapReduce. So that's kind of cute. Once you build that, you realize, my God, scheduling this thing is really complicated. And I'm sure you see this in Dryad. When you have a much more flexible programming framework, scheduling on a parallel system is just -- there's so many design variables. So we want to get back to our declarative scheduler to play with that space. So that's going to come back to us. And we want to build latency sensitive services, because really this batch processing stuff didn't exercise performance and a bunch of other things. I've been working for some years now, as I said, on declarative machine learning and we haven't lost that yet. I'm still collaborating with Carlos Guestrin on that. We've been talking about management and configuration but actually doing nothing. Bill did a little bit of work on that, but he was just sort of a class project. And then I'm funded to do security stuff, so I hope I will. Here's the Bloom stuff. So where is Bloom going? We need to make the logic more approachable and possibly a link-style thing with comprehensions is better, and sort of like the Ruby I showed you, than logic. But more importantly, like as you think about building a lot of software, how do you structure code when this is your programming paradigm. We've been working really hard to figure out like what's a function or what's a call or what's a -- how does a programmer think about like a handler. It's all flat. And it's all intertwined. So we've been worrying about that. And then Neil was asking me like, well, how does this relate to like stack ripping. I was like, I don't know, what's a stack? What are you talking about? I'm really confused. So how does this relate to events versus threads, I don't know. I'm confused. So we may not understand this because this is how programmers structure their code. So that's really important. I think I'm very optimistic that we'll be able to do neat static analysis stuff. Now that we have Dedalus in hand. So this notion of traces for testing for confluence, causality, and easy concurrency control, then possibly taking the Acid 2.0 ideas and understanding them better. This distributed declarative invariants I think is a very big deal. So we're focusing on that. We've done some stuff with debugging because you get message provenance or general provenance in logic fairly easily, so you can answer things like why is this fact here, or why did this update happen to this data structure, or why did I get this message. I can tell you exactly, it's because this was true and this was true and this was true and this was true. 39 So that's kind of interesting. And there was work like this in our earlier project, but we haven't really exploited it enough. And then finally if there's a theoretician in the room, this whole idea that ->> They're long gone. >> Joe Hellerstein: They're long gone. Damn it. Damn it. So I'm working with Christos Papadimitriou on this. Basically if resources are really free, if you believe the Yahoo! guys, and I think they're nuts, frankly, but if you believe the Yahoo! guys, you know, as many machines as you want, no problem. You know, MapReduce, we're going to win a benchmark, 10,000 machines. Could we have done it on 40? Yeah. But we do it on 10,000. Because why not? It's MapReduce. It scales forever. If that's true, then the only thing expensive is coordination. Anything you can do in the map phase is free. Okay. Well, how many coordination boundaries are there in quicksort? Or in sorting. To sort something. How many times should you coordinate. How many machines should coordinate each time. This is kind of an interesting notion of complexity. So what's the coordination surface. How deep is the coordination, how many times do you have to have a barrier, and for each barrier how many nodes are involved. It's kind of an interesting complexity model. And then I think you can get randomized and approximation algorithms tucked into this model by saying I'm going to step over a coordination boundary either speculatively and then just move on, which would be kind of approximation, or I'm going to step over this boundary and there'll be some cost to fix it if I'm broken, which is more like a randomized algorithm. So, anyway, we're playing with this. It's -- I don't know. It's interesting. So some of the things we're thinking about -- now I'm two minutes over my really long talk. But there's references for the record. >> [inaudible] >> Joe Hellerstein: It's fat, it's slow. Thank you for your patience. >> Jim Larus: I suggest we give the speaker the big round of applause. [applause]

1 >> Jim Larus: It's my pleasure to introduce Joe... database guy who has seen the light and decided to...

Related documents

Products

Support

1 &gt;&gt; Jim Larus: It's my pleasure to introduce Joe... database guy who has seen the light and decided to...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

1 >> Jim Larus: It's my pleasure to introduce Joe... database guy who has seen the light and decided to...