1 >> Jim Larus: It's my pleasure to introduce Joe... database guy who has seen the light and decided to...

advertisement
1
>> Jim Larus: It's my pleasure to introduce Joe Hellerstein from Berkeley. Joe is a
database guy who has seen the light and decided to go and apply database systems to sort
of everything. I believe his first stop on the excursion was sensor networks, and he's now
moved on to parallel distributed computing.
And I saw this work a little while ago and thought that this was a very new, cool idea for
how to program very large-scale systems. And fortunately Joe and a bunch of his
students are up here today to talk about it. Most of you have probably seen the meeting
after this. They're also around this afternoon, so if people want to talk to them, there'll be
some time after 2 p.m. where you can get together with them. And just probably the
easiest thing to do is to just send me an e-mail if you want to get together with them, and
we'll work something out.
So it's a pleasure to introduce Joe, and welcome.
>> Joe Hellerstein: Thanks. Thanks. Yeah, so why don't I just jump right in, give you
guys a sense of what we've been thinking about over the last couple years. And because
the intention today was to bring up the gang and brainstorm, there's a lot of -- there's not a
lot, but at least there's some at-the-end stuff we don't know yet, we haven't figured out
thinking about, so conversation-starter stuff.
So this is joint work with a whole team of folks including a collaboration with folks at
Yahoo!, and Rusty Sears moved from the top list to the bottom list in the last year. So
that's been a good collaboration for us and a lot of fun.
So what I'll try to talk about today is kind of our take on what cloud programming is
about. We can, if the audience thinks it's worthwhile, do a little quick tutorial on
Datalog, because that is foundation for what we're doing.
My assumption is we want to do that, but stop me maybe, Jim, if you think that I'm ->> Jim Larus: No, why don't you do it.
>> Joe Hellerstein: We'll do it. Okay, yeah. And that's going to cut into our time to talk
about the real stuff, but it's probably worth it.
And then our project, BOOM, I'll give you a look at what we've done to kind of prep for
it in a way, sort of we did an excursion or an experiment over the last year in that realm,
and then a flavor of where we think we're going to with the language for BOOM, which
is Bloom, and then some research directions.
So I shouldn't have to talk too much about cloud at Microsoft, but our take on it is that
we're interested in it as a programming platform. And it's an unusual programming
platform. Every new platform brings different features and challenges. We know many
of the features we're seeing in the cloud, the shared environment, dynamic sets of
machines doing computations that evolve over time for services, and, you know, a lot of
the applications people are talking about are very data intensive and some of them also
session-centric, involving users on clients.
And of course for a platform to really be interesting, it needs programmers. And when
2
you look at something, like a new platform like the iPhone, you know, what's fun is there
are all these people thinking up crazy things do with the iPhone that people at Apple
probably never thought of.
And so when you think about the cloud, you know, you ask yourself, well, if there's
going to be an app for that in the cloud, what are they and who's going to build them.
Right? I mean, we can name some of them now, but it's not a long list.
So how are we going to get the people who aren't in this room to innovate in this space,
right?
And this leads to the doom and gloom, which is in the title of the talk. You know, cloud
programming is really hard because it's parallel and it's distributed. And, you know,
there's this -- there was a blog post by Dave Patterson where he quoted John Hennessy,
and the two of them have this very dark view of the future computing that they were
using to raise money from the federal government.
There's two ways to make someone interested in what you're doing. You can say my
stuff is really great, or that you can say you're really screwed if you don't fund me. And
they went for option 2.
But Hennessy had this quote: When we start talking about parallelism and ease of use of
truly parallel computers, we're talking about a problem that's as hard as any that computer
science has faced. I would be panicked if I were an industry.
Fortunately I'm running a large university with a big deficit, so my life is easy.
But really it is. It's a challenge in this space because I think when the programming
model is so hard it stymies innovation.
So, you know, the programmers programming in the cloud today are either expected to
knock up single-node software and just replicate it. And that's fine. But that's not
innovative use. That's like running -- I don't know. That's like running a telephone app
on your iPhone. It's like, well, yeah, sure, but that's not going to sell lots of new iPhones.
If you want to write something new, you have to orchestrate concurrent communication
and computation, tolerate delay and failure, in this elastic, minimally managed
environment. And there's just very few people who can be expected to program like that
in the traditional languages like C# and Java and so on.
So this is a roadblock for the folks we really want, or the creative people who think about
stuff that's about what the users want and it's not about computers. So the kind of
creative software developers who aren't necessarily great developers but they have great
ideas.
And think about all those iPhone apps, right? Those are not necessarily Ph.D.s in
computer science writing those things. But that's what makes it interesting.
And, frankly, you know what? Building stuff in the cloud is hard for you guys too, right?
Even if you have a Ph.D. in computer science or you're the world's best programmer, this
3
is a lot harder than writing single-threaded code.
And so our take on this is basically to look at inverting kind of the control structure or the
way we think about programs where instead of the kind of [inaudible] model where
there's a processor manipulating memory, let's turn it upside down and say that
computing is subservient to thinking about state.
And this is where, you know, Jim was sort of making fun of me a little bit, but I think as
the database point of view may be helpful, you say, well, computation is secondary.
What you should start thinking about first is is the information the state of your system.
So that's state like session state, the state of the system at every node, protocol state,
permission state, and then of course like the data, such as it is, is in there too.
But everything about your computation really is about deriving, updating, and
communicating state. And that's a perfectly reasonable way to think about computation,
and it's a way that, you know, we're going to argue is easier to parallelize, easier to deal
with stuff like concurrency, because you're actually not thinking about that as your first
problem, you're thinking about that inservice of managing the state.
So the strategy we're arguing could be a winning strategy is this data-centric design,
where you take all the state of your program that you would have thought of as data
structures or messages, that kind of thing, and think about it instead as real data,
first-class data. So everything that you're going to do in your program is going to be like
data in a database, because it matters.
All right. So you start all your system's data as first-class data, and your basic data types
are going to be collections and streams. Right? Very much like you would have if you
had data cared about. You wouldn't have one off little things in your database that are
their own data structure. Right? You try to generalize, you try to come up with thematic
designs that are reusable and schematize things. And that's what we're going to say is the
right way to think about all these structures in your program.
And then computation, again, is modeling that data carefully and then reacting to changes
to it and evolving the state as you go.
And part of our argument is that this is a language-independent piece of what we're
doing. This is not going to be about Bloom per se, but many of these lessons, if you
stepped back, you could exercise them in a disciplined way, in a language like C#.
So just thinking about data first is a -- it's a design pattern.
And then the second part of our -- kind of our conceit is that you can use a high-level
language once you've done this, like a logic language, a declarative language, where you
write things as specs. So you're basically saying specifications for correctness,
specifications for handling events, safety of the system and safety of your transitions.
And that you write that in a very high-level language where you're really just specifying
outcomes. You get all the traditional benefits that database people like about data
independence where because your spec doesn't include the implementation, the
4
representation and the placement of data in the cloud is left underneath the programmer's
level of thought. It's left to some kind of system optimization.
And then I believe we're getting -- and we're not there yet, but we're getting a lens on
parallelism based on logic that is quite a bit more simple, I think is the way to say it, than
thinking about threads. And I'll try to give you a whiff of that today, although I think
really this is something I can only do kind of very sloppily right now, so it takes a while
for me to get a convincing argument. We're getting there, though.
Anyway, the point with both of these, whether you want to take the kind of
language-independent design pattern or you want to also buy into declarative
programming, the idea is to take things that seem to be hard problems and turn them into
dead-simple problems.
So we're not actually going to try to do anything fancy; we're going to try to make things
you might have thought were fancy and just point out that they're easy.
So, you know, concurrent programming sounds really hard. But data parallelism, like
MapReduce, is really easy. You know, parallel computing people like to talk about
embarrassing parallelism, so let's go find all the embarrassing parallelism we possibly
can. That's kind of at some level part of the agenda, is to make things that you might
have thought hard dead simple.
And, you know, on a fancier level, I think -- and this is getting to the understanding of
parallelism, when should you be coordinating in your programs. And one of the lessons
that's coming out from thinking about it is you should only be coordinating when you're
doing something nonmonotonic in your logic.
And a sort of simple thing to put in your pocket to take away as a slogan is you only
parallelize for counting. The only thing you should ever need to do -- I'm sorry. The
only thing you don't parallelize, the only thing you coordinate for, is to get a correct
count. Everything else that you program should just go data parallel.
And so really what you do, look at your program to figure out when do I need to know
how many of something there is, and everything else can be easy.
>> [inaudible] complexity in distributed systems was this nonmonotonicity.
>> Joe Hellerstein: Yes.
>> So how -- are you going to be able to hide it? How can you hide it?
>> Joe Hellerstein: No. No, I'm going to point it out very, very clearly, and everything
else you won't even worry about it being hard. So, like I said, the whole goal is to take
the easy stuff and say all that stuff is easy, stop thinking about it.
But when you're writing concurrent Java programs, it's very hard to see what's the
nonmonotonicity in the interactions between 25 threads. But we're going to see it very
clearly. We're going to say here's a program where all it's doing is accumulating
information, accumulating information, and then eventually it meets the count.
5
MapReduce. There's a whole bunch of map, and then there's a reduce. And everything
should be that easy to some degree. There's really no other reason to coordinate.
So let me go out, and hopefully we can talk about this more later.
So it's not like I'm making this up off the top of my head. There goes way back. There's
decades of theory of logic programming and dataflow. What's interesting I think at some
phenomenologically is that there's a bunch of folks like me running around doing this
independently. It's been kind of cropping up all over the place. It's a sign of something
happening.
So a lot of people have been doing declarative and dataflow languages for stuff. So
obviously query processing and data analysis goes on. In addition to SQL we now have
MapReduce. That's been very interesting to watch. And my group here in yellow, we
started doing networking, both sensor networks and peer-to-peer networks. Distributed
computing is what we're kind of doing now. We've done distributed statistical machine
learning.
At Cornell they've been looking at multiplayer games, 3-tier services at Carnegie Mellon,
robotics at Hopkins, natural language processing at Stanford, compiler analysis, and
security. There's just a lot of smart people, and we haven't been coordinating almost at
all, even still, sadly, who've just been kind of going to the same toolkit to solve
programming problems in a whole bunch of domains.
>> [inaudible] focused on one?
>> Joe Hellerstein: Just basic general, like, Bayesian inference. Yep. And so I wrote a
blog post about this stuff about a year ago. I was invited to by the CRA and we tried to
start putting together a bibliography, at least a limited one, and we'd love contributions to
it if folks are interested.
Okay. So that was a motivation. So we're going to go back to school for a little bit,
okay, and I'll just teach you a little Datalog.
Okay. So I had to learn this stuff in grad school, and I hated it. And I still kind of hate it.
I'll try to make it painless as best I can.
So here's the basic idea. Here's the world through the eyes of a real database geek.
There's data, and then there's logic, which is the stuff that you know based on the data
and some rules. And that's everything.
Here's what logic looks like. If I know Q, then I know P. So that's the way to read that,
if I know Q, then I know P. That's like an implication arrow.
And if you want, you can think of this as SQL views. So it's an expression with a name,
P, but it's not stored, it's just an expression that you can compute from the data. So it's a
named query that's been stored as a query, not as the answer to the query.
Okay. And that's all of computing. Really, at some level if you enriched your logic
6
enough, it's true and complete. The only thing is like until recently the only people who
cared about this were Europeans.
Yes.
>> Where is time?
>> Joe Hellerstein: Time is going to turn out to be really, really important for us. So it's
interesting that you asked that, because it took us like many years to bother asking
ourselves that question. So thank you for asking it.
It will come up later. We can take care of enough time for our purposes in plain old
Datalog actually. Although, we'll put sugar on top of it. And I'll show you that in our
semantic stuff. Great question. Good.
So far there's no time. And, in fact, Datalog there's no time because Datalog isn't about
programming really. And so that's -- I'm starting with this pure logic thing that doesn't
think about time. It doesn't even think about state update. All there is is there's data and
there's what you can know from that data, and the world is static. And so that's what
Datalog is kind of, that's the environment.
So here's the classic Datalog example you're forced to learn in [inaudible] in Wisconsin
in 1992. Or in Jeff Ullman's class at Stanford. Not -- yeah.
So the canonical example is you have a family tree where it's stored as pairs of
parent-child tuples and you want to compute something about who's an ancestor of
whom. So over here is the family tree of IBM mainframes. This is upside-down from
the usual family tree.
And there's a rule that says if X is a parent of Y, then X is an ancestor of Y. So this is
going to be an inductive statement of what an ancestor is. Base case, if you're a parent,
you're an ancestor. Inductive case, if X is a parent of Y, okay, and Y is recursively an
ancestor of Z, oops, that's -- that's my little joke -- Y is the ancestor of Z, then you know
that X is an ancestor of Z. So that's the way to think about this.
Another way to read this is take the parent table, join it to the ancestor table, where the
second attribute of parent equals the first attribute of ancestor, and do that iteratively, or
if you like, recursively, until you can't compute any more ancestors.
This is just an SQL expression. It's a recursive SQL expression that joins these two
tables' second attribute equals first attribute.
Okay. All right. And then you can run queries over this stored relation. These two
things are [inaudible] together. And this is -- think of this as a stored query expression.
And then you can run queries over that query expression like you can run queries over a
SQL view. So you can say find me all the ancestors of some person S.
And convention -- so syntax is horrible in this stuff. Convention in Datalog is capitals
are variables and lower case things are constants. Just because.
7
All right. So this is horrible. It's old stuff. It's no fun. Some notes. This join expression
by variable repetition is called unification, in case somebody wants to use that word.
Variables in capitals. This is called the head, this is called the body. The body implies
the head. The body can have multiple things that you're joining together.
And it sets semantics in Datalog, so there's no duplicates in any of these relations and you
don't generate duplicates and that's how you know that this will sort of terminate. You
find all the distinct ancestors that can be derived from this finite dataset.
So the Internet changes everything, right? This is not -- this is old stuff. But, in fact, you
know, really, if you look at what goes on on the Internet, it looks a whole lot like
ancestors and descendants, it's just links and paths.
So routers keep pairs of links. I am connected to you. And they want to find paths. Who
can I get to from me. So it's exactly the same program, I've just changed the names. It's
exactly the same thing. So this is a path finding Datalog query. It finds all paths from S
to anybody. It's the all paths query, which you don't really want to do, but it's a good
start. It's all paths. This is crazy in the Internet, right?
What you really want -- well, before I go -- before we move on, let me focus on kind of
why Datalog is nice as opposed to, say, an imperative programming language. This is
just an expression of the output, very much like SQL. And what's pretty is that -- how
should you interpret this.
This is just logic. So this is just syntax. If you ever had to take a logic class, they'll beat
you over the head with this forever. This has no meaning, it's just a bunch of characters.
It means whatever you want it to mean. But let's have a convention about what it means
that makes some sense.
So the convention is going to be find a substitution for these variables that is consistent
with this expression, all right, find a substitution that is the smallest possible one so that it
doesn't contain any extraneous stuff. So that's a model for this program. It's a set of
assignments of the variables, such that if you deleted any of them it would become
inconsistent. All right. That's called a minimal model.
So you want to find assignments in the variables that you can't take any of them away.
So if you want, it's given a fixed database of links. It's the smallest sensible output
database of paths. You could make up paths that aren't supported by the links, but that
wouldn't be minimal.
Okay. And there's this nice lemma if you want. I don't know, maybe it was a theorem
once upon a time. Datalog programs have a unique minimal model. So modulo
renaming, okay, there's exactly one minimal model for this program. That's really nice.
That means that I can take this syntax and there's no ambiguity about this natural
interpretation of it, which is the minimal interpretation. So this is well defined in a very
strong sense.
And another thing that as computer scientists is really nice is there's a very simple
recursive join strategy, keep joining links and paths, links and paths, links and paths,
sticking things in paths, until you can't compute anything else. That generates the
8
minimal model. Cool.
Okay. So now what do we have? We have agreed-upon logical interpretation of the
program and we have a very natural algorithm that computes it. And it's the algorithm
you'd expect to run, too, so it all feels right.
So that's really nice. So there's no implementation implicit in the program, but the natural
implementation gives you the right answer. Very cool.
>> A question. For this program, if you ignore that path-X-S-?, it could be the empty
model, right?
>> Joe Hellerstein: Not if you had link populated with data. But, yes, I agree with you,
the empty model satisfies this.
So typically what you'll do, and this is more annoying acronym terminology, so have
some kind of stored database to start with as part of this, you would actually write it
down. You would say link from Joe to Bob, link from Bob to Sally. It would be part of
the program. That's called the extensional database, and you're not allowed to fuss with
that.
>> Your implementation is a purely SLG resolution?
>> Joe Hellerstein: We're doing bottom-up ->> You're doing tabular and not [inaudible].
>> Joe Hellerstein: Correct. Yeah. It's bottom-up execution, actually dataflow. It's
really doing joins.
>> A traditional Datalog is semantic ->> Joe Hellerstein: Traditional Datalog.
>> Not [inaudible].
>> Joe Hellerstein: That's right.
>> Okay.
>> Joe Hellerstein: Bottom-up [inaudible].
>> Okay.
>> Joe Hellerstein: And it's strictly bottom-up. And using magic sets and all this stuff
that makes bottom-up efficient.
Okay. So what do we do so far? We found alt paths in the Internet, which is insane.
What we might want to do is form paths that are like shortest paths. That would be sane.
And that's, in fact, what we do on the Internet. Let's write that down.
9
First let's change our schema slightly to have costs on our links. That's what C will be.
And then we can rewrite our program to propagate the costs. Okay? So if we have a link
from X to Y of cost C, then we're going to have a path from X to Y of cost C.
The other thing we want to do is actually construct these paths so that we can route along
them. So what we're going to do is we're going to keep next hops. If we have a link from
X to Y, then we have a path from X to Y, and if you're sitting at X, the way to get to Y is
to next hop is Y.
The recursive rule is more interesting. Here's our costs. If you have a link from X to Y
of cost C, the path from Y to Z of cost D, let's say that we're measuring something like
latency, that the cost from X to Z, then, is C plus D.
If we were measuring capacity, we might have taken the min of C and D, sort of just an
example.
And then the next hop thing, let's be careful. The next hop, if we know that there's a path
from Y to Z recursively, where you should go from Y to N and then onwards to Z, and
there's a link from X to Y, then there's a path from X to Z where the next place you go
from X is Y. And then of course from Y you would go to N and recursively you would
unroll the path.
Okay. So that's how you propagate next hops in this thing. I'll just make one note, which
is I already cheated and I'm adding stuff to data log. I added plus. Plus is a relation. It's
a ternary relation, two inputs, one output, that's infinitely big. It's all triples, X, Y, Z,
such that X plus Y equals Z. And we have infinite relations, so that actually extends the
expressive power of the language. So just -- I don't want to cheat. This is actually not
simple Datalog anymore; this is Datalog with function symbols.
>> [inaudible] form of that?
>> Joe Hellerstein: No. You don't want to get the whole plus relation stored. It's a drag.
Yeah. Although, memory is getting cheaper.
>> [inaudible] really easy to overlay. It's also invertible in the direction, so some
functions work, some don't.
>> Joe Hellerstein: Okay. But what we really want is best paths. So let's take the same
program. Okay. So that's the program we have so far, which compute all paths, their
costs and their next hops, and now for each pair source destination, look at all the paths
from that source to that destination, what's the cheapest cost among all those paths. And
that's what this expression says. For a source destination pair the cheapest cost is the
minimum of all the Cs that you see on this side that match.
This is like a reduce when the key is XZ, or if you like in SQL it's group by query where
they're grouped by columns [inaudible].
And I've just extended the language again with aggregation. And that's a big deal. And
that's the thing where I said, you know, the only thing that matters is counting. Really
10
what I mean is the only thing that matters is anything that's aggregated. So that's a big
change.
>> [inaudible] in the angle brackets.
>> Joe Hellerstein: It's in the angled brackets. And that's a syntax we made up.
And then finally that just gave us the minimum cost. Now we need to find the path of the
minimum cost, the arg min. So we take the mincost path, we join it with the path and we
find that path from X to Z whose cost is the minimum cost, and its next hop.
Now we've got shortest paths. All right. And, you know, five lines, pretty good. And
you could ask ->> [inaudible]
[laughter]
>> Joe Hellerstein: Okay, okay, not fair, not fair. I was trying to gently introduce you to
the full language. I did not in fact make up this language to suit this example. But that's
a very fair critique of the presentation. As a teacher, I've done a bad job. Yeah. I don't
recommend we keep doing this anymore. We're going to stop. Good. I'm suitably
chastened.
All right. So we just extend the Datalog with aggregation. And the funny thing about
mins, right, is you can't know what they are until you've computed the full input. You
can't compute the min until you know all the path costs.
And that's why you need coordination. Whenever you want to count how many things do
I have, you have to have all the things. And the most basic thing of that is do I have
nothing. And so usually people talk about this in terms of negation, do I have nothing.
And you won't know that until you've tried computing everything.
So counting is a generalization of negation it's just count equals zero. Yes.
>> [inaudible] recursive algorithm it's not going to terminate, because even if you're
asking [inaudible].
>> Joe Hellerstein: No, because there's no duplicates. So this particular program will
basically run these rules. And, again, there's a model theoretic semantics in a natural
implementation ->> [inaudible] longer and longer paths [inaudible].
>> Joe Hellerstein: Oh. If you have -- this is sort of loops in the Internet. It will, in fact,
go around cycles. This is very good. So it's a typical Internet thing. And, in fact, you
should annotate the program to not go around cycles, or -- and my student has been
thinking about that's not in the room -- you could actually infer from this minimum, you
could propagate a constraint down to here that if you find a path with same source
destination that's bigger than something you already have, you could stop. So you can
11
push that constraint actually into here.
>> [inaudible]
>> Joe Hellerstein: No, no. No, no. That's just a Datalog to data log rewrite. It's not
stepping out of Datalog at all. You just put a less than thing in here that you don't add
something to path if the C plus D is less than something that's already in path.
>> [inaudible]
>> Joe Hellerstein: Yeah, sure, sure. We have max and min and average and all these
things.
>> [inaudible]
>> Joe Hellerstein: Right. Or a min path length if you have negated costs. Same story.
And, you know, any algorithm then has a problem because there's no nonfinite answer.
Sometimes you can test for that. There's conservative tests for safety of these answers,
and sometimes you can't. Because it's kind of a halting problem sort of thing in the limit.
So running around loops forever and checking for that should make you nervous, right?
Yeah. These are very natural and good questions that I'm glossing over a little bit
because time's short.
I do actually have slides for some of this that we can go over offline, or we can just
[inaudible].
Okay. I want to move on a little bit to talk about building real software with this. So so
far what have we had? We had a logic for path finding in the Internet, shortest paths.
That was all very cute, but like there's some database in the sky that we were computing
this on, which is not the way the Internet works.
I've heard -- I've read research papers where they said routing should be a service, you
know, something like Google would do Internet routing, that it's crazy that Internet
routing is this distributed computation.
But that's not the way the world works. So can we implement the things the way the
world works in logic. Can we basically build protocols from this specification. And the
answers can be yes really, really easily using the first thing you might think of.
So all we're going to do is we're going to do what databases and MapReduce and
everybody else does, we're going to say logically there's a database in the sky. Physically
we'll take the rows in that database, or, if you like, you'll take the key value pairs, and
we'll partition them by key. We'll horizontally partition this table.
And how will we do that? Well, we'll make the programmer do it for the moment. So
we'll make them put a little @ next to the field that's the key, the field they want to
partition on. And we'll make sure it's of type address in some network addressing
scheme. So think of it as an Internet IP address for now if you like.
12
And then we'll place data, just the way you would in a parallel database, at the node that's
in the location specifier.
>> [inaudible]
>> Joe Hellerstein: That's relations, but relations -- a single relation [inaudible] relation
to the primary key concatenate the fields that are in the primary key that are the value.
I'm trying to be ecumenical by saying key value pairs.
So let's do it here. The link table, remember this was source destination. We'll partition
it by source. So if you have a tuple that says to get from X to Y it costs you C and there's
a link there, it's stored at X. And that indeed is somehow network routing tables are
stored. You have a routing table that tells you who your neighbors are and who you can
send to.
Okay. And if you take the union of all those tables in the Internet routers, you get the
link relation.
All right. And paths are going to be partitioned the same. Everybody wants to know the
paths outbound from them. And here's where the fun begins. Links are partitioned by
the first attribute. Paths are partitioned by the first attribute. The join, if you like, the
unification is on the second attributive link, which is not the partitioning, and the first
attribute of path.
And, by the way, I want the output partitioned by the first attribute because it's path
again. So my problem is I need to compare things. Here's an example of data. Node A,
node B, node C, node D on a network, a linear network. Here's the link table. It's been
partitioned by the first attribute. Here, the path table same, partitioned by the first
attribute, so the As, the Bs, the Cs, the Ds. I need to put together this link and this path.
Right? And they're on different computers. Communication will have to happen to
achieve this specification.
Well, how are we going to do it? And then the output's got to go back, right? The
output's got to go back. So how are we going to do it? We're going to do it in the way
that I always thought was kind of disgusting in graduate school with Datalog which is
we'll just rewrite the program. So we stay in Datalog but we're going to rewrite the
Datalog to an equivalent program. And it's kind of a little bit operational, frankly, but it's
still logic.
So we're going to say take -- introduce a new predicate, link_d, which is the link table,
but not partitioned by the second attribute. So just a copy. Okay. But it's partitioned by
the second attribute. So it's going to look like -- you know, this tuple that we wanted to
look at is now there in the link D relation.
And that's the whole link D relation. And now our join is on a single node. This is
partitioned by Y, this is partitioned by Y, life is good. Now we can do a local evaluation
of the body of that rule, and then to propagate the head back is another communication.
So this is actually the distance vector routing protocol that's used in Internet routers. It
13
does exactly this. But the way it's described is not in terms of where shall the data be.
See, if you look at this program, all it says is there is data laid out the following way, and
there should be data laid out this other way. It's pretty declarative.
When you read about distance vector in a networking textbook, what it will say is first
every node sends advertisements to its neighbors about its links, and the neighbors
respond to the advertisements in a certain way, and this is done iteratively. Just much
more operational. This really just says the data shall land in the following way. Uh-huh.
>> Do you advocate doing this sort of rewrite [inaudible] automatically or ->> Joe Hellerstein: Yes. So one of the things we did in this early work on declarative
networking was say you would have rewritten this difficultly, you could have rewritten
this so we repartitioned path instead of repartitioning link. Would have been the same
program. And of course we didn't write either of those things. This is done by an
optimizer. This is what we wrote. We wrote something that didn't tell you how to do it
at all.
The optimizer was the one that rewrote it. You could rewrite it another way, and the fun
thing is, you know, I said this is distance vector used in the Internet, right, just replaying
history here. So exciting, right? If you rewrite it the other way, it's dynamic source
routing which is used in wireless networks. Cool. Right?
So a single specification, two implementations, one of which is better tuned to fairly
stable, fairly low-variance links, one of which is better tuned to very unstable,
high-variance links.
In the networking literature, that's not the way they think about it. They think about as,
well, there's the wired protocol and there's the wired list protocol, and I can publish a
paper with a third protocol, and if I'm real clever, I'll come up with a hybrid protocol.
Right?
But every protocol is a research paper and implementation. Here what we're saying is no,
you know, there's physical properties of the platform you're running on, and you optimize
for those physical properties. And, by the way, the Internet is changing really fast. So if
you're writing one protocol for every combination of devices on the Internet, you're doing
something wrong. So that was kind of part of our message at that time.
>> [inaudible] want to do a hybrid version?
>> Joe Hellerstein: Cardinality is a bit of it, but a lot more of it in the Internet context is
variance of various things, like communication costs. It's not actually raw numbers; it's
the predictability of things in many cases.
>> [inaudible] cases that the protocols deal with like, I don't know, count to infinity
[inaudible] ->> Joe Hellerstein: Yeah. So that stuff actually isn't so bad. But when you talk about
like the corner cases in like really looking at BGP and the way they do distance vector, I
mean, like any standard, they've just added tons of stuff. Can you write all that stuff
14
down? Yeah, but then it gets ugly because it's just ugly. The spec is ugly. And I can't do
better than the high-level spec at some point.
But the algorithmic pieces of the spec, like not counting to infinity, those parts usually
come out fine. And, you know, broad strokes, it's fine.
All right. And I'll try to justify that later in the talk with having built something real.
These are toys. Quite frankly, what we were doing a couple years ago was interesting
toys. And we were learning and that was the point.
So we had -- the language I showed you is something we call Overlog. It's a variant of
Datalog. We added aggregation of function symbols along the way just to -- you know,
as we needed them. But that's actually very standard in the Datalog literature. That's a
classic thing to do. We horizontally partitioned the tables, which, again, we're talking
about data, we never talk about messages. There's no send and receive in this language.
There's just data and its layout.
And then we added this thing called event tables which are tables that don't persist for
very long. So the data in those databases kind of evaporates quickly. And those can do
things like what is the state of the clock, so there's a tuple that tells you the current time
and it might arrive at a given time to sort of trigger things. What's a message coming off
the network, something that happened in your Java code maybe that's wrapping this stuff.
All right. So events can be passed in as data that's sort of transient.
And then the execution model is these iterated single-machine fixpoints. So think about
this in a single node. We take a bunch of these events that were coming out of the
environment. We turn them into data, we stick them in tables. Frees the world. On our
local machine, computer Datalog fixpoint of all the persistent state we already had and
this transient state we just inserted.
But in the middle in phase 2 here, it just looks like a static database. So all these events
have been turned into tuples, we have a static database, do Datalog, compute the outputs,
and then do something with them. So generate more events, generate up-calls to Java,
generate network messages, et cetera.
Okay. So that was what Overlog was. There are problems with this. And for a long time
we didn't even have it this clean. We were really confused about when does, say, state
manipulation happen, when do inserts and deletes happen. How do you do atomicity.
Like all this stuff was causing us enormous headaches. And later in the talk I'll show you
how to model that. We understand this now.
15
In this model, which is pretty good, it's sort of an event handler with a Datalog engine in
the middle. The database update happens atomically here before you start again. So the
Datalog program you run in the middle of the loop at least is well defined. It's just plain
old Datalog in the middle. And you go atomically from fixpoint to fixpoint on every
given node in the system.
>> How does that happen? Because, you know, like the update, imagine you take the
Internet example, you update to some part of the [inaudible] would be happening in one
part of the world and then you [inaudible] fixpoint here. So how can you ensure that the
fixpoint happen atomically -- happens atomically?
>> Joe Hellerstein: So, to be very, very clear, this fixpoint is local. So all the facts you
have in your local partition are run to a fixpoint, which generates messages. The
language I showed you in the example I showed you conveniently looked global. Right?
And I said, oh, you partitioned the table. But it's really a logical global database.
It took us a while to just decide that that was a lie. We don't try to propagate that lie
anymore. But it's a lie. So you're absolutely right.
How do people really program that stuff? They think about it. You know, they don't lie
to themselves in that fashion. And I will stop doing that from now on in the talk. So
that's going to be one of my conclusions from the experience of programming Datalog is
that lie was a bad lie and we stopped using it. Yeah.
>> So I'm not -- in that issue [inaudible] Java program is going to run. But it seems to me
that [inaudible] because events might come and then you [inaudible] fixpoint [inaudible]
fixpoint [inaudible] across these timesteps.
>> Joe Hellerstein: Absolutely you're right.
>> [inaudible] all kinds of things.
>> Joe Hellerstein: Race is maybe not the term I would use, but the point is exactly well
taken. So, you know, look, when we're in phase 2, I sort of conveniently said whatever I
dequeued in phase 1 has been chosen. Phase 2 is all well defined. And from phase 2
phase 3 is well defined. And then crazy stuff happens. And then I dequeue it in phase 1.
So, yeah, races can happen before I start phase 1 again in terms of what [inaudible].
>> [inaudible] persists.
>> Joe Hellerstein: This program does.
>> From the left-hand side to right-hand side?
>> Joe Hellerstein: This database is state.
>> So once phase 2 finishes, I can just throw away the [inaudible] and then I have to
restart it from only the state of the database. Is that how we should think about?
16
>> Joe Hellerstein: Maybe the Java program thing is confusing. Think of it as there's
many clients off in the networks and networks packets. These are going to be
asynchronous, in some sense, Java calls. So there should be no difference between
saying Java and saying network. Does that help?
>> But all the state is in the Datalog.
>> Joe Hellerstein: All the state is here. And then if you have state in the outside world
because you're a separate process, that's your business. I mean, you may. That's sort of
outside the model of this language.
But I think that the thing that is important to note is that the order of the events in this
queue and the choice of what to dequeue is not modeled in any way at all. And in that
sense there's definitely races here.
>> Yeah. The way I think about this is that the outer phases are asynchronous. But the
inner phase is synchronous. There's time applied to the ->> Joe Hellerstein: Yes.
>> -- to the thing.
>> Joe Hellerstein: Yes.
>> And so it's subject to races as synchronously clocked flip-flops are. So it's subject to
races as synchronously clocked flip-flops are. There aren't any races. What? You know,
there are no races in there.
>> Joe Hellerstein: Yeah, right. Right. It's all the things come in, you do your thing,
things come in. But the problem is the things are coming in in crazy order. You have no
control over that.
This is not bad. This is kind of like event programming with a really powerful
event-handling language. It's really not bad. And if it weren't for the horrible syntax,
actually, it might even be, you know, kind of attractive.
But it's not what we want. So I'm going to show you something better in a little bit.
What's happening is you guys are jumping to conclusions and it took a bunch of years in
building software to believe. So, you know, four years ago we would have said no, come
on, it's not that bad, no. But now I agree with you. But give me a minute to get there.
I want to argue that this language is pretty good. And I'm going to show you some
anecdotes to argue for that. And with language design, you revert to language a bunch, it
seems to me, not knowing anything about how you do language design, doing it as an
amateur.
So here's a snapshot. It's a screen shot, sorry, of a research paper in the sensornet
literature. This was I think the best paper in SenSys the year it was published. It's the
Trickle Protocol from Phil Levis. It's a code propagation protocol. So it's flooding.
17
You're trying to flood code to all the nodes in your sensor network.
You want to make sure every node has the latest version. So everybody's going to gossip
about what version do they have and give you a better version if they have something
newer. Except that there's contention in the radio space. You want to back off if
someone else is communicating. So you don't want to be gossiping if other people want
the channel.
And this is this pseudocode for the program from the paper. And there's an
implementation in the nesC event language, which is many hundreds of lines of very
opaque-looking embedded C.
This is an implementation in a variant of Overlog that runs on Berkeley Motes. So this is
David Chu's thesis work, or a little snippet of it, really. And this is the line-for-line
translation. And a lot of David's code here has comments that repeat what you see over
here. So this was a standalone version of this slide, so it's redundant. Almost line for
line.
So the way that someone who does sensornet thinks is actually pretty well captured by
this programming model. Which actually in retrospect isn't surprising because it is event
programming. And if you're a network person, that's a fairly natural way to write
protocols.
>> So I'm just going to comment from aside the networking perspective, so it seems like
the -- all the messages should be transferring has not been explicitly specified, and yet a
lot of protocols worry about things like back off, retransmission ->> Joe Hellerstein: That's right. This is back off, right, because it's trying to avoid
collisions. So all this stuff about timers and doubling your timers if something happens,
it's all here.
>> So all that is expressed [inaudible] TCP-like logic for transfers and ->> Joe Hellerstein: Sure. Yeah. TCP Windows setting is just a very simple stimulus
response pattern. Right? I wait for something that doesn't happen, I change a timer.
That stuff's not hard to say. The spec in Van Jacobson's paper is not very long.
>> Except there has been a lot of changes since ->> Joe Hellerstein: Changes are good. And that's about making the network work well.
The programming model I think is -- the actual specification is still pretty simple.
You know, I would love actually offline -- I care less about network protocols than I used
to. But offline you can tell me how wrong I am in saying that. Because I'm sure I'm
missing stuff.
All right. So we're trying to move up the food chain from protocols because we would
run into people like you and they would tell us that we didn't know what we're talking
about. So, okay, fine, maybe not real network protocols, let's do peer-to-peer protocols,
okay, or sensornet protocols. So now we do peer-to-peer protocols. But let's pick at least
18
an interesting one.
So the next thing we did was we said I bet we could build a real distributed hash table,
content-based routing protocol out of this language. And we built chord. This is
[inaudible] four years ago.
And we did a high-function implementation with all the stuff in the paper. So all the
things they said, you know, it's not really chord unless you add this and you add this,
because otherwise it won't perform. So we added all that stuff. And the whole thing was
still pretty compact.
And in fact this is it. Okay. So why is "Where's Waldo" on this slide? Waldo's right
there. He's in the middle of the bull's-eye, okay, is the point. So we took something that
you would have thought was hard and we just made it really, really easy, is the point
here.
So this is in the spirit of saying, you know, chord shouldn't be that complicated. The
paper is not complicated. The routing player of the C++ implementation from MIT is
10,000 lines of C++. For something that's well described in a ten-page research paper,
and most of the paper is not presenting the protocol, it's not presenting the spec, it's
talking about why it's interesting, it's presenting performance results. So, you know, I
would argue our spec, our implementation is pretty close to the spec. And when you read
it, that's true.
And I just want to make one other point, because it's easy to get into arguments about,
well, the code's dense, I don't really think it's any easier to read the C, you know, look, it's
one piece of paper, okay, and you can print it out and you can sit down, and I know the
syntax is weird. You can get it all in your head all at the same time. You don't have to
context switch. 10,000 lines of C++ you cannot print out, go to the coffee shop and just
read and understand. It just can't be done.
So we like to talk about orders of magnitude code size difference, because I think
anything else is kind of irrelevant. But when you're two orders of magnitude smaller,
now it's -- you're obviously saying things at a different level. That's the point I'd like to
carry away. Obviously you can't read this code. Yeah.
>> [inaudible]
>> Joe Hellerstein: Performance? Performance is fine, man.
[laughter]
>> Joe Hellerstein: So, in fact, you know, there were performance graphs in the paper.
Performance of a wide-area distributed hash table is not a very demanding exercise. The
performance actually was more or less fine. It wasn't great. We didn't try to make it
great. We didn't bother optimizing it. We got the same performance characteristics that
the MIT guys got.
The figure of merit, actually, is churn handling. So as nodes come and go from the
system, how long does it take for people to agree on the routes. And there, you know,
19
they would get like 99 percent agreement with some distribution of people leaving the
network, and we'd get like 97 percent. And at that point we're like, eh, close enough, let's
move on to the next paper.
So I think it really was fine, but actually I'm not on strong ground saying that, and so I
didn't even put it up.
>> I don't see a single primary key anywhere.
>> Joe Hellerstein: Oh, they're here. They're up there. And they're a terrible thing.
Primary keys are a terrible thing.
>> Why?
>> Joe Hellerstein: Because they mean update. If you have a primary key and you want
to add another tuple for the same primary key, that's an update. And that's
nonmonotonic. And that's what's happening between those loops and the state. We're
going to get all that right with my copious spare time when I talk about our new
language, Dedalus.
>> So suppose adding the key were really an event and it had a time associated with it.
>> Joe Hellerstein: Good. You're way ahead of us. Good.
>> [inaudible]
>> Joe Hellerstein: No, no, no, no. That's exactly what we're going to do. Precisely
what we're going to do. That's great.
>> Good idea.
>> Joe Hellerstein: It's been done before, I'm sure.
Okay. So I've now brought you up to the present and we have 15 minutes left I think. Is
that right?
>> Jim Larus: We have a little more.
>> Joe Hellerstein: Okay, okay. I'll go fast.
At some level this is just taking that stuff and trying to make it realer in a different
application domain.
So the BOOM project is really to try to take this idea of orders of magnitude smaller code
and apply it to building orders of magnitude bigger systems, namely cloud systems. So
can we really write big-scale systems in this kind of code style that makes things much,
much simpler, qualitatively simpler. Orders of magnitude less code means qualitatively
we're saying things differently, we're saying it simpler.
And we're going to build on this experience, but we want to address a broader application
20
domain, because really fighting with networking experts about TCP is -- I lose. It's just
maddening. And I don't care enough. And most people don't, actually. So it's not a very
big pool to swim in.
So we're trying to actually -- and this we haven't gotten to yet, but we'd like to really
address real developers and not ourselves. We would like to get out of the academic
language design and into real language design.
So that's -- I got these new grad students to come in a couple years ago. And Neil and
Peter are sitting here. And I said, you know, we should design the next language. This
Overlog thing, nobody can program it. We're going to design a real language.
And they kind of walked in and said, you know, first of all, we just got here and we don't
even believe that your previous language is any good. And, second of all, even if we take
it as a given that you're our advisor and we have to believe you, we couldn't know how to
start designing the next language because we haven't built anything in the previous
language, so would we please, like, go build something and see if this is going to work.
Which was really good advice. It's really nice to have smart graduate students tell you to
not do what you think you want to do and just do something different.
So their idea was -- this was with the advice of Tyson Condie, who had been
collaborating with the Yahoo! guys on Pig, was let's just go build something big that's
cloudy in Overlog before we design the next language. In particular, let's build Hadoop.
And you know what? Let's really build Hadoop. Let's not cheat so when someone asks
us is it fast enough we cannot -- we can't say. We're going to really build Hadoop.
So the goals are convince ourselves that this makes some sense, inform the designer this
is a better language, certainly for cloud, maybe some day we'll look at this in the
multicore space. We'll see. And, you know, they're smart guys and they need to be
famous, so we better do some hard stuff along the way.
Turns out Hadoop is really easy to build. And you can build it -- anybody seen
BashReduce? It's MapReduce implemented in Bash. A trivial implementation of
MapReduce is really easy. So we wanted to do some stuff that was hard along the way.
And the metrics couldn't be just lines of code anymore because people were complaining
that that was a stupid metric. We really wanted to show that the code would be better, it
would be more flexible, more malleable, and it would perform.
All right. So here is the game as we would, first of all, rebuild Hadoop in Overlog. And
then we would extend it and sort of prototype where Hadoop's going to be five to ten
years from now, add availability, because, embarrassingly enough, both Google's
MapReduce on Hadoop have a single node as the master. It's a single point of failure.
And it's a scalability limit because it's a single node. And when -- in the GFS paper they
basically say, look, if you can't fit your metadata in one computer, buy a bigger computer.
It's just like so unGooglely, it's scary.
So we wanted to sort of say, well, they're not going to stay there forever, let's see how we
would extend those features. And then we wanted to do some monitoring along the way.
21
So this took these two guys and a couple other folks about, realistically, three months,
plus we wrote a new interpreter for Overlog and we were trying to be honest. So let's call
it nine months. But it was mostly like three to four months. So it was very easy.
So here's what we did. Pete and Neil rebuilt HDFS from scratch and they built what they
called BOOM-FS. In Overlog is about this big. And then they wrapped it up in Java API
so it would be API-compliant with HDFS.
So Yahoo!'s Hadoop runs on BOOM-FS just fine. So it really is API-compliant. And
that was 1,400 lines of skin in Java over these 470 lines of Overlog.
And then the -- Tyson Condie, my senior student, he was interested in the interesting bits
of the MapReduce scheduler. Because most of MapReduce is really boring. The
interesting bits are like straggler handling and the scheduler. That's where people are
kind of innovating.
So what he did is he did a brain transplant. He scooped out the brains of Hadoop, which
was roughly about 6,000 lines of code -- everything else is plumbing; swear to God -- and
he replaced it with about 400 lines of Overlog.
So that's the orders of magnitude thing, roughly speaking, you know, 20,000 of 400,
6,000 of 400. This one's a little more impressive, but this one had to really -- brain
transplant is harder than [inaudible], right?
>> Is the check point and restart capability of MapReduce in the Java or Overlog?
>> Joe Hellerstein: It's in the Overlog. Yeah. Yeah. I mean, you know, the -- actually, I
shouldn't say that. Putting things into HDFS is in Java. Only the scheduler is in Overlog.
HDFS is in Overlog, though. So the decision to put something at the end of the map
phase on a local disk, the decision to put something at the end of the reduce phase into
HDFS, that's in Java. But that's not interesting.
>> Yeah, but there's the failure handling question where you lose a node and then you've
got to go find a copy ->> Joe Hellerstein: That's all here. The decisions about how to schedule the stragglers
when someone hasn't answered in a while, that's all here. All the interesting
decision-making and policy is in Overlog.
So just somebody wanted performance. So here's a super high-level performance graph.
This is all pairs, components. So over here we've got Hadoop on HDFS. Over here
we've got BOOM on BFS. And then we've got Hadoop running on our filesystem, our
MapReduce running on their filesystem.
And what you're seeing is cumulative distribution functions percentage of mappers
finishing in blue over time, percentage of reducers finishing in red over time. You can
see reduces start after maps finish. They all look the same. The high-level takeaway here
is this is running a particular boring word count, I think, example from the paper, the
usual crap. It's the same.
22
The reason it's the same is because it's not hard. This stuff is not hard. It's all spec.
There's no efficiency issues.
>> [inaudible]
>> Joe Hellerstein: Less, I think. But something like that. I think we may have tuned
that. I shouldn't claim. It's about the same. We are much more CPU intensive. I will
say that. We measured that and we were pretty open about that.
All right. But the fun stuff is now. So suppose we want availability. HDFS is a single
point of failure. If you go look at their issue trackers, there's, like, oh, you know, we
have a single point of failure. That's bad. Maybe we should do -- what's best practices,
first of all, in the issue tracker. Best practices, put your metadata on NFS. That's a
Hadoop community best practices for ensuring the metadata of your terabytes of file.
And we had lots of stories from friends of, oh, yeah, you know, actually we did lose
terabytes of stuff because we lost our metadata. That's just -- it's embarrassing. I mean,
Hadoop is a very immature piece of software. Much as it's really exciting and enabling
piece of software for a big community, it's pretty immature.
So there is an issue tracker proposal for warm-standby, you know, you trickle logs to
somewhere else and you can at least recover as of some time in the past. We wanted to
do hot-standby and process pairs with a Paxos consensus. Because these guys wanted to
do feats of derring-do, right? Well, we wouldn't invent no warm -- sinking
warm-standby.
And also we'd always wanted to do Paxos. Some folks at Harvard did Paxos in the
previous system back when we were doing chord. They did a toy version. It only would
pass one decree, but it was still -- it was like, wow, you can do Paxos, okay, let's really do
Paxos.
So Pete did the Paxos implementation. At the end of the day, it was about 50 rules,
which is about as complicated as chord. Okay. Paxos is pretty complicated; that makes
sense. And it's -- we're talking, you know, real multiPaxos, so it can issue at many
decrees, it does leader election, it sort of -- it's a real service.
And, again, you know, so I showed you David Chu's slide. This is single-round Paxos.
This is Leslie Lamport Paxos. So it can issue one decree and then it stops. This turns out
to be really pretty.
So this is Lamport Prose, and this is Pete Alvaro Overlog. And it maps up -- maps really,
really nicely. Because Leslie Lamport thinks in invariants and our language is an
invariant language.
Unfortunately when you start doing leader election and you start doing multiple rounds of
things, most of the papers end up looking much more state machineny, because most
people are not Leslie Lamport. And when -- Pete can tell you more about this, but when
he had to turn this into real like multiPaxos, things got less pretty.
But what is pretty -- Pete put this together for a workshop paper -- is kind of what he
23
found looking back at what he had written in Paxos where the design pattern's inside
there.
So at the bottom here, here's plain old Datalog: Selection, join, and recursion. Here's
what Overlog adds: Messaging, aggregation, state update, timers. So this is the language
we provide is down here.
Above that he found he had sort of built some patterns. He did both two-phase commit
and Paxos, just to kind of compare them. So the pattern multicast was standard primitive
in both of them. Counting was a standard primitive in both of them. And then choosing
amongst a group of people, which is a lot like counting.
Roll call. Who's around. Voting. Who's around and agrees. Dequeue and semaphores.
And these little components added up in various combinations to Paxos and two-phase
commit.
So I think one of the contributions by writing a really clean Paxos was to look at the
structure of kind of what are the primitives in a consensus protocol. And you can ask
something like, well, we cut this diagram here and gave you a language [inaudible] here,
but maybe if you wanted domain-specific language for distributed systems, maybe the
cut's here or maybe the cut's here.
And this is the language-independent point. This is what are your libraries that you
should build. Should you just ship a Paxos library and be done?
So these are -- you know, this kind of gets you to what's a good DSL for distributed
systems. And I thought this was one of the things you could take away from our work
that wasn't sort of tied up in buying the whole farm, right? So I wanted to bring this up.
There's stuff happening because of the work we're doing where I think we're adding
clarity to the discussion.
Okay. So that was variability. So we added multimaster availability to a Hadoop
filesystem and Hadoop. The next thing we want to do is scale out.
All right. Scale out is great. So, you know, Google says if your master can't scale, buy a
bigger box. And that's what Hadoop does, because Hadoop does whatever Google tells
them to do. And our friends at Yahoo! are like, you know, this is actually a problem.
We've got really big Hadoop clusters at Yahoo! and we have problems with scaling our
metadata.
So how can we scale out the master to multiple machines? Well, really easy. So this is
the design that Neil and Pete had for the metadata in HDFS. There's a file relation.
There's a derived relation that's logical, which is the paths that you can derive recursively
from parent-child relationships. Right? Fully qualified paths are just recursion over
parent-child relationships. And then there's chunks that make up the file.
Partition these tables like a relational database, and then you have a scaled-out master
node. Yeah. That's right. So it took one day to do that. And the guy who did it was
actually the guy -- he's like our operating systems guy. He doesn't even -- at the time I
would say he did not understand Overlog because he kept doing weird things with it.
24
We're like no, man, don't do that. And it still only took him a day.
That's crazy that this only takes a day. If you tried to do this in the Hadoop code base in
the Java, it would be a nightmare. It would just be a nightmare. Because the assumptions
about this stuff being centralized are not in one place and the data layout is not in one
place.
And this is just a pure win for data-centric design. If you [inaudible] I have a collection
of stuff and I'm going to store that collection of stuff, then you can parcel the stuff out.
Right?
All right. So I'm going to skip this one because time is short and I want to move on. I'm
also going to skip bragging mostly. Except to say that, you know, we did build
something real and we got the benefits we hoped to get this time still. Orders of
magnitude smaller. And performance was really good enough in some measurable way.
Apples-to-apples comparison. But there's lots more we could do.
The thing is, this is not an end goal. We actually don't care about bragging about this
because we didn't want to build Hadoop. The world already has Hadoop, and, frankly,
Hadoop's kind of dorky.
So what are the lessons, right? So lesson 1 ->> Change the talk when you give [inaudible] ->> Joe Hellerstein: I'm going to give it at Yahoo! I think.
[laughter]
>> Joe Hellerstein: This whole thing about the fascinating ecosystem that's growing up
around this phenomenon that is Hadoop, yeah, which I agree, I don't talk about the
software artifact as much. Yeah. Now, Dryad, now there's a system. It's beautiful.
[laughter]
>> Good start.
>> Joe Hellerstein: That's a good start. Okay. So I try to partition the lessons of this.
We're working hard to make this something that you can take home in your pocket, and,
frankly, we still don't really have it nailed, as Jim has pointed out to us in shepherding our
paper.
But part of what we're doing here, like the scale out of HDFS, is really just about thinking
about the state of your system as a collection data type. You know, it could have been
MapReduce. It could have by Python comprehensions. It didn't have to be logic. Could
have, frankly, been a C list library with the appropriate programming discipline to not
assume that it's all in one place.
You could do that in any language. And I think, you know, this is just good design
principle for building distributed systems. Take your collections of data and make them
25
collections and put collection interfaces for them. Use link. This is sort just link at some
level. I love that. That's great. And I'm not going to try to explain that.
And then part of it is the declarative stuff. So we did have recursion in the filesystem to
form paths, and we got to play with stuff like when to materialize the paths and when to
leave them unmaterialized as optimizations that were separate from the design. So
should we actually form the paths and store them or should we compute them on the fly,
that's all underneath the covers the way it would be in a database.
We did in other parts of the system things like dynamic programming falls out really
nicely in a logic language, and anything involving graphs and transitive closures, like
network paths falls out nicely in a logic language.
And then as I said, you know, Lamport-style invariant specification is natural in a logic
language, because that's what you're doing, you're doing invariant specification. That's
what logic is. And so certain parts of distributed systems are a good fit. And actually the
tricky bits are a good fit.
And then finally -- and I didn't get to talk about this -- we did have this sort of
aspect-oriented ability to kind of go and express invariants over all the state, like make
sure that the number of messages sent is what we would have expected from Paxos. We
didn't have to look at any APIs to do that. We just expressed that as extra invariants on
the data. And so you get this kind of cross-cutting ability to put in debugging
specifications.
And because it's Datalog, the spec of the invariant is the check. You just write down the
rule. The engine enforces the rule for you or flags you when it's not true. So there's no
translation of your specs for debugging into some kind of implementation. There's no
tweaking of the code.
>> But there seems like there's another aspect of it, which you haven't talked about too
much, which is a major point of leverage, which is essentially the dealing with race
conditions and disambiguation of race conditions by using timestamps. That's a big part
it too.
>> Joe Hellerstein: There are a lot of limitations. And that one false in here.
All right. So, first of all, the syntax is hard to write. That's fine. But, worse, it's hard to
read. So it's not just that people don't like it, it's that I don't like reading my own code,
which I don't write that much of. These guys don't like reading their own code. It's just
it's very terse syntax. It's not a pleasant syntax.
More to the point, and this is what was brought up before, this idea that there's state all
over the world and that you express invariants over distributed state is a lie. And so we
stopped using it. These guys are better developers than I am, I think, and they just didn't
use that feature of Overlog in their code. They didn't use distributed rules in essence.
They work in special cases, like totally monotonic things, like paths in the Internet.
Routing protocols actually are mostly monotonic, so you can get away with that kind of
nonsense. They don't work for stuff like Paxos.
26
However, if you have Paxos, you can do it on top of that, because then you have a
consistent global view.
So it has its place, but it's a lie in general, and so we mostly didn't use it. And you don't
want a language that promotes a lie, because you'll write bad code and you won't
understand why it doesn't work.
Here's a really interesting one. And obviously I'm going to stop the talk now and we're
not going to get much further, but this one's come up kind of recently. Neil pointed this
out. Distributed invariants are a very natural thing to specify. The way that they're
specified in most systems is actually through protocol, which is not declarative.
So I'll give you an example. Everybody knows from reading the Google filesystem paper
that there should be three replicas of every block. So isn't an invariant of GFS that there
are three replicas of every block? No. Sometimes there's two replicas and then you do
something.
When there's not three copies of a block, then a heartbeat doesn't come to the master and
the master thinks maybe there aren't two anymore and it initiates some copies and
lalalalala. What is the invariant that they're going after with this? Could you even write
down what their goal is with this protocol? It's something about there should most
always be one copy of the block, I think. Or the mean time to a block getting lost is way
lower than something. But it's not like they wrote that down.
So even writing down these specs for what do you want out of a distributed system is
not -- there's no art yet for that, really. And even if you wrote them down then, we
couldn't enforce them because we can't even check them. So suppose that the invariant
was indeed 99 percent, you know, mean time to failure is whatever for a block. Like how
do you test that. How do you translate that spec into implementation. I don't know.
So we're actually interested in this. And we think we have some ideas for sort of really
software synthesis, I guess, from specification.
Got to start with testing and then go on to enforcement. But the goal is to really be able
to do distributed invariants. Right now we could really only do single-node invariants.
And I think for distributed systems design, this is generally important and hard and not
well treated.
Okay. As been pointed out, Jonathan pointed out, state update in Overlog is outside the
logic. It's illogical. And a bunch of people have been starting to pay enough attention to
us that they've pointed this out in public. So they've written papers saying, hey, you
know what, P2, the system you built for Overlog, like it does some crazy things that's not
what you say in your paper. And what you say in your paper is crazy too, but it's just
crazy in a different way. And this is all true. So this is a problem. Okay. And it was -it bit these guys when they were coding as well.
Another thing we really haven't dealt with is, so, is it better than Erlang? Why? You
know, I don't know. I actually don't really have a good answer. I have ideas for answers.
And that's one reason I really want to talk to you guys about [inaudible], is I want to
27
understand maybe by hearing each other's pitches for what's good, we can get more
clarity on the pieces you like.
All right. So I think with my very brief remaining time ->> Is "good" the right question or "necessary" the right question?
>> Joe Hellerstein: Elaborate, please.
>> Well, as soon as I have a monad, which is link for Pascal or link from any of the .NET
languages, dot dot dot, I can make this happen. I mean, I can embed Datalog inside of
any of these and make sure that all the state is contained in the declarative part in all the
rest of that. And once I have that, now you're talking about programming practices. But
better -- they're all [inaudible] they could actually all interact with each other.
So I think it's more of what's necessary. You know, C++ doesn't give me what I need by
itself to actually pull this off. Now, we can enforce it. I can give you libraries you have
to live inside of and all the rest of that. The language itself doesn't help by itself.
>> Joe Hellerstein: I have a real hard time with that line of thought because, of course,
anything too incomplete is fine. As necessary as -- it's like I understand it from a
complexity point of view, but everything after that is programming practice. I mean, so
it's -- I think we would just agree it's all about good. And your definition of necessary is
really a strong version of good.
>> Sure. Okay.
>> Joe Hellerstein: Yeah. And ->> You might want to say which one of these makes it more automatic to do the right
thing.
>> Joe Hellerstein: Yeah. Yeah, maybe. Or how easy is it to shoot yourself in the foot.
Something. And we haven't really looked at it from that angle. That would be really
useful, actually, just example of design patterns you can easily say in these languages that
are bad. Yeah. Good. Thank you. That's great.
All right. In my very brief amount of time, which is negative -- what would you guys
like? One thing we could do is we could say that was interesting, let's take that break,
and if you've had enough interesting for today you can go, and then a few of us can stick
around for another 20 minutes. Does that sound good?
>> Jim Larus: Yeah, that's probably [inaudible]
>> Joe Hellerstein: Okay. I will not be offended if you take off. We're mostly going to
talk about language ideas at this point. So if this was the part of the talk you had wanted
to hear, I apologize.
[laughter]
28
>> Jim Larus: It's all on videotape [inaudible].
>> Joe Hellerstein: Oh, okay. Then you can suffer through it again is what you're
saying. Okay.
>> Jim Larus: [inaudible]
>> Joe Hellerstein: There you go. Okay. That's good. Okay. So the next thing I want
to talk about is the foundation logic for Bloom, for our new language which is called
Dedalus, which if you're not catching the reference, you can talk to Peter afterwards
because he was an English major and I wasn't in college.
So Dedalus is the foundational language for Bloom, as I said. It's the Datalog in time and
space. And the first observation that sort of came out and somebody -- right away one of
you guys pointed this out was, time has got to be really important. And in fact space
doesn't matter. So distributed systems aren't about space, they're about time.
Basically it doesn't matter whether the thing is like in the next room or two rooms over or
if the thing is in Albuquerque or it's in New York or if it's Beijing. That only matters
from a kind of performance perspective. Once it's far away and you can't clock with it
and you can't do things atomically with it, it's distributed.
So distributed systems are not about space, they're about whether you can atomically do
stuff in time together. And that's the thing we should be reasoning about as a
programmer is about time.
And so in Dedalus, there's three things you can say. You can say something is true now,
something should be true instantaneously after now, or something's true asynchronously.
>> Eventually.
>> Joe Hellerstein: I don't like that word "eventually." Because -- yeah. People have a
lot of connotations for that one. So I prefer later. Or actually we just say async. Yeah.
The other thing -- so that was the first sort of conceit was that Overlog was all about
partitioning in space, and that was wrong. It should be about partitioning in time.
The other conceit I think in the language, which is liberating, is that there's no database.
Every fact is transient. Every fact is true only for one instantaneous timestep.
>> Sounds like it's streaming.
>> Joe Hellerstein: It's very much a streaming system. Yeah. Absolutely. But -- yeah.
But there's windows. Or there's no built-in windows.
So every local node has atomic timesteps. So on a given node a fact is true for a single
atomic timestep, and that's it. All right. And I think you pointed this out before: Put a
timestamp on every tuple, if you like. Same effect. And it's a local clock timestamp, a
local, logical clock timestamp.
29
And what's [inaudible] so what is persistence, what is storage. It's just induction in time.
And what this is going to give us is a model theoretic framework to talk about side effects
and update, which is really quite cool. So all this stuff that we didn't get before about like
what does it mean to do update and when do updates happen, that's all going to be logic
now. And it's going to have a unique minimal model and all these lovely things we had
for Datalog. So I'll show you this in just a sec.
And then delay and failure are just nondeterminism in assigning timestamps. So all the
things we worry about in distributed systems will be asynchronous specifications with
nondeterminism in the timestamps.
And there's actually been work -- there's nice work on capturing nondeterminism in
logics that we just lean on for this. And so we didn't have to invent anything.
So here's sort of core Dedalus example. I know some event E, which is attributes A and
B, and a timestamp attribute. Every table in Dedalus has a timestamp attribute, the last
attribute.
If I know AB is an E at time T, then I know AB is in P a time S. And let's say T equals S.
If you write a rule like that, that's a plain old Datalog rule. It's deductive. It says if I
know EAB, I know PAB at the same time. That's plain old Datalog.
If instead of T equals S you say successor, this is next. This means instantaneously in the
next timestamp, if I knew E at what timestamp, I know Q at the very next timestep
atomically. With no intervening state of Q.
And then asynchrony just says if I have a time T, then of all possible times in the
university, which, again, we will not store, all possible times in the university, this is the
logic expression from our friend -- from our Italian friends at UCLA for expressing
nondeterminism keyed on AB and T, choose an arbitrary S. Right? That's just the way to
think about it. It's like for each ABT triple that satisfies these two rules, you pick an
arbitrary S that is in the time relation. This is just basically rand, if you like.
>> Presumably that's S greater than T.
>> Joe Hellerstein: Ah. We actually like to say it doesn't have to be. And there are
cases where it really doesn't. And this is all just logic, man.
[laughter]
>> Zero of time back to time.
>> Joe Hellerstein: It turns out it has to do with how much nonmonotonicity is in your
program. So if you have a totally monotonic program and all it's doing is accumulating
stuff, you can have it in first stuff that was true before. And it will catch up. And it -Pete, what's the quote, the MIT quote?
>> Peter Alvaro: Time is a device that was invented to keep everything from happening
at once.
30
>> Joe Hellerstein: All right. But in most of our programs, because we do want to admit
nonmonotonic update, it makes sense if you do it in monotonic time and it doesn't make
sense if time has cycles. Yeah. So update gives you headaches if time doesn't go
forward.
>> If you put a range limit on S, do you get the [inaudible] algebra out of this?
>> Joe Hellerstein: I don't know the answer to that because I don't know what you're
talking about.
>> Well, the [inaudible] algebra is two times, not one.
>> Joe Hellerstein: Oh, your algebra.
>> I'm saying put a range. Put a range on S. That gives you the two times. You say it
has to have [inaudible] to be true.
>> See, but every tuple here only has one time. The tuple is not true for an interval ->> No, no, no. The tuple has one, but you can put an interval on top of that and say I will
do the async random deferral during this period that that tuple now applies.
>> Joe Hellerstein: You could probably say something like this fact can only be true if it
arrives within a certain amount of time.
>> Within an interval.
>> Joe Hellerstein: That's fine. You could say that. Sure. I don't know how that maps
to your language. That would be hard work for me to -- I'd first have to really understand
what you're doing instead of vaguely understand what you're doing.
All right. Here's the sugared version. So you don't really have to write that. That's
horrible. If you want to write a Datalog rule, write a Datalog rule. Leave out the
timestamps. If you want to write a Datalog rule with next, say @next. And if you want
to write later, say @later or @async. So that's the sugared version of Datalog -- Dedalus.
Here's where it gets fun. Persistence. Persistence is induction. If P is positively true,
AB, then P is positively true at the next timestep AB. So this is just infinite storage of
that fact. It's true now, and if it's true at X, it's true at X plus 1. Persistence through
induction.
You don't want to implement that. That's like RAM. That's like implementing DRAM in
your software, which is just a bad idea. You have to keep refreshing all your facts. But
it's a fine model. And, after all, this is just the model. The implementation is
independent.
Now let's suppose you want mutable state. You want to do deletion. Let's have a
convention. For every predicate P, which we'll call P positive for now, we will also have
a predicate called P negative by convention. And if you like, this is, you know, like Ruby
on Rails. You must name your tables the way I say so because I invented it.
31
But in fact we can put sugar on top of this, so you always generate you say P and I say,
oh, you mean there's a P_pos and a P_neg, and I do that under the covers.
Now ->> [inaudible]
>> Joe Hellerstein: Hold on. Let me do this because it's fun. And then I'll -- whatever
you want to do which is less fun. I'm sorry.
[laughter]
>> Joe Hellerstein: The arrogant bastard. So all right. If we know that A1, B was in
P_pos at some time sometime, and it was not also in P_neg at the same time, then it is in
P_pos at the next time.
>> [inaudible]
>> Joe Hellerstein: It's a bug. What A? Oh, this is on video.
[laughter]
>> Joe Hellerstein: Go back in time. It's nonmonotonic. All right. There's no A, man.
But think about this for a sec. Let's look at this example. P(1,2) is true at timestep 101.
Very nice. It's in P_pos. Okay. P(1,3) is true at timestep 102. Fine. P_neg has
(1,2)@300. Does P_pos contain (1,2)@300? Yes. P_pos does not contain (1,2)@301.
Because we're saying is if a fact is currently true and has been asked to be negated right
now, it will be gone in the next timestamp. We break the induction.
>> Do you want a comma there in the first line?
>> Joe Hellerstein: I put a period.
>> No, the first line.
>> Joe Hellerstein: No, that's just bad -- that's bad, you know, keynote wrap. This is the
arguments of P_pos.
>> Oh, it is -- it's the argument of [inaudible].
>> Joe Hellerstein: Yeah, yeah, yeah. Sorry.
>> Looked like a separate [inaudible].
>> Joe Hellerstein: Sorry. So but the point is the induction has been broken at timestep
N, and that appears because there's no induction to timestep [inaudible].
This is actually extremely cool. Because now what we've got is this is all just Datalog.
There's no updates. There's no nothing. This is just good old-fashioned Datalog being
32
used in a very stylized way. Which means we have a beautiful, lovely model-theoretic
treatment of state update and side effects. And all is -- all the tools we know about in
Datalog can be thrown now at these programs. Jonathan.
>> [inaudible]
>> Joe Hellerstein: We have windows.
>> These are windows.
>> Yep.
>> Joe Hellerstein: They're little bitty windows.
>> They're -- you can create a rule that says for each fact [inaudible] to P_pos you add a
P negative with the timestamp [inaudible] the original timestamp plus some constant.
>> Joe Hellerstein: You could absolutely implement windows. Yeah.
>> You also got rid of the Prolog retract. You don't need it.
>> Joe Hellerstein: Which I wouldn't have wanted in the ->> No, no, that's the point, is this works correctly in a logic framework without having to
do ->> Joe Hellerstein: Yeah. Actually, you know what would be -- if I cared about such
things, and, you know, what I care about seems to change over time, because I used to
think programming languages were really boring, if I cared about teaching the Prolog
community something I guess we would go back and do cuts and all this kind of stuff and
retracts.
>> Oh, with this?
>> Joe Hellerstein: With this.
>> Yes.
>> Joe Hellerstein: Right? Basically you implement ->> But it doesn't fit this [inaudible].
>> Joe Hellerstein: You do. But you have to implement the Prolog interpreter in
Dedalus.
[laughter]
>> Joe Hellerstein: Which is awesome. So we implemented a query optimizer in
Overlog which did bottom-up search. And then somebody on the review said, well, you
say it's extensible, can you do top-down search? And we do. In a bottom-up language
33
we'd implement a top-down search. So, yeah, it's very doable. It would be -- like I say, if
I cared, which I actually right now don't.
>> [inaudible] parallelism that you have in pure Prolog by getting rid of cut and get rid of
retract that you can now do the logic and this thing now parallelizes like crazy. It's good.
>> Joe Hellerstein: Thank you. Okay. Once we're in Datalog now we can start
playing -- writing theory papers and playing games. So we sent a paper off to PODS
where we took all the -- we took the two classical analyses that you do to Datalog
programs and we recast them in this temporal model so that we could get their benefit.
So one is stratification, which is finding out where your nonmonotonicity boundaries are
and organizing your program so that you step through them in a way that has a unique
minimal model and that has a natural implementation.
And we show that you can do that with this temporal extension. And it does exactly what
you just described. It tells you that there's a whole batch of stuff that because it doesn't
count essentially you can do it as parallel as you like.
And so it points out in these programs, even with state update, it tells you where your
races are at some level and it tells you to wait for them. It tells you to hold out and wait
until that race is done.
And somebody was talking about lock-free stuff. What's lock. What's a semaphore.
Counting semaphore. Counting, counting. Didn't I just say everything's about counting?
Yes, it is. What this is going to tell you is where your locks should be. And if you want,
they're counting. Sure. Yeah. It sounds really cosmic. I don't know if it's true. But it
sounds good.
And then also safety. When you have conservative checks that can tell you your program
will terminate, we can translate those to checks over time. Okay. And that's really useful
sometimes.
And the key is that time -- and this gets to your point about going backwards in time,
time is a source of monotonicity and we can use it in our analysis. If you know the time's
going to march forward, you can say stuff about what your program is doing. You can
actually take your program, you can unroll it into a plain Datalog program with lots and
lots of these strata laid out one timestamp at a time. And then you can collapse the strata
if you do something totally monotonic across two strata.
So time is the source of monotonicity because facts from the past cannot negate
themselves in the future.
And therefore many things that look -- actually the classic examples of nonmonotonic
reasoning from the database theory textbooks are fine in time. They look like flip-flop
programs. And they're meaningful. They're well specified. They don't terminate, but
they're well specified. You toggle between your beliefs. I win, I lose, I win, I lose. It's
true, it's false, it's true, it's false. It's all fine. It's all well specified. And so the static
checks are easy.
34
Wow. Yeah. Okay. Good. So a little more formalism. This is not yet written down and
we haven't figured out what to do with it. This is Pete's drawing of what Dedalus does. It
takes -- these concentric circles are like strata in a single Datalog fixpoint. So if you
want, it's just the single Datalog fixpoint.
And then, you know, in Overlog what we'd say is state update happens, and then you do
another fixpoint.
In Dedalus what we're saying is that these are next rules. So these are the facts you know
at the next timestep, which define a selection on the database at that timestep.
And then there's these asynchronous rules that go through the randomizer that pop in at
some later timestep. And, you know, there are these foreign agents sending us data from
across the network. Because, after all, we want to go back to doing distributed systems.
We can't control when the messages arrive. And they could be written in other
languages. They could be written in Java.
So these arrive at some random amount of time, some random times. If everything you
write is just nexts and you don't have any laters and you don't have any foreign things,
your entire program's outcome is specified in the first timestep by the base data and the
rules.
The only nondeterminism, the only thing you don't know logically from the very
beginning is these nondeterministic timestamp assignments, and of course the data and
the timestamps of arriving stuff. That's what you want to describe in terms of the
semantics of your program. All the semantics of the next stuff is here. It's given at the
beginning. The rest of the semantics are captured in the timestamps on these messages.
So call that a trace. It's just an assignment of timestamps. And now really what we want
to talk about with program correctness is what are equivalence classes of traces with
respect to certain properties.
So like Church-Rosser property would say can I prove that all traces -- see, ignore the red
stuff. But all assignments of delay give me the same answer. That's the Church-Rosser
property. It says all traces are equivalent. It's giving me the same output database.
But you could have weaker forms of equivalents. You could have equivalence classes on
traces that are something else; that like, I don't know, the numbers in the bank accounts
add up the same even though the timestamps on the issuance of the checks are different.
And so I think we have a nice tool to talk about, things like eventual consistency. I think
you said something about eventually. Eventual consistency, lose consistency [inaudible]
in terms of trace equivalence. I think. We haven't done this. So this is pure conjecture.
Maybe all we have is a pretty way of saying that we don't understand this stuff and
nobody ever will. But I think we have a leg up.
And then the other thing to keep in mind is if the red guys are also the Dedalus programs,
kind of what we're capturing here is kind of things like Lamport clocks and causality and
stuff. But we're capturing it in a way where we can dig deeper into the program and look
at the dataflow analysis of the logic. So you can actually talk about causality with respect
35
to the meaning of the program, not just like, oh, somebody sent me a message at a certain
time and I couldn't possibly know what they were doing with that message, so I have to
be very conservative. We can actually integrate the assignment of clocks with the
semantics of the program.
And so I think we can -- I may be wrong and may be underrepresenting what's possible in
sort of classic TLA-type stuff, but I think we have a lever here as well.
>> So TLA actually deals with things like [inaudible] things over intervals, as does the
[inaudible] algebra, which we mentioned before. You can say this is true from now to
infinity. Every tuple has two times, not one, to make that go.
>> Joe Hellerstein: A start and an end.
>> Yeah. Yeah. Yeah. You get there from here with induction, but ->> Joe Hellerstein: Yeah. It's not as pretty.
>> But you can do this with that by just saying T1, T1.
>> Joe Hellerstein: That's right. You can go either direction. You can [inaudible] one or
the other. Yeah. Good.
>> Either can simulate the other and pretty efficiently, I would think.
>> Joe Hellerstein: But TLA doesn't include all of nonmonotonic logic, right, so do that
part.
>> No, I wasn't trying to get there.
>> Joe Hellerstein: And so that's where I think you -- having these both in a single
language I'm hoping is going to help us reason better about stuff.
And an example of that is like, you know, this notion Acid 2.0, you say, well, if you have
a set of interfaces and they're all associative, commutative, and [inaudible], that's a good
way to program distributed systems.
Can we analyze these programs for those properties and then say, oh, we can relax.
Trace equivalence will now be easy to prove because I've proved associativity of
something. I think maybe we can. And we're starting to think about it. We're marching
down that path.
So best practices for building distributed systems have actually a logic -- constraint on
what you can say in the logic. And I think maybe we can test for those constraints.
All right. So much for Dedalus. It's really just a formalism, okay?
So here's what I think of for Bloom, and then I'll leave you with this. Because Bloom
doesn't exist. Bloom is a fiction. It's going to be some version of Dedalus that's
supposed to be great. But here's how I think about it because it's fun. It's what I call the
36
Bloom loop. So the Bloom loop is -- you know, it's an acrostic for bloom. So it's easy to
remember.
So the Bloom program, what do they do? They're going to write a timestep essentially.
They're going to say given stuff that's come off the queue, what do I want to dequeue. So
you can have some control over the asynchroty. You can postpone things. They've
arrived, but I don't want to deal with them right now. So you have a -- basically a priority
queue with a priority function. And that will define your trace. That gives you control of
your trace.
And then you write your logic, which is your "now" stuff. That's L. And that's if you
wanted safety testing in distributed systems or it's assertions, it's invariants. It's
derivations of new stuff.
Then you write your operations, which are like state update side effects. That's really
what "next" is.
And then you need another O to make it spell bloom. So that's the orthography step. We
call that acronym enforcement.
And then there's messages, which is "later," and you'll send messages.
So you show this to a programmer and you say dequeue, test some stuff, make your
updates in batch, send messages. This is not crazy. This does not look like European
logic. This looks like an event loop programming design pattern. I -- my, you know,
ethnic roots are in Europe. I have a great respect for the continent.
Shortest paths. Here's our shortest paths program, and I did a Ruby-style version of this
for fun. So this is not Bloom. This is the Bloom version negative 1. And my guys aren't
particularly fond of this, but it's better than not showing anything.
So here's batch for our shortest paths program. Every time a new path tuple arrives or
every one second, do the following. All right.
Here's the logic. Here's the definition of the link table key value. Here's the path table
key value defined by link.each for each link. Yield, a key value pair for path.
For each pair of path and link that match on to and from, yield, you know, the from/to for
the combined path. This is not great. This unification syntax is a little confusing. But so
be it. But these are comprehensions instead of four loops. They could be four loops.
That would be fine.
And then shortest paths has kind of got a reduce in it, right, because you have to do min.
And, again, this is pretty close to Ruby. It's a little bit of Pythony stuff in here, but it's
kind of Rubyesque.
And then there's no ops in shortest paths because you never do any state update at all.
And then there's messages to be sent.
All right. But that's the kind of Bloom loop look at things.
37
When we think about like our entire Hadoop filesystem implementation, this doesn't
seem to help us much as a framework for thinking about how to factor that
implementation into modules or functions or objects or what. So it's actually -- it's cute,
but it's not clear that it's all you need for structuring a program.
Okay. Now I'm actually going to be on time if you give me a 90-minute talk.
So where are we going. Here's some stuff we're doing on the sort of systems side, and
then there's a slide of stuff we're doing on the languages side. And I realized as I did
these two slides that there's no way we could do all this stuff with the size of the team we
have. These guys haven't seen these slides, but they're going to kind of scare you.
[laughter]
>> Joe Hellerstein: What we desperately need, which we're putting on the calendar, is to
sit down and prioritize this list. Add things to it, brainstorm some more, and then we
know it. But here's stuff that we've been talking about.
Neil, who is an excellent and very experienced coder, is building a new runtime for
Dedalus or for Bloom called C4, which is a plastic explosive, so part of the Bloom
project. The target is to be able to do as many Paxos rounds per second as necessary
from our Paxos spec. Okay. So can we build [inaudible] for real at the performance -the throughput we need. And to do that you need very little latency message handling.
And most sort of database engines are tuned for bandwidth, they're tuned for throughput,
they're not tuned for low latency. So that's what C4 is, the challenge there. And that's
coming along.
Harry Otti [phonetic] came from Wisconsin and he does storage and recovery. And
we've been talking about this idea of failure as a service in the cloud. So the idea is if
you want to do testing of your software and how it's robust to failure, you want to be able
to do large-scale -- orchestrate large-scale experiments where you fail components in the
system. And orchestrating that seems like something very natural in a high-level
language.
So we've been trying to put his ideas for testing and software for storage together with
our notion of declarative programming with this idea of failure as a service. And this is a
very young idea. But it's exciting, because Harry Otti has done a lot of really good work
with declarative testing for correctness of storage systems.
We've been talking a lot about Paxos and virtual synchrony and other group
communication and consensus protocols, and I think we can make some progress here.
I'm arrogant enough to think we can waltz into that area now with this new logic and
make some progress. We'll see. We'll see. I don't know.
And then we want to build some stuff. So we built Hadoop. And the next thing we built
which didn't use Overlog at all was a streaming version of Hadoop called HOP, the
Hadoop Online Prototype. It does continuous queries and it does online aggregation, if
you know what that is, without changing the MapReduce fault tolerance model.
38
So it takes out the stages of maps and reduces and it lets it stream while also doing the
backup restart that's built into MapReduce. So that's kind of cute. Once you build that,
you realize, my God, scheduling this thing is really complicated. And I'm sure you see
this in Dryad. When you have a much more flexible programming framework,
scheduling on a parallel system is just -- there's so many design variables.
So we want to get back to our declarative scheduler to play with that space. So that's
going to come back to us.
And we want to build latency sensitive services, because really this batch processing stuff
didn't exercise performance and a bunch of other things.
I've been working for some years now, as I said, on declarative machine learning and we
haven't lost that yet. I'm still collaborating with Carlos Guestrin on that. We've been
talking about management and configuration but actually doing nothing. Bill did a little
bit of work on that, but he was just sort of a class project.
And then I'm funded to do security stuff, so I hope I will.
Here's the Bloom stuff. So where is Bloom going? We need to make the logic more
approachable and possibly a link-style thing with comprehensions is better, and sort of
like the Ruby I showed you, than logic.
But more importantly, like as you think about building a lot of software, how do you
structure code when this is your programming paradigm. We've been working really
hard to figure out like what's a function or what's a call or what's a -- how does a
programmer think about like a handler. It's all flat. And it's all intertwined. So we've
been worrying about that.
And then Neil was asking me like, well, how does this relate to like stack ripping. I was
like, I don't know, what's a stack? What are you talking about? I'm really confused. So
how does this relate to events versus threads, I don't know. I'm confused.
So we may not understand this because this is how programmers structure their code. So
that's really important.
I think I'm very optimistic that we'll be able to do neat static analysis stuff. Now that we
have Dedalus in hand. So this notion of traces for testing for confluence, causality, and
easy concurrency control, then possibly taking the Acid 2.0 ideas and understanding them
better.
This distributed declarative invariants I think is a very big deal. So we're focusing on
that.
We've done some stuff with debugging because you get message provenance or general
provenance in logic fairly easily, so you can answer things like why is this fact here, or
why did this update happen to this data structure, or why did I get this message. I can tell
you exactly, it's because this was true and this was true and this was true and this was
true.
39
So that's kind of interesting. And there was work like this in our earlier project, but we
haven't really exploited it enough.
And then finally if there's a theoretician in the room, this whole idea that ->> They're long gone.
>> Joe Hellerstein: They're long gone. Damn it. Damn it. So I'm working with Christos
Papadimitriou on this. Basically if resources are really free, if you believe the Yahoo!
guys, and I think they're nuts, frankly, but if you believe the Yahoo! guys, you know, as
many machines as you want, no problem. You know, MapReduce, we're going to win a
benchmark, 10,000 machines. Could we have done it on 40? Yeah. But we do it on
10,000. Because why not? It's MapReduce. It scales forever.
If that's true, then the only thing expensive is coordination. Anything you can do in the
map phase is free.
Okay. Well, how many coordination boundaries are there in quicksort? Or in sorting.
To sort something. How many times should you coordinate. How many machines
should coordinate each time.
This is kind of an interesting notion of complexity. So what's the coordination surface.
How deep is the coordination, how many times do you have to have a barrier, and for
each barrier how many nodes are involved. It's kind of an interesting complexity model.
And then I think you can get randomized and approximation algorithms tucked into this
model by saying I'm going to step over a coordination boundary either speculatively and
then just move on, which would be kind of approximation, or I'm going to step over this
boundary and there'll be some cost to fix it if I'm broken, which is more like a
randomized algorithm.
So, anyway, we're playing with this. It's -- I don't know. It's interesting.
So some of the things we're thinking about -- now I'm two minutes over my really long
talk. But there's references for the record.
>> [inaudible]
>> Joe Hellerstein: It's fat, it's slow. Thank you for your patience.
>> Jim Larus: I suggest we give the speaker the big round of applause.
[applause]
Download