Document 17864367

advertisement
>> Sebastian Burckhardt: It's a pleasure to host Marc Shapiro. Marc Shapiro is visiting from
INRIA. Marc is working on eventual consistency and distributed systems, which is why our
research interests coincide.
Marc also worked for Microsoft before. He was leading the Distributed Systems Group in
Cambridge for six years. And I also found out today that he invented proxies a while ago. So
from this one I will make -- let Marc take over and tell us everything about eventual consistency.
>> Marc Shapiro: Okay. Thank you. So I'm going to tell you about eventual consistency and a
subset of eventual consistency, which is strong eventual consistency which I'll explain in a
minute, and CRDTs, which is a particular approach. The whole idea is to have very scalable and
very fast replicated objects.
So imagine you have some data structure that you want to replicate in the cloud on large scales.
For instance, you have this graph that's supposed to represent some data of interest and you
want to be able to access it at large scale, so you want to have lots of replicas in different
machines spread over the network, possibility in different geographical locations.
So of course having lots of replicas is great for fault tolerance and it's great for latency, for read
latencies but as soon as you have updates of course you have a problem because if you
synchronized your updates then you're slow, and if you don't synchronize your updates then you
have conflicts.
So what we're trying to look for is some approach that will allow you to do this, to have replication
but also have very, very high performance for both reads and writes and will also be simple.
Okay. So the conclusion I'll come to at the end is we have a particular class of objects which are
call conflict free objects. That's what CRDT stands for, conflict free replicated data types, which
are objects where you can do updates without any synchronization and still have guarantees of
eventual consistency.
And so I'll show you how -- sort of a little bit of the theory behind eventual consistency and
CRDTs and I'll also try to address the question whether you can actually use these things in
practice.
So back to my, you know, data structure I have, you know, this graph. Maybe it represents say
the structure of the web. And well, since the structure of the web is changing over time, you want
to be able to also change the graph so you have, you know, operations to add and remove edges
and vertices, and all of this has to work very fast because you need to follow the changes of the
web, and it also has to scale to very large because, you know, the web is so big.
So the contributions, the things I'm going to talk about in this talk. So it's, as I said, a model
called strong eventual consistency. And this is a sort of a solution to the very well known CAP
problem. I'll explain this in an minutes. I'll give you some formal definitions. I'll give you some
sufficient conditions. So if you have data type that fulfills these sufficient conditions, that's
enough. You don't need to do anything else. You automatically have all the nice properties.
I also prove that these two conditions are actually equivalent to each other. And they're two very
different ways of looking at the world. I'll also show you that this is incomparable to existing
consistency criteria.
So this is all sort of the theory. And then I'll also talk about the practice. So I'll give you some
examples of object types that fulfill these conditions.
So I'll start with a little bit of theory consistency. So you probably all know about strong
consistency. Okay. This is sort of the ideal that you would like that every time you do an update
everybody knows about the update immediately and therefore, you know, your programs never
get confused because all your threads, all your processes, wherever they are in the Internet have
the same view of the world.
Okay. So strong consistency means basically that every time there's and update there's a total
order of your updates. Okay. So even if you have concurrent clients trying to do a particular
update, they'll be ordered in some order, and everybody's going to see the same order. Okay?
So basically there's a counter implemented somewhere in your network and everybody has to
follow the same counter. So everybody knows that the first update here is adding this, you know,
purple subgraph, right? And have been knows that the second update is -- I think I deleted
something up here, right. And then the third update is going to be something else, right, so I'm
adding a new edge. And the fourth update I deleted some stuff on the left, et cetera. Okay.
But you can see that this is -- this is going to put a bottleneck on your system. Because basically
where you had a big parallel system now, you've made your big parallel system into a big
sequential system. Right.
So maybe this is good for fault tolerance, but it's really bad for performance.
The way you implement this bottleneck, the serialization bottleneck, is called a consensus. So it's
a well known problem in distributed systems. It's sort of the basic problem in distributed systems.
How do you get multiple processes to together decide on a particular single outcome. It's been
very well studied, and it's very well known. Of course, this is a serialization, this is the bottleneck.
It doesn't scale very well. Consensus doesn't scale very well. There are fault tolerance versions
and sort of peer-to-peer versions of consensus, but they still don't scale very well. And they
never can tolerate processes being crashed.
So consensus allows you to implement this very nice model, okay, which is also called a
sequential consistency or linearizability, et cetera. But as I said, it's really bad for performance.
So people in the cloud space have been working on something else, and so first of all, it -actually this started quite a few -- quite a long time ago with things like Usenet use and
disconnected -- disconnected work with, you know, laptops. How can you do updates that are
uncoordinated and still converged towards something? So this is called eventual consistency.
And basically the idea is you can update one of these -- you can update your different replicas -different updates to do -- to different replicas in parallel, so they'll diverge for a while, but
eventually they'll converge.
So something like this is going to happen. So one of the replicas is going to add this purple, you
know, subgraph. The other replica just completed this subgraph down here. And then the top
replica -- sorry. So continuing to delete at the bottom. And now the update that this top replica is
going to propagate somehow to the bottom replica and lo and behold, we discover there's a
conflict, right? Because I try to create and edge to this -- to this vertex and this vertex has been
deleted, and therefore this graph doesn't make any sense, okay?
So in eventual consistency, the idea is when you see one of these conflicts, you reconcile. So
there's some sort of a global decision about what the outcome is going to be. Obviously I can't
leave the graph like this because it's not a graph, it's not a proper graph. So I have to do
something. And there's basically in this case, there are two things I might do. I might say, well,
this, you know, adding this subgraph isn't correct because the -- there's no vertex here or I could
say, well, the delete was incorrect, right?
It doesn't really matter, as long as you make one of these decisions. But everybody has to make
the same decision, right? So again, this reconciliation requires a consensus in order to do it
properly. Okay. The win, what you've gained from the previous version, the strong consistency
is that the consensus which remains complicated and remains a bottleneck is now in the
background, okay? So even if your system is -- if your system is partitioned, for instance, or if
more than half of your replicates are crashed, you can still make progress, okay? Because you're
not -- you don't have to do the consensus right away, you can do it later. Okay?
So this is -- this is what is known as eventual consistency. Now, I'll move to -- no, not yet. So
this is what is called eventual consistency. But you can see that this arbitration is still very
complicated, okay? And it's actually a lot more complicated to do eventual consistency in this
style than to do strong consistency. Yes?
>>: So say that consensus now there's no guarantee of when it is run, is it just sometimes ->> Marc Shapiro: Yes.
>>: Because [inaudible].
>> Marc Shapiro: Yes. Yes. So in strong consistency, the consensus is in -- is the bottleneck.
It's in the critical path of your application. Here you've moved it off the critical path. So
performancewise, it's better, but complexity-wise it's really very difficult. And getting the -- getting
the reconciliation right can be very complicated.
>>: Question?
>> Marc Shapiro: Yes.
>>: Does the reconciliation have to be the [inaudible] with these two [inaudible].
>> Marc Shapiro: It didn't matter. Because whatever you do eventually you do a consensus. So
you can diverge in arbitrary ways, right, and you can sort of converge pairwise in, you know -- I
could reconcile with you and then you could reconcile with Sebastian, et cetera, so that could be
non-deterministic. But eventually we have to come to consensus of what the real reconciliation is
being to be. So anything you do before the consensus is tentative. And it is going to be thrown
away anyway. Yeah?
>>: [inaudible] aren't there sort of [inaudible] transformation [inaudible] try to reach agreement
with --
>> Marc Shapiro: Next slide. Next slide.
>>: Okay.
>> Marc Shapiro: Question? Okay. Okay. So I'm going to show something which is slightly
different, which is a subset of eventual consistency, which I will call strong eventual consistency.
And I would argue that what you really want is -- at least in cloud systems, is this. Okay? So
strong eventual consistency starts with the same idea. You could do your updates locally. And
you could propagate your updates to each other, right? So here I've done, you know, added the
purple subgraph there. I've deleted down there, I've deleted part of the subgraph down here.
And now I'm going to propagate.
Okay. And what's going to happen here is I'm to say well, okay, this thing is illegal, fine. Just get
rid of it. Okay? So I'm not going to try to, you know, do any sort of consensus, I'm just going to
say I have a deterministic outcome for any conflict. So basically I'm getting rid of conflicts, right?
There are no conflicts anymore.
Any kind of update is allowed. And I just have some recipe to get out of bad situations. Okay.
Now, as long as everybody applies the same recipe, as long as all these pairs of concurrent
updates, all have some deterministic outcome, right, we're all going to converge to the same
thing. And that's what -- so that's going to be very fast ->>: [inaudible].
>> Marc Shapiro: Sorry?
>>: Only the order ->> Marc Shapiro: The order doesn't matter, right. If it's deterministic -- sorry unique and
unaffected by what other people are doing on a side. Right. Okay. So the nice thing here is that,
first of all, it got rid of consensus. I got rid of the complexity of consensus and I've gotten rid of
the performance problems of the consensus, and I've also get rid of the bottleneck -- the reliability
bottleneck. Because consensus only allows up to half of my sites being crashed. This will allow
any number of sites being crashed. Okay?
So you've probably heard of the CAP problem. So the CAP problem is this theorem that was -that was uttered by Eric Brewer who says that in a system you cannot have at the same time
consistency, meaning strong consistency in this case, availability and partition tolerance.
So if your system can fail, then you have to choose between either having consistency or
availability. Right? If you accept instead of strong consistency if you accept strong eventual
consistency, then you can have all three of these.
So now, if I look at the sort of formal definition, okay, eventual consistency is defined by these
three definitions. So one says that every time -- eventual delivery says that any operation that is
executed at some replica will eventually be executed all replicas, okay?
Termination says that all executions, you'll updates will eventually terminate.
And convergence says that eventually all the replicas that have received by this clause all the
updates will eventually reach the same state. Okay? So if your system stops sending updates,
everybody will eventually receive the updates so far and eventually will converge to the same
value, okay?
But this doesn't preclude diverging, rolling back, reconciling, diverging, rolling back, reconciling, et
cetera. So we don't want that. We want something stronger. And that's going to change the
convergence clause so instead of saying eventually they will reach the same state, I will say that
as soon as you've received the same ask updates, you have reached the same state. Okay. So
there will be no rollbacks. Every update is immediately persistent. Durable. Okay?
So this is going to be my model. And I think this is actually when people talk about eventually -eventual consistency, for instance if you look at what's his name, the guy from Amazon, Werner ->>: Werner Vogels.
>> Marc Shapiro: Werner Vogels, right. He wrote this paper in the CACM about eventual
consistency. If you read the paper, this is really what he has in mind. Okay.
So strong -- recap, strong eventual consistency says you can do local updates with no
synchronization and therefore you're extremely fast. Okay. Strong eventual consistency gives
you consistency availability partition tolerance, right? And it doesn't involve any consensus. So it
seems like a good thing.
Now, how do you do it? And I'll get to that in a minute. I think I'll skip this slide. I just want to say
that if you're interested, we can talk about it later. But at a theoretical level, this is actually a new
-- a new consistency criterion which is not comparable to things that have been studied before.
So now that I have this idea that I wanted to do eventual consistency without conflicts, okay, if
this is possible, I should be able to have a library of data types that supports these updates
without conflicts. And so that's what I'm going to do in what comes up.
All right. Any -- yes?
>>: [inaudible].
>> Marc Shapiro: Okay. So you have to wait a few minutes and I'll -- you'll see my object model,
and you'll understand what I mean.
So when I talk about sufficient conditions is what are the conditions that guarantee that your
objects will converge whatever the order.
Okay. So I'm going to use this, you know, standard grams with time going from left to rights. And
here I have three replicas. Of course there could be a lot more of some object with some state.
So the -- each replica has its own copy, its own view of the current state, right? And then we
have some client maybe on the outside who is making calls on methods of this shared replicated
object, all right?
So queries are easy. A query -- the client just sends the query to some -- to one of the replicas.
It could be any of the replicas. And gets back some state, right. So it's read only. It doesn't have
any side effects. And you can talk to -- the client could talk to this replica or it could talk to the
other replica. It doesn't make a difference. The thing is that the system -- you can't -- you can't
know in advance which replica is going to get an up -- a query, right. So there would be no
synchronization in the query path. So queries are easy.
Updates are a little more complicated. And so there's actually two different ways of describing
how updates happen. There's the state-based approach, which says -- well, there's the two
approaches are state-based and operation-based. So state-based means that specifying -- the
specification of my -- of my data type is going to specify what's in the state of a particular replica,
and in order to synchronize I'm going to send the state from one replica to another replica and
somehow they're going to do their thing together based on this state, all right. And the operation
based says that whenever you do an update, that's an operation. And you send that operation to
all the replicas, okay?
So these are two very different styles of specifying distributed objects. And they've been used in
very different settings. So like state-based in the past has been used for like file systems and for
cloud systems and key value stores. And whereas operation based has been used in the past a
lot by people who do the theory and people doing collaborative editing and stuff like that.
Now I'll show you they defer. So state-based replication, again, as I said, again you have a client
who is sending some update method to one of the replicas. But this replica now has to propagate
it to your other replicas, right?
Again, it could do an update on one replica or on another replica. But we're going to assume that
these are two different updates. And for instance, we've distinguished them here by saying that
they have two -- so this update here is going to apply to the local state, S1, and has some
parameter. And this update is going to apply to the local state at the second replica, so it's S2
with some other parameter.
And the update is going to happen locally at the place where you've sent it to, all right, which I'm
going to call the source. That's what this S here means. Right. So at the source, you're going to
do some computation and you're going to change the state. You're going to side effect, right. So
I'm calling the local state the payload. You're modifying the local payload.
And then every once in a while the -- each replica is going to send a copy of its state to the other
replicas. And the other replica on receiving this copy is going to apply a merge operation. So the
specification of my data type has to specify how I do queries, how I do updates, and how I do
merges. Okay?
>>: This is just conceptionally the implementation is [inaudible].
>> Marc Shapiro: No. No.
>>: S1 can [inaudible].
>> Marc Shapiro: But that's the whole idea. So we'll come to the other kind in a minute. Right.
So the -- well, yeah, sure, you could send deltas, yeah, you could send deltas.
But people usually send operations. If they want to send just the delta, they just send the
operation.
Okay. So and then this -- this replica is now going to talk to the other replica and so that means
that eventually all -- and this replica of course is going to talk to the other replica, so you have all
the replicas every once in a while talking to each other either directly or indirectly, right?
So eventually the first clause, which is that all updates have to eventually be applied everywhere,
this is going to happen, right? The second clause says that updates -- updates terminate. Well, I
assume that's going to be true here, right?
And the third clause says that as soon as you've applied an update, it's there, and it's not going to
change. And you see that this is actually what's happening here. Right? So this satisfies -hopefully this is going to satisfy eventual consistency if it converges. I haven't proven that yet.
But anyway, this is the state-based approach. Now, what is a condition -- what is a sufficient
condition for this to converge? So if you remember your math courses from a while back, the
concept of a semi-lattice might mean something to you. It's actually very simple. A semi-lattice is
simply a set with a partial order and an operation that can take any two values and give you an
upper bound on those two values. Okay?
So if the -- if your data -- if the data, if the payload forms a semi-lattice, that is if there is a partial
order of your -- of your values, okay, and you can always take an upper bound on two values,
right, so if your payload forms a semi-lattice, if your updates always go forward in that
semi-lattice, always give you a higher value in your partial order, and if merge computes this
upper bound, right, this is a very simple mathematical property, and this -- this ensures that it
converges. Okay?
I'll give you an exact one in a minute. But if you just think of, for instance, if your updates are just
-- if your payload is just an integer, right, and your updates just do some increment of the integer
of any value and your merge takes the max of the two values, you can see very easily that this is
going to converge, right? That's your typical semi-lattice.
>>: [inaudible] decrements?
>> Marc Shapiro: So if you've decrements, you have to make sure that your decrements still
increments in your -- in your partial order. So you have to define your partial order so that even
decrements go forward. And I'll show you an example in a minute.
>>: So [inaudible].
>> Marc Shapiro: [inaudible].
>>: [inaudible].
>> Marc Shapiro: The [inaudible] just takes the max of your timestamp and takes the value that's
associated with that. Yeah.
Okay. So that's in the state-based approach there's a sufficient condition for getting strong
eventual consistency. And it's a pretty simple condition.
So let's now look at the operation-based approach. So here we have two concurrent updates.
Right? And now, the system's going to -- in the operation-based approach, you don't send your
full state because, as Tom noticed, that can be huge. But instead you just change -- you just
send the update, right? And you expect all the replicas to redo, to replay that update, right?
So you need something a little bit stronger than in the previous case. In the previous case each
replica just talked to every other replica once in a while and eventually everybody got everything.
Okay.
Here you have to make sure that every time you do an update it is propagated to all the replicas,
okay? So you have some sort of broadcast operation that will send an update and make sure
that it is received everywhere, okay? So I said that you apply the update at the source, but then
you also apply it at all the other replicas, which I will call the downstream replicas, okay?
So here I just had to pass -- I just had to send the argument to the update and now, you know,
everybody of course has the code for the update, so you just need to apply the code at all the
sites. So of course I just propagated this update and now I'm going to have to propagate the
other update, right? And you can see that, well, what happens here is that this guy has applied A
followed by B. And this guy has applied B followed by A. Right?
Now, if you wanted to converge, what you want is that your updates are commutative operations,
okay? So there you get your -- your sufficient condition in the state-based approach, sufficient
condition for convergence and strong eventual consistency, is simply that all your concurrent
updates commute. Updates that are not concurrent, right, if I -- if I do an update followed by
another update, I want those to be replayed in the same order everywhere. But that's pretty easy
to do.
Okay? So I gave you two different models for replication, and for each of these models I gave
you a sufficient condition. Now, it turns out that there are actually -- I can prove that they are
actually equivalent, okay? So I won't go through that because it's pretty -- it's pretty lengthy and
not very interesting. But basically you can take any state-based approach -- state-based object
and emulate it in an operation-based model, right, and if one converges, the other converges, and
vice versa. Yeah?
>>: [inaudible] clearly you can operate by saying add one and they commute. But how do you
emulate a state-based? Max operation is totally different. How do you do elections by mail in a
state-based?
>> Marc Shapiro: You probably can't do elections.
>>: But I can operation-based. All the updates [inaudible].
>> Marc Shapiro: But you need to know when you've finished, and for that you need some sort of
synchronization. So, yes, that's something you want -- you want to know. This is eventually
consistent. But you never know when you've reached consistency, unless you add more
synchronization into the system.
So I'll skip this one. But if you want, we can go into them -- we can go into it. But it's interesting
to have these two views, okay? Because they seem to be very different, but, in fact, they're
actually equivalent. And it turns out when -- when you -- when you start working with this, you
realize that the state-based approach is actually very nice as a reasoning tool, okay? But, of
course, if your objects become very big, then it's going to be terrible inefficient. So you know now
that you can use the state-based reasoning and then convert to the operation based, and you still
-- you're still going to win. Okay?
And the operation based approach is going to be a lot more efficient, but it's a lot harder to
reason about.
Now, if you think of CRDTs -- so if you think of these -- these data types as -- if you think of it in
the operation based approach, right, you have all your updates, concurrent updates are going to
be commutative. But if I take two of the CRDTs that are independent, obviously they're still going
to commute, right? So I can -- I can take two of these object types, combine them, and I still have
an strongly eventually consistent objects. So I can use this property -- so these object types are
-- I'm going a little bit ahead of myself, I'm calling them CRDTs, but -- so basically the
combination of two CRDTs is a CRDTs, but I also take a CRDT and cut it in half, right? And if the
two halves are independent of each other I still have a CRDT, right?
So this is the basis -- sort of a theoretical basis for sharding. Okay. If I can take a very large -so, for instance, if I take my graph and I can arrange it so that, you know, I can -- I can subdivide
the edges and the vertices into sub -- into, you know, partitions, I can do the updates
independently from -- in each partition.
Of course, this only works if the partition is static, right? So if you have some sort of hash
function that will tell you okay, this vertex goes in this side and this vertex goes on that side and
everybody applies the same hash function, that will work. If you want to do something more
dynamic, where and element goes from one shard to another, then it's going to be more
complicated because you need some synchronization again to make sure that they agree on
what they're doing.
But the static -- static sharding is very easy. Okay. So that's the extent of my theory. And now
I'm going to go into some examples. If there are no more questions.
So what kind of objects can you use -- can you design using these principles? So you actually
have a pretty extensive portfolio of designs. We haven't implemented them entirely yet. But
basically the -- of course you already know about last-writer wins. But there's also the multi-value
sort of register that -- that's in Amazon, for instance. That's also CRDT.
We have different kinds of sets. And I'll spend a little bit of time on one of these variants, which I
call the observed-remove set.
Once you have sets, you can do graphs. Once you have sets you can do -- story. You can do
maps. And once you have sets you can also do graphs. So I'll spend a little bit of time on that.
And I'll also spend a couple of minutes on the counter example, because that's -- the example of
a counter, sorry, because that's -- it's an interesting example.
And actually, this whole line of research started with this data type. We were looking for a way to
do concurrent editing, right? So you have a sequence of characters, and you want to be able to
insert and delete characters concurrently without having any conflicts. So we have a -- an
interesting data type here which is called [inaudible], but I will not go into that, unless there are
questions. Yes?
>>: [inaudible] just to set and it's not in this list, what do I do?
>> Marc Shapiro: You have to tell me what your -- what your requirements are.
>>: If I [inaudible] condition, then what is my fallback?
>> Marc Shapiro: Your fallback is to use the usual strong consistency or consensus. I mean, if I
can design -- if you can, you know, bend your problem into this small corner, right, which is
everything has to commute -- all concurrent operations have to commute, then you can use this,
right?
But it's a small corner of the whole design space. There are lots of things that don't fit into that.
>>: So the connectivity is sufficient [inaudible].
>> Marc Shapiro: No. I don't think there would be, but I'd be glad to be proven wrong. Now, it's
a necessary condition as well.
>>: [inaudible].
>> Marc Shapiro: Yes.
>>: [inaudible].
>> Marc Shapiro: No.
>>: You can always define merge to be [inaudible] and then I'm sure [inaudible].
>> Marc Shapiro: But that's commutative.
>>: I think that's where the confusion comes in, what's the exact definition of commute, because
last-writer wins is -- you know, doesn't feel commutative, right, because there are different
updates, and one of them takes precedence but [inaudible] makes it commutative.
>> Marc Shapiro: As long as ->>: [inaudible] as it sounds ->> Marc Shapiro: Yes. Of course, since it's weak, there are a lot of things that you can't do,
right? I mean, if there's -- the universal -- it's proven that if you have consensus you can do
anything, right? Now, we don't have consensus, so there are obviously some things that we
cannot do, and there are lots of things we cannot do. Yeah?
>>: [inaudible] think about the connectivity requirement is if you had only two replicas and you
updated both of them at the same time and they set their operations to each other in order to
have consistency they would have to be commutative, because they would both be seeing the
exact same things but in the opposite order.
So effectively what you've proven is that if you -- if you are on a data type allows two replicas to
be consistent, then it allows [inaudible], a nice way to think about it.
>> Marc Shapiro: Right. But you have to also say that this is true for any initial state as well.
>>: Yes.
>> Marc Shapiro: And for any argument to your updates. Yeah.
So let's spend some time on the example. So, first of all, let's look at a counter. You think that a
counter, you know -- for instance, if you take a counter of that you can only increment, right?
Increment operation's obviously commutative, so it should be very simple. In fact, it's more
complicated than you would think. So here I'm looking at the -- at the state-based specification.
So if you have -- the problem is if you have multiple master; that is multiple replicas that can do
an update, right, and if I send this update to Sebastian, who sends it to you, and then I also send
to it Sun Mai [phonetic], who sends it to you, you have to be able to tell that those are the same
update. Whereas if I -- you know, if Sun Mai doesn't update and sends it to you, that's a different
update. Right?
So in order to do that, you actually need to have an a vector of counters, one per master, one per
replica that can do an update, right? So if I -- each one of us is going to choose an entry in this
vector and is only going to update his own entry. Right? So my entry is this one. And I'm only
going to update this one. Sun Mai is this one, he's only going to ever update that one. Right?
And so the value of this counter, of this increment-only counter -- so I can only increment these.
So that's going to ensure that I have a partial order, right, and that all my updates are going to go
forward in the partial order, right? And my partial order is not described here, but it's the usual
partial order between vectors, that is A is less than B if all the elements of A are less than or
equal to B if all the elements of A are less than or equal to all the elements of B.
So the value is going to be the sum of these because I'm going to do updates here and you're
going to do updates here and therefore the total number is going to be the sum, right? So to
increment, I'll just increment my entry. And to merge -- so if I send, you know, 15 to Sun Mai,
being my value, and, you know, 20 to Sebastian, the truth is going to be 20, right? So I'm going
to take the max of each one of these entries in order to do the merge. Right? And this is going to
do the trick.
This is indeed a semi-lattice. My updates always go forward in my semi-lattice. And, you know,
the merge function of the -- does take the least upper bound in this semi-lattice. Okay. That's
going to work.
But that only works for increments. So how can you do decrements? Well, you just take two of
these, right, and you take the difference between the two.
So to increment, I will increment this guy, and to decrement, I will decrement this guy. My entry in
this vector, right? And the value is going to be the value of this one minus the value of that one.
Okay. So it's more complicated than you ever thought, right? There are other ways of doing this,
but they're even worse.
>>: So if I use the same [inaudible].
>> Marc Shapiro: It will not -- well, it will be a different semantics. It won't do what you want,
what you expect, right, because you're always going to be taking the max. So you'll never be
able to decrement, basically. Your decrements will be no [inaudible].
>>: [inaudible].
>> Marc Shapiro: Well, here I have a partial order. A is less than B. If A.P is less than B.P and
A.N is less than A.N.
>>: Some of my decrements will get lost [inaudible].
>> Marc Shapiro: No, they will all get lost.
>>: [inaudible].
>> Marc Shapiro: They will all get lost because you're always taking the max. So every time you
do a decrement, it's going to be forgotten.
>>: [inaudible].
>>: [inaudible].
>> Marc Shapiro: So of course you assume that this all starts with zeros in both vectors.
>>: [inaudible].
>> Marc Shapiro: Sorry?
>>: Your cumulative delta [inaudible] timestamp.
>>: [inaudible].
>>: [inaudible] not as complicated. This seems [inaudible].
>> Marc Shapiro: So I could also do this with a set, right? I could have a set -- every time I want
to increment I add some new element to the set. For the add only version, right? And to merge I
would just take the union of my sets. That would also work. But that would grow indefinitely and
it's not very nice.
>>: [inaudible].
>> Marc Shapiro: Right. So if you think in the operation-based approach, right, all you need to
do is to send the increments and the decrements and, yes, that's definitely going to work. Okay.
But here I'm putting myself on purpose in the state-based approach.
Okay. So now I'll move on to something else, which is how do you do a set. So asset is
something that -- I've decided that a set is something that has these two operations, and
operation to add an element and an operation to remove an element, right? And the -- you know,
the invariants are, you know, whatever the initial state is after I've added E, E is going to be in my
set, right?
And whatever the initial state s when I remove E, E is not in the set anymore. Okay? Seems
reasonable, right? Okay.
So the sequential semantics are well defined. Now, what happens if you have concurrency,
right? What happens if you have concurrently add E and remove E? What is the outcome going
to be? Okay.
Now, you could say, well, I want a linearizable system where there's a total order and, you know,
the last -- the system makes some decision and puts one of these last, right? Okay. That will
give you exactly the same semantics as the sequential semantics. But it involves a consensus.
So you don't want that, right? It's not SEC. Well, it is SEC, but it's too strong for what we want to
do, right?
So I cross this solution out. But there's still many other solutions. So you could say well, I will
mark this as an error state and let -- just let the user decide eventually. That is a perfectly valid
answer that will give you SEC.
Or you could say last-writer wins. I will put a timestamp on this and one of these will win, but
everybody will make the same one win. Or you could say the add always wins or you could say
the remove always wins, right?
And each one of these is a reasonable semantics and it really depends on what your application
is, right? It turns out that the one that is -- besides the linearizable one, the one that gives you the
most intuitive semantics is add wins, and it also is what I need for my applications. So I'm going
to spend a little time on how do you design this. Okay. And I'm going to call this the
observed-remove set.
So the idea -- so let's start -- I'll give you an example execution to understand how this works. So
we'll start with an initial state where all the replicas are empty. So the first replica is going to add
some element A. So in the interface -- it's a set, right? So if I add A 10 times, I still have only one
A, and if I do a remove, it's gone. Right? Even though I did 10 adds. That's the semantics I
want. Okay?
But so but internally I have to distinguish these different elements. These different adds, sorry.
Okay. So I'm going to internally add some unique identifier to -- that it's going to distinguish
different instances of A. So externally it's all A, but internally I make a difference. Okay. So here
I have A subset alpha, which is my unique identifier for this particular instance. And this guy is
also going to add A. And it's distinguished by a different unique identifier.
Now we propagate the state down here and obviously the merge function is going to be to take
the union, right. So here I have, you know -- so far I have only received this state, so in my state
here now there's only A with, you know, the beta unique identifier. And now I receive -- the third
replica receives this state. And the merge is going to take -- in this state internally there is two
distinguished instances A, but of course externally I'll just say there's only A there.
And now I do remove. And here we have concurrency between this remove and this add. Right?
But you can observe that this guy doing the remove, he can only see this A, right. He cannot see
this A. So it would make no sense to try to remove this one. So that's why I distinguish the two.
So now, this remove A is going to take it's going to remove from the set -- actually it's not going to
remove, it's going to mark as being deleted all the elements that it could observe. All the As,
sorry, that it could observe. Right? In this case it's just A alpha.
And now I'm going to propagate this deletion down here and I'm going to merge the two states.
And now my state here is A has been added here and a different A has been removed here.
Right? So A is still in my set. So you see, I have concurrent remove and add, and the add wins.
>>: [inaudible] pattern to this [inaudible] producing more memory space to get to an intuitive
merge and sort of deferring -- here you're deferring like deallocation of some [inaudible] because
you're hanging off of them longer -- longer than would be necessary in the strict sequential
setting.
>> Marc Shapiro: So the common -- what's common in both cases that I have this, you know,
[inaudible] and I always have to go forward, right, so in the integer case I had to separated the
increments and the decrements, right? And here ->>: [inaudible] separating the [inaudible].
>> Marc Shapiro: Right. And here I have to mark deleted items to keep them as deleted, right?
>>: As long as ->> Marc Shapiro: Tombstones. I'll get to that in a second.
>>: Okay. Right.
>> Marc Shapiro: It turns out that you can optimize this and you can actually remove them right
away. Okay. But I will not get into that.
>>: Okay.
>> Marc Shapiro: Right. So now I want to show that this is going to converge. Okay? So now,
let's say that this state is going to be propagated, right, and now you're going to take the union
between this and this, and that's what you get. So you have the same state in both cases.
Right? And -- well, this guy so far has not seen the deletion yet. He's only seen the add A, beta,
right? But eventually he will see the same stuff and he will converge to the same state.
So there are different designs for sets, right. There's a design which is well known in literature
which goes back to the '80s which we call the 2B set where every element is unique and you add
it and you can remove it, but once it's removed you can never add it again. That's much simpler
specification, much simpler design, but it's much less intuitive because you can never add a thing
back.
Here's a design where you can add and remove an element and add it back and remove it and
you get very close to the sequential specification.
>>: Is this [inaudible] to implementing that set by using a multi-set and when you remove an
element you will remove as many elements as are [inaudible].
>> Marc Shapiro: No, that doesn't work.
>>: Okay.
>> Marc Shapiro: That doesn't work. There are people who tried to do that so you social a
counter, basically, right, with your element. And every time you add an element, you increment
the counter, every time you remove an element you decrement the counter. But then you get into
anomalies where your counter can become negative. And then when you add something, it's still
not there.
Yes?
>>: Actually if you use the counter that you've showed previously, that the -- that the eventually
consistent counter, you know, for each element, I think you would have exactly this.
>> Marc Shapiro: No. No. It's just as I said, the counter can go negative.
>>: How can it go negative ->> Marc Shapiro: Because if I add A and Sun Mai and a Sebastian both currently remove A,
right, A -- I set the counter to one and they're going to decrement it by one each, it's going to go
to minus one.
>>: [inaudible] because they're deleting [inaudible].
>>: Question [inaudible] the bottom specification, in between -- anything that's in between
[inaudible] concurrently, but between them the node that does the add communicates with the
node that does the remove, because we're specifying perhaps one of those [inaudible], right?
>> Marc Shapiro: It's not concurrent, yes. So it would be add followed by review, in which case
[inaudible].
>>: But then you're including the communication in the concurrency specification.
>> Marc Shapiro: Yes. But that's the definition of concurrent -- yeah, of concurrency.
Concurrency means there's no communication. There is two independent updates.
>>: [inaudible] specification you need to add [inaudible] add E. If alpha adds E or alpha removes
E, E is [inaudible] every process.
>> Marc Shapiro: No.
>>: So the [inaudible] specification --
>> Marc Shapiro: These, these, these internal identifiers are internal. They're not -- they don't
come through the interface, right? In the interface you just have elements. And you can add the
same element several times.
>>: [inaudible] this is a [inaudible].
>> Marc Shapiro: No, no, no. There's no public state of the [inaudible]. You can only use the -the methods, right. So you can look up through the lookup interface, there's no way of
distinguishing two Es.
>>: [inaudible] stick to same [inaudible].
>> Marc Shapiro: Yeah.
>>: So that's what I'm saying with the -- E belongs to S [inaudible] stick to the same S.
>> Marc Shapiro: S is the global state.
>>: But how do you define -- how do you say that it sticks to the same S?
>>: [inaudible].
>> Marc Shapiro: So ->>: What is the S is the question. Eventually ->> Marc Shapiro: It's whatever your state is. So if I have -- if I have, you know, some replicas
here, right, and I have add here, right, remove A here, and then eventually these guys
communicate with this guy, right, this guy's going to see there's a concurrent add and remove and
then for [inaudible] for any replica.
>>: So [inaudible] eventual consistency?
>> Marc Shapiro: Yes. Once you have been able to observe both the add and the remove, then
you're in the state where you've seen them both and therefore you're in this state. Right? Okay.
So by the way -- sorry? Yes?
>>: [inaudible] tombstones, which you can compile the history?
>> Marc Shapiro: So it turns out, in fact, in this particular case you can actually remove the -let's see. So in the operation-based approach is a lot easier to upped. If I -- if I create an
element, right, and I send it to you -- okay. Underlying all this there's going to be something like
vector clocks, which will allow you to compare what things happen before other things, right?
So if I -- if you receive from me a version that contains A alpha, right, and then later you receive
another version that doesn't contain A, alpha, and you compare the vector clocks you can see the
second one comes after the first one, you know that A alpha has been deleted. Right?
So if you have vector clocks, then you don't need to maintain the tombstones. You can throw
them away right away.
So I think I've been talking for an hour or so. Do I need to stop or -- continue? I still have, I'm not
sure how many more slides. I have a few more. I can stop at any time. Just, you know, just stop
me.
>> Sebastian Burckhardt: We have the room. It's only depending on the audience.
>> Marc Shapiro: Yeah. Okay. So, as I said, there's some optimizations. You can get rid of
tombstones, you can make us more efficient by using the operation-based approach. Snapshots
I'm going to talk about in a minute and sharding, I'm also going to -- well, I already told about you
sharding. So I don't need to talk about it now.
And it's interesting to see that this actually solves -- if you guys have read the Dynamo paper,
they have this -- they have this example of the Amazon shopping cart and they say well, it's very
weird because, you know, we have this shopping cart which is, you know, wonderfully available,
and basically what they're trying to do is strong eventual consistency, right? But it's strange
because when you add a book to your shopping cart and then later remove it, sometimes it sticks,
it comes back, okay? And what happens is they -- the problem they have is that they haven't
used the right data type. They didn't use something that is properly designed as a set, they used
something which we call the multi-value register, where the only operations are assignment.
There's no add and remove, and that's why it doesn't work properly.
So if you take the time to actually design a set, you can do it properly.
So when you scale up, you need to be able to do sharding. So I explained that. But you also
need to be able to do consistent snapshots. For instance, if you're looking at different replicas or
different shards, you want to be able to make sure that the data that you're looking at is
consistent. So it's very easy to extent the OR-set specification to get consistent snapshots. I told
that you each element has an internal unique identifier. If you just use some clock for this unique
identifier, that gives you the capability to do consistent snapshots just by deciding in advance
what's the clock will value that you're interested in. And you can tell what is in the set and not in
the set at a particular clock time. But then you cannot use the optimization to get rid of
tombstones. You have to keep them.
Okay. So I told you about set. Now, if you can do a set, you should be able to do a graph, right?
Graph is just a pair of sets, right? So, you know, graph is a set of vertices and a set of edges,
where an edge is, you know? Pair of vertices. And the sequential specification of a graph, you
know, is very simple. You know, you have -- so basically it's the sequential specification of sets,
plus this invariant, right? So it says that you can only add an edge if the two vertices are in the
vertex set and you can only remove a vertex if there was no edge between -- if there's no edge
that uses that vertex, right? .
Now, so this adds a new problem. Okay? I already had the problem of concurrent add and
removes within each set which I solved with OR-set. Now I have a new problem. What happens
if I have concurrent review vertex with addEdge, right? So if I'm trying to remove a vertex, even if
I think locally that I'm allowed to do that, because there's no edge between it, there might be
somebody adding an edge. So what can I do? Again, I could try to use -- you know, to stick to
the sequential specification, but then I have to use linearizability or if I want to use, you know,
strong eventual consistency, I have to have some well defined unique outcome. And again,
there's a whole choice of what that outcome can be.
We will use -- for our purposes, which is, you know, modeling the web, it turns out that addEdges
-- addEdge wins is the specification we want. So I'll just send a little bit of time to see, you know,
how can we -- well, actually I've already given away the result, right? The result is that you just
have to choose some unique outcome for that. Okay.
So the way we do this, we implement this as, again, a pair of sets. So there's a set of ORs -observed-remove set of vertices, observed-remove set of edges, right? And basically what we do
is we do not -- we do not try to enforce the invariant, right? So if you want to add an edge or you
want to remove a vertex, fine, just go ahead, right. And we'll check the invariant on lookup.
Right? So if you try to, you know, create an edge to something that doesn't exist, when you do
the lookup, we'll just say there's no edge in there. Right?
And that's consistent with the semantics of the web, right, where you can create a URL to
something that's out there. And that thing might appear later or it might never appear. Okay. So
that's why this is the right semantics for us.
>>: [inaudible] eventually -- that invariant will eventually converge to something -- it may be false
for a while, but once it becomes true in the presence of no more updates, it won't be ->> Marc Shapiro: You'll never know. You'll never know. It's just when you try to do the lookup
will say no, there's no edge there.
>>: That's what I mean, about the lookups. Eventually if there's no updates, the lookups will be
consistent.
>> Marc Shapiro: Yes, but -- yes. No, but even at any point in time. That's not even eventually.
Just now. If you do the lookup now and you happen to observe that the, you know, target, the
head vertex has been deleted, you will also, by side effect, observe that the edge has been
deleted. So basically deleting an edge has the side effect on -- sorry. Deleting a vertex has the
side effect on the edges.
>>: Question. You stated earlier that [inaudible] remove wins?
>> Marc Shapiro: Remove wins with. Sorry.
>>: And then is there a reason to not enforce the invariant in local operations in the sense that
[inaudible] try to add an edge between two vertices that at the moment [inaudible].
>> Marc Shapiro: So there's two reasons why enforcing -- why not enforcing the invariant. One
is because I'm trying to stick to the semantics of the web, where you can create a URL to
something that doesn't exist. That's fine. And the other reason is for sharding, okay? Because if
I want -- if I shard, I don't want to have to be able to check with a remote shard whether my
invariant is true.
So this is basically the execution that we saw at the beginning. And, okay. This is what I was
going to tell about you sharding. So for sharding, each one of the sets is sharded using the
technique I described earlier. And as I just mentioned, you -- using this semantics is good for
sharding because I don't have to look up the remote shard when I -- when I create a -- an edge or
I delete. Yes?
>>: [inaudible] replicas on the edge ->> Marc Shapiro: So you might have -- you might have -- you might have this vertex in one shard
and this vertex in another shard.
>>: [inaudible].
>> Marc Shapiro: No. Only this guy does.
>>: [inaudible].
>> Marc Shapiro: When you do lookup -- when you do lookup, then you'll have to look up this
vertex.
>>: So lookup is now a distributed operation?
>> Marc Shapiro: Yes. So I move the sort of the burden of the work to lookup. But I can -- but I
can do consistent snapshots. Okay? So I will do a consistent snapshot and then do a lookup on
the consistent snapshot.
Okay. So that's basically what we've done. We've designed all this stuff. And we're only just
now implementing it. So this work is being funded by a grant from the French government and
also from a grant from Google. So basically we have three years to study all this -- you know, the
theory and the practice of CRDTs and strong eventual consistency.
And what we're thinking of doing is something like this. We're thinking of -- so if you look at the
semantics, for instance, of web search, all the stuff that's in there -- well, not all the stuff, but a lot
of the stuff that's in there is amenable to CRDT. So basically if you want to -- you know, your -the -- control your websites and you create a local copy of your website, basically, you want -- it's
basically -- it's going to be a map of a URL to some content, right. So that's just a map.
If the graph I showed you how you can do a graph as a CRDT, the list of words in your system is
just a map again. So a lot of these things could be -- could be implemented in CRDTs. So we
combine the three things. One is CRDTs. The other is sharding, in order to scale. And the other
thing is using a dataflow approach this if order to do your updates incrementally, right, and to just
push them down the dataflow.
So for instance if you -- if you crawl your websites you can say compare each page with its
previous version and then you can tell which links have been added and removed, which words
have been added and removed. And you just need to push those operations downstream. And
you would use that -- those operations to update, for instance, the graph in the words.
The advantage of doing this is that you can sort of adjust your quality of service according to what
kind of resources you have. So if you have lots of resources, you can put all your updates right
away. If you have -- if you are not -- if you don't care about your quality of service but you want to
save on resources, you can throttle the updates. That's just what I -- same thing as what I had
just mentioned.
This one is the same one. And if that's true, if what we're saying is true, then this should be able
to scale very well. You could have, you know, crawlers that are very local to your -- to the
website that you're looking at. And then you could just push updates to the rest of the world
asynchronously and you could have lots of copies of this. All these things here are CRDTs. And
so everything can be asynchronous. But there's some things that have to be -- probably have to
be centralized. So, for instance, computing the page rank is probably going to be a centralized
operation that has to work on a consistent snapshot of your data.
>>: [inaudible]. For example if I take the classic bank account, for example two counts in two
counters, what are the guarantees [inaudible].
>> Marc Shapiro: Eventual consistency. Which is not what you want for that [inaudible].
>>: For both?
>> Marc Shapiro: Yes, for both. Say, you know, remove hundred dollars from this account and
add it to that account. Eventually you will see the fact that the money has been moved. But you
might see states where the money is in both accounts or in neither of the accounts.
>>: [inaudible] account allow negative state in this case?
>> Marc Shapiro: Definitely. It's very hard to enforce invariants, strong invariants with CRDTs
because enforcing invariant means that you might be able to check your, say, precondition
locally, right? But then you have no guarantee that your precondition is going to be true at
another replica.
So and for the reasons transactions are hard. Which is why we need to have consistent
snapshots instead. It's an alternative to transactions.
>>: Is there a [inaudible] sum up the framework?
>> Marc Shapiro: Yeah. There's a couple papers.
>>: Okay. Which one in particular?
>> Marc Shapiro: Well, we can look at [inaudible].
>>: Oh, okay. Great.
>> Marc Shapiro: Anyway, this [inaudible].
>>: [inaudible] consistent snapshots?
>> Marc Shapiro: So basically in the case of -- the interesting case is the set or the graphs are.
So if you remember each element in this set has a -- has a unique identifier. You can use a
timestamp as that unique identifier, right? And then you can tell at any time -- and if you retain
tombstones, right, you can tell at any point -- you can choose a point in time and say well, this
element was in the set at that point in time [inaudible].
>>: [inaudible] as the sign deadlines [inaudible] synchronized.
>> Marc Shapiro: Exactly. When you -- when you do a consistent snapshot, you need to take a
-- a version vector which is basically collection of local time stamps. So for each one of these
local time stamps you can tell whether an element is in the set or not.
>>: [inaudible].
>>: [inaudible].
>> Marc Shapiro: Yes.
>>: [inaudible].
>> Marc Shapiro: Yes.
>>: [inaudible] it's not even just one time stamp ->> Marc Shapiro: But that's because -- but that's because you want to be able to do consistent
snapshots. If you're only interested in the current state, then you can throw a lot of that stuff
away. Okay? So I'm done. Thanks for all the questions.
[applause].
>> Marc Shapiro: Oh, you want ->>: Actually I found it.
>> Marc Shapiro: Okay.
>>: I found one. It's the [inaudible].
Download