>> Amar Phanishayee: All right. So let’s start. ... introduce Iulian Moraru. Iulian is a five year grad...

advertisement
>> Amar Phanishayee: All right. So let’s start. It’s my pleasure to
introduce Iulian Moraru. Iulian is a five year grad student at CMU where he
works with the recently tenured David Andersen. Iulian’s interests are in
distributed systems and early on in his grad career he also look into acute
data structures for solving problems that were in the realm of machine
learning and applied them to a really scalable [indiscernible] at CMU. His
PHD thesis was focused on improving this glorious protocol that you all know,
Paxos, and Iulian is going to talk to us about that today. So please help me
welcome Iulian.
[clapping]
>> Iulian Moraru: Thank you, Thank you Amar. Thank you for having me here.
So this talk is going to be about revisiting Paxos. And I am going to try to
convince you that it’s indeed as exciting as that sounds, just not in that
way.
So the general topic of my thesis work is full tolerance in distributed
systems. And by full tolerance in most other fields this is all about
redundancy. The way we achieve redundancy in distributed systems is through
state replication. That means replicating the state of a process onto
multiple machines so that if some of these machines fail the remaining ones
will be able to continue handling client queries and commands just as the
failed ones would have.
And to keep the state of these processes in sync we made them behave like
state machines that change their internal state only as a result of executing
the commands that were posted by the clients of the system. And if we make
sure they execute the same commands in the same order then they will
transition through the exact same sequence of states.
And this approach is called State Machine Replication or SMR for short and it
is important. For example in local area clusters state machine replication
is used for operations as diverse as replicating data, resource discovery,
distributed synchronization and because we build larger and larger clusters
there is increasing pressure on these implementations of state machine
replication to have both high throughput and high availability. And these
are some real world systems that implement state machine replication for
these operations.
We also use state machine replication in the wide-area, because we have
clients on different continents accessing the same databases. So we want to
bring that data closer to clients and we also want to be able to tolerate
full datacenter outages.
And because in this setting distances are so large any inefficiency in
unnecessary message delays or round trip times to commit are going to have a
high impact on latency. So it is critical in this setting for our
implementations of state machine replication to have low latency. Spanner
and Megastore are well known systems that implement state machine replication
for general replication.
And implementing state machine replication there are multiple ways to do it
and they range from simple primary backup protocols to more complex protocols
such as Paxos and visiting full tolerance. Now all the systems that
presented on the previous slides they implement Paxos or protocols that are
very similar to Paxos and the reason is Paxos has a nice balance between
safety, which is to say what kind of failures it can tolerate, and
performance.
And what I mean by performance is Paxos and Paxos life protocols are fast.
They are faster than, at least as fast as a simple primary backup, and they
do not depend on external failure detectors, which make them have very high
availability. And here is an example of what I mean by that. Let’s say we
have a primary backup system, if there is an error partition and we let the
clients continue to talk to both the primary and the backup their states will
diverge, which is exactly the opposite of what we want to achieve.
So, what we do in the setting is we have an external entity, this external
failure detector which decides, for example that the primary has official
failed, it’s officially failed and the backup is the new authoritative copy
of the data. And then the clients will learn from this external failure
detector that they have to talk only to the backup.
But, of course it takes time between the partition and error partition
setting in and the client switching talking to the backup and that’s a window
of time where the system is essentially un-available. In Paxos we don’t have
that problem because we use more resources. So here we have three replicas
instead of two to tolerate the failure, but any of these replicas appearing
to have failed, or any network partition setting in will not cause the
majority to stop working.
And in fact we don’t have to declare failure synchronously. If a replica
appears to have failed we just continue going the majority. Because of this
reason Paxos has high availability, almost essentially constant availability
or in other words instant fail over. And this brings me to the overarching
goal of my thesis work.
So in my thesis work I want to improve state machine replication,
specifically Paxos-style state machine replication and by that I mean systems
that tolerate benign failures, so non-[indiscernible] failures. And by
Paxos-style I mean systems that use [indiscernible] consensus. I want to
improve it in multiple practical important dimensions, but at the same time
in a way that is well anchored in theory and here’s what I mean by that: so a
Paxos system has essentially two components. There is this nice elegant
core, which is this general algorithm, the Paxos algorithm.
And then there are a bunch of implementation considerations, right. How do
we choose to implement a certain feature? How do we optimize a certain
performance characteristic that’s important in our system? And I want to
improve state machine replication in a way that also expands this nice
algorithm core to include more of the practical implementation considerations
because I want the result to be applicable to a wide range of applications.
So in this talk I will present the two main components of my work so far.
They are Egalitarian Paxos which is a new [indiscernible] protocol based on
Paxos. It has lower latency, high throughput and higher performance
stability than previous state machine replication protocols and I will also
talk about Quorum read leases that address an orthogonal performance
characteristic of state machine replication. And that is, how do we read it
really quickly from replicated state machines?
So I start with Egalitarian Paxos. Before I get to Egalitarian Paxos it’s
useful to go through a quick overview of Paxos. So Paxos is at its core an
agreement protocol used by a set of distributed processes to agree on one
thing. It tolerates F failures with a total of 2F plus 1 replicas and that’s
optimal because remember we don’t depend on external failure detectors. And
replicas tolerate, so Paxos tolerates only benign failures, non-Byzantine
failures. Machines can fail to reply for an indefinite amount of time, but
they will not reply in ways that do not conform to their protocol.
Also, communication is asynchronous. So for Paxos to be safe we don’t make
any synchrony assumptions about communication, however, for it to be live
there has to exist this window of times where there is asynchronousity. So
how do we use Paxos, this protocol which at its core is a consensus protocol?
It let’s us agree on one thing. How do we use it to agree on a sequence of
things? Right, that’s what we want in state machine replication.
And I am going to show you this as an example. Let’s say we have three
replicas and this is their replicated states, everyone has a copy of this,
initially all these command slots are empty, right, they are order command
slots. Clients talk to the replicas as they propose commands and these
replicas will contend for these preordered slots and they will do so by
running Paxos. As a result of running Paxos only one of these replicas will
win that slot. Everyone will know which replica that is; everyone will know
what command went into that slot. The replicas that lost that slot will
contend for a different slot and so on. They will [indiscernible] in Paxos.
Now, when a continuous sequence of slots has been filled every replica can
independently execute the same commands in the same order, thus ensuring that
their states will remain in synch.
>>: So are you implying executing the Paxos [indiscernible] on every command
or are you not really doing that?
>> Iulian Moraru: I am going to get to that in the next slide.
like the canonical execution of Paxos.
So this is
So the take away here is that in Paxos, in canonical Paxos we use separate
[indiscernible] of Paxos to decide every slot and that it takes two round
trips to commit a command because the first round trip is use necessary to
take ownership of a slot. Then only after we have taken ownership, only
after replica has taken ownership can it propose the command. But, of
course, as you remarked, this is inefficient because it takes two round
trips.
So, practical implementations of Paxos that actually implement something are
usually referred to as Multi-Paxos. And in Multi-Paxos one of these replicas
is the pre-established owners of all the slots. So for example in this case
the green replica is the pre-established owner of all the slots. And then
clients talk to only this replica, this one replica, which will decide which
commands will go into which slot.
And it will be able to do so after just one round trip because again the
first round trip was to take ownership of the slot, that’s not necessarily
here anymore, it is already the owner. But, unfortunately the single
[indiscernible] replica, as I am going to refer to it in this talk can be a
bottleneck for performance and availability, right. It has to handle more
messages for each command than all the other non-leader replicas and if it
fails there is a window of time where there is no leader before another
replica is elected as leader.
So the question that motivated this research is: can we have it all? Can we
have this high throughput and low latency given to us by Multi-Paxos? But,
at the same time we want to preserve the constant availability of canonical
Paxos. And we want to furthermore be able to distribute load evenly across
all our replicas so that we use our resources efficiently and get high
throughput. We also want to use the fastest replicas. Perhaps in our system
there are concurrent jobs running on some of these replicas and we might want
to avoid those replicas that experience high load when we commit commands.
>>: So let me ask what you mean by constant availability in standard Paxos.
In particular it seems to me that, if I remember the way you described it,
clients just send their commands to some replica and then the replica tries
to contend for the next slot, right.
>> Iulian Moraru: Correct.
>>: If the client chooses a replica that has failed then obviously that
replica is not going to do anything.
>> Iulian Moraru: That’s right, so for that it might appear that the whole
thing is unavailable.
>>: Right.
>> Iulian Moraru: But for other clients --.
>>: It’s the retries that makes it different.
>> Iulian Moraru: Exactly.
>>: So I guess the question is: is that really better than using Multi-Paxos?
>> Iulian Moraru: Well it matters in that other clients that have not been
unlucky to choose that replica that has failed will chose replicas that have
not failed and for them the system will be available.
>>: Yeah, but doesn’t it come out in a wash? Because what will happen is in
the canonical Paxos scheme when you take a replica failure until you have a
timeout because the client can change it, let’s say it’s a 3 a third of the
clients will see unavailability.
>> Iulian Moraru: That’s right.
>>: And in Multi-Paxos if you take a replica failure two-thirds of the time
you are not going to lose the leader so there will be no effect at all and
one-third of the time everybody will slow down.
>> Iulian Moraru: That’s right.
>>: So when you multiply this out they come out the same.
>> Iulian Moraru: That’s right, but I am talking not about what happens if
you take the whole thing over a day. I am talking about what happens in that
moment and in that moment one-third of the replicas for canonical Paxos will
experience a higher latency for those commands, whereas the other third of
the replicas will not experience that.
For Multi-Paxos when the leader fails every client will experience higher
latency for all their commands.
>>: This is true, but it only happens a third as much.
>> Iulian Moraru: That’s right.
>>: Because it’s going to have a third of as many leaders in its total
replica. So you windup with --. I mean it’s not clear that it’s that much
better, right.
[Inaudible]
>> Iulian Moraru: I think we agree on what I define on what I define by
availability. I mean all the clients cannot do anything for a particular
time.
>>: Right, if you define availability as some client is able to do something
then canonical Paxos is better than Multi-Paxos.
>> Iulian Moraru: That’s right.
>>: Exactly.
>>: All right.
>> Iulian Moraru: And you may argue that’s not enough, but that’s what I mean
here.
>>: I mean it’s not clear that it’s actually in practice any better, but I
mean I understand what you are saying.
>> Iulian Moraru: Okay.
>>: Good.
>> Iulian Moraru: So the last property that we want is to be able to use the
closest replicas because in wide area replication we want to commit commands
after talking to our closest neighbors instead of having to go perhaps to a
neighbor that’s furthers away.
So, as I said, Paxos canonical and Paxos have these properties, but because
of the two round trips it’s not exactly efficient, right. It doesn’t have
high throughput and low latency. Multi-Paxos solves that, but it loses these
other properties in the process.
Now by contrast EPaxos has all the properties that I have mentioned and it
implements them efficiently. So much so that it has higher performance than
Multi-Paxos. In Egalitarian Paxos it’s all about ordering. So we have seen
that previous strategies included contending for slots as well as canonical
Paxos. One replica decides and that’s the case for Multi-Paxos, but also for
other versions of Paxos like Fast Paxos and Generalized Paxos.
And finally a newer version, a newer protocol based on Paxos called Mencius
has this property that replicas take turns in committing commands. So it is
pre-established that the first replica is the leader of every third slot
starting at one. The second replica is the command leader for every third
slot starting at two and so on. And Mencius is effective in balancing load
because at the same time there will be many commands being proposed
concurrently and all the replicas will be leaders for some commands and
acceptors for other commands.
Unfortunately, in Mencius the whole system runs at the speed of the slowest
replica because we cannot decide a slot, we cannot commit it, until we have
learned what happened in all the previous slots. And the previous slots
belong to all the replicas. So a third of the previous slots belong to every
replica. Furthermore any replica being unavailable causes the whole system
to become unavailable.
>>: Which makes you wonder what the point is.
>> Iulian Moraru: Better throughput, it does have better throughput.
>>: Okay, as opposed to just doing a single machine.
>>: Anyway, okay, right, that’s not your --.
>> Iulian Moraru: It is effective in balancing mode.
>>: Right.
>> Iulian Moraru: So, in Paxos we take a different approach. Instead of
having a linear sequence of pre-ordered slots we split that space into as
many subspaces that there are replicas. And we give pre-established
ownership to each replica over one of these rows. So this is the replicated
state, everyone has a copy of this bi-dimensional rate. And clients will
propose commands to any replica of their choosing and that replica will get
that command committed in one of it’s own slots, so a slot that belongs to
it’s own row. So for the green replica that’s the first row and so on.
And this is good because there is no longer contention for slots, but it’s no
longer clear how do we order these commands, like 3.1 comes before or after
2.2? So the way we do it is in the process of deciding what command is to be
committed in a slot we also decide ordering constraints for that command.
This means that B has an ordering constraint on A, so B should come after A,
it should be executed after A. And we do this for every command and when
these commands have been committed every replica sees the same command in the
same slot and with the same ordering constraints. So they can independently
analyze these constraints and come up with the same ordering.
So the take away here is that we achieve load balancing, because every
replica is a command leader for some commands and we have the flexibility to
choose any quorum of replicas, any replicas to be part of a quorum to commit
a command. We no longer have to revolve one particular replica on every
decision. Like, this was the case for the stable leader in Multi-Paxos.
Question?
>>: Is there an assumption that a client is to list only one command at a
time? I mean if I am a client and I have a set of commands that attended upon
one another and I send those to different replicas that aren’t communicating
I could have those operations commit out of order in terms that the clients
expected [indiscernible].
>> Iulian Moraru: Sorry to interrupt, but what is your question?
sending commands one at a time? And the answer is yes.
Are clients
>>: So they wait, they wait until [indiscernible] so you can’t pipeline
commands.
>> Iulian Moraru: If it is important for them that some commands are executed
in a particular order then what they do is they have to wait for the first
command to be acknowledged by the system before they propose next command.
>>: Is that true for all versions, because Multi-Paxos it seems like that
wouldn’t exist because I can, since I am system serialize through a leader, I
can sequence them and I can be sure that one of them will be executed.
>>: You can’t really.
So if it’s Smarter it doesn’t work that way.
>> Iulian Moraru: So even in Multi-Paxos you don’t get that guarantee.
>>: And the reason is you might change leaders.
[inaudible]
>> Iulian Moraru: That’s right.
>>: So Smarter let’s clients pipeline, but the ordering guarantee that you
get is that anything that you send down is ordered after things that you have
already seen completions for. So you can send out as many as you want, but
they will execute in whatever order.
>>: Okay.
>>: It’s just like doing Disk IO asynchronous system calls.
>>: Well unless you have [indiscernible].
>>: Unless you have [inaudible].
>>: Okay, I understand.
>> Iulian Moraru: Okay, I was saying that --.
>>: Can you do that by the way? Can you do the same thing where a client can
kind of pipeline if he doesn’t care about the order of commands?
>> Iulian Moraru: Sure.
>>: And they still windup serialized.
>> Iulian Moraru: They windup serialized, but not necessarily linearized.
>>: Right, yeah.
>>: So in Paxos you have to couple serialization guarantees with cubit
guarantees, similar to ordering.
>> Iulian Moraru: So as I was saying we have the flexibility in Paxos in
choose any replicas to be part of our quorums. There is no longer a special
replica that has to be part of every decision, right, like a stable leader.
And we don’t have to get some information from every other replica in the
system like was the case for Mencius. And this has important implications
for performance stability because we will be able to avoid replicas that are
slow or unresponsive and for wide-area commit latency because we will be able
to just talk to our closest neighbors.
Question, yes?
>>: [inaudible].
>> Iulian Moraru: Excuse me?
>>: Well, the important question here is: How are the dependencies
determined?
>> Iulian Moraru: Right, I will get to that in the next slide.
>>: All right.
>> Iulian Moraru: Is there a question?
>>: And maybe this is something you can answer at the end of the talk because
you are saying you want to combine low latency to [inaudible]. I am just
wondering if at the end you could take a few minutes to talk about what
scenarios [inaudible] where high throughput Paxos systems [inaudible].
>> Iulian Moraru: I think the simple answer is in a local cluster probably
throughput is more important than latency, whereas in the wider latency
becomes very important.
>>: Okay.
>> Iulian Moraru: Just because any sort of mistake or inefficiency that will
be made is going to cost us a lot, to tens of many seconds.
>>: When we did Smarter we headed back a disk.
important as SQL server [inaudible].
High throughput was really
>> Iulian Moraru: Okay, so to go into more detail of: How do we set these
ordering dependencies, I will go through an example. This is a time sequence
diagram and time flows from left to right. We have five replicas and let’s
assume that there is a command A proposed at replica 1. Replica 1 at this
point has no other commands so it says, “PreAccept A”. That’s what we call
these first messages, very strong messages, we call them PreAccept’s.
So take A with the dependency that it depends on nothing. So essentially
there is no ordering dependency at this point. It sends this PreAccept to a
majority of replicas, that is itself and to other replicas. The replicas R2
and R3 they agree because they haven’t seen other commands. And because
these two acceptors agreed with each other R1 can commit locally and notify
everyone else asynchronously.
Now let’s assume that at about the same time there is another command
proposed replica 5 saying command B. R5 has not seen anything for A. So it
says B depends on nothing. R4 agrees, but of course R3 has seen a PreAccept
4A. So it says B has to be dependent A, because the two acceptors R3 and R4
disagreed with each other R5 has to take the union of these constraints and
use a second round of communications saying that these are the final rooting
constraints. And the replicas that received these accept messages only have
to knowledge accepted messages. So they don’t have to update these ordering
constraints anymore, even if there are some other commands being proposed
concurrently. So they only have to acknowledge.
When they have acknowledged R5 can commit locally and notify everyone else
asynchronously. And one last example let’s say C is proposed to replica 1,
R1 has not seen any message for B so it says C depends only on A. And then
R2 and R3 have both seen commits for B before getting the PreAccept for C.
So they say, “C has to depend on both A and B”. Because the two acceptors
agree with each other R1 can commit locally and notify every other replica
asynchronously, including the client.
>>: Okay.
>> Iulian Moraru: So a simple analysis of this protocol shows you that when
commands are not concurrent it will take one round trip to commit, but when
commands are concurrent some of them might have to undergo two rounds of
communication until commit. And that seems like a bad thing, right. That’s
what we wanted to avoid with [inaudible].
[Coughing]
>>: Just to be clear this does resolve [inaudible] order, right? You would
never windup end up with any –-. For any pair of reads, or sorry, for
operations, I can always tell what order they are in. There always was
[inaudible].
>> Iulian Moraru: That’s right, that’s right. Now to our rescue comes this
observation made by Generalized Paxos before us and by [indiscernible]
broadcast protocols before Generalized Paxos, which is that we don’t actually
have to order every command with respect to every other command. We just
have to order those commands that interfere with each other. And an
intuitive way to think about this is that we only have to order those
commands that refer to the same state. If they don’t refer the same state,
for example, if we have two puts in a replicated key value store to different
keys it doesn’t matter which way we execute them on different replicas. As
long as we execute them both we will get the same state.
So then what this means for our system is that we will be able to commit
commands after just one round trip if they are either known concurrent or
concurrent but non-interfering and we will take two round trips only for
those commands that are both concurrent and interfering. And it turns out
that in practice it is rarely the case that concurrent commands interfere
with each other. The next logical question is: how do we determine whether
commands interfere with each other before executing them? Because, remember
we first have to first order these commands and in the ordering process we
have to determine whether they interfere or not and only then can we execute
them.
Now the answer is application specific for known, scaled systems that are
variations of key value store. It is very simple because we just look at the
operation key, if two operations have the same key then they will interfere,
otherwise they don’t. Another approach that we can take is the one taken by
Google App Engine which requires developers to specifically specify which
sets of transactions interfere with each other. And finally it is possible
to infer interference automatically even for more complex databases like for
example relational databases because it turns out that most OLTP workloads
have these simple transactions that are executed most frequently. And they
are so simple that we can analyze them before executing them and tell with
certainly which tables and which rows in which tables they will touch.
So for these simple transactions we will be able to determine whether they
interfere or not and for the remaining transactions which are more complex
and perhaps we cannot analyze beforehand we can safely say that they
interfere with everything else. Okay, so we have got on these dependencies
and how do we actually order the commands? And this again is an example with
five commands: A, B, C, D, E and the edges here are [indiscernible] five
dependencies. So A has a dependency on B, B also has a dependency on A and
so on.
So the first thing that we do is we find the strongly connected components.
And then because this graph where SCC’s, strongly-connected components, are
these super nodes. This graphs is a DAG, directed acyclic graph, we can sort
it [indiscernible] and execute every strongly-connected component in inverse
[indiscernible] order. When there are multiple commands in a strongconnected component we use another ordering constraint that I haven’t
described. We call it an approximate sequence number, but a simple to way to
think about it is just a Lamport clock and that is what it is essentially.
And we are executing increasing order of this approximate sequence number.
So what we get is --.
>>: Can I ask you a question. You just move a little too fast for me. So
what’s a dependency? Is that interference? You started using the word
dependencies in the previous graph.
>> Iulian Moraru: So dependency is an ordering dependency.
>>: It’s an ordering dependency.
>> Iulian Moraru: It means that a command has a dependency on another command
if when committing this command, the second one, we saw something about the
first command, right, so we had a dependency. Now we only have dependencies
if those two commands interfere. So commands interfere if they refer to the
same state. If they interfere then we add some ordering dependencies between
them.
>>: More particularly if they have [inaudible], right.
>> Iulian Moraru: Correct, if they are both leads it’s not necessary, right.
So the properties --.
Yes?
>>: So would it be the case that analyzing interference or not takes longer
time than long-term [inaudible]?
>> Iulian Moraru: For some applications it may take longer time, right. My
argument is that for many applications it is actually very easy, because we
have things like operation keys.
>>: Yeah, so --.
>> Iulian Moraru: And even for relational databases. If you take the TPCC
benchmark which is very popular in the database community it has one
transaction, I believe it’s called a new order transaction, which is very
easy to analyze. And we can tell with certainty whether two of these new
order transactions can interfere or not.
>>: Right, but that depends, at that point what you are considering to be the
second thing, right. You mean the schematics are the same or the state stays
the same?
>> Iulian Moraru: I wanted to say that they commute.
>>: They commute.
>> Iulian Moraru: That’s the simplest way.
>>: So the system won’t resolve in [inaudible]?
>> Iulian Moraru: But that’s the case for Multi-Paxos too, because every
replica has to execute all the commands at exactly the same time.
>>: Um.
>> Iulian Moraru: As long as they will execute all the commands that have
been committed so far then they will have the same state.
>>: If they execute all the commands in the same order they will have the
same state, right. You can imagine let’s say a page applicator that just
grabs the next page and [inaudible].
>> Iulian Moraru: Sure, yeah.
>>: And that only matters when [inaudible].
>> Iulian Moraru: I am sorry, so you are suggesting that --.
>>: Well one of the things that we addressed in Smarter is that if you get
into a situation where, let’s say a replica goes down for awhile, right, it’s
not permanently lost, but it’s off for --.
>> Iulian Moraru: Right and it doesn’t get [inaudible].
>>: So you have two options: one is you can just save all the commands --.
>> Iulian Moraru: Oh, I see so you are saying so you [indiscernible] the
state and you send the state in bulk to the [inaudible].
>>: Or you checkpoint the state and [inaudible]. So you can send the
entirety of the state, that will always work, even regardless of this or you
can send only the pieces of the state that have changed and you can’t do
that.
>> Iulian Moraru: Absolutely, but you can checkpoint here. You can use this
special checkpoint command that interferes with everything.
>>: Right, but the point is --.
>> Iulian Moraru: And after you have executed that you will say --.
>>: Your states will be different, so you will --.
>> Iulian Moraru: In between checkpoints, sure.
>>: In between checkpoints. So to bring back a replica you either have to
continue remembering all the commands, which results in unknown storage, or
you have to move all of the state.
>> Iulian Moraru: No, you can remember all the commands from the previous
checkpoint.
>>: Well, that’s what I meant, but obviously it stops checkpoint when it
stops running. I mean, it’s not a terrible thing, but.
>> Iulian Moraru: Yeah, since the last checkpoint that it has observed,
that’s right.
>>: Right, you just lose [inaudible].
>> Iulian Moraru: Yeah, that’s right.
>>: It’s actually kind of handy sometimes.
All right.
>> Iulian Moraru: Yes.
>>: Can I just ask one more question about that previous slide before you
move on. I am trying to understand the semantics of the circular dependency
there. We are trying to determine the order of transactions.
>> Iulian Moraru: So you are saying that, am I getting this correctly?
are saying that you don’t know why two commands might have a circular
dependency or are you saying why does that imply to ordering?
You
>>: I guess both, it just --.
>> Iulian Moraru: So two commands may have a circular dependency if they, for
example, if they are concurrent, but it can happen even if they are not
concurrent. And they are concurrent with the third one for example. Let’s
take a simple: if two commands are concurrent one of them sees the other on
some replicas and the other one sees the second one first on some other
replicas. Just because they don’t receive messages for the same command,
replicas only receive messages for its command in the same order. Okay.
>>: I thought that was just a transient state. In other words if there is
disagreement I thought we had to keep iterating until finally everyone agrees
on [inaudible].
>> Iulian Moraru: No, that’s not what we do. We don’t continue iterating.
So we only at most execute two rounds and then if after the second round, if
after the first round sorry, there has been a disagreement we take the union
of those dependencies. And when we take the union that’s when we get the
circular dependencies.
>>: So hang on, why don’t you understand what these are because I thought I
understood. I didn’t think carefully enough.
>> Iulian Moraru: So this means that B got after E on some replica.
>>: After?
>> Iulian Moraru: Yeah, so some replica received some message for B after it
has received some message for E.
>>: It depends on E, so B depends on E?
>>: So you are saying in this case circular is some replica saw A before B
and others saw the opposite.
>> Iulian Moraru: Exactly.
>>: Okay.
>>: I think the confusion was earlier Bill said, “This always establishes a
total order” and you said, “Yes”.
>> Iulian Moraru: It does establish total order because after you have
executed this algorithm --.
>> After you have executed this algorithm, but earlier before you described
this slide --.
>>: Right, the first thing you showed does establish it.
algorithm establishes a total order.
The two round
>> Iulian Moraru: Only after we have executed this algorithm on the
dependencies that this commit protocol establishes.
>>: So after the last algorithm that you showed in the other slides you could
still run the circular dependency.
>> Iulian Moraru: Exactly.
>>: So does every replica see the same graph?
>> Iulian Moraru: Every replica will see exactly the same graph.
>>: [inaudible].
>> Iulian Moraru: No, in this graph you can assume that they all interfere
with each other. So in the simplest way to imagine this you have some
commands that refer to object A and some commands that refer to object B and
they will have different graphs. Of course, you can have commands that refer
to both A and B and that’s when the graphs will be connected.
>>: Is it possible that they agree because [inaudible]?
>> Iulian Moraru: Right, so those commands will be in different connected
components and would have no bearing on this.
>>: But they will eventually [inaudible] the same order?
>> Iulian Moraru: Okay, exactly.
>>: Okay, I think I get it now.
>> Iulian Moraru: Okay, so the properties that this whole commit protocol and
execution protocol guarantee are linearizability, which is a strong property.
And linearizability implies serializability and furthermore implies that if
two commands interfere, and one of them is committed before the other one is
even proposed, A will be executed before B everywhere. And the second
property, which is an important property, is the fact that fast path quorums
for state machine replication that tolerates F concurrent failures. So, fast
path quorums will have to contain F plus [indiscernible] F over 2 replicas,
including the command leader.
What I mean by fast path quorums: how many replicas including the command
leader have to agree on a certain set of dependencies for that command to be
committed in the fast path? And that’s the number of replicas. What this
means is that for the most common deployments of Paxos, of 3 and 5 replicas
respectively, this comes down to the smallest majority, so this is optimal.
And it also means that it is better than Fast and Generalized Paxos by
exactly 1 replica. In practical terms this means that if we could do georeplication and we have a replica in Japan to one on the West Coast --.
Oh, excuse me, sorry.
So if we do geo-replication and we have a replica in Japan, to one on the
West Coast, to one on the East Coast and one in Europe a replica in Japan
will be able to commit a command after just talking to its closest neighbors,
the US West. It doesn’t have to go all the way to Europe or all the way to
the East Coast.
So we have implemented EPaxos and we have also implemented Multi-Paxos,
Mencius and Generalized Paxos and we compared them for replicated key value
store. So first I am going to show you a wide area commit latency. This is
the setup I have shown before. We have a replica in an Amazon institute
datacenter in North Virginia, one in California, Oregon, and Ireland and in
Tokyo. And this gives you a sense of the round trip times between those
locations. Now in this diagram I am going to show you the median commit
latency as observed at every location for every protocol. We have clients
co-located with every replica in each datacenter. And these clients commit
commands and then they measure the time it takes to get the reply back.
So for a client in California it has to wait for its local replica to talk to
Oregon and Virginia. And that’s also the case for our client of Multi-Paxos,
but only because the Multi-Paxos leader is located in California. For a
client that’s in Virginia that client has to first go to the leader in
California, wait for it to commit to the command and then get back to
Virginia, so it experience higher latency.
Mencius has higher latency because you remember before committing a command
we have to wait for some information from every other replica in the system,
including the ones that are furthers away. And Generalized Paxos has higher
latency than EPaxos because its fast path quorums are more numerous or are
larger. And in fact at every location EPaxos has the smallest latency of all
the protocols because it has optimal commit latency in the wider gap for 3
and 5 replicas.
Now of course, this is median latency. When there is interference, when
there is command interference, 99 percent latency is going to be double this.
However, there are ways to mitigate that, ways that we just haven’t
implemented yet. And we believe that command interference will be rare
enough that it will affect at most the 99 percent latency.
Now in the local cluster we compare Multi-Paxos, Mencius and EPaxos to see
how they fair with respect to throughput. This is a 5 replica scenario;
EPaxos has different rates of command interference. 0 percent means that no
commands interfere; 100 percent means that all commands interfere. And what
you can see here is that EPaxos has higher throughput than the other
protocols when command interference is low. Furthermore when replicas are
slow and for the Multi-Paxos here the leader has to be slow for this to have
an otherwise throughput remain the same.
So when replicas are slow and that replica is the Multi-Paxos leader in
Multi-Paxos the performance of EPaxos degrades more gracefully than for the
other protocols. The reason is we can simply avoid those replicas that are
slower whereas in the other protocols we cannot.
>>: So a quick question.
>> Iulian Moraru: Yes.
>>: So looking at that top one, you are comparing Multi-Paxos verses the 100
percent EPaxos [inaudible]?
>> Iulian Moraru: Correct, yeah.
>>: So the difference is like 10 or 20 percent?
>> Iulian Moraru: So the difference here is probably smaller than 10 percent.
>>: Yeah, so can you comment on that?
>> Iulian Moraru: So what happens here? So the reason why Multi-Paxos is
slow here is because all the decisions go through the leader and the leader
is bottlenecked.
>>: Right.
>> Iulian Moraru: The reason why EPaxos is slow here is because it has to do
two round trips of communication before committing every command.
>>: Right, but it uses all the five servers.
>> Iulian Moraru: But, it uses all the five servers.
>>: So why is the difference so small?
>> Iulian Moraru: You are saying it should be larger?
>>: Yeah, right.
>> Iulian Moraru: I don’t have the intuition of why it should be larger. So
EPaxos does more work, but it distributes that work across all the replicas.
Multi-Paxos does less work, but it concentrates that work to the leader.
>>: So my practical experience with doing this is that the work of being the
leader is not very substantial.
>> Iulian Moraru: So the assumption here is that the amount of work that you
do on the leader is essentially --. It essentially has to do with
communication, because executing the command for example or sending it to the
disk, that’s not the bottleneck in the system. If that is the bottleneck in
the system then these results would look differently.
>>: Okay, what’s the bottleneck?
>> Iulian Moraru: Here it’s the CPU.
interrupts on the leader.
The fact that you have to handle more
>>: So you are [inaudible]?
>> Iulian Moraru: No, so the operations are small enough that this workload
is dominated by messaging.
>>: So these are [inaudible], right, all the leads?
>> Iulian Moraru: Right, sorry.
>>: Okay and then if we look at the 0 confliction and then it’s only twice
difference?
>> Iulian Moraru: Okay, almost twice, yeah.
>>: So not five times?
>> Iulian Moraru: No, it’s not five times because there is still some work
that has to be done on every machine. And that work sort of like --. So the
amount of work that you do in total is more than the amount of work that you
are doing total in Multi-Paxos.
>>: So are you logging these?
>> Iulian Moraru: In this case they weren’t logged. In our paper we have
comparison between them. When we do log them synchronously to SSD’s and then
the results are marked closer to each other, that’s true, right because there
logging is dominant factor.
>>: Right and I guess depending on your application you may not care about
persistence.
>> Iulian Moraru: That’s the reason why I showed you this graph because I
wanted to compare the protocols themselves, not necessarily the underlying
means of the application which may defer.
>>: You said the dominant of overhead in messaging is the time the CPU takes
to service the message?
>> Iulian Moraru: No, no, no, it’s the fact that it has to handle more
interrupts to reply to more messages.
>>: Is that because there are more total messages than protocol?
>> Iulian Moraru: Because there are more total messages on one replica.
>>: It’s because the [inaudible].
>> Iulian Moraru: So a leader --.
>>: [inaudible].
>> Iulian Moraru: So the leader handles order and messages for each command.
Whereas a non-leader replica handles order of 1 messages for each command,
that’s where the imbalance comes from.
>>: Okay.
>>: [inaudible]. I have one question: in EPaxos are the replicas actually
measuring and sending all these messages to other replicas?
>> Iulian Moraru: Yeah.
>>: So how do you determine the slow replica?
>> Iulian Moraru: Oh, we send pings and we measure how long it takes to reply
to pings, echo messages essentially.
>>: And this is done once?
>> Iulian Moraru: It’s done periodically, something like every 250 seconds or
a half of second.
>>: So for the second [inaudible] would you slow down one of the replicas?
>> Iulian Moraru: Yes.
>>: How did you slow it down?
By reducing throughput or --?
>> Iulian Moraru: No by having a CPU intensive program, essentially an
infinite loop running on the same machine.
>>: [inaudible].
>> Iulian Moraru: What is the metric?
>>: What is the protocol --?
>> Iulian Moraru: Oh, it’s an Amazon institute datacenter. I think it’s,
yeah, I think [inaudible]. But that doesn’t matter because the bandwidth was
not the bottleneck.
>>: No, but you are saying [inaudible].
>> Iulian Moraru: It’s TCP, yeah.
Okay, so now the intuition of --. So batching is an approach that we
currently use to increase throughput and you might have the intuition that
Multi-Paxos might do better with batching because it has more of a chance to
use larger batches, right. So a Multi-Paxos leader receives all the
commands. It can just commit all these commands in one batch, in one round
of Paxos, whereas in EPaxos we get the same number of commands distributed
across multiple replicas. So there are going to be more batches for the same
number of commands.
This is a latency verses throughput graph. The wide scale is, the YX is log
scale, this is the curve for Multi-Paxos, obviously it’s better to be lower
and to the right because that means higher throughput, lower latency. EPaxos
is significantly better than Multi-Paxos and the reason is the communication
with the clients is now shared by all the replicas instead of having to fall
squarely on the leader.
And perhaps counter intuitively EPaxos 100 percent is almost as good as
EPaxos 0 percent. The reason is communication, the second round of
communication is necessary to commit commands when there is interference,
when there are conflicts, is spread across the large number. So the cost of
that is spread across a large number of commands in a batch, right. Second
round messages are small; they don’t have to contain the commands themselves
anymore, just the revised ordering constraints.
And finally a nice side benefit, and this comes back to a discussing about
availability, is that in EPaxos we have constant availability. And this will
probably clarify what I mean by that. So let’s say that we have, this is
Multi-Paxos and what I am showing here is throughput over time when there is
leader failure. A replica failure, a non-leader replica failure, would not
be seen here.
So when there is a leader failure the throughput for the whole replicated
state machine goes down until the new leader has been elected. These are
commands from the clients that have not been able to commit them in the
meantime. In Mencius any replica failing will have this behavior whereas in
Paxos a replica failing will cause the throughput to go down by a third. In
this case we had 3 replicas. So the throughput goes down by a third because
those clients that were talking to that replica that has failed have to
timeout and then talk to some other replica.
>>: And again I think it’s worth observing that in the Multi-Paxos case twothirds of the time nothing happens.
>> Iulian Moraru: That’s right.
>>: Right.
>> Iulian Moraru: But, when something happens that’s greyer then when
something happens in Epaxos.
>>: But, it happens most often.
>> Iulian Moraru: That’s right.
So instead of conclusions for this part of the talk I am going to try to
disentangle some of the most important insights into Epaxos. So the first
one was we deal with ordering explicitly instead of dealing with ordering
implicitly by using Paxos on some pre-ordered slots for commands. And this
gives us high throughput because we load balance better. It gives us
performance ability because we have the flexibility to avoid slow replicas
and low latency because we can chose the replicas that closest to us.
Furthermore, it let’s us optimize only those delays that matter. Previous
Paxos variances that optimize for low latency try to do away with the first
message delay between the client and the first replica. These are Fast Paxos
and Generalized Paxos. Now when we are doing wide area replication where
latency is really very important that first message delay is going to be
insignificant because usually the client will be co-located with is closest
replica in the same datacenter.
So instead of focusing on doing away with that message delay we focus on
having smaller quorums which in turn gives us lower latency because we can
talk to fewer of our closest neighbors. And we have [indiscernible] with a
[indiscernible] for correctness for Epaxos. It also contains a TLA plus
specification that can be model checked. And we have released our
implementation of EPaxos and all the other protocols in open source.
Yes?
>>: Why is it called Egalitarian?
>> Iulian Moraru: Because replicas perform the same functions at the same
time, sort of.
>>: [inaudible].
>> Iulian Moraru: Instead of having a leader that dictates.
Okay, so to go back to the goal of my thesis work was to improve SMR and I
think that Egalitarian Paxos goes a long way to doing so because it has
higher throughput than previous state machine replication protocols, it has
optimally low wide area latency for 3 and 5 replicas, it offers better
performance robustness, constant availability and furthermore there is no
need for leader election, there is no leader.
Now can we improve other aspects of state machine replication? And a very
important one that doesn’t fall into this category is what happens when we
read from the replicated state machine replication? So how do we improve
read performance? Now why is that different from command throughput, right?
Because let’s say that, so let’s look at the different ways in which we can
do read from a replicated state machine.
A simple way is to treat reads just like any other command. So if we have a
client that client may try to read from a replica, this by the way is MultiPaxos, so the discussion that I am going to describe now for the second part
of my talk applies to any Paxos type protocol. So in this particular case
the example is Multi-Paxos.
So let’s say that a client tries to read from a replica, that replica will
forward the read to the leader, the leader will just commit it just like it
would do with any other command and then will get back to the client with the
result. Now obviously this involves a lot of communication. This is not
ideal for a wide area.
A second better way is to have the replica that got the read just talk to any
quorum of replicas. For instance, its closes neighbors and ask them what the
latest command is that you have seen, wait for that command to be executed,
get back with the result and the replica closest to the client would perhaps
reply to the client. This is less communication, fewer rounds of
communication, fewer message delays, its better, but there is still some
communication.
So, practical systems use leases. The most common variant of leases is
leader leases where the leader has a lease on all the objects in the system.
As long as that lease is active the leader can read from its local cache,
from its local store. Now even if the leader appears to have failed the new
leader will not be able to commit updates until the lease of the older leader
has expired. In this case the client can either redirect it from the leader
or have its closest replica forward that command to the leader and get it
back. Obviously the clients that are co-located with the leader will see
very low latency and other clients will see high latency.
And the other previous way in which we had leases was described in Megastore.
And these were leases where every replica in the system had a lease on every
object in the system. This means that a client can just read locally from
its closest replica and that’s great for reads, however, we pay the price
when we do rides because every ride has to go synchronously to every other
replica in the system. We cannot commit that ride until every replica in the
system has acknowledged that ride. So rides will have higher latency and
furthermore when the replica us unavailable the whole system is unavailable
for rides.
So if we look at the --.
>>: Until the lease expires.
>> Iulian Moraru: Until the lease expires, exactly.
So if we look at the design space for leases so far we have these two points.
We have the leader lease which has almost the same ride performance as a
system that has no leases and has better read performance than a system
without leases, but the read performance is not great, right. We can only
read locally from the leader. And we have the Megastore lease where the lead
performance is great, we can read locally from every replica, but we pay that
in ride performance and availability.
And in our work we set out to explore the space in between and I am going to
argue that we can do so with something that we call Quorum leases. And
Quorum leases allow us to get the right trade off for our application between
read performance and ride performance and furthermore I am going to argue
that they actually get us most of the performance, most of the benefits of
both the Megastore and the leader lease.
And the idea here is pretty simple. Instead of having one leader, one
replica holds the lease or all the replicas hold the lease we can let our
arbitrary subset of the replicas hold the lease. Now in particular we make
the observation that the fact that in Paxos we have to communicate
synchronously with a quorum before we can commit an update. It induces a
natural leasing strategy and that is we simply take the smallest majorities,
the simplest quorums and we give them leases for objects.
And in fact we can have different quorums hold different leases for disjoined
subsets of objects. And the assumption here is that different objects are
popular, or are hot, at different replicas. Imagine if we have, let’s say a
replicated social network, it is more likely that the clients in Europe will
have a different set of requests to the datacenter in Europe then the clients
in the US will have of where datacenters in the US, right. So the assumption
is that there is some geographic locality in our workload.
>>: And that [inaudible] multiple objects. Like imagine that you have a B
tree, right, everybody is going to touch the root.
>> Iulian Moraru: Right, so when this happens too often we may not want to
use that. When this happens infrequently we can do two things: we can either
pay the price with the write or just forward it to all the quorums that are
pertinent to that composite update or we can make sure that we configure the
quorum such that objects that are usually accessed together are part of the
same quorum lease.
>>: [inaudible].
>> Iulian Moraru: Right, right.
>>: [inaudible].
>> Iulian Moraru: Right, have a smaller quorum, exactly, right, right, right.
So I guess what you are saying is that if we have a read heavy object then
maybe we will be able to lease for that object to more replicas. If it’s a
write heavy object it will give that read lease for fewer objects, or I mean
for fewer replicas, right, that’s a possibility.
>>: So does this also mean that once you start doing this with read lease
before you are able to commit a write [indiscernible] don’t you need to know
the set of objects [inaudible]?
>> Iulian Moraru: That’s right, yeah.
>>: Which isn’t necessarily an obvious thing in general state machine
replication.
>> Iulian Moraru: Uh, not in general, but I thin in many cases it’s fairly
easy.
>>: And certainly [inaudible].
>> Iulian Moraru: That’s right.
Okay, so instead of describing in detail our design I am just going to lay
out the challenges of implementing quorum read leases. So the obvious
question is which quorums should hold the lease to which objects? And it
becomes quickly apparent that quorum releases should be dynamic. Before
quorum releases we had single leader lease and every replica has a lease, all
the replicas have the lease, which means that it’s very easy to decide who
holds the lease, right. It’s either the leader or everyone, but now we have
to adapt to access patterns. We have to give the lease to those objects that
need it more, right. Sorry, to those replicas that read those particular
objects more.
So now the challenges become how do we establish, maintain and update these
leases, right? And by update I mean about establishing the timing
information necessary to be able to expire the leases in a timely way and
also migrating different objects between different quorums. So we want to do
this cheaply for millions or many millions of objects concurrently with
minimal bandwidth overhead. So that’s one challenge. And the other
challenge of course is we want to do this while at the same time maintaining
the safety of the Paxos protocol underneath the quorum leases and the safety
of our leases, right. So we want every read to be a strong lead consistent
read.
And I am going to quickly show you some preliminary results that we have had.
We have used the YCSB benchmark for a skewed workload. This is the
distributed workload, YCSB and again, this is the same setup for wide area.
And we observed that every replica, not just one replica as the single leader
lease, can do 60 to 90 percent of the reads locally and furthermore 60 to 90
percent of all writes have the minimal latency that we can get in a MultiPaxos system on this configuration.
Now we believe that with a bit more engineering we can get this number to
over 90 percent for the skewed workload for reads. I think that should be
pretty easy to do. In the local area quorum release also have the benefit of
increasing throughput, read throughput because we can now do local reads at
every replica. Every one of our 5 replicas read throughput increases by
4.6X.
>>: Wow, wow, okay that depends, right. Again, if you look at what we did
with read Smarter we read from all of them. You use the leader to guarantee
seralization, but you read from all the replicas.
>> Iulian Moraru: But we don’t do any communication for reads.
>>: Right, but in the local area case the communication is relatively cheap.
>> Iulian Moraru: Not for throughput.
>>: Not for throughput?
>> Iulian Moraru: No, I understand your point, which I think it is. The
reads you compare between the read strategy in Smarter and that’s a valid
point, that’s right. We should do that, yeah.
>>: Right, which doesn’t work very well in one area that it wasn’t designed
to do.
>> Iulian Moraru: Right, no I understand.
Now this number is about 20 percent lower then Megastore leases, right,
because Megastore leases have the best read performance, but we also don’t
have the problems for writes with Megastore leases, in particular the write
availability problem for Megastore leases.
So in summary I presented my work so far in improving state machine
replication. I have shown you that we can get optimally low latency in the
wide area, higher throughput and higher performance stability with
Egalitarian Paxos, we can get high read performance without sacrificing
writes with quorum releases and I believe there is extensive room for future
work here. I want to, in particular, study faster configuration algorithms
that have minimal impact on availability. I also want to study more uses of
wall clock time beyond leases.
What I mean by that is say let’s say we had synchronized clocks, what can we
do better? We know for sure that there are ways to implement EPaxos with
synchronized clocks that are easier then just a normal implementation of
Egalitarian Paxos. But, conversely if we don’t have synchronized clocks and
we want to get things like spanners and time snapshot reads. Can we do that
with just quorum leases? And I think there might be a way to do that.
And finally I like to be able to put this all together and provide general
transactions on top of multiple EPaxos groups that also use quorum leases.
And the last slide I am going to show is to point out that I have also worked
on other things including hardware-aware data structures, in particular this
feed-forward Bloom filter. That’s a variant of Bloom filters that creates
benefits for multiple patterns matching, like for example Malware scanning.
And also on re-designing new systems around these new memory technologies
like, free exchange and memory stir.
And if your interests are aligned with these please ask me about it.
you.
Thank
[clapping]
>> Amar Phanishayee: Are there any more questions, clarifications or do
people want to go back to slides that you might have said I will ask later?
>> Iulian Moraru: Yes.
>>: So this is something I didn’t understand and that is [inaudible].
>> Iulian Moraru: Right, those are a not regular clock, that’s just a Lamport
clock.
>>: [inaudible].
>> Iulian Moraru: Yeah.
>>: So is that something that only works in this particular case or is that
something where you could use Lamport clocks to do order and --.
>> Iulian Moraru: So let me see if I understand your question. You are
saying what if we use just the Lamport clocks and not use the other
dependency thing with the graph and all the edges, right?
>>: Yeah.
>> Iulian Moraru: Yeah, so that’s not possible. We thought of that and
that’s not possible. The reason why it’s not possible is because the Lamport
clock for that approximate sequence, for a particular command doesn’t have to
cover the entire space for that Lamport clock. So for example if a command
get’s sequence it doesn’t necessarily mean that there have to be commands for
1, 2, 3 and 4. So we may wait indefinitely for something to be committed for
1, 2, 3 and 4.
Yes?
>>: Sorry, I don’t have a question; I just have a clarification [inaudible].
If I understand this correctly basically this strays off the impact of
failures by reducing that impact, but spreading it and making it more likely
over time.
>> Iulian Moraru: Uh huh.
>>: So why would it be better to actually increase the impact of failure, but
reduce [inaudible]? Like I am sort of struggling with it, seems like the
[indiscernible] is a better trade off.
>>: Well, no, it’s not clear, right. It depends on --. If you measure it by
--. If you assume that replicas are equally likely to fail and you measure
it by sort of by total amount of extra latency added by failures then it’s a
wash.
>>: I agree; I understand.
>>: It’s better for predictability though. You would rather have a larger
number and smaller failures than a smaller number of large impact failures.
>>: Possibly, but it turns out from the point of view that the individual
requests it’s exactly the same.
>> Iulian Moraru: Let me point out --.
>>: So the question is: Is it useful to have the system partially functioning
some of the time? And it may or may not be; it depends on your application.
>> Iulian Moraru: Let me pout out something else; some other way in which
Paxos might be better in this scenario. So if we really do want no client to
observe any latency penalty when a replica fails we can simply propose that
command to a quorum of replicas, right. So in this case the problem is of
course that we would have lower throughput on the all, right, because we
propose each command twice in a 3D replicas system. We never have that
latency penalty when something fails, but our system has to be able to
distinguish between the same command proposed twice, right and only execute
the first one.
>>: Yeah, but you have to be really careful with that, right.
>> Iulian Moraru: No, no, no, because you will have to do that for MultiPaxos as well, because imagine a client that has sent a command to a leader
that fails and it re-sends that command, the same command, to another leader.
So any --.
>>: Yeah, but in that case only one of the --. If you do it correctly only
one of those is ever accepted so you don’t ever --.
>> Iulian Moraru: I don’t think so, the leader --.
>>: Trust me, it is in Smarter; it absolutely has that property in Smarter.
>> Iulian Moraru: And you don’t use any client per client ID’s for commands?
>>: No there is per client ID.
>> Iulian Moraru: Well then that’s probably why it works in Smarter.
>>: Yeah.
>> Iulian Moraru: And that’s how you would solve it in the system.
>>: Well, I mean the way that it, so --.
>> Iulian Moraru: And that’s how you would solve it in any Paxos.
>>: Okay, all right, yeah, so you can do the same thing.
>>: You two need to have a white board later to hash this out.
[laughter]
>> Amar Phanishayee: Any other questions?
[clapping]
If not it’s time to go.
Download