Document 17864648

advertisement
>> Amar Phanishayee: So it's my pleasure to introduce Wyatt Lloyd. Wyatt is a grad student at
Princeton. He's actually defending sometime this week, later this week>> Wyatt Lloyd: No, next week.
>> Amar Phanishayee: Next week. Okay, there we go. And he works with Mike Freedman at
Princeton. Wyatt has also closely worked with Dave Anderson and Michael Kaminski at CMU.
And Wyatt’s work sort of tackles building systems that are geo-replicated and are, offer low
latency to users of system, and in doing so, they provide the standard things that are required
of large-scale systems today: high availability, high scalability, but also partition tolerance and
stronger notions of consistency. And not just that, but he also offers limited transactions. So at
this point you probably think I'm making things up, right? So I'm going to give it to Wyatt. Take
it away, and blow us away.
>> Wyatt Lloyd: All right. Thank you very much. I'll give these guys one second to trickle in. So
I like to begin the talk by talking about the types of problems that I like to look at which are
what I consider to be classic distributed systems problems, exceptionably lens of modern cloud
or web services, and so today I'm going to talk about how we re-examined replicated storage
for massive scale websites with the new requirements of low latency, massive scale, and georeplication.
So what do I mean when I say geo-replicated storage? Well, I mean the backend of massive
scale websites. Things like Facebook, Reddit or Amazon and I mean the storage that holds the
data that represents that service. So on Facebook, this would be your profile, who you’re
friends with, what you like, your status updates and so on. And why is this geo-replicated
storage? Well, if we look at the architecture of these services, we'll see that they are located in
geographically different places, East Coast, West Coast, Europe, for instance. So what are some
reasons for this? Well, one is fault tolerance. If one of our data centers goes down, we still
want our service to be up and available so people can use it; but another one, and the one we
are more concerned with in this talk, is so that we can serve client requests quickly. We want
to take clients, direct them to a nearby data center, serve their request entirely inside that data
center, and then send a response back to them very quickly. And this is because many Internet
companies have done many studies, and they've seen that there’s this strong correlation
between page load time, user engagement, and revenue. So we want things to be really fast.
So we'll zoom in on one of these data centers.
So what happens when I actually connect to one of these Web services? Well, my web browser
is creating connection with the one of the machines in the front end web tier. And so this is the
machine that's building my page for me and interacting with the service on my behalf. And
these machines are, they don't have to durable state on them. It’s not stored there, and
they’re independent, meaning they can each do the same thing. So if my service becomes
more popular and I want to be able to handle more clients, this is pretty simple, I just add more
machines to this front end web tier and I can direct clients to them and they can handle front
end requests.
So how does this machine in the front end web tier actually build my pages and interact with
the service? Well, it does this by reading and writing data from a separate storage tier. So in
contrast, this storage tier is where that data is actually durably stored, and it's a cooperative
system. So you have a large number of nodes cooperating together to provide this single
storage abstraction. So this storage tier will be the focus of the talk today. And there's two
interesting dimensions of the storage tier. So our first interesting dimension is the many node
dimension. So the scale of this state is massive. It's very large. It’s far too large to fit on a
single machine. So instead, we have to spread it out across a large number of machines. So the
typical way to do this, and the way that we do this is by sharding the data, or putting different
subsets of the data on different nodes in the system. So one way we can do this is by last
name. So we put Ada Lovelace's data on the server for L and Alan Turing’s on the server for T.
Now our other interesting dimension is the multiple data center dimension. So we have our
data in multiple places, East Coast, West Coast, Europe, so if we write data into one location,
we have to replicated out to those other locations as well. So at a high level, we have three
goals for this storage. One, we want to serve client requests quickly. We want things to be
very fast because of that correlation between page load time and revenue. Two, we want to be
able to scale out the number of nodes in the data center. As our service becomes more
popular, we want to be able to continue to grow that storage tier to keep up with capacity and
throughput demands. And then third, we want to build and interact with the data coherently in
a way that makes sense. Now you notice that I've been a little wishy-washy with this term
coherent. And that's because, as we'll see later, stronger things that we could say, like the
strongest forms of consistency, are actually theoretically incompatible with our first two
properties.
So what we do in this work and in this talk is we’re going to take these first two goals and we’re
going to say these are our requirements. We are going to ensure that we’re always going to be
fast and scalable, then we’re going to push on this third goal. How can I interact with this data
in a way that makes sense? So to do that we're going to provide stronger consistency so that
operations up here in an order that makes sense for users and programmers, and stronger
semantics or more powerful tools for programmers to interact with this data spread across
many machines.
>>: Stronger than what?
>> Wyatt Lloyd: So stronger than eventual. Yeah. Okay. So I want to say our first two goals a
little bit more precisely, that we want things to be fast and scalable. So we can do this with the
ALPS properties. So the ALPS properties say that we want a system that has availability, which
means that operations always complete successfully; low latency, which says that operations
always complete quickly, something that's on the order of a local round-trip time, and
something that's much faster than a wide area around the trip time; partition tolerance, which
says that our system continues to operate even if we have a network partition, for instance,
one separating the East Coast and West Coast data centers. And if you take these three
properties together, you get what’s called an always on data store. Operations always
complete, they're always successful, and they're always very fast. And in addition to these
three properties, we also want scalability. And this simply says that we can continue to grow
our cluster, increase our capacity and throughput, and by adding more machines.
So in addition to these ALPS properties, let's say we are fast and scalable, we also want to be
able to interact with this data in a way that makes sense. So the way that we do that is through
consistency. And consistency is a restriction on the order and sometimes the timing of
operations in our system. We prefer stronger consistency because it makes programming
easier, there's fewer things for me to reason about as a programmer, and because it makes the
user experience better. Websites behave in a way that I intuitively expect them to. Yeah.
>>: When you say partition tolerance, do you also consider partitions within the data center or
is that outside your>> Wyatt Lloyd: Yeah. So we do not. So we assume the partitions happen only in the white
area, not inside data centers. Okay. So I want to explain stronger consistency a little bit more
clearly through an example. So I want to explain what strong consistency is, and this is more
formally called linearizability. And this gives us two properties. The first, that there exists a
total order over overall operations in the system so that everyone sees all operations in the
same order, and that this order agrees with real time, so that if operation A completes before
operation B, it will be ordered in this way. So intuitively, the way to think about this is that if I
write something on the East Coast and then call up my friend on the West Coast and say, hey,
check out this thing that I just wrote, my friend on the West Coast will be guaranteed to see it
in the linearizable system. So linearizability is great for programmers. It's easy to reason about,
and it’s great for users, behaves in a way that they expect.
So there's a spectrum of consistency models, and what consistency can actually achieve with
ALPS systems? So let's start at the top of the spectrum with linearizability, which I just
described, can I achieve this with ALPS systems? So unfortunately, the answer is no. So the
famous CAP theorem, which I expect pretty much everyone in here knows, says that you can't
have linearizability availability in partition tolerance. You can't have all three of these things.
So linearizability is out. So let's move down the spectrum a little bit and we get to sequential
consistency. So sequential consistency is the original correctness condition for multiprocessors. It still has that total order that exists over all operations, but it no longer has that
real-time requirement. Sequential consistency is still very, very useful, but unfortunately this is
also provably incompatible with the ALPS properties. Notably, the total order in sequential
consistency and our low latency requirement are incompatible. And this proof about the total
order also means that other consistency models, things like serializability which has a total
order over transactions, not operations, is also provably incompatible with low latency.
So we can't have linearizability or serializability or sequential consistency, what can we have?
Well, if we look at systems that people built, they were fast and scalable, things like Amazon's
Dynamo and Facebook and now Apache's Cassandra system, these, when they are in the low
latency configurations at least, only provide you with what’s called eventual consistency. So
eventual consistency is this weak catch all term that just means that data written in one
location will eventually show up somewhere else. And it doesn't give you any guarantees about
ordering, and it especially doesn't give you any guarantees about ordering for data that's
spread on different shards of the system. So the question is: what sits in this whitespace here
in between the impossible to achieve sequential consistency and that doesn't give you very
much eventual consistency? So this is where our work falls in, this is where our talk sits, and
we’re going to provide something called causal consistency. So causal consistency is going to
give us a partial order, not a total order, but a partial order over operations in the system that
agrees with the notion of potential causality. So if A happens before B and you see A and B,
you'll see A before you see B. So I'll explain this much more clearly with an example in two
slides.
First, I want to revisit these theoretical results though that say that there's this fundamental
choice between the ALPS properties and the strongest forms of consistency. And what I'm not
arguing is that you always want the ALPS properties. Sometimes you need the strongest forms
of consistency. And specifically, you need them when you want to enforce a global invariant.
So if I have something like a bank account that you want to guarantee never goes below zero, I
need the strongest forms of consistency for that. But in all other situations, we can provide you
stronger consistency while still guaranteeing low latency. So before our work, this was our
feeling in the community for scalable systems, so this is a quote from Amazon's Dynamo paper,
very influential from SOSP in 2007, and then it said: “for Dynamo to provide this always on
experience, this high-level availability, Dynamo sacrifices consistency.” And what our work
says, it says, don't do that. Don't settle for eventual consistency. Instead we can provide you
causal consistency.
So what is causal consistency? So I'll explain it so hopefully it makes sense now. So I’ll do this
through an example. So on our social network we’re going to have Alan Turing remove his boss
from his friends group, and he's going to post to his friends, I'm looking for a new job, and then
his friend Ada will read that post. So you have three operations here and three rules that
define causality. So let's go through all three of them. Our first rule is the thread-of-execution
rule. This says that operations done by the same thread-of-execution are ordered this way by
causality. So Alan’s first operation is before his second. So everyone will see them in that
order. Second rule is the reads-from rule that says operations that read a value are causality
after the operations that wrote that value. So Ada’s read of his post is after his write of that
post. And then our final rule is the transitive closure over our first to rules. So this means that
Ada’s read of his post is still after his first operation. This also means that any of Ada’s later
operations will appear to people after they see Alan's earlier operations. So causal consistency
is great for users because now websites work in the way that they expect. Yeah.
>>: What would strong consistency [inaudible]?
>> Wyatt Lloyd: So what would strong consistency look like here? So it doesn't sort of map
directly to this. Everyone would see all of these things in this order, but strong consistency will
also avoid the drawbacks or like the sort of tricky thing you still have to reason about, which I'll
get to in a few slides, and it will allow you to enforce global invariants.
>>: Are you assuming that Alan's order goes to the same data center?
>> Wyatt Lloyd: Yeah. So one of our assumptions is that users are tied to a particular data
center. So the clients of our system are the front end web servers, but sort of the people that
we really care about causality for like meta-clients, the clients of the clients. So we assume that
they're connected to a single data center. So we assume that the web servers that using us
detects us in some way like has in your cookie, this is the data center you've been using
recently, and then will redirect you to that data center if for some reason there some routing
flap and you’re sent somewhere else. But, yes. We are assuming users are sticking to a data
center. Okay.
So causality is great for users because things work in the way that they expect. So here are
some examples. So when Alan removes his boss from his friends list and then posts that he
wants a new job, he does this because he expects those operations to show up in that order. In
an eventually consistent system, this new job post could show up before the boss removal. And
this is exactly what he didn't want to happen. Or on Amazon, if I add something to my cart and
then I click on my cart, in an eventually consistent system, what I added might not be there. In
a causality consistent system, my boss will never see my post, and in a causality consistent
system, I will always see what I just added to my cart. Or on Wikipedia, if I see some spammy
article and I delete it and then I refreshed the page, I expect it not to be there, but it might be
in an eventually consistent system, and won't be in a causality consistent system.
So causal consistency is great for users in that websites work in the way they expect now, but
our bigger contribution is making programmers lives better. And programmers really like
causality because of simplifies what they have to reason about when they're programming. So
here are some examples of this. So let's say I take a photo and I upload it to a website, then I
take that photo and I add it to an album. In an eventually consistent system I can get this add
to album operation before I get the photo that it’s referring to. So I have to reason about well,
what do I do in this situation? Do I queue up a callback, do I even think about this, what's going
on? I have to reason about getting these operations out of order. I don't have to do that in a
causal consistency system. Here's another example. Let's say I like something on Facebook and
then I hide it for my timeline. And in eventually consistent system, you have to reason about,
what happens if I get this hide from time line operation before the thing I'm trying to hide
exists? Or on Amazon. You have to create an account before you can check out. In a causality
consistent system, when you go to check out, your account will always exist. In an eventually
consistent system it might not.
So causal consistency is great for users and programmers, but it’s not as good as strong
consistency. There's still some tricky things that we have to reason about, and specifically, we
have to reason about our concurrent writes. So these are called conflicts in causal consistency.
And these are writes that don't have any ordering between them. And they are writes to the
same data item. So this is what it looks like. We have two people at two different data centers
writing to the same data at the same time. So what happens here? Well, in traditional causal
consistency, in the formalization of traditional causal consistency at least, the operations are
propagated out to the other replicas, and whatever arrives later at a location overwrites what
existed earlier at that location.
So this means that on the East Coast and the West Coast and our other data centers, we might
end up having divergent values. We might have different values for the same data. And since
this is not what we want, we want everyone to see the same data everywhere, so how do we
handle this? We do one of two things. One thing that we do, or one option, is to just arbitrarily
pick one of the two operations. And say this one happened later than the other. And so this is
called the last writer wins rule or the Thomas's write rule. And this is the default thing that we
do in our system. So just choose one of the two and have it overwrite the other when they're
both present in a data center. We could also do something fancier. So we could take our two
update operations, one and two, and merge them together so we get three in each one of our
data centers. We can do this through a function that's commutative and associative that we
call an application specific conflict handler function.
So one of the things that we did in this work was that we named and formally defined this
combination of causal consistency and convergent conflict handling as causal plus consistency.
So if you look at previous causal systems from the distributed systems community, these
systems provided causal consistency but they also provide a causal plus consistency. This is
intuitively, this is what we want our system to actually provide. So our contribution there was
naming and defining it. So if we take a look at these previous causal systems, they're based on
a similar idea. And that's this idea. Let's exchange logs of operations. So this is what it looks
like. So in our local data center we are going to take operations and we’re going to log them in
the order they occurred. Now we’re going to take this log of operations and ship it off to a
remote data center. Now in this remote data center we’re going to play back those operations
in that exact same order. So this log is functioning as a single serialization point.
So this is good and it’s going to implicitly capture and enforce our causal order. If A happens
before B, it will be logged before B, then when we ship this across the remote data center, it
will be played back before B. So that's great because that gives us causal consistency. But it
has one of two problems. So one, if we do this on a per shard basis, so remember our data is
spread out across a huge number of machines, if we just to do this on sort of each shard we
don't have any causality cross the shards and we are actually not providing causal consistency.
Or, if we do this data center wide, this is not going to be scalable. Let's take all the operations
in this data center and log them to one place. That's a big bottleneck. But an even bigger
bottleneck is when we ship this log off to another data center and try to play this back in that
exact same order across this big distributed system. So that doesn't scale even worse.
So let's review the challenges that we've seen with consistency. We saw that the strongest
forms of consistency are theoretically incompatible but the ALPS properties. We saw that
eventual consistency, which is what people were settling for with these ALPS systems, doesn't
really provide us with any consistency. And then we shot this technique of log exchange, which
provides causal consistency, but is not scalable. So the contribution of our work is that we are
the first scalable system that's able to provide this causal consistency.
>>: So is Amazon’s entire e-commerce infrastructure built on eventual consistency?
>> Wyatt Lloyd: Their entire e-commerce, so I don't have sort of a full vision into what Amazon
does, so they have Dynamo there which is eventually consistent. I’ve sort of only heard rumors
of how much Dynamo is used internally versus not used, and my understanding it is not used
exclusively and it’s not used sort of ubiquitously. I don't know if anyone else here has more
insight than that.
>>: That's what we heard too.
>> Wyatt Lloyd: Yeah. Not used that much.
>>: Because the consistency[inaudible].
>> Wyatt Lloyd: Yeah, cool. Confirmation. Amar.
>>Amar Phanishayee: Just a quick question. So is it not scalable?
>> Wyatt Lloyd: Yes.
>> Amar Phanishayee: One of the assumptions that you make in your work is the user data
[inaudible]?
>> Wyatt Lloyd: No. User data is not owned by any data centers.
>>: [inaudible] if you wanted to write to [inaudible]? I mean the one post [inaudible] one data
center [inaudible].
>> Wyatt Lloyd: So the data is not owned by any data center, it’s that users are tied to a data
center for sort of a temporal period of time. So we assume that a user continues to use the
same data center, but I can be a user in data center East Coast and you can be on the West
Coast and we can be reading and writing the exact same data, and you can be updating data
that's stored under my key and things like this. But it’s for the sort of causality tracking it and
enforcing it, which I'll get into soon, that we assume users are tied to a particular data center.
We do have ways to move users between data centers, but this happens, it’s sort of, you know,
when you're going to log out, when you're on the plane moving across the country, that kind of
thing.
>>: Actually, my question is within the data center. You have all these machines where data is
stored, systems like Isis or Bayou or whatever, right?
>> Wyatt Lloyd: Yeah.
>>: [inaudible] some sort of causal consistency [inaudible]. [inaudible] systems today data
icons across servers. How much tracking needs to be done [inaudible]?
>> Wyatt Lloyd: So that’s like super workload dependent. So definitely some, but yeah, sort of
how much data is related across all of the different servers is highly dependent on workloads
and sort of slightly difficult to quantify.
>>: So something, the application has to decide how much data-
>> Wyatt Lloyd: Absolutely.
>>: [inaudible] previous techniques would scale or not. Is that right?
>> Wyatt Lloyd: So previous techniques are designed with the idea, which was correct and sort
of accurate in the 1990s, that your data fits on a single machine. And sort of the ways that you
would take this and generalize it out to a system where the data spread across different
machines, you know, we can guess and there’s sort of different ways to do this that are better
or worse, but I suspect that if you take this sort of all the way to its logical conclusion, if you
transform these to a fully distributed scalable system you'll end up with something that's very
similar to what I'm about to describe. Okay.
So, yeah. So our big contribution is this scalable causality consistency system. So how do we
provide this? What's our key idea for enabling this? Well, it's two parts. The first part is we are
going to capture causality with explicit dependency metadata. So we’re going to say than this
purple operation is after that blue operation. And then the second part is that we’re going to
enforce this ordering with distributed verifications. So we’re not going to expose a replicated
put operation, a replicated write operation, until we know that the causally previous operations
have been applied in this data center. So here's what this looks like: in the local data center we
have the data spread across these many shards, they're replicating this data out to the remote
data center in this scalable way that's distributed, and then the remote data center, these
nodes are doing a small number of checks with one another to make sure we are applying these
operations in that correct causal order.
So let's dive into the architecture of our system. Here we have a local data center with the data
spread across many nodes, sharded across many nodes there, and then you have two remote
data centers out in the distance. Inside the data center, this is our client. As I mentioned
earlier, the front end web server is our direct client, the one that we can easily reason about.
And this is what's building pages on behalf of meta-clients, actual users out there in the world.
So what we're going to do with this client is we’re going to keep all of its operations local. And
this is how I provide that always on experience, that available low latency partition tolerant
experience. And this is always on, so we’re never going out in the white area where we can hit
the partitions, and we’re never going out in the white area where things take a long time,
where things are slow. We’re going to keep everything local.
So if we take a look at our client a little bit more in depth, we have this client library that sits on
the client. And the client library does two things for us. The first is it mediates access to nodes
in the local cluster. So it knows where data sits on these different nodes, so we can send
request directly to them, and second and far more important, is it's going to track and attach
causal dependencies to our operations. So let's see how it actually does this. So here's a read
operation. So the web server sends a read operation to the client library, and the client
library’s going to send this out to the node in the local data center, so it's responsible for the
data we are interested in. That node will respond immediately. Excuse me, sorry. Okay. So it
will respond immediately and will update some metadata in that client library. The client library
will then return to the web server. Okay. Great.
So what about a write operation? This is a little bit trickier. So write operation comes into the
client library. We’re going to transform this into a write after operation. So this is a write
operation that also includes that ordering metadata. This operation is after these other
operations. We're going to apply this at the local data center immediately and queue it up for
replication in the background. Now we can apply it immediately because this client has been
interacting with this data center. So you know that the client’s, everything that the client
previously read from this data center, or everything the client previously read was from this
data center, and everything the client previously wrote was written into this data center. So we
know that all causally previous operations have been applied here, so we can apply it right
away.
So we apply it right away, we return to the client library immediately, again, update some
dependency media there and return to the web server. Now in the background, we’re going to
replicate these write after operations out to the nodes in the other data centers that are
responsible for this particular data. So let's zoom in on one of our remote data centers. Okay,
question.
>>: So locally writes are synchronized?
>> Wyatt Lloyd: Local writes are synchronous. Yeah. So we assume linearizability or sort of we
build on top of systems that provide linearizability inside the local data center. Yeah. Or we
assume partitions don't occur. Okay. So we replicate these out to the nodes in the other data
centers responsible for this data. So let's see what happens in these remote data centers there.
So we zoom in. We get this replicated right after it comes in and it has these attached
dependencies. So if we take a look at these dependencies, we’ll see that there's two parts. The
first part is this locator key. This lets us know, who do I ask to find out whether this causally
previously operation has been applied yet or not? So here it's L. So we’ll ask the node that's
responsible for storing key L.
And then the second part is this unique timestamp. And this uniquely identifies the operation
in the system so when we ask node L, it knows exactly what operation we are talking about. So
this timestamp comes from the accepting data center. The node that accepts that write assigns
each write operation, this unique timestamp; it's based on the logical Lamport clocks that we
have in the system, and we append the unique node ID to ensure that these are globally
unique. Okay. So we can send out these dependency check operations that ask the other
node, have you applied this particular operation? And that node will check. If it has, it will
return right away, if it hasn't, it will block until it does. Okay, question?
>>: Is the timestamp globally unique or is the timestamp plus the [inaudible]?
>> Wyatt Lloyd: Yeah. So the timestamp is those things concatenated together. So you need
the global unique node ID at the end to ensure uniqueness. Yeah. Okay. So we send out these
dependency checks. We send them out in parallel for each one of our dependencies. And then
those nodes respond once those operations have been applied. So once we get all these
responses back, we know we can safely applied this write operation because all the causally
previously write operations have been applied in this data center. Yeah.
>>: [inaudible] it seems like [inaudible] are implicit rather than explicit. If I remove my boss
from my friends list and then make a post, the act of making a post doesn’t explicitly depend on
my friends list, it’s just that I have to know in my brain that I already removed my boss from my
friends list?
>> Wyatt Lloyd: Right. So this is a very interesting point. So what we are capturing in our
system is a pessimistic notion of potential causality. We're not capturing sort of actual
causality, because as you notice, that's in your mind. We don't actually know what's going on.
So anything that could potentially cause it is what we are capturing. So just because you read
something earlier and then you write something after you've read that, we're going to ensure
that it shows up later even though we don't know if it really is causally related or not, we’re
going to capture every possible scenario just like that. Yeah.
Okay. So to summarize the basic architecture of our system, we’re going to keep all of our
operations local inside this local data center and then do our replication in the background.
And that's what allows us to provide that always on experience. Then we’re going to shard our
data across many nodes inside each one of these data centers, that's what allows us to provide
scalability, and then we’re going to control how we do replication using dependencies and
dependency checks so we can provide causal consistency in this scalable way across that
sharded base. Yeah. Question.
>>: [inaudible] application and maybe you’ll get to it, but [inaudible]. If you don't have latency
or low latency [inaudible]?
>> Wyatt Lloyd: Yes.
>>: My understanding in this, basically you are basically in causality, the write is also
dependent on read, right? Basically, let's say, basically I’m doing a Facebook post. Before I
post, I see two other posts.
>> Wyatt Lloyd: Yes.
>>: And you assume when this gets replicated and when the other person, for example, see
these posts, the order they appears is that you know, basically, your post, see the early, let’s
say post from [inaudible]. So my question is this: imagine A is posting to basically one addition
request after that. So basically you have your own post of A, you write a new, basically your
post, and then you write another post after you do this. Is this causality preserved that you can
always guarantee the other person's seeing your post before A’s new post?
>> Wyatt Lloyd: So causality exists only in the forward direction. So we would ensure that your
post appeared after the first post that you read. We would not ensure that some second post
that that person did appeared after yours. We would only do that if that person read your post
and then posted in response to you.
>>: Your basically causality does not guarantee this basically causality on the basically write
order? So say after your post, say, oh I don't agree with you, but this post potentially can move
up in the order [inaudible].
>> Wyatt Lloyd: So if the causality exists, we will enforce the order. So if there is actual causal
dependence, so the person saw my thing and said I don’t agree with you, then we would
enforce that order. If two people are commenting on the same post concurrently, we won’t
enforce any ordering there. So sort of any causality that exists, we will enforce it, but any
causality that doesn’t exist, we won't enforce it because then we gain greater parallel time and
we can be more efficient.
>>: The reason that I asked is basically, they are [inaudible] consistency basically. I think this
may cause some problems because you are reading some values, but this value may be made
basically rewritten putting the time you basically arrived.
>> Wyatt Lloyd: Yeah.
>>: [inaudible] basically.
>> Wyatt Lloyd: Yeah. So the way to think about things in our system is everything is operation
based. And when I move on in the future and I start talking about limited transactions, these
will only be read-only or write only transactions. So that's sort of in a single atomic block,
you're only either reading data or you’re writing data. So there's no sort of insurance that you
read this data and then you wrote something based on it, and then when it's replicated across,
the data that you read still exists there. It could have been updated to something new
concurrently while you were doing your writes as well. Yeah.
>>: So I'm trying to understand your sense of causal consistency.
>> Wyatt Lloyd: Yes.
>>: If I post on someone's, if I respond to a post with a comment saying: you're an expletive
one and that same person replies back, well, yeah, you're an expletive two. Under your
definition, that you're an expletive two might occur before you’re an expletive one.
>> Wyatt Lloyd: So no. Because that person was responding, they must've read your post. And
because they read your post, they’re causally after your post. So when they were one and two,
so because two saw one and was responding to it, it will show up after it.
>>: Okay.
>> Wyatt Lloyd: Yeah. So anything that sort of read or written in the system that actually
happens that is seen by the system, we enforce that order. Yeah.
>>: [inaudible] consistence exactly [inaudible] causal ordering>> Wyatt Lloyd: Absolutely.
>> Wyatt Lloyd: [inaudible] paper in 1970 [inaudible].
>> Wyatt Lloyd: 78. Exactly. Yeah. So causal consistency is exactly equivalent to enforcing the
happens before relationship. Yeah, exactly the same. Okay. Cool. So in the current system I've
described, there's this big challenge. And that's we’re going to have lots of dependencies. So
why are dependencies bad? Well, we have metadata associated with them. So we are wasting
space in the system that we have do these dependency checks. So it’s taking away throughput
from user facing operations. So we don't want to have lots of dependencies. And we have lots
of them because dependencies grow with client lifetime.
So here's an example. Let's say a user does a write operation and they do another write
operation and another write operation. Then they read some value and they read some other
value, well, their next write operation has a ton of dependencies. Depends on all of its previous
write operations because of that thread-of-execution rule. Depends on all of the operations
that wrote values that it read because of the reads-from rule, and then it depends on all the
operations that those operations depend on because of the transitive closure rule. So as you
can see, this can very quickly get out of hand, but luckily we can use the concept of nearest
dependencies to dramatically reduce what we have to track in the system.
So nearest dependencies are what I've marked in green here on this graph, and they're going to
be the nodes that transitively capture all of our ordering constraints. In graph theoretic terms,
the green nearest dependencies are the nodes that have a longest path of length one to our
current operation. Because they transitively capture all of our ordering constraints, we know
that if we are after the green nearest dependencies, we’re going to be after all the blue
dependencies as well. So this is great because this is all we really need to track in our system.
One really nice thing about these nearest dependencies is even if I have this huge graph of
causality, I still have a small number of nearest dependencies. And as I'll show you near the
very end of the talk, this allows us to have an implementation that's quite efficient.
So there is one drawback of nearest dependencies and it's this: to actually figure out what the
nearest dependencies are we need a cut of this causal graph. We need a large sub graph where
sort of each one of these operations that we’re looking at to be able to calculate what is
actually a nearest dependency and what is not. So instead of tracking this optimal set of
nearest dependencies, instead we track what we call one hop dependencies. So these are what
I've marked in orange here. So you notice that the middle node in this graph is orange, but it
was not green. So it's a one hop dependency, but it's not a nearest. These are the nodes that
have a shortest path of length one to the current operation in the system. They are superset of
nearest dependencies, so they're going to be sufficient for providing causal consistency, but
they're a lot easier to track. In particular, to track them all we need to do is track one, the last
write operation that we did, and two, all of the reads that happened after that. So now from
our thread-of-execution rule, we only have one dependency on that last write that we did.
From our reads-from rule, we now only have a limited number of dependencies on all the
operations that wrote values that we read after that last write. And then finally, we don't have
any dependencies from the transitive closure. None of them are there. Question.
>>: [inaudible]?
>> Wyatt Lloyd: So is this like tracking>>: [inaudible] reach parts of the global [inaudible], so there's a global vector [inaudible] parts
of it?
>> Wyatt Lloyd: Yeah. So vector clocks are sort of a very difficult issue to address in that the
sort of depending on what you do with them, we have this like big graph going on in the system
and vector clocks sort of compress things based on a single node and have a certain number of
checks going in between them. What we do is we have a much larger amount of parallelism
that's going on in the system, and we ensure that all operations can be accepted locally right
away. So it's sort of the thing that I would do if I was just using regular vector clocks to build
one of these systems is I would end up sort of, if I'm waiting for an operation to show up from
over here, to show up in this data center, I wouldn't be able to accept new operation coming
into this node til that operation is applied. So I’d have to block this write operation. And I don't
want to do that, I want everything to be low latency. Slightly unsatisfying because exactly how
you would build vector clocks in the system is unclear. But it's related at a high level and we’re
sort of doing all the specific things that you need to make it efficient in this scalable
implementation.
>>: An interesting point [inaudible] value [inaudible] vector clocks; in theory it’s a huge vector
clock. [inaudible] so that would be [inaudible] dependencies. But the problem is that
[inaudible] vector [inaudible]. Maybe I’m wrong.
>> Wyatt Lloyd: If you had a vector clock with an entry per data location, I would say it would
be equivalent. But that would be, yes. That would be ungodly large. But yes. I would say that
would be equivalent.
>>: [inaudible].
>> Wyatt Lloyd: So this is, so you could think of what we're doing as sort of a clever way to take
this ungodly large vector clock and reduce it down to the minimum set; it’s easy for us to track
that will allows us to enforce these things in a causally consistent way. Yeah. Okay, cool.
So what are these one hop dependencies actually buy us, or what do they give us? Well,
checking them is sufficient for causal consistency because they are superset of the nearest
dependencies. And there’s going to be, still be few enough of them that we’re going to be
competitive to eventually consistent systems, which I'll show you near the end of the talk. We
never have to store any dependencies on the server, and this is because we are not storing, we
don't need any of those transitive closure dependencies, and because we're not calculating the
nearest dependencies. So we never need any dependencies on servers. And it really simplifies
our client-side dependency tracking. All we need to do is every time we have a write operation,
attach all of our current dependencies to that write operation, clear it out entirely, and add a
new dependency on that write. So it's very simple.
So before I move onto the second half of the talk, I want to summarize how to provide causal
consistency in this scalable way. So we have this large number of clients that are concurrently
accessing the system sending read and write operations to their local replica of the data store.
These operations are serviced right away, so reads return right away and writes are applied
right away. And then responses are sent back immediately to the clients. We’re going to
update metadata on each one of these web servers so we can continue to track causality. And
then in the background, we’re going to replicate data out to the remote data centers, to the
nodes in the remote data centers that own the data we've just read. And these remote data
centers will do these explicit dependency checks and will only apply the values once those
dependency checks have been returned. And they're going to exploit the transitivity in that
graph of causality, so we can do this in an efficient way. Yeah.
>>: [inaudible] tiers, so I as a user always contact the same [inaudible] not just the same data
center?
>> Wyatt Lloyd: So we, in the current design of the system, we have users sticking to a front
end web server. And one of the questions that I'll often get is sort of if you're going to design
the system over from scratch, what one thing would you change? This is the thing that I would
change. I would not have the client library actually resident on one of these web servers, so
that I would have the client library resident on one of these machines; and we would connect
users to a particular server machine based on their cookie so that we could sort of remove all
state from these web servers. But right now on the current design they are sticking to web
servers.
>>: And what happens [inaudible] front end [inaudible]?
>> Wyatt Lloyd: So that's why I would move them.
>>: Moving them affects it because the failures can also then happen at the storage tier.
>> Wyatt Lloyd: Right. So one of the things that we're doing in the storage tier that we're not
doing in the web tier is that each one of these servers that I'm representing here is actually a
logical server. So there's an abstraction of something that does not have failures. And what's
going on underneath that is we have replicated state machines. So that's like a chain
replication group or a Paxis[phonetic] group going on underneath that's providing us fault
tolerance. And so because we already have that abstraction there, that's why put the client
server and there. These front end web servers, we don't want to do that. We don't want to do
any replication to provide fault tolerance.
>>: And it’s easy to provide that abstraction? Because I mean in the end>> Wyatt Lloyd: Is it easy? I mean, so there's lots of ongoing research into this which is sort of
one of the reasons why I wanted to have that abstraction because there's lots of exciting things
going on from, some from here, some from CMU and at different places. They're going on it
sort of this lower level for providing efficient replicated state machines. Lots of questions. Yes.
>>: So when you're applying [inaudible] data center, why [inaudible] another [inaudible] comes
from the data center, you just apply the [inaudible]? How do you order that?
>> Wyatt Lloyd: So how do I order it?
>>: Yeah.
>> Wyatt Lloyd: So when I get something over here, I have these dependencies attached to it
and I check the dependencies. Only once they return do I apply this operation.
>>: But then that means they [inaudible]?
>> Wyatt Lloyd: So, okay. Yeah. So the writes that return to clients, so in sort of the lifecycle of
write there's the client facing part of the write which happens only local data center. So the
write comes in, it’s applied right away, and returned to the client. And the client is now done.
My write’s been applied. There's also a part that's in the background is not user facing. That's
when we replicate things out to the other data centers and then do this check. So things for
users, all user facing operations are very fast. They have low latency. But what sort of can have
longer latency, which does have longer latency, is this sort of latency divisibility. How long from
when I write something over here til it shows up over there? They will delay it until they can
show it in the correct consistent order.
>>: When do the conflicts [inaudible]?
>> Wyatt Lloyd: So the conflicts would surface when sort of once both conflicting operations
are both resident on a single server. So once this server gets an operation from the remote
data center, if it also had a concurrent rate, that we could recognize it there. If we’re doing a
last order wins rule we can actually do this obliviously. We just apply things in the order given
by the way of the Lamport timestamps that automatically enforces the last order wins rule. But
if you're trying to detect those, if you’ve told me that you want to detect those, then we’ll
detect those every time a write comes in, and we check against the current write to see if these
are causally ordered and there’s not a conflict, whether they're concurrent and there is a
conflict.
>>: So I'm curious what should the [inaudible] operator [inaudible], this write and later on the
[inaudible] potentially?
>> Wyatt Lloyd: Is not that the write is dumped in favor of another write. So like once you get
a write, we’re going to apply that elsewhere unless this data center blows up. But once you get
a write, we're going to apply that elsewhere. But you're not guaranteed that no one else, no
one is not also allowed to write that data. Yeah.
>>: Quick question. Here when the [inaudible] front end, reading[inaudible] basically
beckoned>> Wyatt Lloyd: Yeah.
>>: For each read there didn’t need to be a basic reading to [inaudible] meaning for a
[inaudible]?
>> Wyatt Lloyd:. So depending on how sort of a logical server abstraction is set up, and I would
set it up in a way that sort of cheap for reads, expensive for writes, but sort of if you just set it
up with the normal Paxis[phonetic] group, you're going to do a normal Paxis[phonetic] read.
>>: [inaudible]?
>>: [inaudible] down there [inaudible].
>> Wyatt Lloyd: There's lots of awesome stuff going on so you don’t have to do expensive
things. Yeah.
>>: [inaudible] it’s a browser>> Wyatt Lloyd: So our client is a front end web server. So this is like a machine learning your
PHP server, your Apache server, whatever. And then the meta-client is the browser. And for
when we get really deep, the meta-meta-client is the person using the browser, using the web
server, using our system.
>>: [inaudible] abstractions? What’s this global abstraction? Is it a single partition or>> Wyatt Lloyd: Oh, yeah. So each server is a replicated state machine. Each one individually is
a replicated state machine.
>>: [inaudible] by [inaudible] data center [inaudible] exploit that?
>> Wyatt Lloyd: Yes. It’s the one that's accepting that write.
>>: So what's the [inaudible]? [inaudible] these objects, bytes, shards, what’s that>> Wyatt Lloyd: So in the first, so on each server it’s a shard. Sort of the level of abstraction
the data model we’re providing is going to be different in the two different systems which is on
my next slide. So that's an excellent, this is a good transit.
>>: Just one question. [inaudible] meticulous way if there's a write in one data center?
>> Wyatt Lloyd: Yes.
>>: An update is propagated from this data center [inaudible] and so if there's a partition now,
all the writes that are going to go there are going to succeed but they're going to be concurrent,
[inaudible] right?
>> Wyatt Lloyd: Yes.
>>: So partition tolerant [inaudible] write to actually show this concurrent write it would be
causally correct but if you look at the timeline actually [inaudible].
>> Wyatt Lloyd: So depending on how things happen on different sides of partition with the
current logical clocks, strange things could happen. If you look at like the 1978 original paper,
there’s sort of ways to incorporate real-time clocks into this as well, so you have a notion of
real-time, so you don’t have some strange thing happening where you have one really active
data center that is going to overwrite data that was written 10 minutes later in a non-active
data center elsewhere. But, yeah. Okay.
Great. So let's take it up to a little bit higher level. We’re talking about these geo-replicated
storage systems and these properties we want them to have. We want them to be fast and
scalable. There's ALPS properties. We want to be able to interact with this data in a coherent
way. In way that makes sense. So I just showed you how I can provide causal consistency for
these scalable systems and that was the big contribution of our COPS paper that was at SOSP,
2011. And after that work we continued to push on how can I interact with this data in a way
that makes sense? And so our Eiger paper, which was at NSTI just a few weeks ago, makes a
number of new contributions. So one, it's going to move us from the key value data model that
we had in COPS, this very simple data model which is appropriate as an opaque cache for my
web service but not as my primary data store. We’re going to move to a rich data model, so I
could actually build a rich web service on top of this, something like Facebook. We're going to
also have read only and write only transactions that allow us to consistently interact with data
that's spread across many servers in the system. Okay.
So let's dive into this rich data model, which is a column family data model. It’s this widely used
hierarchical structure. Here's what it looks like abstractly, and don't worry I'll fill that in so it
hurts your eyes less. So this data model was introduced by Google with their big table system.
So you can see why it's called a big table. And then it was open sourced by Facebook with the
Cassandra system which is now widely used outside of Facebook. Okay. So this column family
data model, it's very useful for all of these services, it’s used for all this stuff, how can I actually
build something like Facebook on top of this, a social network? So we have keys on the left
side. These are going to correspond to users in our system. So we have one for Ada Lovelace,
one for Alan Turing, and then one for his advisor from when he was at Princeton, or his boss,
Alonzo Church. Along the top I have super columns which you could also think of as categories.
And then underneath each one of these I have columns.
So columns are individual data locations and we can store things like your age in one, your
town in another and so on. Underneath, okay that's profile, so what about underneath the
friends and status? Well, here I only have an entry per data that exists. So I'll have an entry
under Ada, that she is friends with Alan and so we have some metadata associated with it like
the date, but I won't have an entry that says she's not friends with her, herself, and she's not
friends with Alonso and she's not friends with our other 1 billion social network users. So this is
a sparse table. It's a [inaudible] data model. The same thing for status updates. You only have
an entry for something that exists. And then under counters we have a special type of column.
So we have friends here, and this is a number, and it's special in that it's going to be
commutatively updated. So when I move to a new town, I'm overriding this opaque value that
existed there. In a counter, I'm not overriding the value that existed before; instead I’m adding
something to it. Every time I add a friend I'm adding one thing to this value. So while we have
one operation that created that town value that says that I live in London, I have 631
operations that commuted together to provide this one single value that I have 631 friends.
So one of our challenges is how do we represent this in an efficient way in our new system? So
this column family data model, again, it's storing tons of data, so it's spread across lots of
servers in the system, and this all existed before, it was widely used, so what's our
contribution? Well, it's providing causal consistency for this data model. So now when Alan
removes his boss from his friends list and then posts that he’s looking for a new job, these
operations will show up in that order across this rich data model, across all of these servers. In
addition to this, we provide read-only transactions. So I can now consistently read data spread
across this table and across all of these servers. And then write only transactions that allow me
to atomically update data that's spread across many servers in a single data center.
So the way you could think about this is now I can consistently update symmetric relationships.
I can make Ada and Alonso friends atomically at a single time. So they both become friends at
the same time, and we never have an inconsistency in this graph of actual friendships. So our
Eiger system, it provides these ALPS properties that are going to be fast and scalable, provides
this rich data model so we can build rich web services on top of this, it provides causal
consistency for that rich data model though I actually didn't go over how we did that, and then
we also provide these read only and write only transactions which I want to go over each of
which briefly.
So first, why do I even need read-only transactions? Why aren’t reads enough? At a high-level,
it's because we're sending asynchronous requests to distributed data. Here's what this looks
like. So we have Alan Turing’s operations where he removes his boss from his friends list and
then he posts that he’s looking for a new job. And here we have a data center where neither of
these operations has shown up yet. So this could be his local data center where he’s thought
these operations but hasn't applied them yet, or it could be a remote data center where they
haven't been replicated out to yet. We're going to track his operations with this purple bar
here. And here's the client of his boss, of Alonzo Church, this web server that's going to be
building his page for him. So this web server can go out before either of these operations have
been applied and requested the friends list that includes his boss. Now Alan Turing’s
operations can appear and be applied in the correct causal order, and now our data store has
always been consistent; but now the web server goes out and gets this status update, and now
it's going to return inconsistent data back to the user even though it was reading an always
consistent data store. So instead what we need are read only transactions. Read-only
transactions give us a consistent and up-to-date view of data that's spread across many servers
in the system. The way to think about it is this: we have these data items, we have logical time
moving in that direction, each one of these data items moves through progressingly newer
versions in a logical time, and what are read only transactions are going to do is they're going to
return a view of the data store from one particular logical time. So we are either going to
return when their friends and there's the old status, or when they’re not friends and the old
status for when they're not friends and there's the new status. We're never going to return
that inconsistent result where they are friends and there's the new status.
So what are some of the challenges that we overcame for these read only transactions? Well,
one was scalability. Traditional solutions for transactions have something like a transaction
manager which is centralized, but we want to be able to continue to scale the number of nodes
that we have in our system so we achieved scalability through a fully decentralized algorithm.
We want to build a guaranteed low latency. And so we do this with an algorithm that's going to
take at most two parallel rounds of local reads inside the data center and then avoids all locks
and blocking. So all responses are sent back immediately. And then finally, we want to provide
high-performance. So our tagline for this work is: don't settle for eventual consistency because
we want performance that’s going to be competitive with eventual. So we achieve that with an
algorithm that's going to take one round of reads in the normal case.
So how do the read-only transactions actually work? Well, at a high-level it's this: we're going
to start out with a first round of optimistic parallel reads. So we’re going to ask the servers that
have the data, what’s your current value? And they're going to return that value along with
some validity metadata. So when in logical time are these values valid? Once we get all of our
responses back we’re going to calculate an effective time for the read-only transaction. This is
a single logical time that the read-only transaction takes place. And we’re going to calculate
this in a way that we are always going to ensure progress in the system. We’re always moving
forward, we are never stuck on an old consistent cut. Once we have this effective time, we'll
check our first round results. Are they all valid at that one particular executive time? If they
are, we'll return and most of the time that will be the case. In fact, the only time it won't be the
case is if there is concurrent writes to the data that we are reading. If that happens, we might
have to reread some data in a second round, and we’ll do that with a second round of parallel
read at time operations. So it's either a special read operations that include that effective time
that asked the servers, return this data at this particular logical time that we chose. The servers
can return that data and then we can return this consistent cut of the data from this effective
time that we chose right away to the client.
Now to support these read at time operations, the servers have to store a number of old
values, but we can limit that significantly because we are always making progress in the system.
We are always moving forward, so we never have to store things that are variable. So our Eiger
provides ALPS properties which data model all this stuff and then read-only transactions. So I
want to quickly get into read only transactions. So read only transactions allow us to atomically
update data spread across many servers in a single data center. We're going to replicate these
out in the background in that causally consistent order, and the data will appear in the other
data centers atomically as well.
So what are some of the challenges that we overcame here? Well, again the scalability. How
do we do this in a fully decentralized, or how do we do this in a way that we're going to ensure
scalability? Is our data spread across many nodes? And again, we do this with a fully
decentralize algorithm. How do we ensure low latency? So for write only transactions, we do it
with a limited number of local rounds of communication avoiding all locks and blocking again,
but really the tricky thing was how do we insure low latency for concurrent read only
transactions that are reading the same data? And we do this by not blocking these read-only
transactions that are concurrent, but instead in directing them through one of the nodes
participating in the write only transaction to find out whether the write only transaction
happens before or after the effective time that we’ve chosen.
So our Eiger system, it provides these ALPS properties: we’re going to be fast and scalable, the
rich data model, causal consistency so that operations appear in an order that makes sense to
users and programmers, and these limited forms of transactions, so I can consistently interact
with data spread across many servers, but what does all of this cost? Well to find out, I built
some prototypes. I built a prototype for the COP system on top of FAWN-KV. This is about four
and half thousand lines of C, plus, plus. And then for the more recent system, I took the opensource Cassandra system, forked it, and added about 5000 of lines of Java to it. So I'll show you
results from the more recent Eiger system.
So when we’re evaluating these systems, we wanted to answer two key questions. The first is
what is the cost of these stronger consistency, causal consistency and our stronger semantics,
these limited transactions compared to eventually consistent non-transactional Cassandra
system? So I’ll show you the overhead for real workload shared with us by our friend from
Facebook, and I’ll also show you the overhead for a large state space of different possible
workloads, not just the one that we really care about. And then second, the big contribution of
our work is being able to provide these properties in a scalable way for data that's spread
across a large number of nodes, so I'll show you empirically that our system scales to large
clusters. So here's our experimental setup. We’re going to have a local data center that has n
client, n servers in it. We’ll spread the data across these n servers. N will be h in the first two
experiments, and then it will grow to be much larger than this in the later experiments. We
also have n client machines with many threads of execution. They're going to be fully loading
these servers in the local data center. We also have a remote data center that we are
replicating data out to. That will also have n machines in it.
So here are the results for that Facebook workload. So these are the results from Facebook
TAO system. It's an eventually consistent non-transactional geo-replicated production system
that they use at Facebook. So TAO is like that baseline I'm saying you don't have to settle for.
Along the y-axis, this is throughput for that eight server cluster; this is going to be in hundreds
of thousands of columns per second; so those are the individual data locations that I showed
you in that big table earlier. The Cassandra system has a throughput of about half a million of
these per second. Our Eiger system is very close. In fact, we see only about a 6 and a half
percent overhead for providing these much stronger properties. So I find this very encouraging.
So that's just one particular workload. What about for a large state space of possible
workloads?
Well to find out, we built a dynamic workload generator, and so this generator is going to have
seven parameters. These include things like: what are the size of values in the system? What
are reads and writes look like in terms of how many keys are they spread across? How many
columns within each key are they spread across? How often are we reading of verses writing?
How often are those writes transactional? So in our system all reads are transactional but only
some writes are transactional. So what we did for each one of these parameters was we chose
a sensible default for each and then perturbed each one individually so we have seven graphs.
Here I'll show you three of the graphs. Excuse me. This should give you a good idea of what's
going on without being too overwhelming.
So each one of these graphs is normalized. So we’ll have the Cassandra system, that baseline
system, as a flat line on top. We want to be as close to that as possible. And so to show you
the results for value size, this is logarithmic. The results for write fraction, how often are we
reading verses writing? We are reading more often at zero. The number of columns per read.
So sort of how many data locations are underneath a single key being updated or being read
from each read. Okay. So here are the results. The gray lines indicate our default values. And
as you can see, there's parts of the state space where our overhead is medium, about 30
percent here. This is where value sizes are tiny. That's because we have metadata in our
system, we have these dependencies. In the Eiger system, dependencies are 16 bytes. And so
when you compare this with a one byte or an eight byte value, most of our processing is not
going to the actual value itself. When we increase value size, however, to something that is still
not very large like 32 bytes, our overhead goes to what I would call low. This is true also for a
large part of the write fraction state space and sort of independent of the number of columns
within a read. So we have this very large of this area of the state space where our overhead is
pretty low; we even have parts of the state space where our overhead is minimal. When we
are only doing reads, or when values are large, like 512 bytes or larger than that, our overhead
is to be very, very low. So this is very encouraging. So not just for that Facebook workload that
we really care about, but for this large state space of workloads our overhead is pretty
competitive as well. So, okay. Yeah.
>>: So you're varying one parameter here?
>> Wyatt Lloyd: Yes.
>>: So what were all the other parameters set at for>> Wyatt Lloyd: Yeah. So they were set at their gray values. The default value. Yeah. Okay.
So here's the graph that I really like. It's my favorite graph. This shows how well our system
scales. So this is the money graph, right? Okay. So along the x-axis we’re going to double the
number of servers in each cluster, right? From one to two, four, eight, more server in each
cluster. And along, at the very extreme end of this graph, we have the 128 servers in a cluster.
This is actually a pretty big experiment, for me at least as an academic, because we have 384
machines here. Two clusters of 128 servers and one cluster of 128 clients. Along the Y axis I
have the normalized throughput. So I'm going to take the throughput of the cluster and divided
by the throughput of a single server cluster. So ideally what's going to happen is as we double
the number of servers in each cluster, we’re going to double our throughput as well. So what
happens is I double from one to two and so on, so I double from 1 to 2, 2 to 4, 4 to 8, my
throughput almost doubles each time. Here we see about a 72 percent increase for each
doubling of the number of servers. So this is good, but it's not ideal. Ideally I would have 100
percent increase for each doubling of servers. So what's going on here?
Well, most of our operations are spread across a large number of keys which are spread across
a large number of machines. In a single server cluster or in small clusters, many of these
operations are batched together into a single operation. So we get great benefits from
batching in this cluster and then a little bit less and a little bit less and so on. Okay? So what
happens is I continue to double the number of servers from 8 to 16 and all the way up to 128
servers in each cluster. Well, the effects of batching are decreased. I see about a doubling of
throughput for each one of these, and in fact, now I see 96 percent increase every time I double
the number of servers. So this is super encouraging. What it says is we can scale out our
cluster. We can keep increasing our capacity and our throughput by adding more machines.
So to summarize this main part of the talk, I talked about geo-replicated storage that's
conserved as the backend of large-scale websites, things like Facebook, Reddit, or Amazon. I
talked about the ALPS properties that precisely say we want to be fast and scalable. I talked
about causal consistency which we enforce in a scalable way through explicit dependencies and
distributed dependency checks. I showed you how we exploit transitivity to reduce the
overhead of our system. I talked about are stronger semantics including this rich data model so
you can build services like Facebook on top of this. Read-only transactions, so I can consistently
read data spread across many machines. And write only transactions. So I can atomically
update data spread across many machines in a single data center. I showed you our system
was competitive with this eventually consistent baseline. And we can scale to many nodes.
I want to revisit the theme of my research which is that I like to look at these classic distributed
systems problems except through the lens of modern cloud or web services, and with that in
mind, I want to review some previous work and talk about a few future directions. So some of
my previous work was looking at fault tolerance for cloud services. So this included the TROD
system, which was at DSN in 2011, which looked at how can I recover connections to these
front end web servers with minimal overhead? So what's new in this setting? Well, one is we
can't modify these client machines. These are client browsers; I don't have any control over
them so I can’t change them. The other new thing is these front end web servers are actually
very similar to another. They can each serve the same content. So what we did was we
restricted our domain a little bit. We said let’s just worry about what we call object delivery
services. So these are going to be services that are serving static content, data that's not
changing, so photos or videos or static webpages. And with this restriction we were able to
provide fault tolerance for these connections with dramatically lower overhead than previous
approaches. And our key technique is we are actually going to embed application level state in
the mechanics of the TCB transport layer so that we’re going to get these unmodified clients to
actually help us out even though they are unmodified.
I also had another project at NSDI in 2010 on a system called Prophecy. This was about scaling
byzantine fault tolerance. So byzantine fault tolerance provides protection against the worst
kinds of faults, malicious vaults where failing nodes can even collude with one another. And in
traditional byzantine fault tolerance systems, you need four machines or even more to provide
the same throughput or similar throughput to a single non-fault tolerant machine. So in this
project we produced the first the system that was able to provide byzantine fault tolerance
with slightly weaker consistency that had throughput competitive with non-fault tolerant
clusters for common web workloads.
I want to touch on a few interesting future directions that I'm excited to work on. One is: how
do we reason about partial replicas in geo-replicated storage? So right now and in this talk I
talked about, you know, we have a couple of data centers and we have our data in each one of
the data centers. So what happens when I have 20 data centers? Well, I'll only want to have a
subset of the data in each one of the data centers; I don't want 20 copies of it. So how do I
reason about that? How do I ensure low latency and consistency for this data where I only have
subsets of it in each one of these data centers? This is also really cool because it allows us to
extend our reasoning to edge devices. So your phone or your laptop could be viewed as a tiny
partial replica of the data that we actually have in the service.
Another exciting future direction is making general transactions more scalable. When I say
general transactions, I mean things that are more general than what I talked about in this talk,
but not fully general in the database sense. And so general transactions will allow you to
atomically read data and then write data in a single atomic block, for instance. I'm interested in
looking at how we can make these fully scalable as well so we can use them on data spread
across a huge number of machines.
Also interested in revisiting the network abstractions that we used to build these storage
systems. So the current abstractions that we have are from the 1970s, and we don't have to be
stuck with them. And this big distributed system, I control all the nodes; so if I want to put a
new networking stack on there, I can do that. I can upgrade all of them at the same time. And I
think this is exciting, this is an exciting place to do networking research because of that, and
when I was building these systems I noticed that the abstractions underneath didn't actually
match what I wanted to do. What I want to do is a massive number of parallel RPCs.
My final direction of future work that I'm excited about is improving the programmability of
these distributed cloud storage services. So I talked about this fundamental impossibility result.
Either we can have strong consistency or we can have low latency. We can't have both.
Sometimes you need strong consistency and sometimes you need low latency. So what I'm
interested in doing is first, unifying these two abstractions for the programmers so they can
explicitly choose what do I need in each situation. And then iterating on that and improving it
using software engineering and programming language techniques so eventually we can get to
a place where programmers are telling us their invariants and we’re enforcing it for them and
they’re telling us declaratively what they want to do instead of imperatively how they want to
do it.
So to conclude, in general I like to look at classic distributed systems problems except through
lens of modern cloud or web services; and today I showed you how we re-examined replicated
storage for these large web services with the new requirements of geo-replication, massive
scale and low latency; and I showed you how we can provide stronger causal consistency and
stronger semantics for them. Thank you.
>>: We have a few minutes for questions.
>>: What’s the scaling of the eventually consistent version of Cassandra look like?
>> Wyatt Lloyd: So it looks very similar. The sort of, the big bottleneck that we see there
where there’s sort of the non-ideal scaling characteristic is that we have lots of batching on a
single server cluster and not so much batching on a much larger cluster. This happens in an
eventually consistent system as well.
>>: But how does the slope compare to the slope that you showed?
>> Wyatt Lloyd: So I did not, I didn't publish the slope for the Cassandra system. But in my not
fully rigorous experimentation with it, it looked exactly the same.
>>: So for the Cassandra system and the Eiger systems, [inaudible] experiments, each of the
write [inaudible] replicated?
>> Wyatt Lloyd: Oh, yeah. No, no, no. Sorry. So in these experiments Cassandra does not
implement replicated state machines. So each one of these, so you take these results with sort
of a grain of salt in that each one is being written onto a single server locally, and we would see
different performance characteristics that would perhaps be a little bit worse if we were doing
something on top of sort of that actual logical server abstraction.
>>: For each of the systems, let's say basically it starts [inaudible] do you think the percentage
basically difference still would be the same or basically Eiger potentially may have more
overhead?
>> Wyatt Lloyd: So I think we did things very naively. Eiger would have more overhead. An
interesting direction that I didn’t talk about is looking at these two levels of abstraction on top
of one another. And I think sort of depending on what we put underneath will have good
performance for some workloads, worst performance for other workloads, so if you have a
super write heavy workload versus a super read heavy workload, you know you want different
abstractions underneath. And sort of exploring this I think is very interesting. And also seeing,
you know, can we do something that's going across these levels of abstraction that would be
more efficient? I think it’s an interesting direction of research as well. Yeah.
>>: So my understanding is that you communicate to a user that they’ve been migrated from
one web server to another or one data center to another by locking them out and locking them
back and forth so they have to lock back in?
>> Wyatt Lloyd: So no. So one, this is not implemented, this is sort of the level in front. But the
way that I would tell someone writing a web service in front of this is that in the cookie of the
user, [inaudible] something that is the last time or the last logical time this user interacted with
the service and what data center they were using. And then there's stuff that I didn't talk
about, like we have garbage collection the goes around, keeps track of sort of this data in the
different data centers that says that this is how far behind different data centers are, so that we
can do things like there's no dependencies on stuff from yesterday, for instance. So if you read
data from that there's no dependencies. If you include that, you can do a check that says this
user is using this data center, once all data centers are passed what that user is seen, the users
no longer have, you can sort of delete the state that says this user has been using data center x.
So when they're on a plane this will happen, and when they got the plane in the other location
they can start using whatever data center your DNS sends them to or whatever.
>>: So you delay. So if the user, if the web server dies or the data center dies, you just delay
them by, you just to say the user can’t use the system until you are sure that all of this data has
copied, has dependencies [inaudible]?
>> Wyatt Lloyd: So if the user physically migrates we will sort of have that will be fine because
they can only physically migrate in a slow time. If a user is directed to a different data center,
we’ll do this sort of redirection to the data center they had been using or we’ll do triangle
routing in the worst case. If that data center blows up, if that data center has failure, that's a
sort of not ideal. We'll try to redirect things for a while, right? Yeah. So I mean we’re going to
try to redirect things for a while. At some point you're going to give up and you're going to say
the user either has to wait for that data center to come back up, or the user is going to see
inconsistent data. That's sort of an administrative policy that you have to decide. And in the
worst case, let's say you write some data into this data center and we accept it locally right
away and send a response back to you and then this data center gets destroyed in like the
earthquake that they tell us is coming, right? So in that case we could have data loss because
we've only written in one place and we are optimistically sending it out elsewhere. I think that
that's not going to happen in a high fraction of cases there because we have, here's the web
server, here's the data storage, the web server writes something to the data storage, we write
it right away and send something back concurrently with that, I sent it to the background but
it’s concurrent, we’re sending it out to the other data centers. So if doesn't make it out to the
other data centers, it's likely that before this server responds like [inaudible] responds
something that says your post went through, this data center, this web server will also be
blown up.
>>: I was just wondering how you communicate to the user, and it sounds like a combination of
delays and non-communication.
>> Wyatt Lloyd: So you can think of primarily triangle routing and in the worst case we would
either delay or show something inconsistent. Yeah.
>>: So why would you show something inconsistent? [inaudible], right?
>> Wyatt Lloyd: So if you have the user accessing data center x and data center x blows up and
some of their updates and some of the stuff that they read is no longer available anywhere
else, you have to choose, do I have this person wait forever because that data centers no longer
there, or do I show them the data that doesn't include sort of everything that was there? This
is like sort of this extreme failure case, right?
>>: So it could be old data that that's still causality consistent.
>> Wyatt Lloyd: It would be old data that was causality consistent with itself, but the user
might not see things like, I see your status update and we’re using the same data center; the
data center gets immediately blown up. We never have that again. If I access another data
center I cannot see your status update. I might even lose it.
>>: [inaudible].
>> Wyatt Lloyd: Yes. Exactly.
>>: [inaudible] replicate and you don't know [inaudible], I mean there’s nothing you can do
about that.
>> Wyatt Lloyd: Yes, yes.
>>: So if you're comparing system, [inaudible] system, what would you say [inaudible]?
>> Wyatt Lloyd: So Spanner is not a low latency system. So there's been a bunch of sort of cool
work pushing on both sides of this impossibility result. On the one side it’s sort of mainly my
work, on the other side it’s the Walter system from NYU and Marcos Abulera[phonetic].
There's also this Gemini paper and OSDI in Spain or OSDI that’s looking at how can we make
strongly consistent systems more efficient, and because of these impossibility results, none of
these systems can guarantee low latency like we can do.
So in Spanner, they're doing at least one white area round-trip. The initial motivation for
Spanner was storing ad data. So it's reasonable in ad data that they can say we can have a
reasonable amount of low latency because we are doing this white area agreement, but it’s all
going to be on the East Coast, for instance. So we’re going to have sort of a small amount of
latency here. In a social network setting you want everyone to be able to access this data, so if
you only write things on the East Coast, if anyone wants to read that data they have to access
the East Coast for that. So there’s sort of a trade-off for there, they're hiding a little bit by not
distributing things globally. And they always have to do something in the white area anyway.
But they provide stronger consistency. So that's the trade-off. Yeah.
>>: [inaudible] linearizability which is not achievable and then causal [inaudible] is right
underneath there?
>> Wyatt Lloyd: Yeah.
>>: Do you have a sense for something useful in between those two or is causal plus the best
way>> Wyatt Lloyd: So there's an awesome, so there's an awesome theoretical result out of UT
Austin where they take, so we have causal consistency and we have causal plus consistency and
they have this thing called real-time causal consistency which is very similar. And they have this
very intricate cool proof that says that you cannot have a consistency model that is strictly
stronger than real-time consistency in a low latency white area system. So there's nothing
strictly stronger than that. There could be something that was incomparable that would be
more useful, but no one’s thought of it yet. You can’t prove something like that. But, yeah. It's
sort of, this exists and everything that we know that is more useful, that is better, has proofs
that say you can't have that.
>>: What about [inaudible] transactions?
>> Wyatt Lloyd: Sorry, I can't hear.
>>: [inaudible] read only and write only transactions>> Wyatt Lloyd: Yes.
>>: [inaudible] general [inaudible]?
>> Wyatt Lloyd: So interactive transactions are very tricky. General transactions that are
greater in scope than what we have but less than like interactive where I can start a transaction
at the beginning of the day and then at the end of the day like try to commit it, things that are
more interactive than that are really interesting, and I need them to transfer money between
our bank accounts and things like this. They're very tricky to handle in a low latency way.
Because as soon as you give me read write transactions, the first thing I want to do is
implement mutual exclusion, right? This field is empty, I grab the lock. But you’re somewhere
else, and if you're doing this in the latency, the fields are also empty for you and you grab the
lock. So if we are doing read write transactions and we’re trying to guarantee low latency, we
have to have these like really funky conflict resolution policies, which some systems like Bayou
have, but I think are very difficult for programmers to sort of use in a meaningful way,
generally. So when I talk about general transactions there, I have this like bullet point where I
say like we can’t guarantee low latency. I think we can’t guarantee low latency because one
thing about read write transactions, the consistency that I need underneath that is stronger
than causal consistency.
>>: [inaudible] are interactive phase for read transactions, like same you do right now for write
transactions or right [inaudible] with just reading interactively. I mean, we don't know who is
my friends or who is my transaction with. They want to read the posts of your friends
[inaudible].
>> Wyatt Lloyd: Two phases of reads, like you read some stuff and you read some stuff after
that. So I think that's a super useful, and it’s very interesting, and it's something that I started
to design and then I had to graduate. But it’s very useful, and I think we probably could do that
in a low latency way. Yeah.
>>: Let's thank the speaker again.
Related documents
Download