1

advertisement
1
>> John Douceur: Well, good morning, everyone. It's my pleasure to welcome
Chris Stewart, who is an assistant professor at the Ohio State University, who
is visiting us for the day as part of a larger west coast tour, we understand.
Chris's interests are in network and distributed systems, and particularly two
sort of areas. One in sustainability, and one in performance.
And based on the title of this talk, I assume we're going to be hearing more
about the latter. So take it away.
>> Christopher Stewart: All right. Excellent. So if I were Troy McClure, I
would start this talk with, you may remember me from such research as. For
those many or good friends, happy to see you guys again. For those just
meeting me, I graduated from the university of Rochester late 2008. My work
there was mostly on modeling and system management for internet services.
So there, the idea, we started with some early work, published in SDI 05, my
second year of grad school and we had some works going all the way up through
2008. And the basic idea there was to come up with some first principle models
that gave us good ideas about what system, what response time would look like
as we changed the number of nodes in a system so we could do things like
capacity planning. Anomaly detection came a little bit later.
One of the great things about doing a set of slides like this is you get time
to sort of think about looking back at the high level contributions that these
things had, right? So hear I think with this particular work especially, the
contribution was on the simplicity of the model. So we started looking at
queueing theory, which at the time people weren't looking really too deeply
into for internet services. I think there was a paper from duke and
[indiscernible] group was working on similar stuff. And we showed how to use
system metrics that you can routinely collect at some middle layer by -- we
instrumented OS, but today it's done in a lot of middleware systems. We showed
how to collect these metrics and how to apply them in a fashion that system
managers could actually understand and implement.
So when I was at HP for basically the latter part of my Ph.D. thesis, I spent
working with folks there, we ended up shipping this out-scope to a product
group, and this paper, I'm told, influenced some other people at other
companies that ended up being able to apply these in practice.
2
So around the time this was winding down, conveniently, it was around the time
I was graduating. So figured it was time to switch areas and I did a one-year
post-doc with my advisor. So I started looking at performance anomaly
detection. And here the idea is to use these models that we were developing
before and use them now as the base system, instead of trying to predict what,
trying to use models to predict what the real system is like. We want to say
hey, the real system should look like these models.
And here, there was a really nice interplay between some machine learning
techniques, especially, here. So we looked at trying to characterize whole
space for this anomaly depiction stuff.
Most recently, or second most recently, I left the systems community altogether
and started looking at datacenter architectures in particular. In 2009, early
2009, I was caught up post-election cycle, I was all about renewable energy at
the time. So I was looking at how can we power datacenters directly with solar
panels. And we published one of the first papers on that at Hotpower. It's
been pretty well received. So now there's an emerging community in the
sustainable computing area, and I'm the editor of the sustainable computing reg
City, which is an IEEE publication from the IEEE STC on sustainable computing.
I'll say that again, the IEEE STC on sustainable computing. I encourage
everyone to join. It's actually free. This is a new model for IEEE, where
we're just trying to bring together experts, researchers in the field with a
common interest area.
So after this, I decided, okay, let's get back to the heart of what I work on,
and how can we do things that are very relevant to business, to industry in
terms of performance and system management without losing the sustainability
focus. And this is what my talk will be about today. And so I'm going to give
you guys some discussion about actually some things that are very recent. So
these are papers that are in the pipeline, stuff that hasn't been published
yet.
So I appreciate your feedback, and you can ask questions at any time. I think
I've slated this talk for about, I think I have been an hour, but the talk is
about 40 minutes without interruption. So feel free to ask questions at any
time, and I can manage. If I can't answer, I'll just pretend like I'm running
out of time and we'll go from there.
3
So at a high level, my current research goal comes from the intersection of
three trends. So from the hardware side, SSD is faster than disk. PCM is
faster than disk, and the total capacity of on-chip caches is still growing.
So we can still fit more transistors on chip at least for a few more years. So
we have the ability to access more data quickly than we ever did before.
On the software side, that has been a timely occurrence, because there's been
this huge data explosion. So we want to perform analytics on top of any and
everything. By we, I should say there are certain companies that want to
perform analytics on top of any and everything for whom analytics is their
profit margin. So being able to predict what people are doing is very
important.
And here, intuitively at a high level, the closer to realtime you can get with
data analytics, presumably the better your profit margins can become. It
doesn't hurt you to be able to do things faster.
And then finally, the human aspect, the human interaction aspect. Some would
say this is surprising. Still isn't getting much better. So the time that you
have to do all of this data analytics is still about the same. So there's a
really neat website called humanbenchmark.com. They just show you a red box
and they ask you to click as soon as this red box turns green. So just
measuring your response time. This is the range of the scores, 215 to about
500 milliseconds. Still same place it was in the early 2000s when we first
started looking at ->>:
How much of that is the [indiscernible] from the brain to the fingers?
>> Christopher Stewart:
>>:
I don't know.
So I think that --
Did you ever stop to notice what time takes.
>> Christopher Stewart: Yeah, I don't know why we're so slow. In particular,
I don't know why I, myself, am slower than others, actually. When I did this
benchmark, I lose upward of 500 milliseconds plus. So there is a biological
answer here, and at a high level, you could say maybe we could find ways to
reinvent computers to circumvent some of these biological constraints, right.
Maybe allowing response time to be a function of how fast we can blink, rather
than click. Maybe that's a little tangential to the main point here, but it's
an interesting thought.
4
So the way I view these all sort of combining together, we're going to want to
be able to run data analytics on moderate scale, moderate amounts of data, very
quickly in as close to realtime as possible. So we're looking at emerging
applications that are defined not by the interaction time between service to a
human, but between a machine to a machine. And we need to do a lot of those
interactions before we go back out to the human.
So at a high level, we're trying to get at emerging applications that have a
demand for very low latency but also very strict SLAs. And I guess I'll define
that on the next slide. So the general outline here is first, I'm going to
talk about the types of applications that I've been targeting in my research
group. Then we'll talk very specifically about our key value store that
supports these types of applications, show some results and conclude.
So when I say very strict SLA with low latency, by low latency, we're talking
about the bound in a service level agreement. So the K-percent of requests
must complete within X amount of time, where X amount of time is defined in
terms of milliseconds. K percent of request is defined in terms of number of
nines. So this long-held goal of trying to get to multiple nines, like the
telephone industry for access times.
So here on the bottom, I'm showing the difference between what I see as
emerging services and traditional services. Your traditional service had a
couple of different layers. But in the end, when it got to the data store
layer, you only had a few machines issuing accesses before you responded back.
For instance, you did one look-up in a user query table to see what their most
recent purchases were, or to retrieve their cookies or something like that.
With these emerging services, what we're looking at is middleware that sort of
encapsulates a lot of what is previously done at the database layer. So
transaction processing. Also performs some complicated data analytics business
logic and is issuing a lot of requests to your back-end data store. Here the
thickness of the lines indicates the number of accesses issued by these
components.
So why do they issue more accesses than previously? Two reasons. One, you're
doing more complicated tasks, like data analytics of some sort. And these back
ends have gotten simpler. So where database gave you the ability, for
instance, to do transactions, your key value stores may just give you the
5
ability to do gets and puts.
And --
>>: Is this because when you say data analytics, I'm trying figure out what
you scoping target is, because there's one kind of data analytics, which is
like [indiscernible] over web pages and stuff like that. And there's another
type of data analytics, when you think about Facebook, things like that, so
somebody's profile page loads. You're trying to figure out which posts should
be shown, stuff like that. So there's much different scale requirements, stuff
like that, for both of those. And so can you give some example applications,
like ->> Christopher Stewart: Right, right. So as a high level, you can follow the
-- those will highlight and those are all example applications that I'm going
to talk about. And on the next slide, James, it's coming, it's coming. But to
answer your question directly, it would be closer towards this Facebook side.
So a moderate amount of data that needs to be analyzed and searched, and very
quickly.
>>: The phone company, after all, the lifeline service, 99.9% reliability
makes sense. If you're trying to make money, wouldn't 90% do it? You make
money on nine customers and you lose on the tenth. If you're ahead of the game
in other ways ->> Christopher Stewart: Yeah, so the reason here is the user isn't involved on
all of the requests here. So you're right. You know, if one out of ten users
has a bad experience, maybe we're willing to deal with it. The problem here is
most of your accesses are happening at this layer. That is, machine to machine
interactions. So a few slow accesses, every 1,000 requests, means that
everybody is going to see a very slow service. That's what we're getting at.
>>:
But the user level, you lose 90% of the [indiscernible].
>> Christopher Stewart: Exactly, exactly. So take a look at some concrete
applications, right. So one domain that we've been targeting is the scientific
cloud. The goal here is not to replace the super computer. That works very
well for the types of applications that it targets. But there are many
scientific applications that don't utilize super computers well. So these, if
you run them, they run on a small portion of the super computer. They don't
need to run for months at a time. Maybe they're running for weeks now, or
several days.
6
And what we would like to do is make those simulations move from an order of
magnitude of days to hours or hours to minutes. And what I'm showing you here
is a concrete example. So we're looking at a smart grid simulator that we've
instrumented in my lab. And one of the collaborators that I've been working
with out here, Bonneville power, I think they're based in Oregon. I don't know
if they reach all the way out here or not. Maybe not.
So one question you might ask is, okay, there's been an increase in demand, so
should I increase the amount of hydropower that I'm producing, or should I ship
electricity in from elsewhere? And if you have a smart grid, you have also
devices that may adjust their behavior if you tell them to.
Well, how does all of this play out? You'd really want to have an idea of this
so you'd like to be able to run simulations to get an idea of what's going to
happen before you do it. So the way this particular grid simulator works is
you have a master node that's going to set up a lot of, for instance, global
variables and just, you know, do some maintenance work. And then it's going to
start all of these different simulation agents.
For instance, if you're modeling houses that all have smart air conditioners
that are going to react to messages from the power company to indicate when
electricity is quite costly and then turn up the heat, right, you want to have
those modeled. You want to have the free market price changes modeled. For
instance if you start producing more hydro, that's going to affect prices. And
you need to model people's personal behaviors.
All of these simulated agents have input and output files that they need to
access in order to do a correct simulation as well as global variables that
need to be accessed consistently over time and shared across these different
agents.
So one more excellent. So one of these simulations. One of these relatively
short simulations, so just simulating the effects across three days, you can
issue over 10,000 storage accesses. And right now, if you just put this on one
node and ran it, you're talking about a little over an hour -- no, a little
over two hours to do a single computation, which is too slow, because your
power operator really would like to be able to do these simulations on the
order of minutes to get an idea of demand response.
7
>>:
So are these jobs IO bound?
Are they [indiscernible] bound?
>> Christopher Stewart: They're a mix. So at some points, any go through
different phases. So at some points, they may be IO bound. At some points,
they may be CPU bound, because, for instance, maybe an agent is trying to
update or model a person or something like that.
So once again, going back to the earlier question, so if you have 10,000
requests in here, if you're only meeting an SLA of 95 percent, that means
you're going to have a lot of slow accesses, right. Five percent of 10K is a
relatively large number. So you really need to have a high SLA to guarantee
you'll have low latency on all of the accesses issued by the smart grid
simulator.
>>: [indiscernible] try to see how that follows, because it seems like if you
model one of these agents, has to get through 10,000 accesses, it seems like
what would drive the total simulation time is the average access time, not the
worst case access time.
>> Christopher Stewart: Right, right. So the intuition here is that this is a
time step type simulation. So in order to simulate the next minute, you need
the results from the prior minute. So if you have a couple of slow accesses,
what they're going to do is slow down -- they'll prevent you from moving
forward. Whereas if you -- for instance, if this were more like a web server,
right, where just everybody throws all their requests against and you get what
comes back, then you care about the average, right.
But here you can be -- we'll show some data on exactly how that has an effect a
little later.
>>:
So it's like the [indiscernible] problem, you get a barrier.
>> Christopher Stewart: Exactly, exactly.
analytics features -- uh-huh?
So looking at some of the data
>>: Sorry. I guess intuitively, it seems like can't they just have a rule of
thumb, and a formula they plug in how many smart air conditioners they have,
and the cost, couldn't you get a good approximation that you could run in a
second? Why do you need to do this elaborate simulation? [indiscernible]. Is
it such a chaotic system where small change and -- I mean, I can't imagine that
8
->> Christopher Stewart: Well being the smart grid in -- if its ultimate glory
were ever realized, the smart grid would be an incredibly complex distributed
system. You can imagine if everyone had their own meter that's fluctuating and
changing over time and people have the ability, for instance, to turn that
meter on and off, right, it can be -- now, I think your question is very valid.
What is the actual impact of those things, right. Is it that it just sort of
converges to some global average. That I don't know. But presumably, we would
like to be able to run the test to see if that happens. This is just something
that has been of interest to the folks that are building the smart grid. You
see how that work out in practice.
So from the data analytics perspective, I guess what we're looking at, the
example I'm providing here is in terms of email. What if you wanted to do some
sort of targeted ad placement, and actually some previous work here from
Microsoft sort of motivated this. So they were looking at searches at bing and
they found that if you could look at the recent log of searches that had been
performed, you could improve the recall accuracy by something like 25 to 50
percent. It was an interesting study.
More generally, I think social networks and whatnot provide even deeper data
here. Like so can you detect when a crowd is -- when there's some sort of
crowd phenomena happening or when there's a trend that is changing the meaning
or context of words for better targeting of ads, right.
So here, the challenge is the data's changing frequently. So you could use
some old data. You could use -- which what is people pretty much do nowadays.
But more up to date is better. And it would be even better if you could do
some sort of dynamic feature extraction from the query. And to give a very
concrete example, we sort of view this as being very akin to deep Q&A, like the
IBM Watson project.
What happens here is you give input of some text that, for instance in Watson
may have been a question or, I'm sorry, a sentence. He said this, and then you
want to extract from that text not just keywords, but some context also. And
the way this is done, actually, was very interesting to me as we looked into
the NLP approaches to this.
I call it a big set of if statements.
So we have an idea of what context looks
9
like here, here, here, here. We need to check against all of these. And based
on that, you issue a lot of storage accesses and try to retrieve relevant
content.
So this idea sort of follows into a map reduce model. But the key idea is
you're time bound. So Watson, they were trying to do two to three seconds to
win at jeopardy. By the way, this quote, does anybody know this one?
Simplicity is a great virtue, but it takes hard work to achieve and education
to appreciate. This is Ed Dykstra. One of my favorites.
So ideally, you guys will believe that my ideas are relatively simple. But
also impactful. That's where we're trying to go with this. So once again,
this is a real system that we have up and running so we have grid lab D up and
running in my lab. We have open Ephyra, which is actually what the earliest
versions of Watson was built on running on our system with data sets. One fun
data set is from USENIX. So we've culled all the USENIX web pages and we can
tell you about your history of service to that community, in particular, what
PCs and whatnot.
And then finally, because I do still care about the sustainability side -uh-huh?
>>: We were talking about stragglers before. So to first approximation, what
fraction of the problems that you're trying to solve are, quote unquote, just
fragment problems?
>> Christopher Stewart:
We hear a lot about stragglers.
Actually --
>>: Would you say that's the main thing that keep people from satisfying the
SLAs, the fact that if there were no stragglers then satisfying SLAs would
actually be fairly straightforward. But the fact is there are stragglers that
straggle different ways.
>> Christopher Stewart: There's still some issues with just queueing and
managing workload fluctuations, but this talk is specifically about the
stragglers. So when we get into some of the more technical aspects, that's
what we will be talking about.
So yeah. So the idea here is you have these, with the ability to beat human
perception now, pretty easily, computers are fast enough that we can respond
10
pretty quickly, one way to think about that is it can give you some leeway with
the SLA. Some opportunities to invest in other things when you would have
originally invested in, for instance, more nodes. So our idea here is we want
to meet an SLA using as few resources as possible, and then exploit other
competitive advantages. For instance, going carbon neutral Microsoft, right?
And how can we turn those into profit?
So another application that we've set up in my lab, you'll notice the URL is
just sort of like a random IP address. Greenmail or mantismail or I don't
know. We have to come up with some internal name for it. This is a carbon
neutral email IMAP cache, and what we'd like to be able to do is provide
differentiated SLAs to different users based on their desired level of
greenness. That is, the amount of carbon offsets that they would like to
receive as they access the site and the frequency with which they access the
site.
Here, once again, we want to use as few nodes to meet the SLA as possible so
that we can have as many -- so that we can buy, within a budget, as many carbon
offsets as possible for our sets of users. So this is another one that we have
set up.
All of this comes together for the technical focus of this talk, which is
Zoolander. This is our key value store on the back end. So without further
delay, the basic idea behind Zoolander is this. For years, we've been looking
at using replication and partitioning as our ways to manage SLAs, right. And
so here I'm depicting an awful JPEG figure. The basic concept's here.
So partitioning means you have some set of keys or groups of keys, often called
shards so you put keys together. And any rights that are going to one shard go
to one node. Rights going to this other shard go to another node, right.
Replication, traditional replication, or what we call replication for
through-puts is, hey, I want to evenly divide the accesses so I'm just going
to, whenever I get a write, I'll send it to one node. Next write, I'll send to
another node, so forth and so forth.
Now, the basic idea behind both of these approaches is that you
queueing. You want to reduce the amount of work that each node
And, you know, it's worked very well. It clearly can get us up
percentile SLAs. The challenges here, traditionally, have been
want to reduce
has to do.
to this 95
in terms of
11
managing hot spots and consistency. So if you do partitioning, if you have
more accesses to one key than another, you have a hot spot, which means you've
added a node, but you haven't divided the workload in half. If you do
replication for through-put, sure you divide the accesses evenly, but in order
to maintain a consistent view of data, the nodes have to interact with each
other. You still haven't really divided the workload in half.
Now, that the traditional problem. But as James was pointing out, we're
actually concerned about a different problem. We're actually concerned about
stragglers. And so the key problem here is even if you have perfect division,
even distribution with workload, and very light arrival rates, your ability to
reach these very strict SLAs that we need to ensure that something like 10,000
accesses all happen very quickly is actually still limited.
So here you're looking at an experiment that we did on zookeeper, where we had
just a very small key range that we accessed with no concurrency. And on the X
axis, you're looking at the latency of each axis. The Y axis is the cumulative
distribution. So your SLA that says something like 98% of accesses is just the
question of reading horizontally along here to the intersection of the line on
the X axis and that tells you you're latency bound at that point.
This figure should be a little bit depressing, because what we're saying is you
can see all of these very heavy tails here. And what it means, 98%, 96%, 91%
for read-only accesses, writes, and writes to an actually dependable store can
actually be processed within -- that is, that bar is three times the mean. So
that means most of the time, that means a high -- I shouldn't say with a high
probability. But getting to 99%, that is relatively close to the mean, isn't
possible under just this experiment.
>>: This is a function of the [indiscernible], right.
[indiscernible] --
So if you
>> Christopher Stewart: No, no, no arrival rate here. This is just take an
access, issue it, wait for it to come back. Issue another access, wait for it
to come back. This is ->>:
[indiscernible].
>> Christopher Stewart: No concurrency.
right? So you can divide this.
Like I said, sort of depressing,
12
>>:
[indiscernible].
>> Christopher Stewart: So I'll get there. So the -- I just want to point out
before that that this actually happens across other key value stores, right.
So we've looked at Cassandra, Zookeeper, Memcache. These are all popular value
stores that people are using. So what are the reasons for this? You have
background jobs, right? So, for instance, Cassandra has this right buffer
issue that, you know, you issue a right, it tries to respond as quickly as
possible by keeping a buffer of your rights. This is great, high through-put,
right? I can batch all of these buffer rights together.
The problem is, every so often, a request can get stuck behind one of these
buffer rights. So if a buffer right occurs once every second and it's taking
you a typical access is taking you just one millisecond, then you have this
good chance that -- not good chance, but you have 5 percent of your accesses
are going to be affected by possibly this right buffer flush.
.
Now, if the right buffer flush stalls the request, if it requires stalling the
entire system, maybe a right buffer flush, you may say, doesn't have to do
that. But definitely garbage collection or DNS timeouts could. Then you're
going to talk about having almost a 10X increase for that particular request
for these background jobs.
>>: It seems like there is an arrival rate. In other words, if there's a
background job and there's some arrival rate, it's not that you're issuing the
request in isolation. You're issuing the requests and other people are also
issuing requests concurrently. Is that what's happening?
>> Christopher Stewart: You can call it queueing. You can call it queueing,
but these aren't particularly requests that are coming from nodes in your
system for storage access, right. This is a part of the system itself.
Cassandra just keeps this right buffer access.
>>:
[inaudible].
Are you talking about background jobs, ear things accessing.
>> Christopher Stewart: So just to maybe explain this example a little better.
So Cassandra keeps a right buffer. Eventually, you have to flush this right
buffer just to keep your data dependable, to get it on a place that is
reliable. That is a background job. The process of flushing the right buffer
13
is a background job.
background job here.
>>:
Garbage collection would be what we're calling a
Or updating or --
[indiscernible].
>> Christopher Stewart: So if we look at this, this is an interesting issue.
So what we found is that these things that are causing the heavy tail tend not
to be correlated with a specific access. So it's nona function of the input
I'm giving to the key value store. It's just something that has to happen over
time. And, as we saw if I go back, these are some pretty big delays. So every
now and then, you just really get whammed with slow delays.
And this has a big impact. So earlier, somebody was asking me about the
barriers and the impact of stragglers, right. So if we look at that grid lab D
workload again, and now we're going to actually inject stragglers into the
system. So on 2% of our accesses, we're going to inject a delay. And we're
going to compare that to injecting the same delay, the same total amount of
delay, across all 100 -- all accesses.
And the idea, the take-away from this is so this dotted line represents linear
slowdown. So if we inject across all accesses, we can actually do better than
the linear slowdown. Why? Because we can mask things with computation.
Because you have to compute so you can compute while you're doing IO in the
background. And it's a slow delay. It's not a long delay. So this is okay.
But if you focus these delays on just a couple of stragglers, you start to run
into trouble. Because now, the delay is long enough that you're delaying these
barriers as you go through.
>>: So you're saying that delaying all requests by 20% has less effect than
going 2% of the ->> Christopher Stewart:
you're injecting.
>>:
20%.
It's the same aggregate amount of delay that
So you're saying the total delay and concentrating it into a small number?
>> Christopher Stewart: Exactly. So here, on this line, you may be adding,
say, for instance, just 200 microseconds to each request, rather than divide by
50, two milliseconds to each request.
14
>>: The average delay, you're adding a series is the standard deviation, if
you will, of the delay you're adding. In other words, at 20 ->> Christopher Stewart:
>>:
So which line is the linear?
>> Christopher Stewart:
>>:
Exactly, exactly.
The dotted line.
Oh, the one that's --
>> Christopher Stewart: Yeah, sorry. So this is reversed. That is reversed.
Whoops. Okay. So here's the key insight behind Zoolander. I really like this
work. So key insight is we want to revisit this very old technique called
replication for predictability, which I think falls in line with this old paper
by Jeff Mogul called an old dumb idea whose time has come.
The basic idea behind replication for predictability is you make N copies of
your data. But instead of sending each request to one of N, you all requests
to all of N. And you take the fastest response back.
Now, and so it's a dumb idea. Why is it a dumb idea, just off bat? You don't
improve through-put this way. So you're issuing the same amount of work to all
of your nodes. Why are you doing this? It turns out to be a useful idea now,
because it reduces the variance. So that heavy tail, you're less likely to get
stuck behind the heavy tail if you do something like this. And we're here
showing this, right.
>>: You're reducing the capacity of the system by a third. So you're
comparing now the 20% of the 2% to some [indiscernible]. Your hypothesis
[indiscernible].
>> Christopher Stewart: Yep, yep, yep, yep, yep, yep, yep. We'll get to that.
We'll get to how we can answer that. So the, here, we're just showing a high
level depiction of what's happening here. So here, along this strip and along
this strip we have requests that are arriving. And here we're showing one of
these anomalies happening. That is, replica 2, or when we do replication for
predictability to use different terms, we use a duplicate. So replica versus
15
duplicate, partition versus duplicate.
So the key point here is the second request on traditional approaches would get
stuck behind this right buffer flush and wouldn't finish until here. And if
you have barrier, meaning this third get can't happen until this responds and
it does some computation, well, you've delayed the entire execution as a
function of the right buffer. That's what we just saw on the previous graph.
>>: So I'm probably getting ahead of you, but you probably are going to say
this is on slide 22, but what about the sources that will happen? If you're
doing the same thing on everybody, then maybe you'll trigger the right buffer
flushes at the same time on everybody. The flush takes place at 3:00 p.m. how
do you make them get skewed?
>> Christopher Stewart:
with that.
We'll get to some of the things that we do to deal
So the key idea here is instead, just using two nodes, Jeremy, right, did
duplication -- replication for predictability here, you would actually get this
response back sooner and then you'd be able to start the third request earlier
and you'd get a speedup here. So this is the situation in which it helps.
>> Christopher Stewart: Well, if you use the same number of nodes, it keeps
power consumption the same. So here, you're just looking at two copies, two
copies of two nodes running the same copy of data. Presumably, maybe you could
argue if you're energy efficient and you're, you know, most nodes today,
though, you know, you have two nodes running. It's 2X the power.
>>: Let's be clear. So that means that your claim is that power is
essentially just a straight linear function of how many nodes you have in the
system.
>> Christopher Stewart:
Rather than the amount of data hosted on each node.
>>: Or the number of requests it's processing.
getting at.
>> Christopher Stewart:
>>:
I guess that's what he's
Yeah.
Right now, seems like what you're doing is actually giving more work to
16
the nodes.
>> Christopher Stewart: That's fair. You're right. In that sense. So we
haven't considered energy yet. If you are energy proportional as a function of
the request accesses or actually as a function of the amount of data stored on
the node, this approach, I could see some issues.
>>:
[indiscernible] that's the key point.
>> Christopher Stewart:
Yeah.
>>: I can't buy the claim that it's just faster in an absolute sense sending
to both. Because in a system that has a lot of competing request, one of them,
all of them are going to be delayed by the fact that, for example, duplicate
two, which had more response, could have been processing somebody else's
response instead. So it would have finished faster, I mean, you can make an
argument that this particular request you're drawing finished faster than it
otherwise would have, but that was at the expense of some other request that
would have finished sooner and is now finishing much later because it got
pushed back in the queue.
>> Christopher Stewart: So there is -- yeah, so right now, I think we're
trying to motivate the high level intuition, but we do not contend that this
should replace all of the prior work that we've done on partitioning and
reputation for through-put and these other techniques.
But maybe in some cases, this is something that we may want to insert and use
when it is appropriate for certain appropriate workloads. So the answer is if
you agree with me that this is a situation where it can benefit, then I should
only have to convince you that I can figure out when this situation occurs and
that there's situation occurs in nontrivial cases that are not made up.
>>: I think you've convinced me that the first request that comes in is
faster, but it seems like all the subsequent requests would be slower.
>>: In a system where you continue, for example, seeks on a disk, on a system
where you have multiple [indiscernible], seems like this is going to increase
queueing time. Because if this system, you're actually sending more requests
to each disk. So it's having to move around, possibly wasting [indiscernible],
it seems like, because in this example, the replication of predictability, so
17
->> Christopher Stewart: So not to get too far ahead, but yes. If you have
shared resources on the back end of this, if you have a shared disk, for
instance, on the back end of this that would create problems, that would
definitely be something that we would want to avoid. So there is a need to
avoid having -- I'm sorry. I put my hand at the wrong place. So if these two
duplicates had shared state here, right, you have to account for these
dependencies. And this is something that we've tried to account for.
>>: It's a shared state, right. You look at the workload, a duplicate of one
versus replica of one, replica of one would have to process two requests.
Duplicate of one would have to process three requests. There's more work
duplicate one had to do than replica one. So if you had a situation where
there was queueing, I think [indiscernible] then you're going to increase the
queueing divide in the second case even though the integer request may finish
sooner, you've increased the queueing because you've increased the load.
>> Christopher Stewart:
>>:
I agree.
A violation where you say look, it really doesn't work.
>> Christopher Stewart: Yeah, I totally agree with you guys. In fact, I think
your intuition is exactly right. So a key plot that I will show hopefully in
the next ten minutes will be a function of utilization over time and how this
approach compares with partitioning and replication for through-put. We are
getting there.
>>: My intuition is you showed us where the tail is ten, a hundred times. If
it's more than two times, then you can afford to halve the number of machines
that you have available so you can -- you double the queueing [indiscernible],
it will move the mean over by a factor of two, but you're still better off
because the tail ->> Christopher Stewart:
>>:
And your intuition is exactly right as well.
That's a problem.
>> Christopher Stewart:
So go ahead.
18
>>: So data consistency, so if you're [indiscernible] if you really want to be
[indiscernible].
>> Christopher Stewart: Yeah, we'll get to this. Actually, I won't focus on
this too much in the talk. At a high level, what I can -- at a high level what
I can say to you now is so you saw this grid lot D is hosting shared variables,
right. So it requires consistent access. So what I'm going to tell you, at a
high level right now, and sort of skip over a little bit in this talk just
because short on time is our key value store supports that. So we can do
strong consistency. We can do [indiscernible] and write consistency, or we can
do write order consistency.
And I'll describe some of the mechanisms for that support, but we could talk
about it offline to talk more about exactly how to do that.
>>:
We'll talk a little [indiscernible].
>> Christopher Stewart: To now, we're not the first people to do that.
Actually, Microsoft either in its hires or in its own papers have been doing
this before us so Mantri at OSDI, these guys were looking at doing this map
reduced task scheduling and they were running things twice and taking -running two copies of the same task, taking the one that responded fastest,
under certain nested conditions.
>>: The first Google paper does that. Regional map [indiscernible] they just
do it in a smarter way. The general idea of the assigning two replicas
[indiscernible]. Paper by Google.
>> Christopher Stewart: Okay. I believe you. So and then Peter's work, SCADS
at FAST, is actually I think the most related. So they were looking at
specifically high SLA key value store and they were issuing their reads to two
copies and waiting for the fastest one to respond.
So the difference between our work is we're providing full support for this
approach. So we actually think this approach is really neat. Rather than just
trying this out on a couple of duplicates, sort of conservatively, saying hey,
we've got two copies and we only want to do it for reads, we want to support
reads and writes at scale. This is our goal. I'll show you some tests on up
to a 32-node cluster for a single partition.
So in order to provide that full support, we have to do a couple of things.
We
19
have to manage these read and write accesses for consistency. Support scaling
out. And, importantly, for this talk, what I'll focus on is the modeling,
because this is the part that I really love. And, of course, there's also
adapting the workload change, failure recovery, some of these other traditional
aspects to managing a data store, which we deal with also, but I won't focus on
too much here.
So I want to give you a high level idea of the type of support that we have to
have and how we go about addressing that. So if we add this into the mix, our
whole point here is we want to add this into the mix. So we want to be able to
scale the number of replicas, but we don't want to say that this is all we want
to do, right. We want to also do partitioning when it's appropriate. We want
to do the others.
The problem is we've just had an SLA violation. How do we decide which of
these approaches is the appropriate approach to take to resolve this SLA
violation.
And so for that, what we're going to do is come up with a performance model
that is a little bit different from past models. Rather than predicting
something like average latency, right, what we want to predict here is the
service level, where the service level, if you recall from some previous
slides, is that percentage of requests that will complete within a latency
bound.
So this is an important point here.
latency.
We're predicting percentages here, not
>>: You say it's an SLA violation. It isn't actually an SLA violation yet,
right, that you're expecting. But it's 4% of requests that only take 15
milliseconds. You're allowed up to 5%. Isn't that the case?
>> Christopher Stewart: No, we're trying to hit SLAs that are multiple nines,
right. So no less than 99%, this is what we want to ->>: You're not trying to predict the fact that we want to take action now so
we don't really mess up.
>> Christopher Stewart: Yeah, you really missed it. So here's sort of the
bird's eye view of our model. So we're taking as input some system metrics,
like arrival rate, service time distributions, and network latency. As well as
20
target SLA in terms of service
you a whole bunch of equations
of some first principles about
work based on the example that
level and latency bound. And I'm going to show
inside of here that really are just a function
the way replication for predictability should
I've shown you.
At a high level, just the model by itself is going to output the expected
service level. So we're going to get a percentage out. You give us a latency
bound, we tell you what kind of an SLA you could actually support at that
latency bound.
The way you really want to use this is you want to it rate over this, right, so
you want to take as an output that service level, compare it to your target,
see, well, maybe I'm not meeting the target yet so I should change my node
layout some kind of way, and iterate over here. So there's a replication
policy as well. And that replication policy is going to tell us should I use
partitioning, how much partitioning should I use, how much replication for
predictability. And our -- the jargon that we'll use to do that is say, for
instance, we'll have four one-duplicate partitions. So here, each shard is on
an individual node. So this is just traditional partitioning, and two
duplicate partitions means we have two nodes, each serving these shards. So
I'll sort of fall back to this in explaining a little later. But this is the
jargon.
>>: When you say service time distributions, you're talking about here being
observed [indiscernible] of ->> Christopher Stewart:
>>:
Yes.
[indiscernible] that's I guess Jay's question.
>> Christopher Stewart: Yeah, they can be, yes. They can and they cannot be.
Our model considers both, but we focus on the case where they're the same.
>>: Anybody tried to implicate the shard by [indiscernible].
three copies, any --
So you ask for
>> Christopher Stewart: Yes, so there's this relationship to quorums, right.
Quorums are great if you want to do this on top of reads. But when you start
dealing with consistency in writes, you need some more support.
21
So the key idea here is this. So there would actually be two problems if we
did this fully. So if we gave you a model that accurately predicted what
happened when you added partitions or when you did replication for through-put
and you had this consistency, that's a problem in its own right. And people
have been struggling with how to model and understand the effects of hot spots.
Our approach sort of simplifies that problem by saying, look, existing systems
give you great support for partitioning and traditional replication, and we're
just going to assume that they're really good at it. So we're going to bias in
favor of these approaches and bias against this new approach that we're arguing
that you should revisit, replication for predictability. How do we do that?
We can just assume that accesses are going to be evenly distributed across
partitions.
So that means we're increasing the performance we would expect to see from
partitions or reputation for through-put. And then we're going to be accurate
on replication for predictability. So that's what this model does. And the
basic intuition here is we look at a CDF distribution like this one, and from
the CDF distribution, we take our latency bound. We're looking at a bound of
about 15 milliseconds. We look at the intersection point with the line and
that tells us the probability of an SLA violation with no concurrency, no
queueing, not considering any of the dependencies that happen between network
latency and, also, possibly workload changes.
So here's a whole big chunk of formulas. I'll motivate really just the
intuition here. So the probability that an SLA violation is going to occur is
going to be one minus the result from the CDF. We can come up with a very
general formula if we don't want to assume that the CDFs are the same across
the duplicates. But if we do make that assumption, this actually just boils
down to good old calculus of geometric series, where the expected SLA is one
minus -- oh, typo. Raised to the N here. Gosh. It's bad when you have things
highlighted that it's a typo.
So one minus the probability of SLA violation raised to the N. So that means
in order to have a violation, all of our duplicates have to incur this
violation independently, together.
So we do consider two key dependencies. So you've got network latency and
you've got queueing delay. Both of these things are going to make it harder to
meet a latency bound for all of your duplicates, because they all suffer from
22
these measures, these effects. So we use an NG 1 queueing delay model and we
just use the CDF, the cumulative distribution function of your network latency.
And the way, the novel idea is how do you incorporate these. So going back
here, tau is our latency bound. So our point here is that adding in queueing
theory or adding in queueing delay and adding in network delay affects your tag
for each node.
So now we have a model that is a function, that has this modified tau for each
node, modified latency bound. And workload changes require some additional
support like one-time monitoring, which we're also doing to deal with changes.
>>: So you're just using the average value of the average queueing delay,
then, or are you actually modeling the variability.
>> Christopher Stewart: So we are using the average queueing delay, the
average network latency. We could go more complex, but in a couple of slides,
I'm going to show you that we're pretty accurate so we decided to keep it as
simple as possible for now. There is, I think, another metrics paper here that
looks more into that variability, but we're not at that point yet.
So here's the key plot on the model's accuracy. So we issued nothing but
writes. Still not dealing with queueing delay at all. So bear with me on
that. One million accesses, and we add a duplicate after each one of these
rounds of tests. And recall, we're predicting percentages here. So absolute
error, as you've typically seen it, isn't really relevant here. So we look at
the absolute percentage point error between our predicted percentage and what
we actually observe.
And you basically see only one line here. Which is great. It shows that our
model is accurately predicting the service level as we scale up. And to make
that more clear, each point we're plotting the prediction error, which in terms
of absolute percentage points, is pretty darn low.
>>: So what is your [indiscernible], is that simulation or is this actual
hardware?
>> Christopher Stewart: This is on actual hardware. We just ran Zookeeper.
Zookeeper with one node apiece. So this is a very fast version of Zookeeper.
Yeah, this is on our --
23
>>: When you say service system, you're now in the replication version, or
there's partitions?
>> Christopher Stewart: These are just duplicates. These are -- so we're only
looking at our ability to predict the behavior of replication for
predictability right now. Because that's where we want to be accurate.
>>:
So when you say [indiscernible].
>> Christopher Stewart: Exactly, exactly. And so the other take-away from
this is not just our accuracy, but how high we can get. So you can start to
see an idea of the level of 9s that you can achieve through this approach that
we've historically dismissed because it doesn't improve through-put, but it
does improve predictability a lot.
>>: These are duplicates, meaning they don't know about each other? Seems
like if you're doing writes then the read is going to be the worst case
[indiscernible] you have to wait for all eight to come back and see which one's
the most recent?
>>:
[inaudible].
>> Christopher Stewart:
>>:
So --
Is there a story about that, or should I wait?
>> Christopher Stewart: Yeah, so the story about that is there's not a story
about that here. Maybe we'll talk about it offline. We deal with that.
You're right, reads can be slower. We evaluate that. It's not bad. It's a
function of your through-put slowing down, not a function of your latency
slowing down, which is what we really care about. But we'll talk about that.
>>:
How many trials is this to get you to this point?
>> Christopher Stewart:
>>:
One million.
One million accesses apiece.
[indiscernible].
>> Christopher Stewart: So once again, this is percentage point error. So
we're looking at -- and this is very different from latency. So you're looking
24
at a tail where you can have a small percentage point error can correspond
actually to a lot of latency. So even a one percent error, for instance, at
the 99 percentile, actually would be awful in terms of latency. A one percent
error would mean that you were mispredicting by almost 3X, right? The
difference between the 99 percentile and the 99.9 percentile. So take that
with a little bit of a grain of salt here. This is a metric on percentage
points, not on latency. So it's a little different.
>>: To restate Jay's astonishment, it seems like with real hardware, it's
unlikely that you could even run benchmark twice. Forget prediction for a
second. It seems like repeatability wouldn't even reach that level, let alone
predictability. Especially if you have these asynchronous background tasks
like buffer flushes that are happening out of your control or DNS run. The
list of explanations that you gave for ->> Christopher Stewart:
>>:
Yeah, so --
Over a million.
>> Christopher Stewart: You're looking over a large sample space is the whole
point here. So if I, for instance, tried to convince you -- actually, yeah, my
students have trouble with this all the time. They often want to run smaller
experiments. Hey, look, we want to understand 99.9%. Just run 3,000
experiments, right. But that only gives you three sample points. So now your
variability is very high if you're trying to understand this 0.009 level. But,
you know, you run a million, that variability is going to go toward a mean of
some sort, right. Just using the law of big numbers.
>>: You're still only getting a thousand points, right? If you're trying to
figure out your 99.99 percentile, then it's dictated by a thousand out of a
million requests.
>> Christopher Stewart: Five is enough for a T-test.
bit higher than that, right?
A thousand is quite a
>>: I would expected prediction errors of one in a million, given an N of a
thousand. Maybe I'm misunderstanding what the prediction error is. It's the
prediction error of what? Of the service level?
>> Christopher Stewart:
Yeah, so it's, for instance, we predict 99.10%, and we
25
observe 99.102%. Something like that. So, well, in any case, we'll sort of
move on past this. I would like to talk a little bit about queueing and maybe
breeze through the actual results before taking questions.
So what we're looking at here is probably the important plot for those that
want to understand how do you combine these two techniques together. So on the
X axis, week looking at utilization. So we're increasing the utilization on
each duplicate up to one. And on the Y axis, we're looking at the expected
service level, according to our model. So we're able to understand the
performance of using just replication for predictability only using the results
that I just showed you. So that's just if we have four duplicates. Their
utilization level will all be the same. And you just feed that into our model.
We can also understand the effect of using these traditional approaches only,
right, where we biased toward no consistency overhead, evenly distributed
partitions.
And then we compare what we're claiming, what we really want to sort of move
people toward doing, what we're doing in my group, and it's working pretty
well, is the mixed approach, where we're doing a little bit of both, depending
on the workload.
So first thing to note is that we really should reconsider replication for
predictability in our system design. Why? Because if you want to have a very
tight latency bound, here we're looking at 3.5 milliseconds, which is just a
little bit longer than our mean latency, I think about 25% longer, you're just,
you can achieve service levels that you just can't achieve otherwise.
So if we use four duplicates, we can get a service level that's pretty darn
high. But if we use just one duplicate, we're talking about orders of
magnitude lower.
I'm sorry. If we used only replication for predictability. So that is if we
just distributed each axis to just one node and didn't get any of the
variability gains that we get from replication for predictability.
If
So
as
to
we -- the second take-away here is the comparison between these two plots.
the problem with replication for predictability is when you have queueing,
queueing increases, you get this huge drop-off, because once queueing gets
be larger than the benefit that you're getting from cutting into this tail,
26
you know, the performance benefit goes straight down, because your node is
basically fully utilized and you're not getting any performance, any gains.
So if we did this at the three and a half millisecond level, even though we can
achieve very high SLAs under very light workload, as soon as the utilization
gets larger, we hit a cliff.
Now, that said if you increase tau up to 15 milliseconds, you can go a lot
longer. So take away two is we visit this, because for realistic latency
bounds that we care about now on the order of maybe tens of milliseconds, we
can still even just using this approach get up to a decent level of
utilization.
>>:
So in practice, what was the utilization you see?
>> Christopher Stewart: So we're looking at 20 to 30 percent utilization for a
key value store that needs to give you very he response times.
>>: So like that's what you'd see if you looked at Facebook, you worked at
those workloads.
>> Christopher Stewart:
>>:
I don't know Facebook's, but --
For reasonable workload generation.
>> Christopher Stewart: Right, right. The issue here is we've long known that
queueing delay means low response times. And while there was a lot of work
maybe mid 2000s that was looking at consolidation and how can we consolidate
things to increase the utilization, really, once you get around 50%
utilization, people get skittish about the effects of queueing delay. So I
think in practice, you're really looking somewhere around this range of, well,
I mean, some people run as low as 10%. For instance, our CIO status center,
they just don't use -- they have no utilization at all.
Some more cost efficient approaches may push you up to 55%. And then the
final, most important take-away from this is hey look. You don't have to deal
with this drop-off at all if you take a more modest approach. So if you look
at the utilization that you're facing and decide to use a mixture of
replication for predictability and partitioning, you can get the same type of
curve that you would see here with just using partitioning only, but you can
27
just tick up the service level that you can meet by using the two in combo.
All right. Big picture slide. I won't take any more questions so I can finish
in a couple more minutes. Big picture slide. This is an early depiction of
our implementation. The key idea to support writes and write order consistency
is that we issue all writes through a message repeater, a multicast message
repeater. Ideally, that would be in hardware. We did it in software and we
still got pretty good performance results. We took a lot of pain to implement
this in a way that is agnostic to the key value store that is on the back end.
So we've implemented this with Zookeeper back here and with Cassandra and
different configurations of both.
We do some system monitoring data, and we have a couple
systems implementation techniques. Like if we put this
loop to the data path, we've created exactly what I was
worried about, which is these things that are common to
that could cause problems for all duplicates.
of tried and true
message repeater in the
telling you we were
all duplicates, right,
So to minimize its effect, our implementation uses a low overhead call back so
we send writes directly back -- or we send responses directly back to the
client, rather than going RPC style back through the repeater.
General overview. Last but not least, I wanted to show you guys this stuff
about Gridlab-D. So this is sort of looking at the layout of agents in
Gridlab-D, issuing reads, doing some computation and finally, at the end of a
time step, issuing a write back to the key value store. So you can see this
sort of, I don't want to call it a barrier, but you can see the issue here. So
until these writes complete, you can't move forward to the next time step.
And James, along your question, you know, this multi-phase scientific
computation, it's been well studied, the fact that these scientific workloads
that run a long time go in and out of different types of phases. And ours is
no different.
So on an open workload, I'll just skip to the end-to-end tests. So here is the
take-away plot. So we're showing performance improvement as you would
typically expect it, right. So the relative performance improvement between
two approaches. So using just the partitioning approach versus using the combo
approach of partitioning and replication for through-put where the combo
approach is chosen by our model.
28
At the far end here, we're looking at the total running time of a grid lab D
simulation over a course of three days. And we're getting about a 20%
improvement when we compare same size storage systems. So using 8, 16 or 32
nodes where we use only replication for predictability, or mixed, right, so
same number of nodes, we're getting a consistent performance gain there. And
we get that by reducing the 99 percentile significantly. So we've
significantly reduced the straggler effect here.
And to be sure that this isn't just something that happens on my cluster and,
you know, nowhere else, we took this, we ran it on EC2 in order to scale up to
larger sizes, and as you can see, results are still consistent there.
Actually, EC2 works even better, because there you have a lot of variability.
I mean, we were able to, at one point, literally watch our CDF change without
touching anything.
So this is end-to-end test.
Conclusion, basically if you walk away with anything here, the four points. So
we were looking at these emerging workloads. They need low latency on
thousands of accesses. This is our target. This is what we sort of see going
forward into the future.
What we're suggesting is that the systems community may want to revisit
replication for predictability, this old approach where you send many requests
to -- you send the same request to many nodes and take the fastest response.
It actually makes sense now for these types of workloads.
And you know, Zoolander provides all this support that I had previously
mentioned, and you just saw our results on end-to-end scientific computing
workload.
And then these are all of my students that have contributed to some
projects that I've talked about today in some way or another, and I
to, even though they are not here, in case they see these slides, I
make sure they knew I was thinking about them. That's it. So time
Ran over a little built.
of these
just wanted
wanted to
for Q&A.
>>: So the simulation, I just want to revisit the consistency thing a little
bit. How many barriers were there in the run. There was one global barrier
29
separates two phases, or were there different barriers throughout this?
>> Christopher Stewart: So it was more like one global barrier. So you would
issue, you would operate all of your agents would do what they were going to do
for one time step. Then ->>: I guess my question is how many times -- in other words, how many times in
the execution was there a barrier? There must be a lot of them.
>> Christopher Stewart: We did it at three-second granularity over three days,
but that's not clear because you have intermediate. So if you want to do that
math, how many three-second.
>>: 6,400. So it still seems like if there's this barrier where everybody has
to write before anybody starts to read in the next time step, it just seems
like there would be this problem that if you're releasing the write barrier
after the fastest write, then the readers would have to wait for the slowest
read to make sure they got the fastest writer. Or conversely, you could wait
for all writers to complete, and then you could wait for the fastest reader,
but you can't do both.
>> Christopher Stewart: Right. So here's the way we handle that. So if you
want, if you need strong consistency, what you do is you send writes and reads
through this multicast repeater that I issued you. So that what does is
serializes everything. And that can get a through-put of ->>:
[inaudible].
>> Christopher Stewart:
repeater.
>>:
Yeah, you send everything through this multicast
[inaudible].
>> Christopher Stewart: Yeah, so then you'd have to use multiple. But this is
standard practice for achieving this type of strong consistency. Zookeeper
does the same thing. Zookeeper, lazy base was just a [indiscernible] paper
this year. So if you want strong consistency, you have to pay a performance
penalty for it.
>>:
Zookeeper is not [indiscernible].
30
>> Christopher Stewart:
>>:
Right.
So --
[indiscernible].
>> Christopher Stewart: Yeah, you can go with quorums. And what quorums can
do is allow you to sort of mask the effect on reads while lowering your
through-put on writes. Or you can take this approach where you just pay the
penalty on both. And so that's sort of our trade-off.
If you want read my own write, we can do that a little bit more efficiently.
So when we respond back with the fastest completing node in the write, we can
also respond back with an address, which allows you to directly issue your read
back to that node again. And if that node is, for instance, a Zookeeper group
that is 3X replicated, then you can scale that read performance by adding
nodes. Does that make sense?
>>:
Are there background tasks on the centralized -- I mean, seems like now --
>> Christopher Stewart: On the software repeater? Yeah, yeah. That's exactly
why earlier, I mentioned this, you know, we have to take approaches to
implement that as well as we possibly can. You're right. Some things are just
going to be caused by this message repeater in between.
>>:
[indiscernible], right?
>> Christopher Stewart: Yeah, yeah. So it's not that hard to implement it
very well, I think. But, you know, I implement systems. And then finally, if
you want write order consistency, then the issue is, then if you can deal with
some inconsistency in your reads, which, you know, a lot of these other systems
support, then you just get the write back from the fastest and then you issue
your reads to whomever and eventually you're guaranteed that everybody will
eventually give you the same write performance.
So our raw through-put numbers are comparable to existing systems, right.
we're not making any fancy tradeoffs there.
>>:
So
In general [indiscernible] you have a scaled down [indiscernible].
>> Christopher Stewart:
Well, that's only -- but you're talking about a very
31
specific type of workload, right. So you're really talking about, hey, if I
only want sequential strong consistency, right, then you have, yeah, the single
bottle neck that is a problem, right. But if you don't need that, we can still
scale your reads, right, by doing -- if you're willing to relax that. I think,
you know, I don't want to get into trouble by saying anything too bold, but
when people scale out nodes to support sequential consistency, it's often
either to improve read through-put or for availability. You very rarely -- by
very rarely, I mean I can't think of anybody off the top of my head that it's
good high write performance while supporting strong consistency. That's not a
fair comparison there.
>> John Douceur:
speaker again.
We're running late so I don't want to go too long, thank our
>> Christopher Stewart:
Thanks for hosting me.
It was a pleasure.
Download