>> Jeremy Elson: Hello everyone. It's my pleasure... Austin. He's worked on a variety of distributed protocols...

advertisement
>> Jeremy Elson: Hello everyone. It's my pleasure to introduce Yang Wang. He’s here from UT
Austin. He's worked on a variety of distributed protocols and storage systems and also, at least
according to his webpage, also enjoys basketball and travel lingo. And anyway, so I'll let him
get started. He's talking about separating data from metadata for robustness and scalability.
Yang, please.
>> Yang Wang: Thank you, Jeremy. Good morning everyone. My name is Yang Wang. Today I
will talk about how to build a storage system that can be robust against the different kinds of
errors and be scalable to solve [inaudible]. And I will present how to achieve this by using the
key idea of separating data from metadata. The fundamental goal of a storage system is not to
lose data. And actually in practice this is hard to achieve because the stored components can
fail in different ways. For example they can crash, they can lose power, what is worse they can
fail in some weird ways. For example, a disk may experience a bit flip, it may lose a write
because of a [inaudible] back, it may even misdirectly[phonetic] write to a wrong location. And
the errors in all other components can also be propagated to discs and permanently damage
our data.
This is a long-standing problem that other people have been studying for decades, and people
have different kinds of techniques to tolerate different kinds of errors. And that's true in this
quantitative graph, it is typically the case that the more kinds of errors we want to tolerate the
more cost we are going to pay and therefore people have to make painful tradeoff between
robustness and efficiency. And nowadays this problem is made worse by our trend that the
scale of the storage system is correlating rapidly mainly because the quantity of data is growing
almost exponentially, and as a result, many big companies have developed their own largescale storage systems to store the growing amount of data; and nowadays we are talking about
at least thousands of servers and tons of para bytes of data and these two numbers are growing
really fast because of quantity of data is still growing.
In such a scalable system the trade-off between robustness and efficiency becomes even more
painful because the overhead of those strong protection techniques will be magnified by the
scale of the system. For example, in the old days when we have only ten machines the question
is like do we want to add five more machines to make my system look more robust? And
nowadays for [inaudible] Google the question is like we already have a 1 million machines, do
we want to add 500 thousand more machines to make my system more robust? So of course
they have to make a careful balance between the cost to tolerate different kinds of errors and
the price they are going to pay if certain errors happen and our system is not designed to
tolerate that.
>>: You’re talking about increasing robustness by adding machines?
>> Yang Wang: I'm sorry?
>>: You’re talking about the only way to increase robustness is by adding machines instead of>> Yang Wang: No, no, no. I’m just [inaudible] as example.
>>: Okay.
>> Yang Wang: Okay. And nowadays the [inaudible] around here. Now that list point is not
necessarily the point where those two lines meet each other. Actually this is the point where
the sum of the two lines is minimum and theoretically that can be where multiple of such
[inaudible] point and lists while in practice. People find that cautious and single bit flips are
pretty common so we have to tolerate them. Whether it is necessary to tolerate network
partition is controversial. Google has decided to pay the cost to tolerate it, but many other
companies have decided not to do so, many for cost concerns.
Of course as a result of such balance if an error happens on the right side of the balanced point
then the system of course is not designed to tolerate that. And some kind of errors do happen
in practice causing their systems to either lose the data or become unavailable. So that kind of
failure will not only hit the income of the companies but will also hurt their reputation. And in
this talk I will try to answer this question, can we achieve both robustness and efficiency for
large scale storage systems?
This is a graph I have already shown. Our approach is trying to significantly reduce the cost to
tolerate different kinds of errors so that we can move the balanced point to its right side.
Actually this is a big and hard problem that I do not plan to fully address in this talk, but at least
at the end of talk I hope to convince you that at least it’s possible for certain storage systems.
So I have devoted almost all my PhD career investigating this problem. And we find that there's
one key idea that always guides our way which are separating data from metadata. If we look
into the [inaudible] of data they are actually not equally important. Some parts are used to
describe relationship of other parts [inaudible] and we call such data, about metadata in
storage systems and they are usually considered more important to the [inaudible] of the
system. And our finding is that by applying those strong but potentially expensive techniques
to owning metadata and applying minimum protection to data and actually we can achieve
stronger entries for both metadata and data.
So in the first half of my talk I will present how to use this idea to build a scalable and a
robustness storage system. And as I said, it’s a hard problem. So let me start with the simple
one. How can we replicate data in a small-scale system? On one hand, restricting our
application to a small-scale system significantly simplify the problem; and on the other hand, a
[inaudible] replication protocol is a fundamental building block in almost any larger scale
systems. And here I will show that it is possible to tolerate both crashes and timing errors with
a significantly lower cost than previous solutions. Then I will move to the large-scale system
and here you will see that replication is still useful here but it's definitely not enough. And we
also need to address challenges introduced by the scale of the system. And here I will show
that it is possible to improve the robustness of a large-scale storage system to tolerate almost
the arbitrary errors while our new prototype can still provide a comparable throughput and
scalability.
So now let me start with replication. As I said, replication is using only almost any large-scale
storage system to provide [inaudible]. The major question here is how much cost we going to
pay. And actually the answer depends on what kind of errors we want to tolerate. And this is
one of the typically examples that are experiencing less painful tradeoff between robustness
and efficiency.
On the left side people have developed those primary backup systems which can tolerate only
crash failures while they are really [inaudible]. They only need f plus 1 replicas to tolerate f
failures which are the minimum you can expect. But of course crashes are not the only errors
that can happen in practice. People find the timing errors which can be caused by network
partitions or slow machines can also happen in all these data centers and therefore people
have developed that Paxos protocol to tolerate such kind of errors and it is more expensive. It
requires twelve plus one replicas to tolerate f failures. And on the right side people have also
developed a [inaudible] for tolerance techniques which can tolerate arbitrary errors with an
even higher cost. As I said, Google is already using Paxos in their systems, but many other
companies are still using primary backup, many for cost concerns, and in this talk I will present
a protocol called Gnothi which targets both f plus 1 replications and the ability to tolerate
timing errors.
>>: [inaudible] using Paxos, but in the traditional GMS papers they were easy Paxos through
exactly the method you're talking about which is separating metadata from data, right?
[inaudible].
>> Yang Wang: [inaudible] Paxos?
>>: No. They were separating metadata from data. They were using Paxos to measure the
metadata from storage system and keeping data [inaudible] replicated on machines.
>> Yang Wang: I'm sorry, which work are you talking about?
>>: The original Google file system paper from a decade ago.
>> Yang Wang: Google file system paper>>: The classic design for large-scale uses Paxos to manage your metadata because of the cost.
And then you keep all your data on the machines themselves.
>> Yang Wang: But the problem with Google file systems is that their replication protocol itself
does not provide the same guarantees as Paxos. They use Paxos for metadata, they use
primary backup for data, okay? But they are kind of saying we provide a stronger metadata but
it’s still the same guarantees as we can guarantee for data. And in this [inaudible] I will show
that it is [inaudible] to, with only replicating metadata with Paxos I can get strong guarantees
for both.
>>: So when you say timing do you have some example for what tendencies you call a timing
error?
>> Yang Wang: Okay. We'll talk about that later. But it's kind of, for example if we use let's say
a timeout to detect whether a node has failed or not sometimes a network partition can cause
your timeout to be inaccurate in such a way that even though the remote node is still live, but
it's another partition then you cannot receive the message from that node and you misclassify
it as a failed one.
Okay. So in this talk I will present a protocol called Gnothi which targets at f plus 1 replication
and the ability to tolerate timing errors. If you are familiar with distributed systems you may
even wonder whether this is even possible. And actually this is proven to be impossible in
general but Gnothi comes close to [inaudible] by restricting its application to a specific storage
system, a block store. A block store function is like a remote disc. It provides a number of fixed
size blocks to different users and a user can read or write a single block.
>>: So you're ordering constraints between both parts?
>> Yang Wang: Okay. We'll talk about that later.
>>: Well, I was wondering if that was [inaudible] assumption you were making.
>> Yang Wang: No, we aren’t talking about that. There are ordering guarantees.
>>: [inaudible].
>> Yang Wang: This file is simplicity, actually a block store is still widely used in practice as
demonstrated by the success of [inaudible] elastic block store, EBS. And in a few slides you will
see how Gnothi benefits from this simple [inaudible]. The key idea which differentiates Gnothi
from previous systems is that in Gnothi we don't insist that all nodes must have identical and a
complete state. And actually for block store this is fine as long as the node knows which of its
blocks are fresh and which of them are stale. That's actually how we get the name of the
system. In Greek Gnothi Seauton means to know yourself, and in our system as long as the
node knows itself it can process requests correctly.
But before I go into detail of our design let's first see why this tradeoff is changing. The basic
idea of replication is pretty straightforward. To ensure data safety despite f failures we need to
store data at the list f plus 1 nodes. The major change here is how to coordinate different
servers or replicas so that requests are executed in the same order on different nodes. Let me
show a simple example to say why this is necessary. When two clients are sending two
requests A equals one and A equals two, to two different replicas of service, it will be bad if
they execute them in different orders because they will reach a different state; and in this case
if a client tries to research data in the future it will get inconsistent replies. That's why we have
to ensure that requests are executing in the same order on different replicas.
Now let's see how different systems try and achieve that. In primary backup systems their basic
idea is that all the clients who send their request to a single node, which we'll call primary, the
primary will rely on order to different requests [inaudible] such ordering information together
with a request to other nodes which will call back up. Of course here the question is what can
we do if the primary is not responding? Primary backup systems relies on synchronism during
that if a node does not respond in a time then it must have been failed. So in this case the
system will promote a backup node into a new primary and the clients will send their new
requests into a new primary. However, the order for this synchronism [inaudible] to be true we
have to set up the timeout in a pretty conservative way. Otherwise if a correct primary is
misclassified as a failed one and a new backup is promoted then we will have two primaries in
the system and they may align different orders to a request.
And a conservative timeout will hurt the availability of the system because the system is
unavailable after the primary fails but before the backup is promoted, before timeout is
triggered. To solve this problem people have developed those Paxos slide protocol. Their basic
idea is that we cannot rely on a single assumption. Even a correct node may not respond in
time because of timing errors, therefore, instead of relying on a single primary to order request
in Paxos the basic idea is that any of such ordering must be agreed by a majority of the nodes.
As long as this is true it is impossible for different replicas to execute the request in different
orders. And now that all systems should be able to make progress just by f failures. This means
that even considering f failures our system should still have sufficient number of nodes to form
a majority, to form a quorum of majority. And by some simple math induction we can get that
we need at least 12 plus 1 nodes which is of course more expensive than the primary backup
approach.
So can we tolerate timing errors with only f plus 1 replication? And actually this is also a
longstanding problem that people investigated a lot. For example, previous works, the basic
idea of previous works like cheap Paxos and ZZ is that actually we only need agreement from f
plus 1 nodes. We need the extra f replicas to cover the possible failures and the timing errors
like paying insurance for future disasters. And of course if the disaster does not happen our
cost is lost here somehow.
So can we only pay such additional cost when failures happen? I guess everybody prefers this
idea instead of paying insurance every month. So following this idea is pretty natural to come
up with the following approach. Instead of sending requests to all servers the clients can
choose to send a request to f plus one servers first. If they can reach agreement then it’s fine.
If they still can't reach agreement for either failures or some other reasons than the system
can't activate the backup ones. The benefit of this approach is that in the failure case the
replication cost is low and the system can use a backup ones for other purposes. But the
problem with this approach is that both cheap Paxos and ZZ are designed for general-purpose
replication so they want to ensure that all nodes are identical. Therefore, before the system
can create the new node as a working node they have to copy all the states from one existing
node to the new node. And for a storage system which can contain at least a terabyte of data
the data copy can take hours and the system is unavailable during this period which is totally
unfavorable.
>>: [inaudible]?
>> Yang Wang: Because in this case, assuming this one is not responding, we need agreement
from a majority but this one is not activated.
>>: Why not just keep more replicas? Why not increase f to plan for the fact that you're going
to have no maintenance and down nodes and you still have the majority?
>> Yang Wang: Increasing f, of course we are increasing you cost in the lack of failure for you
case.
>>: Sure. I'm trying to show you how to reduce the cost f in every case.
>>: Yes, he's doing that. This is all planned downtime. This is something failed.
>>: You can't have unavailability of failure>>: And he wants to make that cheap.
>> Yang Wang: Ever wonder why I need to ensure that all nodes are identical? This is because
in general, before application accesses a new request it has to ensure that all the previous
requests have already been executed. Otherwise the execution might be wrong. But as I
mentioned earlier, in block store this is not necessary as long as a node knows itself. And since
a block store is designed to process only write center reads let’s see how can incomplete node
process writes and reads in Gnothi. Actually it's pretty straightforward to process the write
because the write requests will override the block anyway no matter whether it is fresh or
stale. And for a read it can also be processed correctly as long as that node knows whether the
targeted block is fresh or stale. Let me show what I mean. For example, when a client is trying
to read a block from an incomplete node as long as the node knows that it does not have the
current version of the data it can tell the client so that the client can retry from another node.
That's the basic idea of Gnothi, but then the question is how can we let each node know itself?
And that's why we apply the idea of separating data from metadata.
The client will separate the write request into two parts, the data part and a small metadata
part which is used to identify the request. Then the client will send that data to f plus 1 nodes
first. And then it will send the metadata to all nodes through a Paxos-like protocol to ensure
that all nodes knows exactly which request the system has processed. In this example node 1
and 2 will know that it has the current version of the data and node 3 knows that it does not
have the current version of the data but the data must be stored somewhere else. The benefit
of this protocol is that since the size of the metadata is very small compared to the size of the
data, this by the fact that it is fully replicated to all nodes, our replication cost is very close to f
plus 1 in the failure for your case.
And to actually achieve higher throughput through this protocol we use a pretty standard way
to perform load balancing. We just to either virtual disk spacing to multiple sizes and we locate
those sizes to different nodes in [inaudible] order. By using this approach Gnothi can achieve at
least 50 percent higher running throughput compared to the full replication approach. Now
let's evaluate our idea. So here we hope to answer two questions. First, what is the
performance of Gnothi? To answer that question we compare the throughput of Gnothi to a
state of art Paxos space block store called Gaios by the performance of full replication of data.
And the second question is what is the availability of Gnothi? To answer that question we
compare Gnothi to Gaios and Cheap Paxos; and as a reminder, Cheap Paxos is the one that
performs a partial replication during the failure for your case and activates the backup ones
during failures and copy all the data to the backup ones.
To answer the first question we mirror the throughput of Gnothi and Gaios on the different
workloads, and here I counted two of them which is the 4-Krandom write. The Y axis is the
throughput which is measured in request per second. Gnothi can achieve at least a 50 percent
higher throughput compared to Gaios because in Gnothi data is only replicated to f plus 1
nodes and Gaios’ data is replicated to 12 plus 1 nodes in the failure [inaudible] case. To answer
the second question we compared Gnothi to Gaios and Cheap Paxos. So in this graph the X axis
is the time which is measured in seconds, and the Y access is a throughput which is measured in
megabytes per second. The top orange line is Gnothi, the middle red line is Gaios, then the
bottom blue line is Cheap Paxos. To measure the availability we actually queue our server for
all serving systems at time about 200. First we considered both Gnothi and Gaios don't need to
block because they still have enough replicas to perform agreement. But on the other hand,
Cheap Paxos needs to block for the data copy, and in our experiment it needs to take about a
half hour to copy 100 gigabytes of the data. And during this period the system is not available.
And then we started a new server with a blank state at time 500. When I say that, and the new
server of course needs to copy all this data from the other nodes. And we can see that during
this period Gnothi can achieve about 100 percent to 200 percent more write throughput
compared to Gaios, and while those two can still complete the recovery at the same time. The
reason for this is that in Gnothi one node only needs to store two sets of the data. And of
course during recovery it only needs to fetch two sets of the data which means that the system
can locate more resources to process new requests.
So far I have focused on small replica, a single replica>>: Can you go back to the [inaudible] previously? The last slide. Yeah, this one. So do you
have a primary scheme compared here as well? [inaudible]?
>> Yang Wang: We did not do an experiment with primary backup, but their system should be,
I would say it's not two fair comparisons. For example, for our Gaios and Gnothi was it was
[inaudible] where we both needed the three nodes. For primary backup you only need two
nodes.
>>: Right.
>> Yang Wang: In that case the throughput should be close to a single disk of throughput. So
assuming all our machines are equipped with a single disk. But that's, I would not say it’s a fair
comparison because you need to use a two node experiment to compare to a three note
experiment.
>>: But it would be different guarantees, right?
>> Yang Wang: Yes. Different guarantees at different costs. So that's why we kind of use a
single disk as like a baseline.
>>: You said it’s different guarantees and different costs, but I'd like you to defend a stronger
claim which is the costs are, Gnothi dominates, right?
>> Yang Wang: Yes.
>>: Because other than the fact that you have to allocate the minimum cost of entry with your
machines, after that it has better throughput and better availability. Like there's the trade-off
here.
>> Yang Wang: So compared to primary backup it does not have better server. It has the same
throughput with better availability I would say compared to primary backup.
>>: It looks like it has guaranteed, has better throughput.
>> Yang Wang: But that's because it has more machines.
>>: Oh, sure. Okay. I guess you’re saying you divide by the number of machines.
>> Yang Wang: Yeah, yeah, yeah. The per machine throughput is, yeah.
>>: Is this why, why is the throughput going up when f equals two [inaudible] per machines and
so it’s load-balancing or is that just like experimental noise or>> Yang Wang: No, it should be higher because now you have five machines and data is only
replicated to three of them. They should get like 1.66 higher throughput.
>>: Okay. So is it the same issue which is that this isn’t normalized for the number of machines
you have?
>> Yang Wang: Yes. I would say that.
>>: So how close is that to the 1.66 you'd expect because you're adding 1.6 times [inaudible]
machine? It looks like five>> Yang Wang: It's not. I think it's more than 1.66 because in Paxos we still need to perform full
replication for metadata. That's kind of for, it’s more expensive when you have more replicas.
So your metadata replication part is heavier so that's why our average throughput is slightly
smaller than 1.66.
>>: If you take these single disk numbers 390 and you multiply by 1.66 you get 585 which is a
little higher than that. So something's [inaudible].
>> Yang Wang: Therefore the random [inaudible], actually that’s also why I have not
mentioned here is that now since in Gnothi only one node need to store two sets of the data.
So therefore they are average of [inaudible] time also becomes slightly smaller. So that's why I
actually this part is slightly higher than like 15 percent higher.
>>: 585 is a little higher. The other is supposed to be 647 which does meet the prediction.
>> Yang Wang: So, so far I have focused on a single replicated work. But as I mentioned earlier,
in a large scale system is much more complex at least in two ways. First, a single replicated
group is not enough to hold all the data and that's why most of the companies choose to shove
their data across multiple [inaudible] replicated groups. For example, Google can choose to
store the general data of user one on shard one and 4, 5, 6 on shard two and so on.
The second complexity comes from the fact that there are usually multiple layers in the system.
For example, if we write something in general we don't talk to Google's file system directly. We
need to talk to a web server first. The web server may need to talk to a [inaudible] table which
may finally talk to Google's file system. In such a complex system ensuring robustness and
efficiency inside a single replicated group is necessary but it's definitely not enough because we
also need to provide guarantees across multiple components. And in this talk I will talk about
how to address two problems; first, how to provide ordering guarantees across different
shards, and second, how to provide end to end guarantees across different layers.
So let me go to the first one. You may wonder why do I need even need to provide ordering
guarantees? This is because a block store requires a specific semantic called the barrier
semantic which means that the use of a block store which is usually a file system here can
specify some of its requests as a barrier and all of the requests before the barrier must be
executed before the barrier is executed, and the all request after the barrier must be executed
afterwards. Such barrier semantic is crucial to the correctness of the file system.
>>: When you say all of the requests you mean all the requests in a given client stream? There
is not a barrier globally across all>> Yang Wang: Right, right, right, right.
>>: For a given client.
>> Yang Wang: For a given client. Actually a block store is usually just accessed by a single
client. It functions like a disk usually. You're right. And a violation of a barrier semantic can
cause the file system to lose all its data in the worst case. Let's see how it can go wrong. Let
me show a simple example where the client is trying to send two requests to two different
shards and the request of two is attached with a barrier. It is possible that the client failure can
cause request one to be lost and request two is still received and committed by the second
shard.
>>: Can you explain the sharding scheme?
>> Yang Wang: The shardings, I would call it somewhat similar to what Google file system did.
So it will use one replicated group to store whatever, maybe based on hash, maybe based on
the file system name, so you choose to store something part of your data on one group.
>>: That's not a block store one client. That's a shared file store.
>> Yang Wang: Yes.
>>: If you had single client, single disk semantics why not share their client? What is this>> Yang Wang: I would say>>: Can you relate is this related, I’m confused by the Google file system sharding example how
it relates to single block.
>> Yang Wang: So our model is that for each block store there's a single user but our system
should provide like a large number for virtual disk to large number of users. This is a usage
model. And then>>: So the client has multiple disks that they’re writing to?
>> Yang Wang: Yes. Sometimes a client wants to have a higher throughput than a single disk.
And also in this approach we’ll allow you to get like better load-balancing.
>>: So like the client has virtual [inaudible] over virtual blocks?
>> Yang Wang: Yes. Not called a [inaudible] but the idea is simple; it's not a [inaudible]. The
replication is>>: It’s striped?
>> Yang Wang: It’s called striped or called read zero. Of course there’s a more naive solution to
this problem is that the client can choose to not send a request two until request one is
completed. But this approach we'll just lose all the parallelisms the sharding approach is trying
to achieve. And of course we will also hurt the scalability of the system. And our solution to
this problem is based on a key idea that such kind of out of ordering write is actually fine as
long as the client still saves the data in the correct order. This is like saying if there’s no
evidence there's no crime.
Let's see what I mean for a simple example. Assuming three different shards are receiving
three different requests, A, B, and C and the third is attached with a barrier and the second one
is lost somehow. We are saying that this is actually fine as long as the client doesn't see the last
update even if it's made to disk. In this this example it's fine to see the new version of A and all
the version of B and C. And based on this idea we have developed a protocol called pipelined
commit. Its basic idea is that different shards should lock data in parallel but they should
coordinate it together to make sure that the data is visible to the clients sequentially.
>>: Does this only apply when you’re updating existing data or does it apply for new data as
well? If you think about a file system term>> Yang Wang: I'm sorry?
>>: Does it apply only to overwrites of existing data or is this also applied to new data that's
being written? Like you’re thinking about if you’re striping the XT3[phonetic] and you have a
file system journal instead of transactions that you could log, does this problem apply there as
well?
>> Yang Wang: So first, our system is designed for block store so it's always updated. There's
no new data in block store but it provides a fixed number of blocks to users. So in a file system
it's actually built on such a block store and that's why the requests actually ordering
guarantees, but if the block store provides such guarantees to the file system then the file
system should not have any problems.
And then let's see how it works exactly. I will use the same example as shown the previous
slides. And here actually each server is actually a replicated shard, but for simplicity I will just
show a single block here. So instead of just sending those request to servers the clients will also
attach a small metadata to each request which identifies the location of the next request. Then
it will send such data plus metadata in parallel to different shards and different shards will also
log them to disks or same parallel. But at this point we will not make the newer version of the
data with both the clients. Instead, so I need to wait for a notification from the previous server
saying that the previous data has already been made visible. And in this case it can make the
newer version of A visible to the clients. Then it will also send the notification to the next
server and so on. The benefit of this protocol is that the first phase, the durability phase can be
executed in parallel. And that's actually where the large block of data is transferred over the
network and also written to disks. Therefore, executing the first phase in parallel will allow us
to achieve most of the scalabilities from the sharding approach. And executing the second
phase, the visibility phase, sequentially allows us to achieve ordering guarantees without
significantly hurting the scalability of the system.
Let me go to a second problem, how to provide robustness across different layers. The major
trial here is that this part of, the fact that the storage layers are usually well protected. Errors,
and especially corruptions in middle layers, can still be propagated into either the end users or
the storage layers. So of course we also need to protect those layers, but let's immediately
remind some of you some of those BFT techniques which are usually perceived as too
expensive in practice.
So here is what we did in Gnothi. We’re asking ourselves again can we try to achieve the ability
to tolerate almost arbitrary errors with a significantly lower cost? And our key idea is based on
the idea of decoupling safety and [inaudible]. Actually what we find is that for safety we only
need f plus 1 replicas. We can require that all the requests must be agreed by all the f plus 1
replicas so that we can know one of them must be from a corrected node. And of course
[inaudible] anonymous concern in our system, but the problem with the anonymous concern is
that it does not provide [inaudible] at all since you’re single failure caused the system to stop
making progress. That is why those BFT approaches need more replicas in general. But as we
already said seeing Gnothi, a general solution may not be the best solution for a storage
system.
So here, do we have another way to restore liveness without somehow significantly increasing
the replication cost? Actually we find that the answer is yes again, and the key operation which
allows us to achieve that is that those middle layers usually don't have any persistence state
and they usually store their persistence state on the storage layer. Now let's see how to
leverage this operation to restore liveness. The major change here is that with f plus 1
replication to tolerate f arbitrary failures it is impossible to know which one is faulty. To
address this change will take a drastic approach. We just replace all the middle layer nodes
with the new set of nodes and then we will leverage the storage node to allow the new set of
middle layer nodes to agree on what the correct state is. And if for further failures they still
can't reach agreement we will replace them again and we will repeat on doing this until they
can reach agreement. Really on why we can do this is exactly because those middle layer nodes
don't have any persistent state and they can be recovered from the storage nodes. And one
surprising fact of storage protocol is that it not only can improve the robustness of the system
but it can also improve its performance under some conditions. The reason is that now since
both middle layer storage nodes are replicated we can co-locate them on the same physical
machine. And this can save a lot of network consumption in tasks like garbage function in
which the middle layer nodes will just receive data from the stored node, perform some
computation, and then write the data back. In such tasks that network consumption can
almost be eliminated. Let's see where we have our idea.
>>: You said that you could eliminate this garbage function example. Are you reading, is that
[inaudible] happening below the reliability layer in the storage nodes? Because if it’s happening
above the reliability layer then you can't eliminate the network frameworks because you have
to talk to the other nodes to do your reliability protocol. So I don't have intuition for why you
can make garbage function free unless the garbage function operation is somehow below the
layer of reliability.
>>: Could you clarify what kind of garbage collection you're talking about?
>> Yang Wang: Let's see, that's used, for example, Edgebase or big table use a log file system.
They only upend to Google file system but it's [inaudible] need to a comeback of the data to
discard the old data. That's what I call the garbage collection. So it isn’t usually initialized,
executed by the middle layer nodes like tablet server and big table. So previously if you don't
have nodes then this would read data from them, perform computation, and wanting to write
data back. Those nodes will perform the reliability protocol. Is that what you mean by
reliability protocol?
>>: What I mean is the storage nodes are running some replication protocol that tolerates a
number of failures.
>> Yang Wang: Yes, we kind of move the replication to the middle layers. It's kind of
coordinated around here, but for garbage collection actually they don't need any coordination
mainly because we can make it deterministic. You could think about it, we moved the
replication to the up layer. Now each one just arrived to its local replica.
>>: That's more layer.
>>: And you were saying that arbitrary errors can happen during this process. Like what kind of
arbitrary errors?
>> Yang Wang: We [inaudible] almost arbitrary errors. The only thing, so first arbitrary errors
really means arbitrary, whatever kind of errors you can imagine. That's called arbitrary errors.
But our system cannot,
>>: Like what?
>> Yang Wang: Corruptions or even a malicious user take control of one of the servers.
>>: Malicious control. [inaudible].
>> Yang Wang: I don't think all the corruptions can be detected, for example, these corruptions
can definitely be detected but [inaudible] memory corruption, I know there are some memory
[inaudible] but if the memory trips does not provide [inaudible] then it’s pretty expensive to
implement your own [inaudible].
>>: [inaudible] bigger structure then you have production whenever it's an error to help us.
>> Yang Wang: Yes. But for example, if [inaudible] has an error, [inaudible] corruption doing
like a control flow I don't know how to maybe that’s your way, but I don't know how to use a
check [inaudible] to protect that. For example, if you do an experiment and somehow the
result should be true but it's corrupted into [inaudible] that kind of thing>>: Your cache [inaudible] have checksum as well. But I think what you’re saying is you can't
have arbitrary corruption that causes your destruction point to go off the rails and execute
some other code. That’s possible even with checksums or [inaudible].
>> Yang Wang: And another problem is that I read your paper that there are like, for example,
memory tricks there are like something's wrong it could be multiple corruptions happening at
the same time in the same, like what they call thing or don't remember the name, yeah.
>>: That discussion has just confused me about [inaudible]. So you have, still at the end you
have a client that's one client that's reading and writing multiple virtual disks, right?
>> Yang Wang: Yes.
>>: So that have can arbitrary corruption and the whole thing will fail anyway because you
won't try to replicate that.
>> Yang Wang: Yes. That's actually>>: Couldn’t you just do an end to end checks in the client, checks on what you’re writing,
checks on what you read? [inaudible]?
>> Yang Wang: Actually your prototype there’s an end-side check at the client-side that I will
not discuss in our talk because we just use existing technique from the [inaudible]. But the
problem is that check is that it can only check for read. It cannot check for write. For example,
if you write data to the middle layer node and if the middle layer is somehow corrupted and
write it to all storage nodes then the data is actually lost. So even if you have a checksum at the
client side it can assure you that you will never read a corrupted data.
>>: Aren’t there known techniques to check to see whether your data that was written is not
corrupted? I mean just read after write?
>> Yang Wang: Read after write>>: It’s just a matter of when you decide the data [inaudible]. What does [inaudible] mean? So
disks do this, right?
>> Yang Wang: Yes. So read after write, that's definitely a possible approach. I think that’s also
used in Google, but of course it’s more expensive. And the other problem is that sometimes
the system will perform [inaudible] tasks like garbage collection. But of course, you can't do
read after write for every write; but my personal experience is pretty expensive not only
because you need to do advertise, it also like destroy the sequential pattern of the system. You
write some data layer and then you read it and then you write on it, so it's not a sequential
pattern anymore. That hurts the disk space, the throughput, quite a lot.
Okay. So we have implemented our idea in your prototype called Salus which descends from
the code base of HBase and HDFS. Here we also want to answer two questions. First, what is
the overhead of better robustness guarantees introduced by servers? Second, does such
overhead grow with scale? To answer the first question we measure the throughput of Salus
and HBase and the different workloads. And we performed the first set of experiment in the in
the environment where there's plenty of network bandwidth. Here we find that actually the
better properties of Salus that's not come at a cost of throughput, and Salus can achieve a
comparable throughput to HBase on almost all workloads. What is more, in an environment
where there's limited and network bandwidth Salus actually allows you to have your cake and
eat it too. And such an environment is actually not uncommon, for example if you have a
cluster of machine equivalent with one capabilities and more than two disks then probably the
network bandwidth is your IO bottleneck.
And we also performed a one set of experiments in such an environment that we find that in
such an environment [inaudible] Salus’ ability to tolerate more errors. It can also outperform
HBase by 74 percent. This is because the Salus active stored protocol can eliminate almost all
the network consumption in garbage collection and thereby making better usage of network
resources.
So, so far we have seen that better robustness actually does not hurt throughput. Now let's see
whether it will hurt scalability or not. To measure this we rented about 180 plus Amazon Easy
2and we ran both HBase and Salus on them. Here the Y axis is the throughput per server. The
[inaudible] here is that if this number does not change when the scale of the system increases
then our system is scalable. So first we can see that under the sequential write workload both
systems are scalable to 108 servers. And then under the random write workload we can see
that both systems are experiencing a pretty significant performance drop. But at least the
overhead of servers on HBase remains constant at 28 percent which suggests that Salus is as
scalable HBase. And the reason for this performance job at the first place is pretty complex.
The short answer here is that when the scale of the system increases actually the IO sizes on
each server decreases as a result of such a random distribution and therefore the 108
experiment each server is actually possessing a larger number of smaller IOs which is usually
bad for disk space in storage systems. And I'm happy to provide more details of this off-line.
>>: How big are the writes? Are these like 4-K blocks?
>> Yang Wang: 4-K blocks and it's doing it in [inaudible]. So each batch we use about 100
requests. So it's 100 4-K requests for each.
>>: I'm just trying to figure out the difference between random and sequential. So you're
seeking somewhere in your random 4-K and then going somewhere else?
>> Yang Wang: I don't think it's a major problem here because both HBase and our system
actually turn our random write into a [inaudible] version in a log file system. So it's actually not
actually [inaudible].
>>: So the problem doesn’t come from seeking. The problem comes from smaller batches?
>> Yang Wang: Smaller batches.
>>: So this explains why throughput is lower with larger nodes. You're saying there's some
timeout, maybe when inside you just have to write whatever you have to log and again fill that
write with as many operations in that timeframe.
>> Yang Wang: That’s probability.
>>: If you increase the offered mode does this graph look better?
>> Yang Wang: There's two kind of things you can do. First is increase the number of clients.
But that I don't think will help a lot because they like log different requests into different disks.
If you increase the load from a single client my guess it will probably help, but the problem is
that in practice usually a client will not have a very large number of outstanding requests so
that it can feed to the storage system. That's why we don't want to increase it a lot.
>>: Forgive me, this is my question again, is that the read performance uninteresting?
>> Yang Wang: Read performance is I would say less interesting because reads usually don't
need to go so low as the replication protocols so it's easier to make a read scalable.
>>: I guess I was wondering, well I was wondering about the way you do the barriers because
the way you do barriers is you just write to everybody and then you do this pipeline commit
stage and so presumably if you read something that was recently written you might have to
wait for a while before you can actually read the correct value for that log. Or did I
misunderstand?
>> Yang Wang: Read can still get a stale one. If you [inaudible]. So let me>>: Oh, I see. So you don't tell the client to the write is completed until you've done, okay.
>> Yang Wang: And if you really need a new version of the data then you can’t wait.
>>: And that's [inaudible]?
>> Yang Wang: That's not common because you arrive to remote disk as kind of first cache it
server. When I read it read it, it will get it from the cache.
>>: I'm misunderstanding everything. It seems like that way the thing of delaying the
application active client would make the throughput look better if you're measuring it on the
server than if you were looking at the client latency. In other words you might be keeping the
servers busy but the client>> Yang Wang: Yes. It's merely designed for throughput.
>>: [inaudible]. It's a virtual disk and you’re waiting for each operation to complete>>: [inaudible] comparison client latency between the two systems.
>> Yang Wang: We have the graph thing on the paper, but I haven't put in the slides. So it will
hurt the latency of the client, but it will not hurt a lot mainly because in storage systems the
major latency comes from the latency of the disk. Usually it’s at a range of time in the second
level. For pipeline commit we don't commit with [inaudible] to disk. It’s just in memory. So it's
usually just network latency plus some memory accessing latency which is [inaudible].
>>: Okay. And then you commit to disk asynchronously in the background?
>>: I’m still thinking about this write pipeline commit issue. When the storage system tells the
client that the write is completed the client is going to go [inaudible] cache or whatever, right,
because it’s typically writing things out asynchronously from some cache. And once it’s written
[inaudible] underneath that memory is it the case that in your storage system there's one point
in the write of which you know you've got the data [inaudible] that it’s not going to get lost and
so the write operation can't fail and then [inaudible] you can commit it because you're
guaranteeing the reads will actually return the latest value?
>> Yang Wang: What is the difference of this one? I don't>>: You [inaudible] but delay the notification of writes because you need to give correct
semantics to [inaudible] so you're guaranteed that once a write is returned complete then
[inaudible].
>> Yang Wang: Yes.
>>: Okay. But often all I care about is when I want do a write is I want to know that it’s not
going to get lost. Maybe, I guess the two things are kind of related.
>>: It might be durable or readable>> Yang Wang: Right. That's actually a very good question. Actually, in our pipeline commit if
you consider failures some durable data may not be visible for error. So that's why we don't
want to let you know before it's actually visible. For example, if 1-3, 2 is lost and then the client
fails then three should never be made visible any time. So actually we have a protocol in the
background to find such kind of things, better things. So that's why we don't want to let you
know that it's durable.
Okay. That actually could be the end of my talk but at the other project there’s one more
question that is deeply dissatisfying to myself, and probably also to some of you. Now that we
set off to do the robust and scalable system but have we really succeeded? So far I have shown
that our system can scale to about 108 servers. And this number is still pretty small compared
to the size of a typical industrial deployment which can have thousands of nodes and we have
tens of thousands of nodes in the future. So how does well does our system perform under
such a scale? Actually, at the end of the Salus project we have no way to find. Actually this is a
fundamental methodology question that not only applies to us but also to all other researchers
in the same field. The question is how can we validate the scalability of a large scale storage
system?
This part we [inaudible] because for researchers usually we don't have enough resources to run
our prototypes at full-scale. As I said, a typical industrial deployment already has tens of
terabytes of data and thousands nodes and they are growing, and on the other hand for
researchers you have hundreds of terabytes of space and hundreds of nodes are not easy to get
[inaudible]. I don't know what they say experience here, but that's my own experience. That's
why actually many of the recent works are only evaluated with hundreds of servers. And those
even include Google’s prototype, Spanner. So how to address this problem, one standard
approach in distributor system is to extrapolate a large scale results from observer result on a
small-scale testbed. For example, if it was 100 nodes we can [inaudible] network is ten percent
utilized and our CPU is five percent utilized we can extrapolate that our system can scale to
probably 1000 nodes. However, order for this approach to work we have to rely on the
assumption that the resource consumption grows linearly with the skills of the system which
may not always hold in practice. For example, sometimes we say that an error only happens
when the scale of the system reaches a certain limit. And sometimes we can say that the
resource consumption growth super-linearly with the scale of the system and such trend is not
obvious when the scale of the system is a small.
So kind of give up all such inaccurate approaches and really run our prototypes at full-scale, of
course with fewer machines. To achieve that of course we need to co-locate multiple processes
on same physical node and actually co-location itself is not hard thanks to virtual machine
techniques. The real problem here is that usually the bottleneck of a testbed is those error
resources. For example, if [inaudible] location process may arrive to a disk with a space of 100
megabits per second, now we can look at three of them on the same disk each of them can only
write to the disk it was a scale of, so there’s three megabytes [inaudible]. And of course the
currently run at their full speed and we still cannot push our system to its limit. So how can we
address this problem?
Since we have no magic way to increase the IO resources in our testbed we are wondering can
we somehow significantly reduce the resource requirements of each process? We found some
work [inaudible] performance profile to present to the rest of the system. So once again this
problem is impossible in general but it storage systems there's one key operation that allows us
to achieve that. The operation is that for storage system usually the content of the data does
not matter. For example, if I write something to a local disk or local file system the actual
contents I read or write does not affect how the system executes because they simply retrieve
them as a black box. What really matters is though the metadata such as the length of the data
or where we performed the read and write. So this motivated us to use synthetic data at the
client-side and abstract [inaudible] data on all IO devices so that we can significantly reduce the
resource requirements of each process so that even if it is co-located with many other
processes it can still run at full speed.
Of course now the question is how come we abstract away data? So the simplest approach is
to discard it completely. This approach is actually used in a previous work called David. They
have successfully applied this idea to evaluate a local file system. But I want to clarify that it
does not work in large-scale storage systems mainly because there are usually multiple layers in
the system and the upper layers usually store its own metadata as data on the lower levels.
And in this case it is not fine for the lower layers to just discard the data because it also contains
the metadata from the upper layers. And if you discard them of course the system will not
function correctly. Our answer to this question is that we should compress data instead of
completely discarding it. So before I go to the design let me first present the requirements of
our compression algorithm.
>>: Do you model things like network and disks themselves?
>> Yang Wang: Not yet but we should. I will talk about that later. So the first three
requirements are pretty straightforward. First, we need our compression algorithm to be
lossless because we cannot risk of losing metadata. Second, we should be able to achieve a
high compression ratio so we can look at many processes on the same node. And third, it
should also be CPU efficient because we don't want to replace our old bottleneck with a new
CPU bottleneck. Actually the [inaudible] rules out those general compression algorithms Gzip
because they are pretty CPU heavy, actually. The final requirement is [inaudible]. We require
that our algorithm should be able to work with mixed data and metadata. Let me elaborate a
bit. So the major challenge here is that despite the fact that we have full control over client’s
data the system itself may still add metadata, only certain metadata, into data and this is not
something that we have control over. And what is worse the system itself sometimes split such
data plus metadata in impossible [inaudible] way and then send them to the lower levels. And
therefore when the lower layer receives some input it does not know where the metadata is.
So the key of our compression algorithm is that we should design our data pattern in our client
data in a way that we can you efficiently locate metadata inside data. For that purpose we
have designed a specific data pattern and corresponding compression algorithm called Tardis.
We use the name Tardis because it can achieve a very efficient space and time compression. So
first to locate metadata inside data we at least need to make sure that data>>: Sorry. Why didn't you just [inaudible]?
>> Yang Wang: After you, if we write all zeroes, that’s actually the first approach we have tried.
If you write all zeroes, if some metadata is inserted then you need to scale all the [inaudible] to
find all those non-zeroes.
>>: Oh. And this you can avoid doing the scan. Okay. I’ll shut up.
>> Yang Wang: So first to look at metadata inside data we had at least need to make sure that
data is distinguishable from metadata. For that purpose we have introduced a specific
sequence of bytes called a flag which does not appear in metadata. And then the question is
how to efficiently locate metadata. As we have already learned in our algorithm class it is
always easier to locate something in a sorted array because we can use Spanner research. This
somehow motivates us to keep our Tardis data pattern sorted. For that purpose we have
introduced another sequence of bias which we’ll call marker which is an integer representing
the number of bias to the end of the data chunk. And our Tardis data chunk is actually a
combination of flags and markers in which the flag will allow such data to be distinguishable
from metadata even when it is split or merged, and the marker will somehow keep the data
pattern sorted so that we can us Spanner research to locate metadata. And if a client wants to
write with 101 key data chunk this is how it looks like assuming both flags and markers are
fallbacks. We will start with the flag followed by integer 1016 which remains that 1016 bias
remaining and another flag and once on the eight and so on.
Now let's see how we can locate metadata. So here, just to show an example which consists of
the second half of the data from the previous example plus some metadata and also some
bytes from the next data chunk. So [inaudible] going to start by searching for a flag, then it can
retrieve the marker after the flag, then it will try to skip those 504 bytes, but before it can
perform the skip it will make a check. It will check that the previous eight bytes must be a flag
followed by the row. If that is true that we can know that there's no metadata inserted if this is
not true we would just use Spanner research to locate the metadata inside it. So in this thing,
for example, we will just compress those bytes into a more compacted format which contains
only two integers. The first one is the starting point of the data chunk and the second one is
the last of the data chunk. Then it will search for flag again. Here we will see that the flag is
actually not adjacent to the end of the previous data chunk which means that there must be
some metadata inserted. And in this case, since metadata is uncompressible, it will simply copy
it to here. Then we will retrieve the marker and try to scale again and perform the compression
again. The benefit of this protocol is that, assuming the size of the data is much larger than the
size of the metadata, then our algorithm can skip, avoid scanning most of the bias in the input
which makes it very superior efficient. And actually in our experiment our comparison
algorithm is about 33,000 times faster than Gzip when compressing one megabyte of data, but
of course this is not a [inaudible] comparison because Gzip is a general. But this simply shows
that by choosing our own data format we can significantly reduce the cost of compression.
>>: So underlined is what’s being sent across network, correct?
>> Yang Wang: Yes, to network and to disk.
>>: So who is generating to top one?
>> Yang Wang: The clients.
>>: So why doesn't the client generate the [inaudible]?
>> Yang Wang: So we want to ensure that, so the clients will generate this and send this over
the network, but the server will see this. It will decompress this into this at the server side.
This is to ensure that the server must be here because [inaudible] the same as if data is not
compressed. For example, I would use like a Google file system as an example. So Google file
system has, basically you can think of Google file system which is a block which is 464
megabytes of data. So why we need to create that block it need to contact the metadata
servers. So we want to ensure that the server sees exactly the same amount of data so that it
will creates exactly the same amount of blocks so that it can [inaudible]. In this case if you
adjust [inaudible] list to a server the server adjuster will see the few bytes but it will still create
a block for area 64 megabits of data so therefore it will create less blocks which will affect the
accuracy of the [inaudible]. But we really want to ensure the server will still see this one, so all
your code will see this one. So this what we want the IO like [inaudible]. So this is the basic
idea of the-
>>: Won’t having multiple servers write lots of the smaller blocks at the same disk, how is that
similar performance? A single server is trying to optimize risk for [inaudible] larger data? You
said you were trying to increase [inaudible] the performance characteristics of the disks.
>> Yang Wang: I will talk about that later. We actually use another approach. In order for this
algorithm to be lossless the flag cannot appear in metadata. Otherwise when we misclassify
some metadata’s data and we'll lose it. So how can we find such an appropriate flag, one
approach that scans all the possible metadata bias and they’re trying to find some bias that
does not exist in them; and actually we find that in practice we can use a much simpler
approach mainly because Tardis is only used for testing so that we don't actually need any
rigorous requirement on the flag. So in practice if our chosen flag does appear in the metadata
and breaks the system then we can simply choose another flag and risk the other test. And
actually it turns out that a randomly chosen eight byte flag works for both HDFS and HBase.
So now we have these two Tardis compression. Now let's see how to use it to achieve our
original goal. Now that our original goal is to measure the scalability of our prototype, so for
that purpose we use a combination for real nodes and emulating the nodes as a microscope to
focus on the bottlenecks of the system. In such a setting the client will stay on the Tardis data
from there to all nodes and those emulating nodes will run Tardis compression and
decompression on the IO devices. And the real node will run with unmodified data. While
running a large number of emulated nodes we will allow us to give enough pressure on the
bottleneck nodes while running our bottleneck nodes, in real node allow us to get accurate
measurement of the throughput of the bottleneck which is critical to the scalability of the
system. If you are really [inaudible] you can try to just use this microscope on different
components in the system.
We have implemented emulated devices for disk network and memory and we have actually
achieved the transparent memory emulation for both discs and networks. By using byte code
instrumentation we just simply replace Java’s IO classes with our own [inaudible] that performs
Tardis compression automatically. The usage is really simple. We just need to add option to
the Java command line. We have not been able to find a way to support memory compression
and transparency yet mainly because in a Java there is no clear interface for memory status. So
for applications that store a lot of things in memory it will require code modification. And our
experienced that HDFS does not need any memory compression because it does not store a lot
of same thing memory. HBase does store a lot of things in memory and so requires about 71
lines of code modification to support a memory compression.
And we have applied our system to HDFS and HBase and we measured the scalability. And
when we find a problem we will try to analyze it, so root of cause, and try to fix it. And we have
now our experiment on [inaudible] cluster. So here I'll just show some of the results from
HDFS. HDFS is a typical sharding system. It has a single metadata service called [inaudible]
which I usually believe to be the bottleneck of the system. There are a lot of data nodes to
store the data, so obviously we should apply emulation to data nodes and run the end load in
real mode.
Here the X axis is the number of emulated nodes and we have actually achieved the co-location
reader of 1 to 100. For example, to emulate about 9.6 K data nodes we only need 96 physical
machines. And the Y axis is the [inaudible] throughput which is measuring gigabytes per
second. When we increase the number of data nodes we find that the system quickly saturated
with about 1-K data nodes, and our profiling shows that the problem is because the full number
of [inaudible] on the main node is too small. After fixing that our system can reach a
throughput of about 300 gigabytes per second; and our profiling shows that at this point the
bottleneck is actually in the locking system of main node. Main node need to log to two types
of information to disk. One is the operation, one is the metadata operation log and the other
one is the debug information. And it is suggested that they should be putting two separate
disks, but unfortunately our test machine only have a single disk and therefore we decided to
put the debug information in tmpfs because it is not crucial to the correctness of the system.
After fixing that we were able to achieve about 400 gigabytes per second. This number is the
same as reported by those HDFS developers who did their own experiment [inaudible] cluster
in Facebook. And we were able to reproduce the same result with only 96 machines. And then
we want to further investigate whether we can increase the throughput of HDFS.
>>: [inaudible] emulated nodes [inaudible]?
>> Yang Wang: Yes, yes.
>>: Do you have an idea why they did it with two times fewer real nodes?
>> Yang Wang: So first they also used extrapolation approach. They don't get exact number
from the [inaudible] cluster. I don't remember what number they get, but it's also another fullscale experiment, but we are close to a real full-scale.
>>: Oh, they weren’t running it on 4000 [inaudible]cluster?
>> Yang Wang: Huh?
>>: They weren't running on [inaudible] cluster?
>> Yang Wang: They run on [inaudible] cluster, but they were not able to saturate their system.
So they still use like an extrapolated form, 4000 to somewhat I don't remember, and the other
reason is that those machines may not be the same. So that's why, so in our emulated
[inaudible] we assume that each machine is equipped with like two disks. Of course, if each
machine with Tardis disks then you need like 10 times less machines, few machines.
>>: I think a slight generalization to Ed’s question is do you have, have you done an experiment
to validate>> Yang Wang: That's a very good question.
>>: I was going to say if you have 100 physical nodes it would very impressive if you could use
one node to emulate 100 and have that be the same result as 100 real. And I believe that 100
real could tell you what 10,000 actually>> Yang Wang: So that's a good question. Actually in the ideal case we should have like 10,000
node [inaudible] and to run an experiment and to validate it, but of course this is impossible for
us.
>>: [inaudible] 100. One physical node>> Yang Wang: We use about 1500 nodes, our [inaudible] cluster and then at least to that point
our result is pretty consistent with our emulator a lot. But one thing I want to mention is that
the purpose of our emulator is not to give you an accurate performance measurement of the
system. It's many mainly used to tell you where the bottleneck is or test where the bottleneck
is.
>>: There’s a danger in that if you're inaccurate in modeling the system you’re going to fix the
wrong parts. Lots of storage systems have bugs that, inefficiencies that don't matter because
they're not [inaudible], they're not the bug you notice. You could spend a lot of time placing
bugs at scale that when you get to scale aren't important bugs to fix. Does that make sense? I
think that's why Jeremy was asking that.
>> Yang Wang: Yeah. That’s definitely a very good question. So I have to say we have no idea a
way to tell you whether the bug we find is really make sense or not, but as you will see in the
next slide when we find the scalability bottleneck we will try to find a [inaudible] in the source
code, then we can see whether you can read about it would really happen in a large-scale
experiment. So I would request like an interactive validation of our prototypes. Ideally we
should really have a 10,000 node experiment. It's really hard.
So at this point we find that the bottleneck is still in the locking system of main node but since
we have no magic way to increase the speed of the disk we can only assume that in the future
there might be some faster devices like a per system memory. So to emulate that case we also
put the metadata log in tmpfs. So by configuring in this way our system can reach a throughput
about 680 bytes per second. At this point our profiling shows that the bottleneck is actually in
the synchronization of different [inaudible] on main node and fixing them would require
significant redesign.
>>: Would you consider partitioning the main node and just [inaudible] the bottleneck?
>> Yang Wang: I actually put it in my future work.
>>: The scale you're talking about, the thing that is considered, okay.
>> Yang Wang: So part of those configuration problems we also find some implementation
problems in HDFS. For example, we find that HDFS can experience a pretty significant
performance job while the size of the file grows large. Actually this is a pretty surprising
because HDFS is designed for big files. And our profiling shows that the problem lies in this
piece of code. So when our main node needs to add a block to our existing file you need to
compute the last of the existing file. And in that current [inaudible] we do this by scanning all
the existing blocks which of course will become heavier and heavier while the size of the file
grows large. And our fix is pretty straightforward. We just add an integer to each file to recall
its existing [inaudible], and you can see by applying our [inaudible], the problem does not
existed at node.
>>: Where is the integer kept?
>> Yang Wang: Huh?
>>: Where is the integer kept?
>> Yang Wang: For each, I would say I node?
>>: So you have to do two commits then to for each [inaudible]? You have to update integer
and block?
>> Yang Wang: No, no. I know it’s on the main node, so you anyway need to>>: You add an additional IO [inaudible] to update the integer? You are adding IO, right, if
you’re keeping integer?
>> Yang Wang: No because while you add a block you already have an IO there. We just, we
don't add an IO, we just put the information in the existing IO nodes.
>>: You add an account to the block so you read the last block of the-
>> Yang Wang: Yes.
>>: And you see if the number of block [inaudible]?
>> Yang Wang: Yes. So what is your question? So this is the end of my assistance work, but I'm
also probably interested in how to provide full Tardis in distributed systems. For example, I
have work in the Eve project which aims at replicating multithreaded applications. And I have
also worked now on the UpRight project which aims at making BFT a practical [inaudible] for
real systems.
This is almost the end of my talk. At the beginning I have shown that my final goal is to provide
a robust and scalable storage system. So far I have been able to show how to improve the
robustness of a scalable block store and also how to validate its scalability. Of course, there is
still a long way to go. On one hand there are other kinds of storage systems such as file
systems key-value store, and databases, and they all have different kinds of workloads and
requirements. On the other hand, scalable systems are still using other fields such as in medical
care, high-performance computing, and so on, and they also have different kinds of
requirements and workloads; and they will of course presented new challenges to our existing
techniques. And in the future, as I mentioned, the scale of the system is still growing and
people are continuously introducing new techniques to supporting such a growing scale such as
[inaudible] data and sharding metadata and they will of course presented new challenges to
our existing research.
So I will just present two concrete projects and interesting in the near future. So first, as you
have already seen, metadata should be probably stored separately from data because they are
totally different. So it might be beneficial to provide robust metadata storage as a single
service. Actually this is another new idea, for example Chubby and Zookeeper, where they
provide tree abstraction to other services; but the question is that they are not scalable, and as
I mentioned earlier, even the size of a metadata may not fit into a single machine in the future.
And we also need to distribute them into different machines. But one key question from a tree
abstraction is that in tree usually those upper layer nodes are accessed more frequently than
those lower nodes, so how to distribute them to different nodes and how to achieve loadbalancing is an interesting question to me.
And also, another question I'm interested in is how to automatically find the root causes of
performance bottlenecks. This is moderated by the fact that I have spent so much time in my
PhD to find such kind of root causes and fixing them. One of the obvious reasons is that some
resources is exhausted. If you see that the CPU is 100 percent utilized then it's probably in the
bottleneck of the system; and during my intern on Facebook we have already worked on a
prototype to find such kind of exhausted resources and it's already deployed.
But there are also other kinds of root causes that are harder to find. For example, sometimes it
can be caused by inefficient usage of resources, and sometimes it comes because by how
different components coordinate with each other. The bad thing with that is that there are no
obvious signs for such kind of problems. You find that no resources are exhausted. So how to
find which machine or which piece of [inaudible] is actually the problem is an interesting
question to me.
So to conclude, the final goal of my research is to provide robustness and efficiency
simultaneously for scalable systems, and I have shown that the key of my approach is to
process data and metadata differently because they serve completely different goals in storage
systems. And one more lesson we have learned in our procedure is that some problems, which
are usually considered to be hard or even impossible in general when it comes to storage
systems, are not only solvable but they can also be solved in an efficient and scalable way. I
hope you have enjoyed my talk. Now I'm happy to take questions. Thank you.
>>: One question?
>>: [inaudible].
>>: Okay. Thank you.
>> Yang Wang: Thank you.
Download