>> Jeremy Elson: Hello everyone. It's my pleasure to introduce Yang Wang. He’s here from UT Austin. He's worked on a variety of distributed protocols and storage systems and also, at least according to his webpage, also enjoys basketball and travel lingo. And anyway, so I'll let him get started. He's talking about separating data from metadata for robustness and scalability. Yang, please. >> Yang Wang: Thank you, Jeremy. Good morning everyone. My name is Yang Wang. Today I will talk about how to build a storage system that can be robust against the different kinds of errors and be scalable to solve [inaudible]. And I will present how to achieve this by using the key idea of separating data from metadata. The fundamental goal of a storage system is not to lose data. And actually in practice this is hard to achieve because the stored components can fail in different ways. For example they can crash, they can lose power, what is worse they can fail in some weird ways. For example, a disk may experience a bit flip, it may lose a write because of a [inaudible] back, it may even misdirectly[phonetic] write to a wrong location. And the errors in all other components can also be propagated to discs and permanently damage our data. This is a long-standing problem that other people have been studying for decades, and people have different kinds of techniques to tolerate different kinds of errors. And that's true in this quantitative graph, it is typically the case that the more kinds of errors we want to tolerate the more cost we are going to pay and therefore people have to make painful tradeoff between robustness and efficiency. And nowadays this problem is made worse by our trend that the scale of the storage system is correlating rapidly mainly because the quantity of data is growing almost exponentially, and as a result, many big companies have developed their own largescale storage systems to store the growing amount of data; and nowadays we are talking about at least thousands of servers and tons of para bytes of data and these two numbers are growing really fast because of quantity of data is still growing. In such a scalable system the trade-off between robustness and efficiency becomes even more painful because the overhead of those strong protection techniques will be magnified by the scale of the system. For example, in the old days when we have only ten machines the question is like do we want to add five more machines to make my system look more robust? And nowadays for [inaudible] Google the question is like we already have a 1 million machines, do we want to add 500 thousand more machines to make my system more robust? So of course they have to make a careful balance between the cost to tolerate different kinds of errors and the price they are going to pay if certain errors happen and our system is not designed to tolerate that. >>: You’re talking about increasing robustness by adding machines? >> Yang Wang: I'm sorry? >>: You’re talking about the only way to increase robustness is by adding machines instead of>> Yang Wang: No, no, no. I’m just [inaudible] as example. >>: Okay. >> Yang Wang: Okay. And nowadays the [inaudible] around here. Now that list point is not necessarily the point where those two lines meet each other. Actually this is the point where the sum of the two lines is minimum and theoretically that can be where multiple of such [inaudible] point and lists while in practice. People find that cautious and single bit flips are pretty common so we have to tolerate them. Whether it is necessary to tolerate network partition is controversial. Google has decided to pay the cost to tolerate it, but many other companies have decided not to do so, many for cost concerns. Of course as a result of such balance if an error happens on the right side of the balanced point then the system of course is not designed to tolerate that. And some kind of errors do happen in practice causing their systems to either lose the data or become unavailable. So that kind of failure will not only hit the income of the companies but will also hurt their reputation. And in this talk I will try to answer this question, can we achieve both robustness and efficiency for large scale storage systems? This is a graph I have already shown. Our approach is trying to significantly reduce the cost to tolerate different kinds of errors so that we can move the balanced point to its right side. Actually this is a big and hard problem that I do not plan to fully address in this talk, but at least at the end of talk I hope to convince you that at least it’s possible for certain storage systems. So I have devoted almost all my PhD career investigating this problem. And we find that there's one key idea that always guides our way which are separating data from metadata. If we look into the [inaudible] of data they are actually not equally important. Some parts are used to describe relationship of other parts [inaudible] and we call such data, about metadata in storage systems and they are usually considered more important to the [inaudible] of the system. And our finding is that by applying those strong but potentially expensive techniques to owning metadata and applying minimum protection to data and actually we can achieve stronger entries for both metadata and data. So in the first half of my talk I will present how to use this idea to build a scalable and a robustness storage system. And as I said, it’s a hard problem. So let me start with the simple one. How can we replicate data in a small-scale system? On one hand, restricting our application to a small-scale system significantly simplify the problem; and on the other hand, a [inaudible] replication protocol is a fundamental building block in almost any larger scale systems. And here I will show that it is possible to tolerate both crashes and timing errors with a significantly lower cost than previous solutions. Then I will move to the large-scale system and here you will see that replication is still useful here but it's definitely not enough. And we also need to address challenges introduced by the scale of the system. And here I will show that it is possible to improve the robustness of a large-scale storage system to tolerate almost the arbitrary errors while our new prototype can still provide a comparable throughput and scalability. So now let me start with replication. As I said, replication is using only almost any large-scale storage system to provide [inaudible]. The major question here is how much cost we going to pay. And actually the answer depends on what kind of errors we want to tolerate. And this is one of the typically examples that are experiencing less painful tradeoff between robustness and efficiency. On the left side people have developed those primary backup systems which can tolerate only crash failures while they are really [inaudible]. They only need f plus 1 replicas to tolerate f failures which are the minimum you can expect. But of course crashes are not the only errors that can happen in practice. People find the timing errors which can be caused by network partitions or slow machines can also happen in all these data centers and therefore people have developed that Paxos protocol to tolerate such kind of errors and it is more expensive. It requires twelve plus one replicas to tolerate f failures. And on the right side people have also developed a [inaudible] for tolerance techniques which can tolerate arbitrary errors with an even higher cost. As I said, Google is already using Paxos in their systems, but many other companies are still using primary backup, many for cost concerns, and in this talk I will present a protocol called Gnothi which targets both f plus 1 replications and the ability to tolerate timing errors. >>: [inaudible] using Paxos, but in the traditional GMS papers they were easy Paxos through exactly the method you're talking about which is separating metadata from data, right? [inaudible]. >> Yang Wang: [inaudible] Paxos? >>: No. They were separating metadata from data. They were using Paxos to measure the metadata from storage system and keeping data [inaudible] replicated on machines. >> Yang Wang: I'm sorry, which work are you talking about? >>: The original Google file system paper from a decade ago. >> Yang Wang: Google file system paper>>: The classic design for large-scale uses Paxos to manage your metadata because of the cost. And then you keep all your data on the machines themselves. >> Yang Wang: But the problem with Google file systems is that their replication protocol itself does not provide the same guarantees as Paxos. They use Paxos for metadata, they use primary backup for data, okay? But they are kind of saying we provide a stronger metadata but it’s still the same guarantees as we can guarantee for data. And in this [inaudible] I will show that it is [inaudible] to, with only replicating metadata with Paxos I can get strong guarantees for both. >>: So when you say timing do you have some example for what tendencies you call a timing error? >> Yang Wang: Okay. We'll talk about that later. But it's kind of, for example if we use let's say a timeout to detect whether a node has failed or not sometimes a network partition can cause your timeout to be inaccurate in such a way that even though the remote node is still live, but it's another partition then you cannot receive the message from that node and you misclassify it as a failed one. Okay. So in this talk I will present a protocol called Gnothi which targets at f plus 1 replication and the ability to tolerate timing errors. If you are familiar with distributed systems you may even wonder whether this is even possible. And actually this is proven to be impossible in general but Gnothi comes close to [inaudible] by restricting its application to a specific storage system, a block store. A block store function is like a remote disc. It provides a number of fixed size blocks to different users and a user can read or write a single block. >>: So you're ordering constraints between both parts? >> Yang Wang: Okay. We'll talk about that later. >>: Well, I was wondering if that was [inaudible] assumption you were making. >> Yang Wang: No, we aren’t talking about that. There are ordering guarantees. >>: [inaudible]. >> Yang Wang: This file is simplicity, actually a block store is still widely used in practice as demonstrated by the success of [inaudible] elastic block store, EBS. And in a few slides you will see how Gnothi benefits from this simple [inaudible]. The key idea which differentiates Gnothi from previous systems is that in Gnothi we don't insist that all nodes must have identical and a complete state. And actually for block store this is fine as long as the node knows which of its blocks are fresh and which of them are stale. That's actually how we get the name of the system. In Greek Gnothi Seauton means to know yourself, and in our system as long as the node knows itself it can process requests correctly. But before I go into detail of our design let's first see why this tradeoff is changing. The basic idea of replication is pretty straightforward. To ensure data safety despite f failures we need to store data at the list f plus 1 nodes. The major change here is how to coordinate different servers or replicas so that requests are executed in the same order on different nodes. Let me show a simple example to say why this is necessary. When two clients are sending two requests A equals one and A equals two, to two different replicas of service, it will be bad if they execute them in different orders because they will reach a different state; and in this case if a client tries to research data in the future it will get inconsistent replies. That's why we have to ensure that requests are executing in the same order on different replicas. Now let's see how different systems try and achieve that. In primary backup systems their basic idea is that all the clients who send their request to a single node, which we'll call primary, the primary will rely on order to different requests [inaudible] such ordering information together with a request to other nodes which will call back up. Of course here the question is what can we do if the primary is not responding? Primary backup systems relies on synchronism during that if a node does not respond in a time then it must have been failed. So in this case the system will promote a backup node into a new primary and the clients will send their new requests into a new primary. However, the order for this synchronism [inaudible] to be true we have to set up the timeout in a pretty conservative way. Otherwise if a correct primary is misclassified as a failed one and a new backup is promoted then we will have two primaries in the system and they may align different orders to a request. And a conservative timeout will hurt the availability of the system because the system is unavailable after the primary fails but before the backup is promoted, before timeout is triggered. To solve this problem people have developed those Paxos slide protocol. Their basic idea is that we cannot rely on a single assumption. Even a correct node may not respond in time because of timing errors, therefore, instead of relying on a single primary to order request in Paxos the basic idea is that any of such ordering must be agreed by a majority of the nodes. As long as this is true it is impossible for different replicas to execute the request in different orders. And now that all systems should be able to make progress just by f failures. This means that even considering f failures our system should still have sufficient number of nodes to form a majority, to form a quorum of majority. And by some simple math induction we can get that we need at least 12 plus 1 nodes which is of course more expensive than the primary backup approach. So can we tolerate timing errors with only f plus 1 replication? And actually this is also a longstanding problem that people investigated a lot. For example, previous works, the basic idea of previous works like cheap Paxos and ZZ is that actually we only need agreement from f plus 1 nodes. We need the extra f replicas to cover the possible failures and the timing errors like paying insurance for future disasters. And of course if the disaster does not happen our cost is lost here somehow. So can we only pay such additional cost when failures happen? I guess everybody prefers this idea instead of paying insurance every month. So following this idea is pretty natural to come up with the following approach. Instead of sending requests to all servers the clients can choose to send a request to f plus one servers first. If they can reach agreement then it’s fine. If they still can't reach agreement for either failures or some other reasons than the system can't activate the backup ones. The benefit of this approach is that in the failure case the replication cost is low and the system can use a backup ones for other purposes. But the problem with this approach is that both cheap Paxos and ZZ are designed for general-purpose replication so they want to ensure that all nodes are identical. Therefore, before the system can create the new node as a working node they have to copy all the states from one existing node to the new node. And for a storage system which can contain at least a terabyte of data the data copy can take hours and the system is unavailable during this period which is totally unfavorable. >>: [inaudible]? >> Yang Wang: Because in this case, assuming this one is not responding, we need agreement from a majority but this one is not activated. >>: Why not just keep more replicas? Why not increase f to plan for the fact that you're going to have no maintenance and down nodes and you still have the majority? >> Yang Wang: Increasing f, of course we are increasing you cost in the lack of failure for you case. >>: Sure. I'm trying to show you how to reduce the cost f in every case. >>: Yes, he's doing that. This is all planned downtime. This is something failed. >>: You can't have unavailability of failure>>: And he wants to make that cheap. >> Yang Wang: Ever wonder why I need to ensure that all nodes are identical? This is because in general, before application accesses a new request it has to ensure that all the previous requests have already been executed. Otherwise the execution might be wrong. But as I mentioned earlier, in block store this is not necessary as long as a node knows itself. And since a block store is designed to process only write center reads let’s see how can incomplete node process writes and reads in Gnothi. Actually it's pretty straightforward to process the write because the write requests will override the block anyway no matter whether it is fresh or stale. And for a read it can also be processed correctly as long as that node knows whether the targeted block is fresh or stale. Let me show what I mean. For example, when a client is trying to read a block from an incomplete node as long as the node knows that it does not have the current version of the data it can tell the client so that the client can retry from another node. That's the basic idea of Gnothi, but then the question is how can we let each node know itself? And that's why we apply the idea of separating data from metadata. The client will separate the write request into two parts, the data part and a small metadata part which is used to identify the request. Then the client will send that data to f plus 1 nodes first. And then it will send the metadata to all nodes through a Paxos-like protocol to ensure that all nodes knows exactly which request the system has processed. In this example node 1 and 2 will know that it has the current version of the data and node 3 knows that it does not have the current version of the data but the data must be stored somewhere else. The benefit of this protocol is that since the size of the metadata is very small compared to the size of the data, this by the fact that it is fully replicated to all nodes, our replication cost is very close to f plus 1 in the failure for your case. And to actually achieve higher throughput through this protocol we use a pretty standard way to perform load balancing. We just to either virtual disk spacing to multiple sizes and we locate those sizes to different nodes in [inaudible] order. By using this approach Gnothi can achieve at least 50 percent higher running throughput compared to the full replication approach. Now let's evaluate our idea. So here we hope to answer two questions. First, what is the performance of Gnothi? To answer that question we compare the throughput of Gnothi to a state of art Paxos space block store called Gaios by the performance of full replication of data. And the second question is what is the availability of Gnothi? To answer that question we compare Gnothi to Gaios and Cheap Paxos; and as a reminder, Cheap Paxos is the one that performs a partial replication during the failure for your case and activates the backup ones during failures and copy all the data to the backup ones. To answer the first question we mirror the throughput of Gnothi and Gaios on the different workloads, and here I counted two of them which is the 4-Krandom write. The Y axis is the throughput which is measured in request per second. Gnothi can achieve at least a 50 percent higher throughput compared to Gaios because in Gnothi data is only replicated to f plus 1 nodes and Gaios’ data is replicated to 12 plus 1 nodes in the failure [inaudible] case. To answer the second question we compared Gnothi to Gaios and Cheap Paxos. So in this graph the X axis is the time which is measured in seconds, and the Y access is a throughput which is measured in megabytes per second. The top orange line is Gnothi, the middle red line is Gaios, then the bottom blue line is Cheap Paxos. To measure the availability we actually queue our server for all serving systems at time about 200. First we considered both Gnothi and Gaios don't need to block because they still have enough replicas to perform agreement. But on the other hand, Cheap Paxos needs to block for the data copy, and in our experiment it needs to take about a half hour to copy 100 gigabytes of the data. And during this period the system is not available. And then we started a new server with a blank state at time 500. When I say that, and the new server of course needs to copy all this data from the other nodes. And we can see that during this period Gnothi can achieve about 100 percent to 200 percent more write throughput compared to Gaios, and while those two can still complete the recovery at the same time. The reason for this is that in Gnothi one node only needs to store two sets of the data. And of course during recovery it only needs to fetch two sets of the data which means that the system can locate more resources to process new requests. So far I have focused on small replica, a single replica>>: Can you go back to the [inaudible] previously? The last slide. Yeah, this one. So do you have a primary scheme compared here as well? [inaudible]? >> Yang Wang: We did not do an experiment with primary backup, but their system should be, I would say it's not two fair comparisons. For example, for our Gaios and Gnothi was it was [inaudible] where we both needed the three nodes. For primary backup you only need two nodes. >>: Right. >> Yang Wang: In that case the throughput should be close to a single disk of throughput. So assuming all our machines are equipped with a single disk. But that's, I would not say it’s a fair comparison because you need to use a two node experiment to compare to a three note experiment. >>: But it would be different guarantees, right? >> Yang Wang: Yes. Different guarantees at different costs. So that's why we kind of use a single disk as like a baseline. >>: You said it’s different guarantees and different costs, but I'd like you to defend a stronger claim which is the costs are, Gnothi dominates, right? >> Yang Wang: Yes. >>: Because other than the fact that you have to allocate the minimum cost of entry with your machines, after that it has better throughput and better availability. Like there's the trade-off here. >> Yang Wang: So compared to primary backup it does not have better server. It has the same throughput with better availability I would say compared to primary backup. >>: It looks like it has guaranteed, has better throughput. >> Yang Wang: But that's because it has more machines. >>: Oh, sure. Okay. I guess you’re saying you divide by the number of machines. >> Yang Wang: Yeah, yeah, yeah. The per machine throughput is, yeah. >>: Is this why, why is the throughput going up when f equals two [inaudible] per machines and so it’s load-balancing or is that just like experimental noise or>> Yang Wang: No, it should be higher because now you have five machines and data is only replicated to three of them. They should get like 1.66 higher throughput. >>: Okay. So is it the same issue which is that this isn’t normalized for the number of machines you have? >> Yang Wang: Yes. I would say that. >>: So how close is that to the 1.66 you'd expect because you're adding 1.6 times [inaudible] machine? It looks like five>> Yang Wang: It's not. I think it's more than 1.66 because in Paxos we still need to perform full replication for metadata. That's kind of for, it’s more expensive when you have more replicas. So your metadata replication part is heavier so that's why our average throughput is slightly smaller than 1.66. >>: If you take these single disk numbers 390 and you multiply by 1.66 you get 585 which is a little higher than that. So something's [inaudible]. >> Yang Wang: Therefore the random [inaudible], actually that’s also why I have not mentioned here is that now since in Gnothi only one node need to store two sets of the data. So therefore they are average of [inaudible] time also becomes slightly smaller. So that's why I actually this part is slightly higher than like 15 percent higher. >>: 585 is a little higher. The other is supposed to be 647 which does meet the prediction. >> Yang Wang: So, so far I have focused on a single replicated work. But as I mentioned earlier, in a large scale system is much more complex at least in two ways. First, a single replicated group is not enough to hold all the data and that's why most of the companies choose to shove their data across multiple [inaudible] replicated groups. For example, Google can choose to store the general data of user one on shard one and 4, 5, 6 on shard two and so on. The second complexity comes from the fact that there are usually multiple layers in the system. For example, if we write something in general we don't talk to Google's file system directly. We need to talk to a web server first. The web server may need to talk to a [inaudible] table which may finally talk to Google's file system. In such a complex system ensuring robustness and efficiency inside a single replicated group is necessary but it's definitely not enough because we also need to provide guarantees across multiple components. And in this talk I will talk about how to address two problems; first, how to provide ordering guarantees across different shards, and second, how to provide end to end guarantees across different layers. So let me go to the first one. You may wonder why do I need even need to provide ordering guarantees? This is because a block store requires a specific semantic called the barrier semantic which means that the use of a block store which is usually a file system here can specify some of its requests as a barrier and all of the requests before the barrier must be executed before the barrier is executed, and the all request after the barrier must be executed afterwards. Such barrier semantic is crucial to the correctness of the file system. >>: When you say all of the requests you mean all the requests in a given client stream? There is not a barrier globally across all>> Yang Wang: Right, right, right, right. >>: For a given client. >> Yang Wang: For a given client. Actually a block store is usually just accessed by a single client. It functions like a disk usually. You're right. And a violation of a barrier semantic can cause the file system to lose all its data in the worst case. Let's see how it can go wrong. Let me show a simple example where the client is trying to send two requests to two different shards and the request of two is attached with a barrier. It is possible that the client failure can cause request one to be lost and request two is still received and committed by the second shard. >>: Can you explain the sharding scheme? >> Yang Wang: The shardings, I would call it somewhat similar to what Google file system did. So it will use one replicated group to store whatever, maybe based on hash, maybe based on the file system name, so you choose to store something part of your data on one group. >>: That's not a block store one client. That's a shared file store. >> Yang Wang: Yes. >>: If you had single client, single disk semantics why not share their client? What is this>> Yang Wang: I would say>>: Can you relate is this related, I’m confused by the Google file system sharding example how it relates to single block. >> Yang Wang: So our model is that for each block store there's a single user but our system should provide like a large number for virtual disk to large number of users. This is a usage model. And then>>: So the client has multiple disks that they’re writing to? >> Yang Wang: Yes. Sometimes a client wants to have a higher throughput than a single disk. And also in this approach we’ll allow you to get like better load-balancing. >>: So like the client has virtual [inaudible] over virtual blocks? >> Yang Wang: Yes. Not called a [inaudible] but the idea is simple; it's not a [inaudible]. The replication is>>: It’s striped? >> Yang Wang: It’s called striped or called read zero. Of course there’s a more naive solution to this problem is that the client can choose to not send a request two until request one is completed. But this approach we'll just lose all the parallelisms the sharding approach is trying to achieve. And of course we will also hurt the scalability of the system. And our solution to this problem is based on a key idea that such kind of out of ordering write is actually fine as long as the client still saves the data in the correct order. This is like saying if there’s no evidence there's no crime. Let's see what I mean for a simple example. Assuming three different shards are receiving three different requests, A, B, and C and the third is attached with a barrier and the second one is lost somehow. We are saying that this is actually fine as long as the client doesn't see the last update even if it's made to disk. In this this example it's fine to see the new version of A and all the version of B and C. And based on this idea we have developed a protocol called pipelined commit. Its basic idea is that different shards should lock data in parallel but they should coordinate it together to make sure that the data is visible to the clients sequentially. >>: Does this only apply when you’re updating existing data or does it apply for new data as well? If you think about a file system term>> Yang Wang: I'm sorry? >>: Does it apply only to overwrites of existing data or is this also applied to new data that's being written? Like you’re thinking about if you’re striping the XT3[phonetic] and you have a file system journal instead of transactions that you could log, does this problem apply there as well? >> Yang Wang: So first, our system is designed for block store so it's always updated. There's no new data in block store but it provides a fixed number of blocks to users. So in a file system it's actually built on such a block store and that's why the requests actually ordering guarantees, but if the block store provides such guarantees to the file system then the file system should not have any problems. And then let's see how it works exactly. I will use the same example as shown the previous slides. And here actually each server is actually a replicated shard, but for simplicity I will just show a single block here. So instead of just sending those request to servers the clients will also attach a small metadata to each request which identifies the location of the next request. Then it will send such data plus metadata in parallel to different shards and different shards will also log them to disks or same parallel. But at this point we will not make the newer version of the data with both the clients. Instead, so I need to wait for a notification from the previous server saying that the previous data has already been made visible. And in this case it can make the newer version of A visible to the clients. Then it will also send the notification to the next server and so on. The benefit of this protocol is that the first phase, the durability phase can be executed in parallel. And that's actually where the large block of data is transferred over the network and also written to disks. Therefore, executing the first phase in parallel will allow us to achieve most of the scalabilities from the sharding approach. And executing the second phase, the visibility phase, sequentially allows us to achieve ordering guarantees without significantly hurting the scalability of the system. Let me go to a second problem, how to provide robustness across different layers. The major trial here is that this part of, the fact that the storage layers are usually well protected. Errors, and especially corruptions in middle layers, can still be propagated into either the end users or the storage layers. So of course we also need to protect those layers, but let's immediately remind some of you some of those BFT techniques which are usually perceived as too expensive in practice. So here is what we did in Gnothi. We’re asking ourselves again can we try to achieve the ability to tolerate almost arbitrary errors with a significantly lower cost? And our key idea is based on the idea of decoupling safety and [inaudible]. Actually what we find is that for safety we only need f plus 1 replicas. We can require that all the requests must be agreed by all the f plus 1 replicas so that we can know one of them must be from a corrected node. And of course [inaudible] anonymous concern in our system, but the problem with the anonymous concern is that it does not provide [inaudible] at all since you’re single failure caused the system to stop making progress. That is why those BFT approaches need more replicas in general. But as we already said seeing Gnothi, a general solution may not be the best solution for a storage system. So here, do we have another way to restore liveness without somehow significantly increasing the replication cost? Actually we find that the answer is yes again, and the key operation which allows us to achieve that is that those middle layers usually don't have any persistence state and they usually store their persistence state on the storage layer. Now let's see how to leverage this operation to restore liveness. The major change here is that with f plus 1 replication to tolerate f arbitrary failures it is impossible to know which one is faulty. To address this change will take a drastic approach. We just replace all the middle layer nodes with the new set of nodes and then we will leverage the storage node to allow the new set of middle layer nodes to agree on what the correct state is. And if for further failures they still can't reach agreement we will replace them again and we will repeat on doing this until they can reach agreement. Really on why we can do this is exactly because those middle layer nodes don't have any persistent state and they can be recovered from the storage nodes. And one surprising fact of storage protocol is that it not only can improve the robustness of the system but it can also improve its performance under some conditions. The reason is that now since both middle layer storage nodes are replicated we can co-locate them on the same physical machine. And this can save a lot of network consumption in tasks like garbage function in which the middle layer nodes will just receive data from the stored node, perform some computation, and then write the data back. In such tasks that network consumption can almost be eliminated. Let's see where we have our idea. >>: You said that you could eliminate this garbage function example. Are you reading, is that [inaudible] happening below the reliability layer in the storage nodes? Because if it’s happening above the reliability layer then you can't eliminate the network frameworks because you have to talk to the other nodes to do your reliability protocol. So I don't have intuition for why you can make garbage function free unless the garbage function operation is somehow below the layer of reliability. >>: Could you clarify what kind of garbage collection you're talking about? >> Yang Wang: Let's see, that's used, for example, Edgebase or big table use a log file system. They only upend to Google file system but it's [inaudible] need to a comeback of the data to discard the old data. That's what I call the garbage collection. So it isn’t usually initialized, executed by the middle layer nodes like tablet server and big table. So previously if you don't have nodes then this would read data from them, perform computation, and wanting to write data back. Those nodes will perform the reliability protocol. Is that what you mean by reliability protocol? >>: What I mean is the storage nodes are running some replication protocol that tolerates a number of failures. >> Yang Wang: Yes, we kind of move the replication to the middle layers. It's kind of coordinated around here, but for garbage collection actually they don't need any coordination mainly because we can make it deterministic. You could think about it, we moved the replication to the up layer. Now each one just arrived to its local replica. >>: That's more layer. >>: And you were saying that arbitrary errors can happen during this process. Like what kind of arbitrary errors? >> Yang Wang: We [inaudible] almost arbitrary errors. The only thing, so first arbitrary errors really means arbitrary, whatever kind of errors you can imagine. That's called arbitrary errors. But our system cannot, >>: Like what? >> Yang Wang: Corruptions or even a malicious user take control of one of the servers. >>: Malicious control. [inaudible]. >> Yang Wang: I don't think all the corruptions can be detected, for example, these corruptions can definitely be detected but [inaudible] memory corruption, I know there are some memory [inaudible] but if the memory trips does not provide [inaudible] then it’s pretty expensive to implement your own [inaudible]. >>: [inaudible] bigger structure then you have production whenever it's an error to help us. >> Yang Wang: Yes. But for example, if [inaudible] has an error, [inaudible] corruption doing like a control flow I don't know how to maybe that’s your way, but I don't know how to use a check [inaudible] to protect that. For example, if you do an experiment and somehow the result should be true but it's corrupted into [inaudible] that kind of thing>>: Your cache [inaudible] have checksum as well. But I think what you’re saying is you can't have arbitrary corruption that causes your destruction point to go off the rails and execute some other code. That’s possible even with checksums or [inaudible]. >> Yang Wang: And another problem is that I read your paper that there are like, for example, memory tricks there are like something's wrong it could be multiple corruptions happening at the same time in the same, like what they call thing or don't remember the name, yeah. >>: That discussion has just confused me about [inaudible]. So you have, still at the end you have a client that's one client that's reading and writing multiple virtual disks, right? >> Yang Wang: Yes. >>: So that have can arbitrary corruption and the whole thing will fail anyway because you won't try to replicate that. >> Yang Wang: Yes. That's actually>>: Couldn’t you just do an end to end checks in the client, checks on what you’re writing, checks on what you read? [inaudible]? >> Yang Wang: Actually your prototype there’s an end-side check at the client-side that I will not discuss in our talk because we just use existing technique from the [inaudible]. But the problem is that check is that it can only check for read. It cannot check for write. For example, if you write data to the middle layer node and if the middle layer is somehow corrupted and write it to all storage nodes then the data is actually lost. So even if you have a checksum at the client side it can assure you that you will never read a corrupted data. >>: Aren’t there known techniques to check to see whether your data that was written is not corrupted? I mean just read after write? >> Yang Wang: Read after write>>: It’s just a matter of when you decide the data [inaudible]. What does [inaudible] mean? So disks do this, right? >> Yang Wang: Yes. So read after write, that's definitely a possible approach. I think that’s also used in Google, but of course it’s more expensive. And the other problem is that sometimes the system will perform [inaudible] tasks like garbage collection. But of course, you can't do read after write for every write; but my personal experience is pretty expensive not only because you need to do advertise, it also like destroy the sequential pattern of the system. You write some data layer and then you read it and then you write on it, so it's not a sequential pattern anymore. That hurts the disk space, the throughput, quite a lot. Okay. So we have implemented our idea in your prototype called Salus which descends from the code base of HBase and HDFS. Here we also want to answer two questions. First, what is the overhead of better robustness guarantees introduced by servers? Second, does such overhead grow with scale? To answer the first question we measure the throughput of Salus and HBase and the different workloads. And we performed the first set of experiment in the in the environment where there's plenty of network bandwidth. Here we find that actually the better properties of Salus that's not come at a cost of throughput, and Salus can achieve a comparable throughput to HBase on almost all workloads. What is more, in an environment where there's limited and network bandwidth Salus actually allows you to have your cake and eat it too. And such an environment is actually not uncommon, for example if you have a cluster of machine equivalent with one capabilities and more than two disks then probably the network bandwidth is your IO bottleneck. And we also performed a one set of experiments in such an environment that we find that in such an environment [inaudible] Salus’ ability to tolerate more errors. It can also outperform HBase by 74 percent. This is because the Salus active stored protocol can eliminate almost all the network consumption in garbage collection and thereby making better usage of network resources. So, so far we have seen that better robustness actually does not hurt throughput. Now let's see whether it will hurt scalability or not. To measure this we rented about 180 plus Amazon Easy 2and we ran both HBase and Salus on them. Here the Y axis is the throughput per server. The [inaudible] here is that if this number does not change when the scale of the system increases then our system is scalable. So first we can see that under the sequential write workload both systems are scalable to 108 servers. And then under the random write workload we can see that both systems are experiencing a pretty significant performance drop. But at least the overhead of servers on HBase remains constant at 28 percent which suggests that Salus is as scalable HBase. And the reason for this performance job at the first place is pretty complex. The short answer here is that when the scale of the system increases actually the IO sizes on each server decreases as a result of such a random distribution and therefore the 108 experiment each server is actually possessing a larger number of smaller IOs which is usually bad for disk space in storage systems. And I'm happy to provide more details of this off-line. >>: How big are the writes? Are these like 4-K blocks? >> Yang Wang: 4-K blocks and it's doing it in [inaudible]. So each batch we use about 100 requests. So it's 100 4-K requests for each. >>: I'm just trying to figure out the difference between random and sequential. So you're seeking somewhere in your random 4-K and then going somewhere else? >> Yang Wang: I don't think it's a major problem here because both HBase and our system actually turn our random write into a [inaudible] version in a log file system. So it's actually not actually [inaudible]. >>: So the problem doesn’t come from seeking. The problem comes from smaller batches? >> Yang Wang: Smaller batches. >>: So this explains why throughput is lower with larger nodes. You're saying there's some timeout, maybe when inside you just have to write whatever you have to log and again fill that write with as many operations in that timeframe. >> Yang Wang: That’s probability. >>: If you increase the offered mode does this graph look better? >> Yang Wang: There's two kind of things you can do. First is increase the number of clients. But that I don't think will help a lot because they like log different requests into different disks. If you increase the load from a single client my guess it will probably help, but the problem is that in practice usually a client will not have a very large number of outstanding requests so that it can feed to the storage system. That's why we don't want to increase it a lot. >>: Forgive me, this is my question again, is that the read performance uninteresting? >> Yang Wang: Read performance is I would say less interesting because reads usually don't need to go so low as the replication protocols so it's easier to make a read scalable. >>: I guess I was wondering, well I was wondering about the way you do the barriers because the way you do barriers is you just write to everybody and then you do this pipeline commit stage and so presumably if you read something that was recently written you might have to wait for a while before you can actually read the correct value for that log. Or did I misunderstand? >> Yang Wang: Read can still get a stale one. If you [inaudible]. So let me>>: Oh, I see. So you don't tell the client to the write is completed until you've done, okay. >> Yang Wang: And if you really need a new version of the data then you can’t wait. >>: And that's [inaudible]? >> Yang Wang: That's not common because you arrive to remote disk as kind of first cache it server. When I read it read it, it will get it from the cache. >>: I'm misunderstanding everything. It seems like that way the thing of delaying the application active client would make the throughput look better if you're measuring it on the server than if you were looking at the client latency. In other words you might be keeping the servers busy but the client>> Yang Wang: Yes. It's merely designed for throughput. >>: [inaudible]. It's a virtual disk and you’re waiting for each operation to complete>>: [inaudible] comparison client latency between the two systems. >> Yang Wang: We have the graph thing on the paper, but I haven't put in the slides. So it will hurt the latency of the client, but it will not hurt a lot mainly because in storage systems the major latency comes from the latency of the disk. Usually it’s at a range of time in the second level. For pipeline commit we don't commit with [inaudible] to disk. It’s just in memory. So it's usually just network latency plus some memory accessing latency which is [inaudible]. >>: Okay. And then you commit to disk asynchronously in the background? >>: I’m still thinking about this write pipeline commit issue. When the storage system tells the client that the write is completed the client is going to go [inaudible] cache or whatever, right, because it’s typically writing things out asynchronously from some cache. And once it’s written [inaudible] underneath that memory is it the case that in your storage system there's one point in the write of which you know you've got the data [inaudible] that it’s not going to get lost and so the write operation can't fail and then [inaudible] you can commit it because you're guaranteeing the reads will actually return the latest value? >> Yang Wang: What is the difference of this one? I don't>>: You [inaudible] but delay the notification of writes because you need to give correct semantics to [inaudible] so you're guaranteed that once a write is returned complete then [inaudible]. >> Yang Wang: Yes. >>: Okay. But often all I care about is when I want do a write is I want to know that it’s not going to get lost. Maybe, I guess the two things are kind of related. >>: It might be durable or readable>> Yang Wang: Right. That's actually a very good question. Actually, in our pipeline commit if you consider failures some durable data may not be visible for error. So that's why we don't want to let you know before it's actually visible. For example, if 1-3, 2 is lost and then the client fails then three should never be made visible any time. So actually we have a protocol in the background to find such kind of things, better things. So that's why we don't want to let you know that it's durable. Okay. That actually could be the end of my talk but at the other project there’s one more question that is deeply dissatisfying to myself, and probably also to some of you. Now that we set off to do the robust and scalable system but have we really succeeded? So far I have shown that our system can scale to about 108 servers. And this number is still pretty small compared to the size of a typical industrial deployment which can have thousands of nodes and we have tens of thousands of nodes in the future. So how does well does our system perform under such a scale? Actually, at the end of the Salus project we have no way to find. Actually this is a fundamental methodology question that not only applies to us but also to all other researchers in the same field. The question is how can we validate the scalability of a large scale storage system? This part we [inaudible] because for researchers usually we don't have enough resources to run our prototypes at full-scale. As I said, a typical industrial deployment already has tens of terabytes of data and thousands nodes and they are growing, and on the other hand for researchers you have hundreds of terabytes of space and hundreds of nodes are not easy to get [inaudible]. I don't know what they say experience here, but that's my own experience. That's why actually many of the recent works are only evaluated with hundreds of servers. And those even include Google’s prototype, Spanner. So how to address this problem, one standard approach in distributor system is to extrapolate a large scale results from observer result on a small-scale testbed. For example, if it was 100 nodes we can [inaudible] network is ten percent utilized and our CPU is five percent utilized we can extrapolate that our system can scale to probably 1000 nodes. However, order for this approach to work we have to rely on the assumption that the resource consumption grows linearly with the skills of the system which may not always hold in practice. For example, sometimes we say that an error only happens when the scale of the system reaches a certain limit. And sometimes we can say that the resource consumption growth super-linearly with the scale of the system and such trend is not obvious when the scale of the system is a small. So kind of give up all such inaccurate approaches and really run our prototypes at full-scale, of course with fewer machines. To achieve that of course we need to co-locate multiple processes on same physical node and actually co-location itself is not hard thanks to virtual machine techniques. The real problem here is that usually the bottleneck of a testbed is those error resources. For example, if [inaudible] location process may arrive to a disk with a space of 100 megabits per second, now we can look at three of them on the same disk each of them can only write to the disk it was a scale of, so there’s three megabytes [inaudible]. And of course the currently run at their full speed and we still cannot push our system to its limit. So how can we address this problem? Since we have no magic way to increase the IO resources in our testbed we are wondering can we somehow significantly reduce the resource requirements of each process? We found some work [inaudible] performance profile to present to the rest of the system. So once again this problem is impossible in general but it storage systems there's one key operation that allows us to achieve that. The operation is that for storage system usually the content of the data does not matter. For example, if I write something to a local disk or local file system the actual contents I read or write does not affect how the system executes because they simply retrieve them as a black box. What really matters is though the metadata such as the length of the data or where we performed the read and write. So this motivated us to use synthetic data at the client-side and abstract [inaudible] data on all IO devices so that we can significantly reduce the resource requirements of each process so that even if it is co-located with many other processes it can still run at full speed. Of course now the question is how come we abstract away data? So the simplest approach is to discard it completely. This approach is actually used in a previous work called David. They have successfully applied this idea to evaluate a local file system. But I want to clarify that it does not work in large-scale storage systems mainly because there are usually multiple layers in the system and the upper layers usually store its own metadata as data on the lower levels. And in this case it is not fine for the lower layers to just discard the data because it also contains the metadata from the upper layers. And if you discard them of course the system will not function correctly. Our answer to this question is that we should compress data instead of completely discarding it. So before I go to the design let me first present the requirements of our compression algorithm. >>: Do you model things like network and disks themselves? >> Yang Wang: Not yet but we should. I will talk about that later. So the first three requirements are pretty straightforward. First, we need our compression algorithm to be lossless because we cannot risk of losing metadata. Second, we should be able to achieve a high compression ratio so we can look at many processes on the same node. And third, it should also be CPU efficient because we don't want to replace our old bottleneck with a new CPU bottleneck. Actually the [inaudible] rules out those general compression algorithms Gzip because they are pretty CPU heavy, actually. The final requirement is [inaudible]. We require that our algorithm should be able to work with mixed data and metadata. Let me elaborate a bit. So the major challenge here is that despite the fact that we have full control over client’s data the system itself may still add metadata, only certain metadata, into data and this is not something that we have control over. And what is worse the system itself sometimes split such data plus metadata in impossible [inaudible] way and then send them to the lower levels. And therefore when the lower layer receives some input it does not know where the metadata is. So the key of our compression algorithm is that we should design our data pattern in our client data in a way that we can you efficiently locate metadata inside data. For that purpose we have designed a specific data pattern and corresponding compression algorithm called Tardis. We use the name Tardis because it can achieve a very efficient space and time compression. So first to locate metadata inside data we at least need to make sure that data>>: Sorry. Why didn't you just [inaudible]? >> Yang Wang: After you, if we write all zeroes, that’s actually the first approach we have tried. If you write all zeroes, if some metadata is inserted then you need to scale all the [inaudible] to find all those non-zeroes. >>: Oh. And this you can avoid doing the scan. Okay. I’ll shut up. >> Yang Wang: So first to look at metadata inside data we had at least need to make sure that data is distinguishable from metadata. For that purpose we have introduced a specific sequence of bytes called a flag which does not appear in metadata. And then the question is how to efficiently locate metadata. As we have already learned in our algorithm class it is always easier to locate something in a sorted array because we can use Spanner research. This somehow motivates us to keep our Tardis data pattern sorted. For that purpose we have introduced another sequence of bias which we’ll call marker which is an integer representing the number of bias to the end of the data chunk. And our Tardis data chunk is actually a combination of flags and markers in which the flag will allow such data to be distinguishable from metadata even when it is split or merged, and the marker will somehow keep the data pattern sorted so that we can us Spanner research to locate metadata. And if a client wants to write with 101 key data chunk this is how it looks like assuming both flags and markers are fallbacks. We will start with the flag followed by integer 1016 which remains that 1016 bias remaining and another flag and once on the eight and so on. Now let's see how we can locate metadata. So here, just to show an example which consists of the second half of the data from the previous example plus some metadata and also some bytes from the next data chunk. So [inaudible] going to start by searching for a flag, then it can retrieve the marker after the flag, then it will try to skip those 504 bytes, but before it can perform the skip it will make a check. It will check that the previous eight bytes must be a flag followed by the row. If that is true that we can know that there's no metadata inserted if this is not true we would just use Spanner research to locate the metadata inside it. So in this thing, for example, we will just compress those bytes into a more compacted format which contains only two integers. The first one is the starting point of the data chunk and the second one is the last of the data chunk. Then it will search for flag again. Here we will see that the flag is actually not adjacent to the end of the previous data chunk which means that there must be some metadata inserted. And in this case, since metadata is uncompressible, it will simply copy it to here. Then we will retrieve the marker and try to scale again and perform the compression again. The benefit of this protocol is that, assuming the size of the data is much larger than the size of the metadata, then our algorithm can skip, avoid scanning most of the bias in the input which makes it very superior efficient. And actually in our experiment our comparison algorithm is about 33,000 times faster than Gzip when compressing one megabyte of data, but of course this is not a [inaudible] comparison because Gzip is a general. But this simply shows that by choosing our own data format we can significantly reduce the cost of compression. >>: So underlined is what’s being sent across network, correct? >> Yang Wang: Yes, to network and to disk. >>: So who is generating to top one? >> Yang Wang: The clients. >>: So why doesn't the client generate the [inaudible]? >> Yang Wang: So we want to ensure that, so the clients will generate this and send this over the network, but the server will see this. It will decompress this into this at the server side. This is to ensure that the server must be here because [inaudible] the same as if data is not compressed. For example, I would use like a Google file system as an example. So Google file system has, basically you can think of Google file system which is a block which is 464 megabytes of data. So why we need to create that block it need to contact the metadata servers. So we want to ensure that the server sees exactly the same amount of data so that it will creates exactly the same amount of blocks so that it can [inaudible]. In this case if you adjust [inaudible] list to a server the server adjuster will see the few bytes but it will still create a block for area 64 megabits of data so therefore it will create less blocks which will affect the accuracy of the [inaudible]. But we really want to ensure the server will still see this one, so all your code will see this one. So this what we want the IO like [inaudible]. So this is the basic idea of the- >>: Won’t having multiple servers write lots of the smaller blocks at the same disk, how is that similar performance? A single server is trying to optimize risk for [inaudible] larger data? You said you were trying to increase [inaudible] the performance characteristics of the disks. >> Yang Wang: I will talk about that later. We actually use another approach. In order for this algorithm to be lossless the flag cannot appear in metadata. Otherwise when we misclassify some metadata’s data and we'll lose it. So how can we find such an appropriate flag, one approach that scans all the possible metadata bias and they’re trying to find some bias that does not exist in them; and actually we find that in practice we can use a much simpler approach mainly because Tardis is only used for testing so that we don't actually need any rigorous requirement on the flag. So in practice if our chosen flag does appear in the metadata and breaks the system then we can simply choose another flag and risk the other test. And actually it turns out that a randomly chosen eight byte flag works for both HDFS and HBase. So now we have these two Tardis compression. Now let's see how to use it to achieve our original goal. Now that our original goal is to measure the scalability of our prototype, so for that purpose we use a combination for real nodes and emulating the nodes as a microscope to focus on the bottlenecks of the system. In such a setting the client will stay on the Tardis data from there to all nodes and those emulating nodes will run Tardis compression and decompression on the IO devices. And the real node will run with unmodified data. While running a large number of emulated nodes we will allow us to give enough pressure on the bottleneck nodes while running our bottleneck nodes, in real node allow us to get accurate measurement of the throughput of the bottleneck which is critical to the scalability of the system. If you are really [inaudible] you can try to just use this microscope on different components in the system. We have implemented emulated devices for disk network and memory and we have actually achieved the transparent memory emulation for both discs and networks. By using byte code instrumentation we just simply replace Java’s IO classes with our own [inaudible] that performs Tardis compression automatically. The usage is really simple. We just need to add option to the Java command line. We have not been able to find a way to support memory compression and transparency yet mainly because in a Java there is no clear interface for memory status. So for applications that store a lot of things in memory it will require code modification. And our experienced that HDFS does not need any memory compression because it does not store a lot of same thing memory. HBase does store a lot of things in memory and so requires about 71 lines of code modification to support a memory compression. And we have applied our system to HDFS and HBase and we measured the scalability. And when we find a problem we will try to analyze it, so root of cause, and try to fix it. And we have now our experiment on [inaudible] cluster. So here I'll just show some of the results from HDFS. HDFS is a typical sharding system. It has a single metadata service called [inaudible] which I usually believe to be the bottleneck of the system. There are a lot of data nodes to store the data, so obviously we should apply emulation to data nodes and run the end load in real mode. Here the X axis is the number of emulated nodes and we have actually achieved the co-location reader of 1 to 100. For example, to emulate about 9.6 K data nodes we only need 96 physical machines. And the Y axis is the [inaudible] throughput which is measuring gigabytes per second. When we increase the number of data nodes we find that the system quickly saturated with about 1-K data nodes, and our profiling shows that the problem is because the full number of [inaudible] on the main node is too small. After fixing that our system can reach a throughput of about 300 gigabytes per second; and our profiling shows that at this point the bottleneck is actually in the locking system of main node. Main node need to log to two types of information to disk. One is the operation, one is the metadata operation log and the other one is the debug information. And it is suggested that they should be putting two separate disks, but unfortunately our test machine only have a single disk and therefore we decided to put the debug information in tmpfs because it is not crucial to the correctness of the system. After fixing that we were able to achieve about 400 gigabytes per second. This number is the same as reported by those HDFS developers who did their own experiment [inaudible] cluster in Facebook. And we were able to reproduce the same result with only 96 machines. And then we want to further investigate whether we can increase the throughput of HDFS. >>: [inaudible] emulated nodes [inaudible]? >> Yang Wang: Yes, yes. >>: Do you have an idea why they did it with two times fewer real nodes? >> Yang Wang: So first they also used extrapolation approach. They don't get exact number from the [inaudible] cluster. I don't remember what number they get, but it's also another fullscale experiment, but we are close to a real full-scale. >>: Oh, they weren’t running it on 4000 [inaudible]cluster? >> Yang Wang: Huh? >>: They weren't running on [inaudible] cluster? >> Yang Wang: They run on [inaudible] cluster, but they were not able to saturate their system. So they still use like an extrapolated form, 4000 to somewhat I don't remember, and the other reason is that those machines may not be the same. So that's why, so in our emulated [inaudible] we assume that each machine is equipped with like two disks. Of course, if each machine with Tardis disks then you need like 10 times less machines, few machines. >>: I think a slight generalization to Ed’s question is do you have, have you done an experiment to validate>> Yang Wang: That's a very good question. >>: I was going to say if you have 100 physical nodes it would very impressive if you could use one node to emulate 100 and have that be the same result as 100 real. And I believe that 100 real could tell you what 10,000 actually>> Yang Wang: So that's a good question. Actually in the ideal case we should have like 10,000 node [inaudible] and to run an experiment and to validate it, but of course this is impossible for us. >>: [inaudible] 100. One physical node>> Yang Wang: We use about 1500 nodes, our [inaudible] cluster and then at least to that point our result is pretty consistent with our emulator a lot. But one thing I want to mention is that the purpose of our emulator is not to give you an accurate performance measurement of the system. It's many mainly used to tell you where the bottleneck is or test where the bottleneck is. >>: There’s a danger in that if you're inaccurate in modeling the system you’re going to fix the wrong parts. Lots of storage systems have bugs that, inefficiencies that don't matter because they're not [inaudible], they're not the bug you notice. You could spend a lot of time placing bugs at scale that when you get to scale aren't important bugs to fix. Does that make sense? I think that's why Jeremy was asking that. >> Yang Wang: Yeah. That’s definitely a very good question. So I have to say we have no idea a way to tell you whether the bug we find is really make sense or not, but as you will see in the next slide when we find the scalability bottleneck we will try to find a [inaudible] in the source code, then we can see whether you can read about it would really happen in a large-scale experiment. So I would request like an interactive validation of our prototypes. Ideally we should really have a 10,000 node experiment. It's really hard. So at this point we find that the bottleneck is still in the locking system of main node but since we have no magic way to increase the speed of the disk we can only assume that in the future there might be some faster devices like a per system memory. So to emulate that case we also put the metadata log in tmpfs. So by configuring in this way our system can reach a throughput about 680 bytes per second. At this point our profiling shows that the bottleneck is actually in the synchronization of different [inaudible] on main node and fixing them would require significant redesign. >>: Would you consider partitioning the main node and just [inaudible] the bottleneck? >> Yang Wang: I actually put it in my future work. >>: The scale you're talking about, the thing that is considered, okay. >> Yang Wang: So part of those configuration problems we also find some implementation problems in HDFS. For example, we find that HDFS can experience a pretty significant performance job while the size of the file grows large. Actually this is a pretty surprising because HDFS is designed for big files. And our profiling shows that the problem lies in this piece of code. So when our main node needs to add a block to our existing file you need to compute the last of the existing file. And in that current [inaudible] we do this by scanning all the existing blocks which of course will become heavier and heavier while the size of the file grows large. And our fix is pretty straightforward. We just add an integer to each file to recall its existing [inaudible], and you can see by applying our [inaudible], the problem does not existed at node. >>: Where is the integer kept? >> Yang Wang: Huh? >>: Where is the integer kept? >> Yang Wang: For each, I would say I node? >>: So you have to do two commits then to for each [inaudible]? You have to update integer and block? >> Yang Wang: No, no. I know it’s on the main node, so you anyway need to>>: You add an additional IO [inaudible] to update the integer? You are adding IO, right, if you’re keeping integer? >> Yang Wang: No because while you add a block you already have an IO there. We just, we don't add an IO, we just put the information in the existing IO nodes. >>: You add an account to the block so you read the last block of the- >> Yang Wang: Yes. >>: And you see if the number of block [inaudible]? >> Yang Wang: Yes. So what is your question? So this is the end of my assistance work, but I'm also probably interested in how to provide full Tardis in distributed systems. For example, I have work in the Eve project which aims at replicating multithreaded applications. And I have also worked now on the UpRight project which aims at making BFT a practical [inaudible] for real systems. This is almost the end of my talk. At the beginning I have shown that my final goal is to provide a robust and scalable storage system. So far I have been able to show how to improve the robustness of a scalable block store and also how to validate its scalability. Of course, there is still a long way to go. On one hand there are other kinds of storage systems such as file systems key-value store, and databases, and they all have different kinds of workloads and requirements. On the other hand, scalable systems are still using other fields such as in medical care, high-performance computing, and so on, and they also have different kinds of requirements and workloads; and they will of course presented new challenges to our existing techniques. And in the future, as I mentioned, the scale of the system is still growing and people are continuously introducing new techniques to supporting such a growing scale such as [inaudible] data and sharding metadata and they will of course presented new challenges to our existing research. So I will just present two concrete projects and interesting in the near future. So first, as you have already seen, metadata should be probably stored separately from data because they are totally different. So it might be beneficial to provide robust metadata storage as a single service. Actually this is another new idea, for example Chubby and Zookeeper, where they provide tree abstraction to other services; but the question is that they are not scalable, and as I mentioned earlier, even the size of a metadata may not fit into a single machine in the future. And we also need to distribute them into different machines. But one key question from a tree abstraction is that in tree usually those upper layer nodes are accessed more frequently than those lower nodes, so how to distribute them to different nodes and how to achieve loadbalancing is an interesting question to me. And also, another question I'm interested in is how to automatically find the root causes of performance bottlenecks. This is moderated by the fact that I have spent so much time in my PhD to find such kind of root causes and fixing them. One of the obvious reasons is that some resources is exhausted. If you see that the CPU is 100 percent utilized then it's probably in the bottleneck of the system; and during my intern on Facebook we have already worked on a prototype to find such kind of exhausted resources and it's already deployed. But there are also other kinds of root causes that are harder to find. For example, sometimes it can be caused by inefficient usage of resources, and sometimes it comes because by how different components coordinate with each other. The bad thing with that is that there are no obvious signs for such kind of problems. You find that no resources are exhausted. So how to find which machine or which piece of [inaudible] is actually the problem is an interesting question to me. So to conclude, the final goal of my research is to provide robustness and efficiency simultaneously for scalable systems, and I have shown that the key of my approach is to process data and metadata differently because they serve completely different goals in storage systems. And one more lesson we have learned in our procedure is that some problems, which are usually considered to be hard or even impossible in general when it comes to storage systems, are not only solvable but they can also be solved in an efficient and scalable way. I hope you have enjoyed my talk. Now I'm happy to take questions. Thank you. >>: One question? >>: [inaudible]. >>: Okay. Thank you. >> Yang Wang: Thank you.