>> Rich Draves: All right folks, why don't we get started? I know there we’re still 2 minutes early of our traditional time, but I think we have critical mass and we'll just punish anyone that shows up late. I'm pleased to introduce Prince. Prince is of course from UT Austin, one of Mike Dahlin’s students there. I guess Mike and Lorenzo really together, right? >> Prince Mahajan: Yeah, I think that Mike was my primary advisor. >> Rich Draves: Okay. Actually Prince is not a stranger to MSR. Prince interned with us back in 2004 working with Galen on similarity. More recently he's done an internship with our colleagues in Silicon Valley working with Ted Lobber [phonetic] and Doug Terry and other folks there, so he knows something about MSR. He's with us here today and tomorrow interviewing for a postdoc position, so welcome Prince. >> Prince Mahajan: Thank you thanks everyone for coming. Today I'll be talking about cloud storage with minimal trust. The body of this talk is going to be on a system called Depot with attempts to embody this philosophy. Towards the end of this talk I will also talk briefly about some ongoing work on sort of bridging the gap between acid distributed databases and these data stores that provide [inaudible] based on workflow. Let me start by talking about Depot. Depot is a cloud storage system in which clients do not have to trust. That is assume the correct operation of storage servers or the clients. To motivate this model, consider the example of a hypothetical pictures showing website called CloudPic that uses an extended service storage provider such as Amazon S3 to store its data. Whenever the user has to store new pictures, CloudPic simply pushes them to the SSP and whenever the user wants to access his pictures CloudPic fetches these pictures from the SSP on demand. This model is appealing because SSP handles all of the details that are necessary to build a highly reliable storage, details such as geographic replication, scrubbing latent errors, provisioning for load spikes et cetera. Because of economies of scale it can do so at a very attractive cost, so considering these benefits it should not be surprising that a lot of people have adopted this model over the last few years. Despite these benefits, however, I argue that there remains significant risk associated with cloud storage. The first one being that the cloud storage is a black box to its users or its clients. The users are not really aware of what are the best practices followed by the storage or what are the keyed application policies that are implemented to ensure their liability. So when I store an object onto the storage service I don't necessarily know, if it's going to be geographically replicated or not and if so, at what granularity. Is it going to be replicated to my additional sites than a few seconds, within a few minutes, within a few hours are at the granularity of days? >>: [inaudible]? >> Prince Mahajan: Yeah, but you don't necessarily know. For example, for geographic representation when you push something to Amazon S3 you don't necessarily know if it's immediately going to be replicated to two nodes and then gradually it's going to be replicated to three nodes and if so at what granularity is that going to happen. So the policies are maybe there at some level but they are necessarily vague perhaps for good reason. >>: Okay. >> Prince Mahajan: The second challenge with cloud storage is that like any distributed largescale system cloud storage is complex. It's prone to both software failures and hardware failures and failures that result from external factors such as natural disasters. Like the repeated occurrence of outages in these large scale services certainly does not help the user confidence in these services. The key challenge, the key problem in using cloud storage I argue is at this point we have a conflict. On one hand we have that the cloud storage offers numerous benefits, but on the other hand there are certain risks associated with using cloud storage service. The fundamental reason behind this conflict is trust, so at this point cloud storage services are not entirely reliable yet, but still we trust them. We trust them with the durability of our data. We trust them to remain available whenever we want to access our data and we trust them to provide the right data whenever they send any data back. So this work I'm going to argue for an approach to minimizing trust. What I mean by that is that we want to minimize assumptions about both the correctness as well as the availability of the storage service. What this means is that while we expect that usually the storage service will return the right data, we should be prepared to handle situations when the data it returns is potentially inconsistent, or in situations where the data is corrupted or even in situations where the storage becomes unavailable. Let's see how something like this can potentially be implemented. Our goal is really to enable the client to be able to separate good data from the bad data, so whenever the cloud service responds back with the data, if the data is correct the client will end up accepting it. However, if the data is incorrect or potentially inconsistent or corrupted, the client will be able to reject the data. Furthermore, to handle situations where the cloud service becomes unavailable, we want to enable the client to specify what replication policies it wants to implement. For example, the client might want to use multiple SSPs or it might want to keep a local copy of its data in its local data storage and it should be able to do access either of these stores transparently without having to compromise consistency or correctness. So I argue that this approach would be appealing to both the users as well as the SSP. The reason it would be appealing to users is because they can sleep peacefully at night knowing that the failure of the SSP overnight will not make them bankrupt. This increased trust can translate into increased cloud utilization which can in turn increase to greater revenue generation. The key question, however, remains can we actually build a useful system despite these weak assumptions? The answer it turns out is yes. We built a system called Depot which embodies this approach. It's a key value store similar to Amazon S3 and in the system clients to not need to trust the servers. As a result the system can continue functioning despite routine failure of a single machine, correlated failure of multiple machines or even a situation where the cloud service goes down. We also further tolerate situations where the client, some of the clients may behave arbitrarily. Perhaps surprisingly, despite these assumptions, Depot was able to provide stronger properties than those provided by existing highly available storage deployments. In particular, Depot provides a variation of causal consistency on each of its volumes. In contrast, Amazon S3 has only guarantees eventual consistency. Furthermore, Depot is able to ensure availability despite Phil stop, corruption or correlated failure of a set of machines and Depot is able to guarantee durability despite correlated failures of machines in the cloud. So for the rest of the talk I will basically consider three key properties that Depot provides the property of correctness, availability and durability. For each of these properties I'll explain what level of trust lies in existing systems and what is the key goal of Depot in terms of minimizing that trust. Then I will explain what are the key techniques that Depot uses to achieve this goal of minimal trust. Finally I'll come to the point where I'll discuss what is the cost of achieving this minimal trust guarantee in Depot. Let me start by talking about correctness. In an existing cloud storage service whenever that storage service returns an answer, the client expects that the answer will be correct, so that's where the trust lies. Yes? >>: When you say client do you mean Yelp, or do you mean the customer of Yelp? >> Prince Mahajan: In this case we talk about enterprises that use the storage service so that would be Yelp. Whenever the client receives an answer from the storage service it expects that the answer is going to be correct and so in some sense it is trusting the service to return the right answer. As a comparison point in traditional state machine replication or quorum based systems, the client trust that whenever a quorum of machines returns matching responses the response is going to be correct. In contrast, the goal in Depot is to eliminate the need to trust either the SSP or the other client. We are going to achieve this goal by enabling the client to distinguish good data, correct data from incorrect ones. In order to go further I need to first define what it means for data to be correct. So I am going to do that by defining the consistency that Depot provides. The diagram here depicts the spectrum of consistency semantics that are provided by existing systems. On the extreme right inside we have very strong consistency which is provided by systems like Azure, MegaStore [inaudible] and so on. The semantics offer a very intuitive property that each GET that you perform is going to return the result of most recent PUT that was performed to that object. However, some fundamental trade-offs prevent us from achieving these properties with high availability and minimal trust. In particular, CAP theorem argues that you cannot achieve strong consistency without compromising availability, and similarly, the trade-off for Byzantine agreement argues that you cannot achieve this property with minimal trust. On the extreme side, we have, on the extreme left hand side we have systems like Amazon S3 or Dynamo that provide very weak eventual consistency properties. These properties are achievable, however, they are very weak and they make the task of programming very difficult. The key reason is that these systems do not provide even a very simple property that programmers expect which is that if as a programmer I perform a PUT and immediately after I perform yet another PUT happening between, I expect that the GET will incorporate the results of the PUT I just performed, but these systems don't guarantee this property. As a result, hackers have to sort of build in like artificial best effort mechanisms by adding artificial delays hoping the replication happens within this period of time and so on and so forth. Furthermore, these semantics do not prevent dependency of PUTs that are performed by the client, for example, if the client reads the result of one PUT and then performs another PUT eventually depending on the first PUT that he read the system does not guarantee these two PUTs will be observed in the same order by all of the other clients. To address these limitations of existing semantics we have basically designed a new consistency semantic that is called fork join causal consistency. Our goal with fork join causal consistency is to be able to provide the strongest possible semantics that can be provided with high availability and with minimal trust. Fork join causal consistency is a slight technical weakening of causal consistency which is designed to accommodate environments where you need to minimize trust. In particular, fork join causal consistency retains the essence of causal consistency. That is it provides the property that if a client performs a PUT and then performs a GET and other operations have been between the GET is going to return the answer the PUT had just performed. Similarly there are two dependent PUTs they will be observed in the same order by all of the other correct clients. This property can be useful for a number of reasons. First, it makes sure that the expected behavior of the system improves user experience because the expected behavior of the system sort of matches the real behavior offered by the system. Secondly, it makes the task of programming easier because suppose that I add a new object and then I add a reference to this object I am guaranteed that the reference will not be seen without the original object that I added. Finally, it also helps us in layering other properties on top of fork join causals. >>: [inaudible] the places where it gets weaker than probable consistency is when somebody else adds an object and then you add a reference to that object. You may not be a guarantee that the reference is valid. Is that the…? >> Prince Mahajan: No. That's not, so you'll still get that property. It retains most of the property that causal consistency offers. The way in which it is weaker than causal consistency is that in causal consistency you have an expectation that all of the operations performed by a given client will be totally ordered. Whereas, if clients can be arbitrarily faulty then you cannot uphold that expectation in this environment, so for clients that are faulty this total ordering property is compromised. Those dependencies are still preserved for all arbitrary scenarios. To explain the benefits of fork join causal consistency let's revisit our example from before. So suppose that me and my advisor uses hypothetical picture sharing website to share pictures. Now suppose that the system right now is just eventually consistent, so recently I went on a secret trip and I didn't want Mike to see these pictures of the secret trip so I decided to remove Mike from the list of people who are authorized to see my album. Unfortunately, a simple [inaudible] partition can prevent, or even simple replication policies can prevent the propagation of this update to all of the servers in the system. Now later on if I try to add new pictures, it is very much possible that due to a simple load-balancing decision or you due to another [inaudible] partition. This request is processed by a different set of servers and the ones that processed my earlier request. Later on a consequence of this partitioning or this eventual consistency is that later on if Mike tries to access my album he will be able to access the pictures that I didn't want him to see. In contrast, if the system provided fork join causal consistency, then no node can process my second request without having processed my first request, and so my advisor would be prevented from seeing the pictures that I didn't want him to see. Yeah? >>: In this example all of your operations are going through one cloud [inaudible] node. >> Prince Mahajan: Right. >>: All of Mike's impressions are going to another node. Is that something that a provider like CloudPic has to do to ensure if they want to… >> Prince Mahajan: No the only requirement that, the property that we provide the, the fork join causal consistency is provided to the clients of the cloud storage. Which in this case are the servers. >>: [inaudible] treated as one? >> Prince Mahajan: No, no. They can be processed by, so these requests can be processed by the same client or they can be processed by a different client. That does not affect the correctness provided by the system. The only, the one requirement, however, that I would like to point out is that these guarantees will only be preserved if the request of the same user are processed by the same client, so… >>: [inaudible]. >> Prince Mahajan: Yeah, so there are multiple ways to sort of either enforce that or to sort of fix situations where that is not enforced. You can obviously maintain some kind of a session to ensure that the request by the same user that are issued in a small time window are processed by the same machine in the CloudPic. But definitively you can also employ some cookies in the users browser so that these cookies carry a short summary of the state that this user has seen and later on if the user now migrates due to a load-balancing or due to failure if it migrates to a new machine, the new machine can ensure that it has seen the prefix that this client has seen to a different client, to a different machine in CloudPic, so you can enforce those kinds of policies in the browser or at the enterprise level to sort of make sure that the good properties you are getting from the cloud are also transferred to your users if you need to. So far I discussed the issue of which consistency should be provided by a system like Depot which intends to be highly available and despite minimal trust. Next time I'm going to describe some of the key techniques that Depot uses to achieve these properties. At the high level this diagram depicts the architecture of Depot. Data is partitioned into volumes and each volume is replicated over a set of servers that do not require overlapping read and write quorums to ensure high availability. Ideally we would like Depot to be only running at the client in the form of a library, however our current prototype requires us to run code on both the client machines as well as on the cloud servers, but we have some ideas on how to change that, how to be able to provide most of Depot's properties with simple API changes in the cloud interface. This diagram depicts how Depot can provide fork join causal consistency with minimal trust. >>: In the previous slide so which cloud service providers does Depot prototype workings? >> Prince Mahajan: It's right now we don't work with any; we basically have the store which we need to run in the server so… >>: [inaudible]? >> Prince Mahajan: You can, so the purpose of designing the system like Depot was to understand what properties can be provided and also drive from that understanding what are the API changes that are potentially needed to provide these properties in subsequent systems. To some extent we have succeeded in that goal because we sort of now understand better as to what will be needed from these cloud storage providers to be able to provide these properties or if, so basically if you have all of these API changes you get all of these properties. If you have fewer changes these are the properties you get and so on and so forth. >>: Will you be talking about that, maybe talking about with those changes need to be? >> Prince Mahajan: Not in this talk, but I can talk about them off-line. >>: [inaudible] the clients have some idea about how servers at the data center or which ones are running on the same physical machine and which ones aren't? >> Prince Mahajan: No. Because the clients we are dealing with virtual machines and they only talk, all the clients need to do is they need to be able to send messages to a node with a given virtual ID and be able to sort of authenticate messages that are coming from that virtual ID. Now whether those messages are coming directly or they are coming through some of the route is not relevant. >>: You have some table of metadata around the storage itself and [inaudible] are you… >> Prince Mahajan: Right. Right now we store some metadata and perform some checks on these storage servers because, again, in Depot we have this goal of providing extreme fault tolerance so even wanted to isolate one server from faults that happen under the server. But it's plausible that a deployment might want to weaken those assumptions and say that I am willing to accept data from other servers in the same deployment without actually checking them and then that case we can eliminate the need to actually store all of this data, perform a lot of these checks on servers. >>: But it seems like the magic here is some consistency table that's maintained around all of the storage and all of the transactions. Is that roughly accurate, or am I missing something? >> Prince Mahajan: I don't understand what consistency table. >>: Right so the only way you would know definitively that one node has been updated and another unrelated node hasn't gotten that change yet is if you recorded those updates somewhere and then you check them against… >> Prince Mahajan: The reason you don't need to do that is because you need to indeed record those changes in some form, but you don't necessarily need to know authoritatively what is the recent version of an object. The reason is we are not trying to provide strong consistency semantics. If you had a very strong consistency that [inaudible] reliability then you would need to know like what is the most recent version of this object and where is it placed. On the other hand, if you're trying to provide weaker semantics of this causal consistency you do not need to ensure that each GET returns the most recent PUT that is performed on that object. It is sufficient to ensure that the GET returns the most recent PUT known to the client who is performing the GET. So the properties are weaker and as a result some of the requirements that you would have to enforce otherwise are not needed. Does that make sense? >>: It does. I just, does it meet your criteria of making sure your boss won't see your pictures that you don't want him to see? >> Prince Mahajan: It does because if the clients, if the same client is performing successive operations, then these operations will remain dependent on each other and they will be observed in the same order and that is how the criteria is enforced, that if any client that observes the second operation is guaranteed to also observe the first operation, so these two are dependent operations performed by the same client. On the other hand, if you had, you would actually get more than that and like, I can try to explain some of the additional properties you get with this, but know that for causal consistencies you have to enforce those requirements. Yeah. >>: I have a quick question before you go on. When you say minimal trust, do you mean minimal in the mathematical sense or English sense? >> Prince Mahajan: In the English sense [laughter]. >>: I don't know maybe I these got some [inaudible] [laughter]. >> Prince Mahajan: All right. Let's see how we can enforce fork join causal consistency with minimal trust. What we do is we attached some metadata to each PUT. Logically this metadata summarizes the history of this PUT, that is all of the previous PUT that has been observed by the client before this PUT was performed. This metadata is then replicated to all of the machines, all of the clients as well as all of the servers in the system. It forms a part of the local state and each subsequent GETs that a client performs are going to be sort of checked against this local state that the client maintains. This is the key to enforcing consistency. And finally before accepting new metadata, clients form some checks to ensure that the new metadata is consistent with what they have seen in the past. So let me talk about what exactly this metadata is and what are the key checks that a node performs. >>: It needs to be replicated at all nodes? >> Prince Mahajan: All of the clients… >>: Do you mean eventually or… >> Prince Mahajan: Eventually. >>: Because you can't possibly mean write everywhere. >> Prince Mahajan: Yeah. So basically all of the clients as well as all of the servers that share volume will end up having some metadata for all of the updates that have been performed in the system for, yet, just metadata. >>: [inaudible] but a client would check for metadata… >> Prince Mahajan: You can't check it on demand. >>: [inaudible] and the GET and that would be the way you would ensure this consistency. >> Prince Mahajan: Yeah. >>: All right. >> Prince Mahajan: So let me talk about what this update metadata is. It has some expected fields such as node ID, the key that is being updated and the hash of the value that is being added. In addition, it includes two new fields. First is a version number that is assigned by the client who's performing the PUT, and second is a compact encoding of the history and this encoding consists of a version vector and a hash, secure hash computed based on the local history. So nodes store this updated metadata until it is garbage collected and they compress it in some version vector. So let's see what checks does a node need to perform when it receives new metadata. So whenever a node receives a new metadata or a new update it performs two checks. First it ensures that all of the updates that are present in the history of this update are also present in its local history, so this sort of amounts to just performing simple version vector inclusion check with the version number included in the update is assumed by the version vector maintained at the client, and in addition we need to compute a hash to make sure to deal with situations where sort of like, where corruption faults can cut up the state. And secondly, we need to check that all of the versions created by a given client are monotonically increasing, so that the client does not end up reusing the versions for different updates. So the system works fine if there are no arbitrary faults in the system, however if a client experiences a loss of state or a corruption fault, it can lead to a problem and this problem is called forking. In particular, what can happen is suppose that the client uses up versions to version number five and then it dies or loses its most recent state so it sort of reboots and it starts reusing versions three, four, three onwards. So as a result what's going to happen is that there will be two different updates both with a version number three. In this case is a faulty client F exposes these two different updates with the same version number to different clients, then what can happen is these two clients become forked. What I mean by forking is now each of these clients has individually seen a consistent view of the system, but when taken together the overall state of the system is not consistent. As a result of this forking these clients cannot subsequently exchange updates because if client A tries to send messages to client B which depend on the different version of update that it has seen, the client B will not be able to verify it and similarly when client B tries to send messages to clients A, client A will not be able to verify it. And this is not a new problem. The concept of forking was invented about 10 years back by the Sunder [phonetic] system. The key problem in all of these systems is that once the system gets forked, the system cannot attain eventual consistency, because these correct clients have seen mutually inconsistent histories are logically partitioned from each other. They are prevented from exchanging updates from this point onwards, and this is a fundamental limitation which is not acceptable for a storage system to not be able to provide eventual consistency. To address this problem Depot has a new mechanism for joining forks. What happens when a client observes that there are two mutually incompatible histories is that it pretends that the faulty node F is instead a collection of two correct virtual nodes F prime and F double prime. By doing this, converging, subsequently these correct nodes are allowed to exchange updates, but we have learned that this faulty node F has created inconsistent updates and so frequently we can evict this node from the system and Depot includes mechanisms for doing that kind of an eviction. So by doing this conversion of a faulty node into multiple correct nodes, we are basically converted corruption faults, or an arbitrary fault into logical concurrency which these are prepared to handle. In this part of the talk what we learned is what it means to minimize trust for correctness, what consistency Depot provides to achieve its goal of minimal trust. In particular, we discussed a new consistency mechanism called for join causal consistency that is strong enough to be useful, but yet weak enough to be enforceable in Depot’s weak assumptions and the key idea was to be able to reduce failures into concurrency and in addition to all of this, we provided a protocol for enforcing for join causal consistency. Next thing I want to talk about his availability. Let's try to understand where trust exists for availability in existing systems. Before in existing systems a client trust that the SSP will remain available whenever it wants to access data. Similarly, for a typical state machine replication system the client expects that a set of nodes will be available whenever it wants to access data. In contrast, the goal of Depot is to enable the client to define a replication quorum for each object so that the client can control its replication policies and in Depot we want to be able to ensure that an object remains available as long as there is at least one copy of this object present at any of the available machines. For example, a client can define that it wants to replicate data onto multiple SSPs and so the data should be available at either one of those SSPs has not failed or is available. Similarly, a client may try to keep a local copy of an object in its data store and in this case the data should be available in either the local copy or the SSP is available. So this might seem simple but the main reason we are able to achieve this goal is because we have eliminated the trust for consistency. In contrast, in traditional quorum systems if we try to apply this approach we risk reading inconsistent or stale data. So let me illustrate why a problem can arise if we didn't have the support of minimal trust for consistency. Suppose that a client performs an update and pushes it out to one of the SSPs, receives the notification from the SSP that the write is completed. Now later on if the SSP fails and the client tries to retrieve this problem from the backup source which could be a second SSP or its local data store, and in this case the backup store may not have a copy of the object that the client just pushed to the first SSP. The client might, the SSP will end up returning that the album is empty. Now absent Depot’s mechanisms for detecting correct data from incorrect data the client will simply accept this response from the second SSP and do further processing based on this response which would lead to the violation of the consistency requirement that we want provide in that if the client performs a PUT and then performs a GET the GET reflect the result of the PUT that it just performed. However, if you have the--sorry. >>: To have that property you've got to have somebody else that knows--if you write something to one place and nowhere else. There is no record of it anywhere else and that place forgets it, then it is as if it never happened. So you are storing this history back on the client also, right? It's got to be somewhere. >> Prince Mahajan: Sorry? >>: It's got to be somewhere or you can't have this property. >> Prince Mahajan: What I'm saying is it's not easy to simply replicate data on multiple SSPs if you don't have mechanisms to be able to detect which copy is acceptable in which copy isn't. Depot does that but if you didn't have those mechanisms, if you used a simple client… >>: [inaudible] how Depot does that. Where is that knowledge stored [inaudible]? >> Prince Mahajan: At the client. >>: It's at the client? >> Prince Mahajan: Yes. >>: All right. >> Prince Mahajan: So if the client has this knowledge and if the clients are able to separate correct data from incorrect data, then the client can distinguish and identify that this response is potentially stale or incorrect and it should not be processing it. So far I discussed how Depot minimizes trust for availability. Next I'm going to talk about durability. For durability we want to be able to make sure that the data essentially becomes sufficiently replicated. So we have enabled the client define where all it wants to replicate data, but for durability our goal is to ensure that this data eventually does become replicated at each of these machines. What this means is that we need an agent in the system who is responsible for pushing the data to all of the replicas at which you want to replicate an object. This agent could be the client who is performing the PUT. It could be the SSP configured to sort of push the objects to the other SSPs, or it could be some background job of reading data from one SSP and pushing them to another SSP. In either of these cases we are putting some trust in that agent to correctly perform this task of replication. This is the trust that can lead to compromise of durability. In particular, so far in the system that [inaudible] every time a client reads or writes an object it is trusting that the replication agent has performed replication for this object correctly. If the application agent fails to fulfill this task it can lead to a situation where we compromise durability. So let me illustrate how that can happen and let's take the example of client performing the PUT. Let's take a situation where we designate the client to perform this task of pushing the data to all of these SSPs. Suppose that the client in the normal case, the client ensures that it has pushed the data to both the SSPs so that if another client reads data from the first SSP and later on that SSP fails, the client can easily access the data from the second SSP. However, there might be a situation, there might be a small window of time where the client has pushed the data to the first SSP but it hasn't yet pushed the data to the second SSP. At this point, the data is visible to all of the clients who are accessing the first SSP, so another client might still end up reading this data, but this data is not sufficiently replicated. Later on this client fails; the data may never get sufficiently replicated. Furthermore, this object is sort of living dangerously at this point because if this SSP now fails, the durability of this object will be compromised. So the goal of Depot is to be able to ensure that any object that is read or written by a correct client remains durable as long as some replica in the client’ s sufficient replication quorum survives. So if I have designated my object to be replicated at two SSPs my object should remain, this object should remain durable if it has been read or written by correct client and one of those SSPs survives. The way we are going to achieve this goal is by minimizing the trust that is needed to enforce the client’s replication policy. The logic for this is relatively simple. All we need to do is to add some receipts to the system. So in particular what we do is that every time the client stores data on one replica, the replica issues a receipt certifying that this data has been durably stored on that replica. These receipts are then attached to the metadata associated with an object and they are propagated throughout the system using the mechanism that Depot provides. Now whenever a client receives an object it ensures, it checks if the object is sufficiently replicated or not. If the object is already sufficiently replicated, the client does not need to do anything else. >>: How does it check? >> Prince Mahajan: Because the object has associated receipts. >>: [inaudible] check its receipts? >> Prince Mahajan: Right, so it checks to see if the replication policies require replication to two SSPs then the client can check whether the object is already replicated at two SSPs are not. >>: Does it have to go back and request those receipts? >> Prince Mahajan: No, they are attached to the metadata so the client gets them. >>: I see. When the metadata updates come, they contain a receipt as well. >> Prince Mahajan: Yeah. So this is the easy case where the object has been already sufficiently replicated and the client reads it so the client sees that the object is sufficient and so it doesn't need to do anything in this case. >>: After you write a second copy do you need to go back and update the first copy? >> Prince Mahajan: Yes, right. >>: That just make the write take longer. >> Prince Mahajan: No, so the write is complete as soon as you perform the first operation at the first SSP. And then you… >>: [inaudible] has to go back? >> Prince Mahajan: You're doing some extra work which you can piggyback for, you're doing some extra work for that write, but the write is complete and visible as soon as you perform the first operation. However, if when a client receives an object and it sees that the object is not sufficiently replicated, then that client requires a copy of that object along with whatever receipts there are. And the client decides to store this copy until it learns that the object becomes sufficiently replicated. So the client ends up storing this copy locally and then it can continue processing and later on when it learns that the item is not sufficiently replicated, at that point it can discard this copy of the object. >>: If the client doesn't trust its own local store it becomes sufficient it might not want its writes exposed until sufficient receipts [inaudible]? >> Prince Mahajan: Right. Actually I come to that in the next slide. Receipts also, this approach of minimizing trust for durability actually helps us address a fundamental trade-off in geographically replicated distributed systems. In these systems you have this trade-off between being able to complete an operation with lower density, and what I mean by lower density is a density smaller than the wide area replication delays, and durability in the presence of data center failures. Let me illustrate what the problem is. Whenever the client sends a request to a server in one data center, the server has two choices. It can either respond back to the client before ensuring replication to other remote centers, in which case you get lower density, but at this point the write is not durable and data center failures can happen. In contrast if the storage center waits before receiving responses from all of the remote clients, remote servers, in this case the object is certainly durable if a data center failure happens, but the request took much longer time and the operation is not low latency. In contrast with a mechanism like Depot when a client sends an object to the first server, the server can respond with a receipt saying I am replicating this object locally. The client at this point holds onto the object instead of throwing it away as it is done normally. Later on, so this operation is still low latency because the server doesn't need to replicate data to all of the remote data centers before responding back to the client. However, if later on this data center fails, the client still has a copy of the object and as a result the client can push this copy to these remote data centers and ensure that the object becomes sufficiently replicated and is durable. >>: The client can push it to the remote? >> Prince Mahajan: The client can push. So this approach is desirable because you get some implicit fault tolerance because the client is typically logically or physically separated from the servers, and in some sense the client is also fulfilling its own fate so the client survives the object will also survive. If the client fails then the object will fail. >>: Not addressing conflict resolution then? >> Prince Mahajan: Conflict resolution is a problem that is, so this talk is not really about conflict resolution because that problem arises irrespective of whether you use this mechanism or not. >>: [inaudible] you've introduced some other last ends because you [inaudible]. >> Prince Mahajan: No, no. Actually for conflicts the approach that Depot offers is that it returns all of the concurrent writes that have been performed on an object and with each write you can associate whatever experimental data that you want to associate to ensure the conflict is resolved properly, so for example, you can attach a timestamp and then pick up a winner based on that timestamp or you can sort of like [inaudible] the concurrent writes in some meaningful way based on the application semantics. We are not introducing a new last writer here because the timestamp that you would associate here would be the timestamp that the client assigned to this subject when it performed the original operation. >>: So you are just editing an additional staleness factor. >> Prince Mahajan: Yeah. It doesn't create, it's possible that if the data center fails you end up having more staleness then you would see otherwise. Later on when the client gathers the receipt from this other remote data center, at that point it can discard this copy of the object. >>: [inaudible] not in the data center that it's trying to write to. >> Prince Mahajan: Even if it is within the data center it still gets some additional fault tolerance because the services may indeed be isolated. >>: It can be in the data center. It's just got to be aware of all of the available regions. >>: Well, if it's in the data center than it’s correlated with the data center. >> Prince Mahajan: [inaudible] may be correlated, but the argument--again, there are two arguments. One is that they can be logically separated entities and secondly the client also fails then it is sort of like losing all recollection of what it has done and so maybe it might be acceptable. I'm not arguing that it is acceptable, but it might be okay in that circumstance. >>: What if the client writes object A with intentionally load durability and then makes a dependent write to B with high durability, does the high durability of B extend to A? Because I can see that if A becomes unavailable now B with supposedly high durability will think everybody was saying well, I can't process B and… >> Prince Mahajan: So you can still process B, it's just that if you are writing something with durability maybe you are okay with the object, losing that data. That would be your intent, so if for example, for object A I specify that it's enough to replicate that object in any single data center, whereas for object B I required that this object should be replicated at two data centers, so what might happen in that circumstance is that you might lose object A but you might still have access to object B, and presumably because you're writing everything with load durability you should be prepared to handle the situation where you lose object A and perhaps you are able to re-create a copy of object A through some other mechanism. >>: But how does Depot handle it? I mean Depot just sees oh, B has this pointer saying back from B to A and it will say I can't provide you B and until I… >> Prince Mahajan: No, B does not include a pointer to A. I mean, it logically does but B includes a pointer to A’s metadata which will be available. The only thing that you will end up losing is A's data, so you won't be able to read A but you still have all of the information that you need enforce consistency. So this is a graph that depicts the situation where in a local cluster we kill all of the cloud servers at about 220 seconds and it shows that Depot’s PUT continues functioning after, even after the servers have been removed and the function with low latency because they don't have to recognize that the SSP is down so they don't have to go to the servers anymore and they periodically keep on checking it and when the SSPs come back and that's why you see like these latency spikes happening in between. At about 660 seconds we bring back the SSP and sort of like the clients then at that point push back the state to the SSP and start working from the storage service again. In this configuration we use the local backup and one as SSP as our deployment. So next let me talk about how much does it cost to provide properties that Depot provides. In particular, I'm going to talk about latency, resource utilization and the dollar cost of building a system like Depot. The test deck that we used consisted of eight clients, four servers. We were performing about one request per second and each volume contains 1000 objects. So just as a reminder, these are the sources of overheads in Depot. We are attaching metadata to each object and this metadata consists of a signature, potentially receipts, a partial version vector, a history hash and a data hash. In order to accept the metadata a node performs a SHA256 check, an RSA verification, history check and then receipt checks. And then to perform a GET the client performs a SHA check, SHA256 check. So this diagram depicts the latency of several variants of the baseline system that we have constructed. In order to provide comparison points, what we did was we constructed several baseline variants and these baseline variants try to emulate cloud storage by disabling the replication on the clients of the metadata and disabling all of the checks that Depot provides. So for example, the base system does not have any data at the client and, and the system clients also do not perform any checks and the servers also do not perform any checks, whereas in the hash system the client’s computer hashes the data and attaches it to the object before storing it and then when they read they verify the hash. Similarly, in the sign the clients sign the object before sorting it and then check the signature when reading it. The Depot minus our version does not include the receipt, implement the receipt code, receipt logic of Depot and the main reason for providing this variant is that in Depot we use cryptographic receipts, but for many deployments it may be acceptable to instead just use sort of like receipts that just have the index information because if nodes trust each other to provide the right information then that might be acceptable and many of the overheads that we incur due to cryptographic operations may be eliminated. The key thing to note here is that the overhead on the forgets is very, very small because the only extra computation that the client needs to do is to perform a SHA256 check on these GETS. In contrast, for PUTs the overheads are high because we are performing several cryptographic operations. In particular, in the complete variant we need to perform two SHA256, two signatures on the critical path and that's why the cost of the complete Depot version is significantly higher. But like I said if you want, if you want to, there is a potential to remove a lot of these cryptographic operations and reduce the overheads of Depot significantly. >>: [inaudible]. >> Prince Mahajan: [inaudible]. Ones that are [inaudible]. >>: [inaudible] 5 seconds to do… >> Prince Mahajan: Yeah. >>: [inaudible]. >> Prince Mahajan: Each signature costs about 4 point, 4 milliseconds, I think. >>: [inaudible] could go faster. >> Prince Mahajan: Yeah, using [inaudible]. [laughter]. So next I want to talk about the resource utilization costs of Depot in comparison to these other variants. In this graphic I'm trying to depict the normalized cost of Depot in comparison to these other approaches and in particular, I compared the 3-D source utilization metrics. We compare the network utilization between client and servers, the CPU utilization of the client and the CPU utilization of the server. For GETs as you can see the overheads expanded for cost is negligible and similarly at the client the extra cost primarily comes from the extra hash computation that they need to perform. The CPU utilization of the server was very, very small, so it's pretty much random. >>: What is the data that you're using for this? >> Prince Mahajan: Sorry? 10 kB objects. >>: 10 kB objects, and your network overhead looks like zero, but you have to be moving… >>: [inaudible]. >> Prince Mahajan: It's about, so the basic metadata without the receipts is about 200, 300 bytes, but we don't add the cost of that metadata to the GETs; instead we compute it to PUTs because we are replicating that, because each client ends of storing that metadata so we, when we are doing, in the accounting we add that cost to the cost of a PUT instead of adding it to the GET. >>: [inaudible] object you're putting our rather large [inaudible] instead of the metadata… >> Prince Mahajan: Yeah, so the stuff, so if you have small objects, indeed, like the cost of the system may be prohibitive. In this case we looked at 10 kB objects and so if you have smaller objects they like 500 bytes or smaller then you would end up paying a higher cost. >>: Does the metadata a fixed size or can this storage factor vary? >> Prince Mahajan: This is a fairly fixed size unless you end up sort of, I think in most cases you can argue that version vectors are going to remain on the fixed size. >>: [inaudible] writers which in real life doesn't happen. >> Prince Mahajan: So also the other reason why our version vectors are fixed is because we, instead of using encoding attaching a computer version vector, we only each update carries an incremental version vector over the previous update, so because the system is processing updates in a consistent order, you can remove the information that has already been there in the previous version vector, so you only encode the new. In particular, that does not change much. So this is the overhead for PUT. As you can see our Depot, as well as Depot without receipt incur about 20% cost of metadata transfer between client and servers. Each of these is probably 20% because we have about 300 bytes and we are transferring 10 kB. That's about compared to about 2 kB of metadata transfer for eight clients. >>: [inaudible] replicate the metadata everywhere, so you end up using, you end up using 20 or 30% of… You can only have cloud storage that is three or four times larger than your local storage because your metadata winds up being a third of your total. You lose a lot of the advantages of cloud storage if your objects are that tiny because you have to have the metadata… >> Prince Mahajan: Yeah, you need to sort of measure that. >>: On the other hand… >>: We don't lose the advantages. It just costs a lot more. >> Prince Mahajan: It costs a lot more. >>: You've got to have local storage. You can only multiply local storage by three if you have 10K objects. Of course the answer is don't have 10K objects [laughter]. >> Prince Mahajan: Yeah. So again, like the storage cost in the client is high because in a traditional system you don't have to store anything in the client, whereas in the context of Depot we need to store this metadata at the client, and furthermore like in this deployment we are also, the client was configured to store a copy of the data that they create, because we are using the local backup approach for ensuring durability, so there was one copy stored in one SSP and another copy stored in the local data center and that's why the additional cost is even more significant. The CPU utilization of the client is high because you are performing receipts; you are performing checks for these receipts and so all of that adds up and similarly, the CPU utilization of the server is particularly high for the receipt variant because the server is now doing these cryptographic signatures for this variant. To look at how much does it actually cost in terms of the dollar amount to implement the system and run a system like Depot. To do that we can basically word some of these resource utilization numbers using some sort of cost numbers that we gathered from existing pricing. What this tells us is, as expected, to perform a terabyte of GETs you pay very little extra cost, less than 5%. But to perform a terabyte of PUT you almost pay 100%, you double the cost needed to perform the terabyte of code. >>: [inaudible]. >> Prince Mahajan: This is for 10 kB objects, and for storage you again pay like about 20% extra cost because you are storing metadata as well as data, extra copies of data. So once again, basically the point is that these costs are high, but they are not so high. They seem acceptable to me. I don't know if they will be acceptable to clients or not, but again these overheads can be dramatically reduced if you remove cryptography from… >>: [inaudible]. >> Prince Mahajan: Or you don't have 10 kB objects. >>: Which you want, right? Anytime we have looked at safe file systems and stuff what you find is that if you get a bunch of small objects even your object size is going to be orders of magnitude bigger than that, and so your overhead is fixed for object, so you are probably making yourself look worse than in reality. >>: File systems are set? >>: File systems. [inaudible] file systems, what you find out is that the mean file size is hundreds of kilobytes to megabytes. And there are a tiny number of giant ones. >>: Is that more of a direct comparison to databases systems, though? I think this was designed more as a stand-in for a database rather than a stand-in for a… >> Prince Mahajan: Well, this is… [inaudible] [multiple speakers]. >>: Right. Your database records are going to come in under 10K probably. >>: Well, is there any data from Amazon in how big the past three objects are? >> Prince Mahajan: Well, they are expected to be several gigabytes. >>: Perhaps three, but Dynamo is a better. >> Prince Mahajan: Yeah, Dynamo is like much smaller. In this part of the talk I introduced a system called Depot that takes an approach of minimizing trust and we showed how despite minimal trust we can achieve some very strong properties, stronger properties than those provided by existing systems. In terms of related work, the key point to take away here is that there is a lot of related work that focused on building systems that minimize trust or improve availability, but the main distinction between all of these systems and Depot is that Depot is designed to really, really minimize trust whereas all of these other systems go halfway. They tried to reduce trust in the system but don't entirely eliminate it, and similarly, they don't achieve the same level of availability that Depot promises to achieve. Broadly speaking, my research is focused on addressing sort of like the trade-offs, investigating the trade-offs between consistency and availability and fault tolerance. What Depot tries to do is to try to understand what the practical limit of properties that can be provided is if you want to hold on to availability and fault tolerance. In contrast, I've also done some theoretical work in trying to understand what is the theoretical limit in properties that can be provided with high availability and fault tolerance. What we ended up showing is that a variant of causal consistency is really the optimal you can get if you want to provide high availability and similarly we have some results that are not as strong for environments where you want to minimize trust in addition to maximizing availability. In the next part of the talk I am going to briefly touch upon a system that I am currently doing work which is focused on answering the other end of the question which is if you want to hold onto consistency, if you always want to be able to support acid transactions, then what is the best availability that you can get. In particular, we are designing a system called Salt that integrates the acid transactions with the base approach. So next I am going to talk about how we mix acid and base in the system called Salt. It is not a chemistry talk. [laughter]. We all know the benefits; we all fully understand the benefits of acid transactions. They are used pretty much everywhere to encode business logic, to perform banking transactions and so on. And they provide you very attractive properties of atomicity meaning that the entire transaction, either the entire transaction completes or no effect of the transaction is left. Consistency meaning that the database transitions from one consistent state to another state, and isolation meaning that transactions do not observe the intermediate state of other transactions and durability meaning that the result of committed transactions remains durable. The reason distributed databases have low availability with acid transactions is because in order to perform an acid transaction that spans multiple components, you want each of those components to be available and reachable. For example, if you want to perform a transaction that includes two components X and Y if either X is down or is unreachable, or if Y is down or unreachable your transaction will not be available. In contrast to this asset approach, there is an approach called base approach. Unlike acid approach, the semantics for this, there is no well-defined semantics for this base approach. There are various variants of this approach. Different systems implement various flavors of base approach and the only underlying idea behind this base approach that is sort of common to each of these approaches is that we are trying to compromise consistency to get better availability and better performance. This approach can be applied on top of an existing database by using the database to perform some very minimal tasks or it can be used on top of some NoSQL systems out there or like consistent data stores like big tables and so on. The main idea is that the application now has to deal with the task of enforcing these properties that previously were being enforced by the system. The reason this approach improves performance is because the application does not have to deal with providing the transaction properties [inaudible]. Instead, the application can issue each one of these operations separately so the system remains available if I have to process, if I have to abate X, the system will be available if X is available and similarly when I have to process Y, the system will be available when Y is available, so I don't need simultaneous availability of both X and Y in order to be able to perform the same computation as I was doing before. This comes with better availability and better performance because you don't need to perform the two-faced command. You only have to hold the logs throughout the duration of this transaction. However, it also complicates programming because it's possible that while you are executing, while a particular operation is partially completed, some of the operation can come in in between and access this inconsistent state. Typically you would handle the situation by signaling to the application or the application might signal to the user that the result of the most recent, some of the most recent operations might not be reflected in the result that I am showing you. For example, the user is trying to see its account balance and there might be a notification or there might be a disclaimer that this may not reflect the most up-to-date balances. There is, again, this conflict between acid and base. Both of them have their advantages; acid provides simplicity of programming but at the cost of availability. Base provides sort of availability and high performance at the cost of ease of programming. In this world we are trying to ask question can we actually provide the benefits of both of these approaches without being subject to their limitations. That's why we are building a system that combines the acid approach with the base approach. This seems like a good reason, a good thing to do for two reasons. First, most of the workloads actually do not consist of performance critical transactions. Most of the time you have very few transactions for which you care about performance, and a lot of other transactions which are performance oblivious. If you are able to combine these two approaches together in the same implementation, I don't have to rewrite all of those performance oblivious transactions and I just need to focus on rewriting and doing more work for the transactions that are performance critical. Secondly, this also provides an incremental adoption path, so think of an enterprise that gets acquired by Microsoft. It's going to see gradually some increase in workload, but the increase will not happen overnight, so as a result the enterprise can incrementally identify the bottlenecks in the workload and fix those bottlenecks over a period of time rather than having to atomically transition from an acid system to a base system. There are various challenges in providing acid, in supporting both of these approaches simultaneously. I am going to talk about one particular challenge next. The key problem is the presence of base sort of like paradigm in conjunction with the asset approach can kind of like corrupt the view, the consistent view that the acid systems expect to see. Suppose that we have this transaction that we are trying to execute using the base approach, and we have executed the first part of this transaction. In between an acid transaction comes in and it tries to read both X and Y. You would expect that the invariant, that the sum of X plus Y remains constant should be preserved, but because it has seen the incomplete state of a base transaction the invariant will not be preserved. To address this and other limitations we have basically introduced a new transaction that is called base transaction. Base transaction has a similar structure as acid transactions as we can see there, and it provides a number of properties that are very similar to the properties that we would expect from acid transactions. In particular, they provide a notion of atomicity which means that the entire transaction will eventually be executed to completion. It provides durability for completed transactions. For integrity enforcement, it introduces exceptions so every time the base transaction tries to perform an operation that violates the application’s integrity constraints, it results in an exception which can be handled by the base transaction locally. It may contain acid transactions to simplify programming. The tricky properties that it provides in addition to these other good features, is that the presence of these transactions does not affect the correctness of acid concurrently executing correctness of acid transactions. Furthermore, it also does not affect performance of correctly executing base transactions, and I will talk about this in the next slide as to why this might be a problem. Finally, it ensures that we are able to provide the same availability and performance as we expect from the base transaction using this base transaction primitive. Let me talk about, let me give you a sense of two key implementation ideas that we use to implement base transactions. First of all, base transactions are serialized as a sequence of atomic transactions and so you can think of them as an acid transaction, but they are not strictly acid transactions for technical reasons. So there are going to be in the overall execution history base transactions are going to appear as several small fragments. The same base transactions would appear at multiple points in the overall serialization history. In order to ensure correctness of the acid transactions, these base transactions are going to acquire some tainted locks which are going to mark that these objects are being modified by an incomplete base transaction. The results of these tainted locks is that while other concurrently executing base transactions can come and access the intermediate state of other base transactions, acid transactions will be prevented from accessing the incomplete state of the system, incomplete state of a base transaction. To ensure that the acid transactions perform, give the same performance in the presence of base transactions, we need to add a few more mechanisms. The key problem here is that because we are going to execute the base transaction as a series of small, small atomic transactions and each one of them will require performing a disk write, so the duration of a base transaction will be, if a base transaction consists of five separate mini transactions, then the duration of the base transaction will be basically five times larger than the duration of a equal length acid transaction because in an acid transaction you're going to perform all of the disk writes in parallel, and as a result without this mechanism that I'm going to talk about, the base transactions might significantly impede the progress of acid transactions by blocking them. Instead, the key idea that we use is we try to break down the commit of transactions into two parts. First is the consistency commit which basically ensures that all the, we still perform the [inaudible], we release all locks, but the data is stored only in memory after the first commit. Then later on we perform a second commit which is the durability commit which happens asynchronously at each of the, in a distributed fashion so it does not require any coordination between various nodes. Each node can individually flash its data to disk and then there is a recovery protocol to ensure that when we recover the system we recover the system to a consistent snapshot. And there is some of the logic to ensure that only the result of committed and durable commit transactions are exposed to the application so that the application does not see the effect of a transaction that has committed, that has released its locks but has not yet made it safely to the disk. The summary for this part of the talk I discussed an approach to combine acid transactions with base transactions, and to do so we introduced a new primitive for base transactions which enables us to isolate the intermediate state of base transactions from concurrently running acid transactions while preserving their correctness as well as performance. These are some of the other publications that I've had and I would be happy to talk about any of this related work if you would like. In conclusion, my research has explored extreme points in this consistency availability trade-off. With the Depot work and with some security work I have explored the point where we are trying to maximize consistency and we are trying to maximize availability and trying to see like what is the best consistency you can get. With my recent work we are trying to investigate situations where you want to hold on to the strong consistency in form of acid transactions and seeing how can we improve on the availability of these systems. With that I will be happy to take any questions. [applause]. >>: I might have just one. For Salt do you have specific applications in mind like for customers for the system? >> Prince Mahajan: For the first part of the system for the second? >>: The first part. >> Prince Mahajan: For the Depot? >>: For Salt. >> Prince Mahajan: For Salt? We are looking at, the example we are looking at the TPC benchmark and we also looked at [inaudible] benchmark, so pretty much anywhere you compute, you are trying to compute the index it's usually acceptable to--there are a lot of work loads that we have looked at where it will make sense to combine acid with base. Some examples of those workloads are situations where you're trying to compute an index but the consistency of index is not critical, so for example, when you add a new item to eBay to be sold on eBay the item has a category, so you would normally want when people search for that category the item shows up, but it's usually acceptable if the item doesn't show up instantly and instead shows up a little bit later. Similarly, you can actually use the base approach to perform very quick e-commerce checkouts because one thing that I didn't discuss about these base transactions is that they include a notion of prime date transaction where you can specify that I want this transaction to be executed right now and the rest of the work to be done asynchronously in the background, so if a customer tries to purchase an item, you can have the stock quantity of that item determined asynchronously and then you can try to charge the credit card and update the order and try to do all of the background tasks in the remaining base transaction, so that's also where it will make sense. Similarly, like in these banking applications it might make sense to determine the balance first using asynchronous transaction and then try to update the, increment the balance of the destination account later on in the background. So pretty much anywhere that you can think of relaxing consistency or you can think of being able to perform committed operations, you can benefit from the base approach. >>: I have a question about the [inaudible] with the [inaudible] multiple versions of an object. Does that only happened because of forks or are there other… >> Prince Mahajan: No. It can happen due to legitimate concurrency as well because if two clients concurrently try to update an object, it can certainly result in concurrent versions of an object. It can happen like in Amazon S3. In Amazon S3, however, when you do a GET it will, the system automatically tries to resolve the conflict for you by saying that, by using some [inaudible] timestamps, so you can also overlay that kind of an approach on top of Depot to say that I just want to see, I am going to associate some timestamps and I just want to see one of the versions that has the highest timestamp. It might make sense in some applications, but in other applications it might make sense to be able to see all of those concurrent versions and try to merge them in some way, so basically we leave it out and you can lay it on top of Depot in whatever approach makes sense for your application. >> Rich Draves: Okay. Let's thank Prince again. [applause].