Document 17864877

advertisement
>> Rich Draves: All right folks, why don't we get started? I know there we’re still 2 minutes
early of our traditional time, but I think we have critical mass and we'll just punish anyone that
shows up late. I'm pleased to introduce Prince. Prince is of course from UT Austin, one of Mike
Dahlin’s students there. I guess Mike and Lorenzo really together, right?
>> Prince Mahajan: Yeah, I think that Mike was my primary advisor.
>> Rich Draves: Okay. Actually Prince is not a stranger to MSR. Prince interned with us back in
2004 working with Galen on similarity. More recently he's done an internship with our
colleagues in Silicon Valley working with Ted Lobber [phonetic] and Doug Terry and other folks
there, so he knows something about MSR. He's with us here today and tomorrow interviewing
for a postdoc position, so welcome Prince.
>> Prince Mahajan: Thank you thanks everyone for coming. Today I'll be talking about cloud
storage with minimal trust. The body of this talk is going to be on a system called Depot with
attempts to embody this philosophy. Towards the end of this talk I will also talk briefly about
some ongoing work on sort of bridging the gap between acid distributed databases and these
data stores that provide [inaudible] based on workflow. Let me start by talking about Depot.
Depot is a cloud storage system in which clients do not have to trust. That is assume the
correct operation of storage servers or the clients. To motivate this model, consider the
example of a hypothetical pictures showing website called CloudPic that uses an extended
service storage provider such as Amazon S3 to store its data. Whenever the user has to store
new pictures, CloudPic simply pushes them to the SSP and whenever the user wants to access
his pictures CloudPic fetches these pictures from the SSP on demand. This model is appealing
because SSP handles all of the details that are necessary to build a highly reliable storage,
details such as geographic replication, scrubbing latent errors, provisioning for load spikes et
cetera. Because of economies of scale it can do so at a very attractive cost, so considering
these benefits it should not be surprising that a lot of people have adopted this model over the
last few years. Despite these benefits, however, I argue that there remains significant risk
associated with cloud storage. The first one being that the cloud storage is a black box to its
users or its clients. The users are not really aware of what are the best practices followed by
the storage or what are the keyed application policies that are implemented to ensure their
liability. So when I store an object onto the storage service I don't necessarily know, if it's going
to be geographically replicated or not and if so, at what granularity. Is it going to be replicated
to my additional sites than a few seconds, within a few minutes, within a few hours are at the
granularity of days?
>>: [inaudible]?
>> Prince Mahajan: Yeah, but you don't necessarily know. For example, for geographic
representation when you push something to Amazon S3 you don't necessarily know if it's
immediately going to be replicated to two nodes and then gradually it's going to be replicated
to three nodes and if so at what granularity is that going to happen. So the policies are maybe
there at some level but they are necessarily vague perhaps for good reason.
>>: Okay.
>> Prince Mahajan: The second challenge with cloud storage is that like any distributed largescale system cloud storage is complex. It's prone to both software failures and hardware
failures and failures that result from external factors such as natural disasters. Like the
repeated occurrence of outages in these large scale services certainly does not help the user
confidence in these services. The key challenge, the key problem in using cloud storage I argue
is at this point we have a conflict. On one hand we have that the cloud storage offers
numerous benefits, but on the other hand there are certain risks associated with using cloud
storage service. The fundamental reason behind this conflict is trust, so at this point cloud
storage services are not entirely reliable yet, but still we trust them. We trust them with the
durability of our data. We trust them to remain available whenever we want to access our data
and we trust them to provide the right data whenever they send any data back. So this work
I'm going to argue for an approach to minimizing trust. What I mean by that is that we want to
minimize assumptions about both the correctness as well as the availability of the storage
service. What this means is that while we expect that usually the storage service will return the
right data, we should be prepared to handle situations when the data it returns is potentially
inconsistent, or in situations where the data is corrupted or even in situations where the
storage becomes unavailable. Let's see how something like this can potentially be
implemented. Our goal is really to enable the client to be able to separate good data from the
bad data, so whenever the cloud service responds back with the data, if the data is correct the
client will end up accepting it. However, if the data is incorrect or potentially inconsistent or
corrupted, the client will be able to reject the data. Furthermore, to handle situations where
the cloud service becomes unavailable, we want to enable the client to specify what replication
policies it wants to implement. For example, the client might want to use multiple SSPs or it
might want to keep a local copy of its data in its local data storage and it should be able to do
access either of these stores transparently without having to compromise consistency or
correctness. So I argue that this approach would be appealing to both the users as well as the
SSP. The reason it would be appealing to users is because they can sleep peacefully at night
knowing that the failure of the SSP overnight will not make them bankrupt. This increased trust
can translate into increased cloud utilization which can in turn increase to greater revenue
generation. The key question, however, remains can we actually build a useful system despite
these weak assumptions? The answer it turns out is yes. We built a system called Depot which
embodies this approach. It's a key value store similar to Amazon S3 and in the system clients to
not need to trust the servers. As a result the system can continue functioning despite routine
failure of a single machine, correlated failure of multiple machines or even a situation where
the cloud service goes down. We also further tolerate situations where the client, some of the
clients may behave arbitrarily. Perhaps surprisingly, despite these assumptions, Depot was able
to provide stronger properties than those provided by existing highly available storage
deployments. In particular, Depot provides a variation of causal consistency on each of its
volumes. In contrast, Amazon S3 has only guarantees eventual consistency. Furthermore,
Depot is able to ensure availability despite Phil stop, corruption or correlated failure of a set of
machines and Depot is able to guarantee durability despite correlated failures of machines in
the cloud. So for the rest of the talk I will basically consider three key properties that Depot
provides the property of correctness, availability and durability. For each of these properties I'll
explain what level of trust lies in existing systems and what is the key goal of Depot in terms of
minimizing that trust. Then I will explain what are the key techniques that Depot uses to
achieve this goal of minimal trust. Finally I'll come to the point where I'll discuss what is the
cost of achieving this minimal trust guarantee in Depot. Let me start by talking about
correctness. In an existing cloud storage service whenever that storage service returns an
answer, the client expects that the answer will be correct, so that's where the trust lies. Yes?
>>: When you say client do you mean Yelp, or do you mean the customer of Yelp?
>> Prince Mahajan: In this case we talk about enterprises that use the storage service so that
would be Yelp. Whenever the client receives an answer from the storage service it expects that
the answer is going to be correct and so in some sense it is trusting the service to return the
right answer. As a comparison point in traditional state machine replication or quorum based
systems, the client trust that whenever a quorum of machines returns matching responses the
response is going to be correct. In contrast, the goal in Depot is to eliminate the need to trust
either the SSP or the other client. We are going to achieve this goal by enabling the client to
distinguish good data, correct data from incorrect ones. In order to go further I need to first
define what it means for data to be correct. So I am going to do that by defining the
consistency that Depot provides. The diagram here depicts the spectrum of consistency
semantics that are provided by existing systems. On the extreme right inside we have very
strong consistency which is provided by systems like Azure, MegaStore [inaudible] and so on.
The semantics offer a very intuitive property that each GET that you perform is going to return
the result of most recent PUT that was performed to that object. However, some fundamental
trade-offs prevent us from achieving these properties with high availability and minimal trust.
In particular, CAP theorem argues that you cannot achieve strong consistency without
compromising availability, and similarly, the trade-off for Byzantine agreement argues that you
cannot achieve this property with minimal trust. On the extreme side, we have, on the extreme
left hand side we have systems like Amazon S3 or Dynamo that provide very weak eventual
consistency properties. These properties are achievable, however, they are very weak and they
make the task of programming very difficult. The key reason is that these systems do not
provide even a very simple property that programmers expect which is that if as a programmer
I perform a PUT and immediately after I perform yet another PUT happening between, I expect
that the GET will incorporate the results of the PUT I just performed, but these systems don't
guarantee this property. As a result, hackers have to sort of build in like artificial best effort
mechanisms by adding artificial delays hoping the replication happens within this period of time
and so on and so forth. Furthermore, these semantics do not prevent dependency of PUTs that
are performed by the client, for example, if the client reads the result of one PUT and then
performs another PUT eventually depending on the first PUT that he read the system does not
guarantee these two PUTs will be observed in the same order by all of the other clients. To
address these limitations of existing semantics we have basically designed a new consistency
semantic that is called fork join causal consistency. Our goal with fork join causal consistency is
to be able to provide the strongest possible semantics that can be provided with high
availability and with minimal trust. Fork join causal consistency is a slight technical weakening
of causal consistency which is designed to accommodate environments where you need to
minimize trust. In particular, fork join causal consistency retains the essence of causal
consistency. That is it provides the property that if a client performs a PUT and then performs a
GET and other operations have been between the GET is going to return the answer the PUT
had just performed. Similarly there are two dependent PUTs they will be observed in the same
order by all of the other correct clients. This property can be useful for a number of reasons.
First, it makes sure that the expected behavior of the system improves user experience because
the expected behavior of the system sort of matches the real behavior offered by the system.
Secondly, it makes the task of programming easier because suppose that I add a new object and
then I add a reference to this object I am guaranteed that the reference will not be seen
without the original object that I added. Finally, it also helps us in layering other properties on
top of fork join causals.
>>: [inaudible] the places where it gets weaker than probable consistency is when somebody
else adds an object and then you add a reference to that object. You may not be a guarantee
that the reference is valid. Is that the…?
>> Prince Mahajan: No. That's not, so you'll still get that property. It retains most of the
property that causal consistency offers. The way in which it is weaker than causal consistency is
that in causal consistency you have an expectation that all of the operations performed by a
given client will be totally ordered. Whereas, if clients can be arbitrarily faulty then you cannot
uphold that expectation in this environment, so for clients that are faulty this total ordering
property is compromised. Those dependencies are still preserved for all arbitrary scenarios. To
explain the benefits of fork join causal consistency let's revisit our example from before. So
suppose that me and my advisor uses hypothetical picture sharing website to share pictures.
Now suppose that the system right now is just eventually consistent, so recently I went on a
secret trip and I didn't want Mike to see these pictures of the secret trip so I decided to remove
Mike from the list of people who are authorized to see my album. Unfortunately, a simple
[inaudible] partition can prevent, or even simple replication policies can prevent the
propagation of this update to all of the servers in the system. Now later on if I try to add new
pictures, it is very much possible that due to a simple load-balancing decision or you due to
another [inaudible] partition. This request is processed by a different set of servers and the
ones that processed my earlier request. Later on a consequence of this partitioning or this
eventual consistency is that later on if Mike tries to access my album he will be able to access
the pictures that I didn't want him to see. In contrast, if the system provided fork join causal
consistency, then no node can process my second request without having processed my first
request, and so my advisor would be prevented from seeing the pictures that I didn't want him
to see. Yeah?
>>: In this example all of your operations are going through one cloud [inaudible] node.
>> Prince Mahajan: Right.
>>: All of Mike's impressions are going to another node. Is that something that a provider like
CloudPic has to do to ensure if they want to…
>> Prince Mahajan: No the only requirement that, the property that we provide the, the fork
join causal consistency is provided to the clients of the cloud storage. Which in this case are the
servers.
>>: [inaudible] treated as one?
>> Prince Mahajan: No, no. They can be processed by, so these requests can be processed by
the same client or they can be processed by a different client. That does not affect the
correctness provided by the system. The only, the one requirement, however, that I would like
to point out is that these guarantees will only be preserved if the request of the same user are
processed by the same client, so…
>>: [inaudible].
>> Prince Mahajan: Yeah, so there are multiple ways to sort of either enforce that or to sort of
fix situations where that is not enforced. You can obviously maintain some kind of a session to
ensure that the request by the same user that are issued in a small time window are processed
by the same machine in the CloudPic. But definitively you can also employ some cookies in the
users browser so that these cookies carry a short summary of the state that this user has seen
and later on if the user now migrates due to a load-balancing or due to failure if it migrates to a
new machine, the new machine can ensure that it has seen the prefix that this client has seen
to a different client, to a different machine in CloudPic, so you can enforce those kinds of
policies in the browser or at the enterprise level to sort of make sure that the good properties
you are getting from the cloud are also transferred to your users if you need to. So far I
discussed the issue of which consistency should be provided by a system like Depot which
intends to be highly available and despite minimal trust. Next time I'm going to describe some
of the key techniques that Depot uses to achieve these properties. At the high level this
diagram depicts the architecture of Depot. Data is partitioned into volumes and each volume is
replicated over a set of servers that do not require overlapping read and write quorums to
ensure high availability. Ideally we would like Depot to be only running at the client in the form
of a library, however our current prototype requires us to run code on both the client machines
as well as on the cloud servers, but we have some ideas on how to change that, how to be able
to provide most of Depot's properties with simple API changes in the cloud interface. This
diagram depicts how Depot can provide fork join causal consistency with minimal trust.
>>: In the previous slide so which cloud service providers does Depot prototype workings?
>> Prince Mahajan: It's right now we don't work with any; we basically have the store which we
need to run in the server so…
>>: [inaudible]?
>> Prince Mahajan: You can, so the purpose of designing the system like Depot was to
understand what properties can be provided and also drive from that understanding what are
the API changes that are potentially needed to provide these properties in subsequent systems.
To some extent we have succeeded in that goal because we sort of now understand better as
to what will be needed from these cloud storage providers to be able to provide these
properties or if, so basically if you have all of these API changes you get all of these properties.
If you have fewer changes these are the properties you get and so on and so forth.
>>: Will you be talking about that, maybe talking about with those changes need to be?
>> Prince Mahajan: Not in this talk, but I can talk about them off-line.
>>: [inaudible] the clients have some idea about how servers at the data center or which ones
are running on the same physical machine and which ones aren't?
>> Prince Mahajan: No. Because the clients we are dealing with virtual machines and they only
talk, all the clients need to do is they need to be able to send messages to a node with a given
virtual ID and be able to sort of authenticate messages that are coming from that virtual ID.
Now whether those messages are coming directly or they are coming through some of the
route is not relevant.
>>: You have some table of metadata around the storage itself and [inaudible] are you…
>> Prince Mahajan: Right. Right now we store some metadata and perform some checks on
these storage servers because, again, in Depot we have this goal of providing extreme fault
tolerance so even wanted to isolate one server from faults that happen under the server. But
it's plausible that a deployment might want to weaken those assumptions and say that I am
willing to accept data from other servers in the same deployment without actually checking
them and then that case we can eliminate the need to actually store all of this data, perform a
lot of these checks on servers.
>>: But it seems like the magic here is some consistency table that's maintained around all of
the storage and all of the transactions. Is that roughly accurate, or am I missing something?
>> Prince Mahajan: I don't understand what consistency table.
>>: Right so the only way you would know definitively that one node has been updated and
another unrelated node hasn't gotten that change yet is if you recorded those updates
somewhere and then you check them against…
>> Prince Mahajan: The reason you don't need to do that is because you need to indeed record
those changes in some form, but you don't necessarily need to know authoritatively what is the
recent version of an object. The reason is we are not trying to provide strong consistency
semantics. If you had a very strong consistency that [inaudible] reliability then you would need
to know like what is the most recent version of this object and where is it placed. On the other
hand, if you're trying to provide weaker semantics of this causal consistency you do not need to
ensure that each GET returns the most recent PUT that is performed on that object. It is
sufficient to ensure that the GET returns the most recent PUT known to the client who is
performing the GET. So the properties are weaker and as a result some of the requirements
that you would have to enforce otherwise are not needed. Does that make sense?
>>: It does. I just, does it meet your criteria of making sure your boss won't see your pictures
that you don't want him to see?
>> Prince Mahajan: It does because if the clients, if the same client is performing successive
operations, then these operations will remain dependent on each other and they will be
observed in the same order and that is how the criteria is enforced, that if any client that
observes the second operation is guaranteed to also observe the first operation, so these two
are dependent operations performed by the same client. On the other hand, if you had, you
would actually get more than that and like, I can try to explain some of the additional
properties you get with this, but know that for causal consistencies you have to enforce those
requirements. Yeah.
>>: I have a quick question before you go on. When you say minimal trust, do you mean
minimal in the mathematical sense or English sense?
>> Prince Mahajan: In the English sense [laughter].
>>: I don't know maybe I these got some [inaudible] [laughter].
>> Prince Mahajan: All right. Let's see how we can enforce fork join causal consistency with
minimal trust. What we do is we attached some metadata to each PUT. Logically this metadata
summarizes the history of this PUT, that is all of the previous PUT that has been observed by
the client before this PUT was performed. This metadata is then replicated to all of the
machines, all of the clients as well as all of the servers in the system. It forms a part of the local
state and each subsequent GETs that a client performs are going to be sort of checked against
this local state that the client maintains. This is the key to enforcing consistency. And finally
before accepting new metadata, clients form some checks to ensure that the new metadata is
consistent with what they have seen in the past. So let me talk about what exactly this
metadata is and what are the key checks that a node performs.
>>: It needs to be replicated at all nodes?
>> Prince Mahajan: All of the clients…
>>: Do you mean eventually or…
>> Prince Mahajan: Eventually.
>>: Because you can't possibly mean write everywhere.
>> Prince Mahajan: Yeah. So basically all of the clients as well as all of the servers that share
volume will end up having some metadata for all of the updates that have been performed in
the system for, yet, just metadata.
>>: [inaudible] but a client would check for metadata…
>> Prince Mahajan: You can't check it on demand.
>>: [inaudible] and the GET and that would be the way you would ensure this consistency.
>> Prince Mahajan: Yeah.
>>: All right.
>> Prince Mahajan: So let me talk about what this update metadata is. It has some expected
fields such as node ID, the key that is being updated and the hash of the value that is being
added. In addition, it includes two new fields. First is a version number that is assigned by the
client who's performing the PUT, and second is a compact encoding of the history and this
encoding consists of a version vector and a hash, secure hash computed based on the local
history. So nodes store this updated metadata until it is garbage collected and they compress it
in some version vector. So let's see what checks does a node need to perform when it receives
new metadata. So whenever a node receives a new metadata or a new update it performs two
checks. First it ensures that all of the updates that are present in the history of this update are
also present in its local history, so this sort of amounts to just performing simple version vector
inclusion check with the version number included in the update is assumed by the version
vector maintained at the client, and in addition we need to compute a hash to make sure to
deal with situations where sort of like, where corruption faults can cut up the state. And
secondly, we need to check that all of the versions created by a given client are monotonically
increasing, so that the client does not end up reusing the versions for different updates. So the
system works fine if there are no arbitrary faults in the system, however if a client experiences
a loss of state or a corruption fault, it can lead to a problem and this problem is called forking.
In particular, what can happen is suppose that the client uses up versions to version number
five and then it dies or loses its most recent state so it sort of reboots and it starts reusing
versions three, four, three onwards. So as a result what's going to happen is that there will be
two different updates both with a version number three. In this case is a faulty client F exposes
these two different updates with the same version number to different clients, then what can
happen is these two clients become forked. What I mean by forking is now each of these
clients has individually seen a consistent view of the system, but when taken together the
overall state of the system is not consistent. As a result of this forking these clients cannot
subsequently exchange updates because if client A tries to send messages to client B which
depend on the different version of update that it has seen, the client B will not be able to verify
it and similarly when client B tries to send messages to clients A, client A will not be able to
verify it. And this is not a new problem. The concept of forking was invented about 10 years
back by the Sunder [phonetic] system. The key problem in all of these systems is that once the
system gets forked, the system cannot attain eventual consistency, because these correct
clients have seen mutually inconsistent histories are logically partitioned from each other. They
are prevented from exchanging updates from this point onwards, and this is a fundamental
limitation which is not acceptable for a storage system to not be able to provide eventual
consistency. To address this problem Depot has a new mechanism for joining forks. What
happens when a client observes that there are two mutually incompatible histories is that it
pretends that the faulty node F is instead a collection of two correct virtual nodes F prime and F
double prime. By doing this, converging, subsequently these correct nodes are allowed to
exchange updates, but we have learned that this faulty node F has created inconsistent updates
and so frequently we can evict this node from the system and Depot includes mechanisms for
doing that kind of an eviction. So by doing this conversion of a faulty node into multiple correct
nodes, we are basically converted corruption faults, or an arbitrary fault into logical
concurrency which these are prepared to handle. In this part of the talk what we learned is
what it means to minimize trust for correctness, what consistency Depot provides to achieve its
goal of minimal trust. In particular, we discussed a new consistency mechanism called for join
causal consistency that is strong enough to be useful, but yet weak enough to be enforceable in
Depot’s weak assumptions and the key idea was to be able to reduce failures into concurrency
and in addition to all of this, we provided a protocol for enforcing for join causal consistency.
Next thing I want to talk about his availability. Let's try to understand where trust exists for
availability in existing systems. Before in existing systems a client trust that the SSP will remain
available whenever it wants to access data. Similarly, for a typical state machine replication
system the client expects that a set of nodes will be available whenever it wants to access data.
In contrast, the goal of Depot is to enable the client to define a replication quorum for each
object so that the client can control its replication policies and in Depot we want to be able to
ensure that an object remains available as long as there is at least one copy of this object
present at any of the available machines. For example, a client can define that it wants to
replicate data onto multiple SSPs and so the data should be available at either one of those
SSPs has not failed or is available. Similarly, a client may try to keep a local copy of an object in
its data store and in this case the data should be available in either the local copy or the SSP is
available. So this might seem simple but the main reason we are able to achieve this goal is
because we have eliminated the trust for consistency. In contrast, in traditional quorum
systems if we try to apply this approach we risk reading inconsistent or stale data. So let me
illustrate why a problem can arise if we didn't have the support of minimal trust for
consistency. Suppose that a client performs an update and pushes it out to one of the SSPs,
receives the notification from the SSP that the write is completed. Now later on if the SSP fails
and the client tries to retrieve this problem from the backup source which could be a second
SSP or its local data store, and in this case the backup store may not have a copy of the object
that the client just pushed to the first SSP. The client might, the SSP will end up returning that
the album is empty. Now absent Depot’s mechanisms for detecting correct data from incorrect
data the client will simply accept this response from the second SSP and do further processing
based on this response which would lead to the violation of the consistency requirement that
we want provide in that if the client performs a PUT and then performs a GET the GET reflect
the result of the PUT that it just performed. However, if you have the--sorry.
>>: To have that property you've got to have somebody else that knows--if you write
something to one place and nowhere else. There is no record of it anywhere else and that
place forgets it, then it is as if it never happened. So you are storing this history back on the
client also, right? It's got to be somewhere.
>> Prince Mahajan: Sorry?
>>: It's got to be somewhere or you can't have this property.
>> Prince Mahajan: What I'm saying is it's not easy to simply replicate data on multiple SSPs if
you don't have mechanisms to be able to detect which copy is acceptable in which copy isn't.
Depot does that but if you didn't have those mechanisms, if you used a simple client…
>>: [inaudible] how Depot does that. Where is that knowledge stored [inaudible]?
>> Prince Mahajan: At the client.
>>: It's at the client?
>> Prince Mahajan: Yes.
>>: All right.
>> Prince Mahajan: So if the client has this knowledge and if the clients are able to separate
correct data from incorrect data, then the client can distinguish and identify that this response
is potentially stale or incorrect and it should not be processing it. So far I discussed how Depot
minimizes trust for availability. Next I'm going to talk about durability. For durability we want
to be able to make sure that the data essentially becomes sufficiently replicated. So we have
enabled the client define where all it wants to replicate data, but for durability our goal is to
ensure that this data eventually does become replicated at each of these machines. What this
means is that we need an agent in the system who is responsible for pushing the data to all of
the replicas at which you want to replicate an object. This agent could be the client who is
performing the PUT. It could be the SSP configured to sort of push the objects to the other
SSPs, or it could be some background job of reading data from one SSP and pushing them to
another SSP. In either of these cases we are putting some trust in that agent to correctly
perform this task of replication. This is the trust that can lead to compromise of durability. In
particular, so far in the system that [inaudible] every time a client reads or writes an object it is
trusting that the replication agent has performed replication for this object correctly. If the
application agent fails to fulfill this task it can lead to a situation where we compromise
durability. So let me illustrate how that can happen and let's take the example of client
performing the PUT. Let's take a situation where we designate the client to perform this task of
pushing the data to all of these SSPs. Suppose that the client in the normal case, the client
ensures that it has pushed the data to both the SSPs so that if another client reads data from
the first SSP and later on that SSP fails, the client can easily access the data from the second
SSP. However, there might be a situation, there might be a small window of time where the
client has pushed the data to the first SSP but it hasn't yet pushed the data to the second SSP.
At this point, the data is visible to all of the clients who are accessing the first SSP, so another
client might still end up reading this data, but this data is not sufficiently replicated. Later on
this client fails; the data may never get sufficiently replicated. Furthermore, this object is sort
of living dangerously at this point because if this SSP now fails, the durability of this object will
be compromised. So the goal of Depot is to be able to ensure that any object that is read or
written by a correct client remains durable as long as some replica in the client’ s sufficient
replication quorum survives. So if I have designated my object to be replicated at two SSPs my
object should remain, this object should remain durable if it has been read or written by correct
client and one of those SSPs survives. The way we are going to achieve this goal is by
minimizing the trust that is needed to enforce the client’s replication policy. The logic for this is
relatively simple. All we need to do is to add some receipts to the system. So in particular what
we do is that every time the client stores data on one replica, the replica issues a receipt
certifying that this data has been durably stored on that replica. These receipts are then
attached to the metadata associated with an object and they are propagated throughout the
system using the mechanism that Depot provides. Now whenever a client receives an object it
ensures, it checks if the object is sufficiently replicated or not. If the object is already
sufficiently replicated, the client does not need to do anything else.
>>: How does it check?
>> Prince Mahajan: Because the object has associated receipts.
>>: [inaudible] check its receipts?
>> Prince Mahajan: Right, so it checks to see if the replication policies require replication to
two SSPs then the client can check whether the object is already replicated at two SSPs are not.
>>: Does it have to go back and request those receipts?
>> Prince Mahajan: No, they are attached to the metadata so the client gets them.
>>: I see. When the metadata updates come, they contain a receipt as well.
>> Prince Mahajan: Yeah. So this is the easy case where the object has been already
sufficiently replicated and the client reads it so the client sees that the object is sufficient and
so it doesn't need to do anything in this case.
>>: After you write a second copy do you need to go back and update the first copy?
>> Prince Mahajan: Yes, right.
>>: That just make the write take longer.
>> Prince Mahajan: No, so the write is complete as soon as you perform the first operation at
the first SSP. And then you…
>>: [inaudible] has to go back?
>> Prince Mahajan: You're doing some extra work which you can piggyback for, you're doing
some extra work for that write, but the write is complete and visible as soon as you perform the
first operation. However, if when a client receives an object and it sees that the object is not
sufficiently replicated, then that client requires a copy of that object along with whatever
receipts there are. And the client decides to store this copy until it learns that the object
becomes sufficiently replicated. So the client ends up storing this copy locally and then it can
continue processing and later on when it learns that the item is not sufficiently replicated, at
that point it can discard this copy of the object.
>>: If the client doesn't trust its own local store it becomes sufficient it might not want its
writes exposed until sufficient receipts [inaudible]?
>> Prince Mahajan: Right. Actually I come to that in the next slide. Receipts also, this approach
of minimizing trust for durability actually helps us address a fundamental trade-off in
geographically replicated distributed systems. In these systems you have this trade-off
between being able to complete an operation with lower density, and what I mean by lower
density is a density smaller than the wide area replication delays, and durability in the presence
of data center failures. Let me illustrate what the problem is. Whenever the client sends a
request to a server in one data center, the server has two choices. It can either respond back to
the client before ensuring replication to other remote centers, in which case you get lower
density, but at this point the write is not durable and data center failures can happen. In
contrast if the storage center waits before receiving responses from all of the remote clients,
remote servers, in this case the object is certainly durable if a data center failure happens, but
the request took much longer time and the operation is not low latency. In contrast with a
mechanism like Depot when a client sends an object to the first server, the server can respond
with a receipt saying I am replicating this object locally. The client at this point holds onto the
object instead of throwing it away as it is done normally. Later on, so this operation is still low
latency because the server doesn't need to replicate data to all of the remote data centers
before responding back to the client. However, if later on this data center fails, the client still
has a copy of the object and as a result the client can push this copy to these remote data
centers and ensure that the object becomes sufficiently replicated and is durable.
>>: The client can push it to the remote?
>> Prince Mahajan: The client can push. So this approach is desirable because you get some
implicit fault tolerance because the client is typically logically or physically separated from the
servers, and in some sense the client is also fulfilling its own fate so the client survives the
object will also survive. If the client fails then the object will fail.
>>: Not addressing conflict resolution then?
>> Prince Mahajan: Conflict resolution is a problem that is, so this talk is not really about
conflict resolution because that problem arises irrespective of whether you use this mechanism
or not.
>>: [inaudible] you've introduced some other last ends because you [inaudible].
>> Prince Mahajan: No, no. Actually for conflicts the approach that Depot offers is that it
returns all of the concurrent writes that have been performed on an object and with each write
you can associate whatever experimental data that you want to associate to ensure the conflict
is resolved properly, so for example, you can attach a timestamp and then pick up a winner
based on that timestamp or you can sort of like [inaudible] the concurrent writes in some
meaningful way based on the application semantics. We are not introducing a new last writer
here because the timestamp that you would associate here would be the timestamp that the
client assigned to this subject when it performed the original operation.
>>: So you are just editing an additional staleness factor.
>> Prince Mahajan: Yeah. It doesn't create, it's possible that if the data center fails you end up
having more staleness then you would see otherwise. Later on when the client gathers the
receipt from this other remote data center, at that point it can discard this copy of the object.
>>: [inaudible] not in the data center that it's trying to write to.
>> Prince Mahajan: Even if it is within the data center it still gets some additional fault
tolerance because the services may indeed be isolated.
>>: It can be in the data center. It's just got to be aware of all of the available regions.
>>: Well, if it's in the data center than it’s correlated with the data center.
>> Prince Mahajan: [inaudible] may be correlated, but the argument--again, there are two
arguments. One is that they can be logically separated entities and secondly the client also fails
then it is sort of like losing all recollection of what it has done and so maybe it might be
acceptable. I'm not arguing that it is acceptable, but it might be okay in that circumstance.
>>: What if the client writes object A with intentionally load durability and then makes a
dependent write to B with high durability, does the high durability of B extend to A? Because I
can see that if A becomes unavailable now B with supposedly high durability will think
everybody was saying well, I can't process B and…
>> Prince Mahajan: So you can still process B, it's just that if you are writing something with
durability maybe you are okay with the object, losing that data. That would be your intent, so if
for example, for object A I specify that it's enough to replicate that object in any single data
center, whereas for object B I required that this object should be replicated at two data centers,
so what might happen in that circumstance is that you might lose object A but you might still
have access to object B, and presumably because you're writing everything with load durability
you should be prepared to handle the situation where you lose object A and perhaps you are
able to re-create a copy of object A through some other mechanism.
>>: But how does Depot handle it? I mean Depot just sees oh, B has this pointer saying back
from B to A and it will say I can't provide you B and until I…
>> Prince Mahajan: No, B does not include a pointer to A. I mean, it logically does but B
includes a pointer to A’s metadata which will be available. The only thing that you will end up
losing is A's data, so you won't be able to read A but you still have all of the information that
you need enforce consistency. So this is a graph that depicts the situation where in a local
cluster we kill all of the cloud servers at about 220 seconds and it shows that Depot’s PUT
continues functioning after, even after the servers have been removed and the function with
low latency because they don't have to recognize that the SSP is down so they don't have to go
to the servers anymore and they periodically keep on checking it and when the SSPs come back
and that's why you see like these latency spikes happening in between. At about 660 seconds
we bring back the SSP and sort of like the clients then at that point push back the state to the
SSP and start working from the storage service again. In this configuration we use the local
backup and one as SSP as our deployment. So next let me talk about how much does it cost to
provide properties that Depot provides. In particular, I'm going to talk about latency, resource
utilization and the dollar cost of building a system like Depot. The test deck that we used
consisted of eight clients, four servers. We were performing about one request per second and
each volume contains 1000 objects. So just as a reminder, these are the sources of overheads
in Depot. We are attaching metadata to each object and this metadata consists of a signature,
potentially receipts, a partial version vector, a history hash and a data hash. In order to accept
the metadata a node performs a SHA256 check, an RSA verification, history check and then
receipt checks. And then to perform a GET the client performs a SHA check, SHA256 check. So
this diagram depicts the latency of several variants of the baseline system that we have
constructed. In order to provide comparison points, what we did was we constructed several
baseline variants and these baseline variants try to emulate cloud storage by disabling the
replication on the clients of the metadata and disabling all of the checks that Depot provides.
So for example, the base system does not have any data at the client and, and the system
clients also do not perform any checks and the servers also do not perform any checks, whereas
in the hash system the client’s computer hashes the data and attaches it to the object before
storing it and then when they read they verify the hash. Similarly, in the sign the clients sign
the object before sorting it and then check the signature when reading it. The Depot minus our
version does not include the receipt, implement the receipt code, receipt logic of Depot and the
main reason for providing this variant is that in Depot we use cryptographic receipts, but for
many deployments it may be acceptable to instead just use sort of like receipts that just have
the index information because if nodes trust each other to provide the right information then
that might be acceptable and many of the overheads that we incur due to cryptographic
operations may be eliminated. The key thing to note here is that the overhead on the forgets is
very, very small because the only extra computation that the client needs to do is to perform a
SHA256 check on these GETS. In contrast, for PUTs the overheads are high because we are
performing several cryptographic operations. In particular, in the complete variant we need to
perform two SHA256, two signatures on the critical path and that's why the cost of the
complete Depot version is significantly higher. But like I said if you want, if you want to, there is
a potential to remove a lot of these cryptographic operations and reduce the overheads of
Depot significantly.
>>: [inaudible].
>> Prince Mahajan: [inaudible]. Ones that are [inaudible].
>>: [inaudible] 5 seconds to do…
>> Prince Mahajan: Yeah.
>>: [inaudible].
>> Prince Mahajan: Each signature costs about 4 point, 4 milliseconds, I think.
>>: [inaudible] could go faster.
>> Prince Mahajan: Yeah, using [inaudible]. [laughter]. So next I want to talk about the
resource utilization costs of Depot in comparison to these other variants. In this graphic I'm
trying to depict the normalized cost of Depot in comparison to these other approaches and in
particular, I compared the 3-D source utilization metrics. We compare the network utilization
between client and servers, the CPU utilization of the client and the CPU utilization of the
server. For GETs as you can see the overheads expanded for cost is negligible and similarly at
the client the extra cost primarily comes from the extra hash computation that they need to
perform. The CPU utilization of the server was very, very small, so it's pretty much random.
>>: What is the data that you're using for this?
>> Prince Mahajan: Sorry? 10 kB objects.
>>: 10 kB objects, and your network overhead looks like zero, but you have to be moving…
>>: [inaudible].
>> Prince Mahajan: It's about, so the basic metadata without the receipts is about 200, 300
bytes, but we don't add the cost of that metadata to the GETs; instead we compute it to PUTs
because we are replicating that, because each client ends of storing that metadata so we, when
we are doing, in the accounting we add that cost to the cost of a PUT instead of adding it to the
GET.
>>: [inaudible] object you're putting our rather large [inaudible] instead of the metadata…
>> Prince Mahajan: Yeah, so the stuff, so if you have small objects, indeed, like the cost of the
system may be prohibitive. In this case we looked at 10 kB objects and so if you have smaller
objects they like 500 bytes or smaller then you would end up paying a higher cost.
>>: Does the metadata a fixed size or can this storage factor vary?
>> Prince Mahajan: This is a fairly fixed size unless you end up sort of, I think in most cases you
can argue that version vectors are going to remain on the fixed size.
>>: [inaudible] writers which in real life doesn't happen.
>> Prince Mahajan: So also the other reason why our version vectors are fixed is because we,
instead of using encoding attaching a computer version vector, we only each update carries an
incremental version vector over the previous update, so because the system is processing
updates in a consistent order, you can remove the information that has already been there in
the previous version vector, so you only encode the new. In particular, that does not change
much. So this is the overhead for PUT. As you can see our Depot, as well as Depot without
receipt incur about 20% cost of metadata transfer between client and servers. Each of these is
probably 20% because we have about 300 bytes and we are transferring 10 kB. That's about
compared to about 2 kB of metadata transfer for eight clients.
>>: [inaudible] replicate the metadata everywhere, so you end up using, you end up using 20 or
30% of… You can only have cloud storage that is three or four times larger than your local
storage because your metadata winds up being a third of your total. You lose a lot of the
advantages of cloud storage if your objects are that tiny because you have to have the
metadata…
>> Prince Mahajan: Yeah, you need to sort of measure that.
>>: On the other hand…
>>: We don't lose the advantages. It just costs a lot more.
>> Prince Mahajan: It costs a lot more.
>>: You've got to have local storage. You can only multiply local storage by three if you have
10K objects. Of course the answer is don't have 10K objects [laughter].
>> Prince Mahajan: Yeah. So again, like the storage cost in the client is high because in a
traditional system you don't have to store anything in the client, whereas in the context of
Depot we need to store this metadata at the client, and furthermore like in this deployment we
are also, the client was configured to store a copy of the data that they create, because we are
using the local backup approach for ensuring durability, so there was one copy stored in one
SSP and another copy stored in the local data center and that's why the additional cost is even
more significant. The CPU utilization of the client is high because you are performing receipts;
you are performing checks for these receipts and so all of that adds up and similarly, the CPU
utilization of the server is particularly high for the receipt variant because the server is now
doing these cryptographic signatures for this variant.
To look at how much does it actually cost in terms of the dollar amount to implement the
system and run a system like Depot. To do that we can basically word some of these resource
utilization numbers using some sort of cost numbers that we gathered from existing pricing.
What this tells us is, as expected, to perform a terabyte of GETs you pay very little extra cost,
less than 5%. But to perform a terabyte of PUT you almost pay 100%, you double the cost
needed to perform the terabyte of code.
>>: [inaudible].
>> Prince Mahajan: This is for 10 kB objects, and for storage you again pay like about 20% extra
cost because you are storing metadata as well as data, extra copies of data. So once again,
basically the point is that these costs are high, but they are not so high. They seem acceptable
to me. I don't know if they will be acceptable to clients or not, but again these overheads can
be dramatically reduced if you remove cryptography from…
>>: [inaudible].
>> Prince Mahajan: Or you don't have 10 kB objects.
>>: Which you want, right? Anytime we have looked at safe file systems and stuff what you
find is that if you get a bunch of small objects even your object size is going to be orders of
magnitude bigger than that, and so your overhead is fixed for object, so you are probably
making yourself look worse than in reality.
>>: File systems are set?
>>: File systems. [inaudible] file systems, what you find out is that the mean file size is
hundreds of kilobytes to megabytes. And there are a tiny number of giant ones.
>>: Is that more of a direct comparison to databases systems, though? I think this was
designed more as a stand-in for a database rather than a stand-in for a…
>> Prince Mahajan: Well, this is…
[inaudible] [multiple speakers].
>>: Right. Your database records are going to come in under 10K probably.
>>: Well, is there any data from Amazon in how big the past three objects are?
>> Prince Mahajan: Well, they are expected to be several gigabytes.
>>: Perhaps three, but Dynamo is a better.
>> Prince Mahajan: Yeah, Dynamo is like much smaller. In this part of the talk I introduced a
system called Depot that takes an approach of minimizing trust and we showed how despite
minimal trust we can achieve some very strong properties, stronger properties than those
provided by existing systems. In terms of related work, the key point to take away here is that
there is a lot of related work that focused on building systems that minimize trust or improve
availability, but the main distinction between all of these systems and Depot is that Depot is
designed to really, really minimize trust whereas all of these other systems go halfway. They
tried to reduce trust in the system but don't entirely eliminate it, and similarly, they don't
achieve the same level of availability that Depot promises to achieve. Broadly speaking, my
research is focused on addressing sort of like the trade-offs, investigating the trade-offs
between consistency and availability and fault tolerance. What Depot tries to do is to try to
understand what the practical limit of properties that can be provided is if you want to hold on
to availability and fault tolerance. In contrast, I've also done some theoretical work in trying to
understand what is the theoretical limit in properties that can be provided with high availability
and fault tolerance. What we ended up showing is that a variant of causal consistency is really
the optimal you can get if you want to provide high availability and similarly we have some
results that are not as strong for environments where you want to minimize trust in addition to
maximizing availability. In the next part of the talk I am going to briefly touch upon a system
that I am currently doing work which is focused on answering the other end of the question
which is if you want to hold onto consistency, if you always want to be able to support acid
transactions, then what is the best availability that you can get. In particular, we are designing
a system called Salt that integrates the acid transactions with the base approach. So next I am
going to talk about how we mix acid and base in the system called Salt. It is not a chemistry
talk. [laughter]. We all know the benefits; we all fully understand the benefits of acid
transactions. They are used pretty much everywhere to encode business logic, to perform
banking transactions and so on. And they provide you very attractive properties of atomicity
meaning that the entire transaction, either the entire transaction completes or no effect of the
transaction is left. Consistency meaning that the database transitions from one consistent state
to another state, and isolation meaning that transactions do not observe the intermediate state
of other transactions and durability meaning that the result of committed transactions remains
durable. The reason distributed databases have low availability with acid transactions is
because in order to perform an acid transaction that spans multiple components, you want
each of those components to be available and reachable. For example, if you want to perform
a transaction that includes two components X and Y if either X is down or is unreachable, or if Y
is down or unreachable your transaction will not be available. In contrast to this asset
approach, there is an approach called base approach. Unlike acid approach, the semantics for
this, there is no well-defined semantics for this base approach. There are various variants of
this approach. Different systems implement various flavors of base approach and the only
underlying idea behind this base approach that is sort of common to each of these approaches
is that we are trying to compromise consistency to get better availability and better
performance. This approach can be applied on top of an existing database by using the
database to perform some very minimal tasks or it can be used on top of some NoSQL systems
out there or like consistent data stores like big tables and so on. The main idea is that the
application now has to deal with the task of enforcing these properties that previously were
being enforced by the system. The reason this approach improves performance is because the
application does not have to deal with providing the transaction properties [inaudible].
Instead, the application can issue each one of these operations separately so the system
remains available if I have to process, if I have to abate X, the system will be available if X is
available and similarly when I have to process Y, the system will be available when Y is
available, so I don't need simultaneous availability of both X and Y in order to be able to
perform the same computation as I was doing before. This comes with better availability and
better performance because you don't need to perform the two-faced command. You only
have to hold the logs throughout the duration of this transaction. However, it also complicates
programming because it's possible that while you are executing, while a particular operation is
partially completed, some of the operation can come in in between and access this inconsistent
state. Typically you would handle the situation by signaling to the application or the application
might signal to the user that the result of the most recent, some of the most recent operations
might not be reflected in the result that I am showing you. For example, the user is trying to
see its account balance and there might be a notification or there might be a disclaimer that
this may not reflect the most up-to-date balances. There is, again, this conflict between acid
and base. Both of them have their advantages; acid provides simplicity of programming but at
the cost of availability. Base provides sort of availability and high performance at the cost of
ease of programming. In this world we are trying to ask question can we actually provide the
benefits of both of these approaches without being subject to their limitations. That's why we
are building a system that combines the acid approach with the base approach. This seems like
a good reason, a good thing to do for two reasons. First, most of the workloads actually do not
consist of performance critical transactions. Most of the time you have very few transactions
for which you care about performance, and a lot of other transactions which are performance
oblivious. If you are able to combine these two approaches together in the same
implementation, I don't have to rewrite all of those performance oblivious transactions and I
just need to focus on rewriting and doing more work for the transactions that are performance
critical. Secondly, this also provides an incremental adoption path, so think of an enterprise
that gets acquired by Microsoft. It's going to see gradually some increase in workload, but the
increase will not happen overnight, so as a result the enterprise can incrementally identify the
bottlenecks in the workload and fix those bottlenecks over a period of time rather than having
to atomically transition from an acid system to a base system. There are various challenges in
providing acid, in supporting both of these approaches simultaneously. I am going to talk about
one particular challenge next. The key problem is the presence of base sort of like paradigm in
conjunction with the asset approach can kind of like corrupt the view, the consistent view that
the acid systems expect to see. Suppose that we have this transaction that we are trying to
execute using the base approach, and we have executed the first part of this transaction. In
between an acid transaction comes in and it tries to read both X and Y. You would expect that
the invariant, that the sum of X plus Y remains constant should be preserved, but because it has
seen the incomplete state of a base transaction the invariant will not be preserved. To address
this and other limitations we have basically introduced a new transaction that is called base
transaction. Base transaction has a similar structure as acid transactions as we can see there,
and it provides a number of properties that are very similar to the properties that we would
expect from acid transactions. In particular, they provide a notion of atomicity which means
that the entire transaction will eventually be executed to completion. It provides durability for
completed transactions. For integrity enforcement, it introduces exceptions so every time the
base transaction tries to perform an operation that violates the application’s integrity
constraints, it results in an exception which can be handled by the base transaction locally. It
may contain acid transactions to simplify programming. The tricky properties that it provides in
addition to these other good features, is that the presence of these transactions does not affect
the correctness of acid concurrently executing correctness of acid transactions. Furthermore, it
also does not affect performance of correctly executing base transactions, and I will talk about
this in the next slide as to why this might be a problem. Finally, it ensures that we are able to
provide the same availability and performance as we expect from the base transaction using
this base transaction primitive. Let me talk about, let me give you a sense of two key
implementation ideas that we use to implement base transactions. First of all, base
transactions are serialized as a sequence of atomic transactions and so you can think of them as
an acid transaction, but they are not strictly acid transactions for technical reasons. So there
are going to be in the overall execution history base transactions are going to appear as several
small fragments. The same base transactions would appear at multiple points in the overall
serialization history. In order to ensure correctness of the acid transactions, these base
transactions are going to acquire some tainted locks which are going to mark that these objects
are being modified by an incomplete base transaction. The results of these tainted locks is that
while other concurrently executing base transactions can come and access the intermediate
state of other base transactions, acid transactions will be prevented from accessing the
incomplete state of the system, incomplete state of a base transaction. To ensure that the acid
transactions perform, give the same performance in the presence of base transactions, we
need to add a few more mechanisms. The key problem here is that because we are going to
execute the base transaction as a series of small, small atomic transactions and each one of
them will require performing a disk write, so the duration of a base transaction will be, if a base
transaction consists of five separate mini transactions, then the duration of the base
transaction will be basically five times larger than the duration of a equal length acid
transaction because in an acid transaction you're going to perform all of the disk writes in
parallel, and as a result without this mechanism that I'm going to talk about, the base
transactions might significantly impede the progress of acid transactions by blocking them.
Instead, the key idea that we use is we try to break down the commit of transactions into two
parts. First is the consistency commit which basically ensures that all the, we still perform the
[inaudible], we release all locks, but the data is stored only in memory after the first commit.
Then later on we perform a second commit which is the durability commit which happens
asynchronously at each of the, in a distributed fashion so it does not require any coordination
between various nodes. Each node can individually flash its data to disk and then there is a
recovery protocol to ensure that when we recover the system we recover the system to a
consistent snapshot. And there is some of the logic to ensure that only the result of committed
and durable commit transactions are exposed to the application so that the application does
not see the effect of a transaction that has committed, that has released its locks but has not
yet made it safely to the disk. The summary for this part of the talk I discussed an approach to
combine acid transactions with base transactions, and to do so we introduced a new primitive
for base transactions which enables us to isolate the intermediate state of base transactions
from concurrently running acid transactions while preserving their correctness as well as
performance. These are some of the other publications that I've had and I would be happy to
talk about any of this related work if you would like. In conclusion, my research has explored
extreme points in this consistency availability trade-off. With the Depot work and with some
security work I have explored the point where we are trying to maximize consistency and we
are trying to maximize availability and trying to see like what is the best consistency you can
get. With my recent work we are trying to investigate situations where you want to hold on to
the strong consistency in form of acid transactions and seeing how can we improve on the
availability of these systems. With that I will be happy to take any questions. [applause].
>>: I might have just one. For Salt do you have specific applications in mind like for customers
for the system?
>> Prince Mahajan: For the first part of the system for the second?
>>: The first part.
>> Prince Mahajan: For the Depot?
>>: For Salt.
>> Prince Mahajan: For Salt? We are looking at, the example we are looking at the TPC
benchmark and we also looked at [inaudible] benchmark, so pretty much anywhere you
compute, you are trying to compute the index it's usually acceptable to--there are a lot of work
loads that we have looked at where it will make sense to combine acid with base. Some
examples of those workloads are situations where you're trying to compute an index but the
consistency of index is not critical, so for example, when you add a new item to eBay to be sold
on eBay the item has a category, so you would normally want when people search for that
category the item shows up, but it's usually acceptable if the item doesn't show up instantly
and instead shows up a little bit later. Similarly, you can actually use the base approach to
perform very quick e-commerce checkouts because one thing that I didn't discuss about these
base transactions is that they include a notion of prime date transaction where you can specify
that I want this transaction to be executed right now and the rest of the work to be done
asynchronously in the background, so if a customer tries to purchase an item, you can have the
stock quantity of that item determined asynchronously and then you can try to charge the
credit card and update the order and try to do all of the background tasks in the remaining base
transaction, so that's also where it will make sense. Similarly, like in these banking applications
it might make sense to determine the balance first using asynchronous transaction and then try
to update the, increment the balance of the destination account later on in the background. So
pretty much anywhere that you can think of relaxing consistency or you can think of being able
to perform committed operations, you can benefit from the base approach.
>>: I have a question about the [inaudible] with the [inaudible] multiple versions of an object.
Does that only happened because of forks or are there other…
>> Prince Mahajan: No. It can happen due to legitimate concurrency as well because if two
clients concurrently try to update an object, it can certainly result in concurrent versions of an
object. It can happen like in Amazon S3. In Amazon S3, however, when you do a GET it will, the
system automatically tries to resolve the conflict for you by saying that, by using some
[inaudible] timestamps, so you can also overlay that kind of an approach on top of Depot to say
that I just want to see, I am going to associate some timestamps and I just want to see one of
the versions that has the highest timestamp. It might make sense in some applications, but in
other applications it might make sense to be able to see all of those concurrent versions and try
to merge them in some way, so basically we leave it out and you can lay it on top of Depot in
whatever approach makes sense for your application.
>> Rich Draves: Okay. Let's thank Prince again. [applause].
Download