>> Christian Konig: Good morning, everyone. Thank you... introduce Sudip Roy. He is joining us from Cornell...

advertisement
>> Christian Konig: Good morning, everyone. Thank you for coming. It's my great pleasure to
introduce Sudip Roy. He is joining us from Cornell University, where he's co-advised by
Johannes Gehrke and Christophe Koch -- or Koch, how you correctly pronounce that. And also,
he has done a Microsoft Research internship here, and he's also done a Google internship. He is
the co-winner of the SIGMOD 2011 Best Paper Award, together with some of his colleagues at
Cornell, and today he will talk to us about lazy transaction execution models.
>> Sudip Roy: Thanks for the introduction, Christian. So today I'm going to present my thesis
work on lazy transaction execution models. So let me start by reminding you of what a
transaction is. A transaction is a single execution of a user program over a shared database state.
Informally, it is a basic unit of change, which the database sees, and this execution of the
program is guaranteed to satisfy the ACID properties. And I'm sure that all of you are familiar
with what ACID is, so I'm not going to go into the details of that, but let me show you what such
a user program usually looks like. So consider Mickey's transaction to book a seat on Flight 123.
So the fact that this program has to be executed transactionally is indicated by the keyword
START TRANSACTION. It has these four statements. First, Mickey selects a seat on Flight
123. It checks if there is something which is available. If not, then the transaction rolls back. If
there is something available, then it does two updates to the database. It deletes that seat from
the available table and it inserts a tuple corresponding to the reservation into the bookings data.
Now, this of course is a very simplified form of what you would see in the real world. So before
I talk about how and why lazy execution is good, let me show you how we can execute this
transaction in a classical model and why this leads to suboptimal results. So consider this
following scenario, in which you have a flight in which you have three seats available. Now, the
available seats are in green. The already-reserved seats are in red, so 1A, 1B and 1C are
available. Let's say Mickey issues the transaction, the program which I just showed you, to book
any seat, and he gets seat 1A. After that, let's assume that Donald issues a similar transaction,
and he gets seat 1B. Finally, Minnie issues a transaction. However, Minnie has an additional
constraint that she only wants a window seat. Now, the only window seat which was available,
1A, has already been allotted to Mickey, and therefore Minnie's transaction had to abort. Now, if
you assume that Minnie's transaction was going to arrive, then you could have given Mickey the
seat 1C. Mickey didn't really care about which seat he got -- in which case, Minnie's transaction
would have committed. So let us see how a lazy execution model addresses this issue. So,
again, consider the same scenario. Now, Mickey says, book me any seat, and instead of
assigning a single seat to Mickey, I'm going to commit Mickey's transaction, and I'm going to
defer the assignment of seat to Mickey. Subsequently, I'm going to do the same for Donald's
transaction. I'm going to ensure that Mickey and Donald both have some seat, but I'm not going
to tell them which exact seat they have. Finally, in this case, when Minnie says that I want a
window seat, I can actually assign Minnie the window seat which was available, in this case, 1A.
So I made these two assumptions of what these transactions are doing. One is that there is a
flexibility in the value that is being written. That is, Mickey does not care which exact seat he
gets, as long as it satisfies a certain number of constraints, and second, that there is a delay
between the point at which the transaction commits and the point at which you read the values
which are written by the transaction. Now, humor me for now and assume that there is a broad
class of operations which satisfy these two assumptions, and I'm going to return back and
precisely identify what this class of applications is. Now, assuming that there is a class of
applications over which these two assumptions hold, the key idea is that we can lazily bind the
unread values in the transaction, and by doing so, we are creating some room to maximize some
notion of global utility. In this particular application, the global utility was to satisfy the
maximum number of user constraints, or equivalently to allow the maximum number of
transactions to successfully commit. So let me give you another example in which being lazy
helps. So consider a simple voting application in which we are using this votes table to keep a
tally of the election status. So in this case, the Democrats have 100,000 votes cast for them. The
Republicans have 75,000 votes. And I have three transactions which the application -- or three
user programs which the application can execute as transaction. One is to cast a vote for
Democrats, which basically just goes and updates the count variable of Democrats by one, a
second, to cast a vote for Republicans, which increments the value corresponding to
Republicans, and a third transform, which checks who is leading, so it reads the Democrats' and
Republican counts, it compares the two values and displays who is the current leader in the
election. Furthermore, let us assume that this is a nationwide election, and I'm replicating this
votes table across two datacenters, one on the east coast and one on the west coast, and also that
the initial state of the database is that the Democrats are leading the Republicans by around
25,000 votes. So what happens if we execute this transaction under strong consistency? So
whenever I cast a vote -- in this case, a vote is cast for Republicans, I need to consistently change
my replica state across these two datacenters, right? So I need to inform -- so if I am executing
on the west coast datacenter, I need to synchronously inform the east coast datacenter of this
change, and at least I have to incur one round-trip latency. Now, of course, under strong
consistency, the programming model is very simple, because the user never has to bother about
perceiving inconsistent or two different replica states. The other extreme is to be eventually
consistent, in which you say that I'm not going to inform the other datacenter synchronously. I'm
going to do that asynchronously, so when my transaction executes on the west coast datacenter, I
commit it locally and I keep my fingers crossed and hope that this transient inconsistency
between the two duplicate states is not perceived by the application. And in this particular case,
it is actually not perceivable by the application, because all the application cares is who is
leading, and that transaction is going to evaluate to the same thing over these two states.
However, you can imagine scenarios where the Democrats are leading by one, at which point if
you are executing an eventual consistent model, then transaction 33 can see two different states,
and the application can basically perceive one in which the Democrats and Republicans are tied
and another in which the Democrats are actually leading. So this exposes -- so this type of
inconsistency then has to be handled at a higher level in the application, right? So can we get the
best of both worlds? That is, can we get the clean semantics of strong consistency as well as the
strong response times of eventual consistency? And the answer is, yes, we can, at least under
certain assumptions and for a certain class of applications, we can. So let me show you how. So
the key idea is to exploit flexibility in the reads which the application is making. So in the earlier
case, for this particular application, if these are the only three transactions which the application
can execute, then it doesn't really matter what exactly the Democrats' and Republican vote counts
are, as well as in both these two database states, the Democrats are leading, because the
application cannot actually perceive this difference. So, in some sense, they belong to the same
equivalence class of database states. In this case, a class in which the Democrats are leading. So
the idea then is to be lazy yet strongly consistent, and how can we do so? Instead of requiring
that the two replicas are always identical, which is what strong consistency does, we are now
going to enforce that the two replicas are always in equivalent states, as opposed to being
completely identical. And this will allow my two replicas to diverge, but I'm going to establish
certain bounds within which they are allowed to diverge, and these bounds are again defined by
the equivalence class. So how do I do it? I do it by using these global treaties. We can assume
these global treaties are contracts which all the replicas sign and say that, as long as any changes
which I am making are not going to violate the global treaty, I am good. But whenever I am
good, as in I can execute things locally, but whenever I am in danger of violating a global treaty,
I need to inform the other replica. So, of course, I have shifted the onus of communication from
the transaction to enforcing this global treaty, and unless I have a mechanism of efficiently
enforcing this global treaty in a distributed manner, I would go back to the strongly consistent
case. So how do I enforce this global treaty? I project it into two local treaties such that, if each
of these replicas are making changes which do not violate the local treaties, I am sure that the
global treaty is not going to be violated. So intuitively, you can imagine that I had a budget of
around 25,000 votes until the boundary of the equivalence class, and I have partitioned that into
two, one of 12,500 votes. And if it is a little vague right now, it will become more concrete
when I get to the technical details and precisely define what this projection is, how we get these
global treaties. Yes.
>>: Are you also going to talk about works that relate to this idea, as well, because I would then
hold off my question until after then?
>> Sudip Roy: Yes, I am going to.
>>: Let me ask my question, and then if you are going to talk about it, you can defer it. How
does it relates to two worlds? One is consistency rationing, 2009 I think we are leaving. And
also as program detection models.
>> Sudip Roy: Right, so consistency rationing, it basically said that we are going to classify -we are going to have three types of three classes of objects, one which have to be strongly
consistent, one which can be eventually consistent and one which are in between. So our work
basically says that you don't have to classify.
>>: But you are between categories, right? Because within certain constraints, you switch from
eventually consistent to strongly consistent and vice versa.
>> Sudip Roy: No. We are going to be always strongly consistent, except that if the application
doesn't actually require strong -- the application always requires strong consistency, but if the
application cannot perceive some inconsistencies, then I am going to exploit that flexibility in the
application to be inconsistent sometimes. But using these treaties, I'm going to ensure that from
the application's point of view, it is always strongly consistent. Regarding the escrow transaction
and demarcation protocol, and there are other protocols, like distributed divergence protocols, let
me come back to that. Right, so how are transactions executed in this lazy yet strongly
consistent way? So now, when you issue a set of transactions, they go to the west coast
datacenter, they are executed locally, and you can execute them locally as long as this local
treaty is not violated. Now, in this case, the Republican count has reached kind of a border of the
local treaty, and the next transaction pushes it over, as in it violates the local treaty, at which
point I synchronize with the east coast datacenter. I update it with the changes which had
happened in the east coast datacenter, and I renegotiate and establish a new set of local treaties.
So again, I made two assumptions. One is from the application's point of view, there are many
database states which are equivalent, and it doesn't really perceive how they are different, and the
second, that communication is expensive, which is a very mild assumption and is true in many
scenarios. And again, I'll request you to humor me for now, and I'll come back and identify a
class of applications over which these two assumptions hold. So assuming that these two
assumptions hold, the key idea is to lazily synchronize distributed state, and by doing this lazy
synchronization, we can minimize the amount of coordination without actually sacrificing the
consistency requirement. So the takeaways from so far is that many applications have some
flexibility in the transactions. By exploiting this flexibility in transactions, we can be lazy, and
I've shown you one example in which this laziness creates room for optimizing. And I have
shown you another case where we can exploit this flexibility to be lazy, and this laziness would
reduce the amount of coordination required without sacrificing consistency. So that was my
introduction, and the outline for the rest of the talk is that I am going to first present a solution
for how we can be lazy and optimize resource allocation, which is the class of applications for
which it is applicable. Second, I am going to show how laziness allows us to minimize
coordination, and that's my project on homeostasis. And, finally, I'm going to show some
experiments. So any questions on the high-level ideas so far? So let me start by revisiting the
original example, which I just showed you. So I had told you that we had these three
transactions. There was some flexibility in the values which were written, and I had also made
the second assumption that there is a delay between the point at which the transaction commits
and the point at which the values are read. And the key idea is that we were going to delay the
binding for these values which are not read by the transactions, and this will create some room
for optimization and we can maximize the global utility, and in this particular case, it was to
allow the maximum number of transactions to go through. So coming back to my earlier
promise of identifying what this class of application is, so there are many database applications
which use transactions to allocate -- yes.
>>: On this scenario, if you look at flight reservation applications of today, they don't
necessarily commit you to a seat, unless you specifically ask for it. So they would not be solving
at the database, but they already solve this at the app level.
>> Sudip Roy: So the idea is that, yes, for this particular -- so the idea is not that you can do it
on -- okay, let me rephrase that. So, yes, you can write custom application logic to do so, which
is outside the database. But what we are claiming is that it is a more fundamental problem, and
therefore we are presenting an abstraction for all of these applications which can be used. More
than that, there are some interesting issues which arise out of now that you are executing this
transaction and you have removed some part of the transaction and are executing at a later point,
what happens to the traditional properties, traditional asset properties. In some sense, you're not
executing it automatically. How do you reason about isolation, because now one transaction
actually can be affected by the transaction? I don't know if that answers your question.
Right. So I'm going to use the word resources as an abstraction for these objects which are
allotted, and I'm going to assume that they are represented as data items in the database, and
you're using transactions to change the state which is associated with these data items. So this is
precisely an example of an application where those two assumptions hold. So seat ID is
basically a social seating platform, which provides social plugins, so that you can basically
choose who you sit next to in a flight. It may be one of your friends. It may be you can specify a
constraint like I want to sit with someone else from Microsoft Research or another technical guy.
Of course, you do not want to be in situations like this. Another field is -- another area where
these kind of reservations hold is where you make reservations, you don't really know which
room you are allotted. You are allotted the room when you actually get to the check-in point,
and FrontDesk Upsell is such a pieced of hotel reservation software which, as they advertise
intelligently makes the right offer for the right guest at check-in, but from the hotel's point of
view, they are maximizing the revenue by allocating rooms efficiently. Finally, I am sure all of
you have run into a scenario where you have some meetings which are scheduled, and someone
higher up in the hierarchy schedules another meeting, which leads to a cascading rescheduling of
meetings. In this case, of course, the time slots correspond to these resources. And this is
usually bad for graduate students like us, who end up with no sign or advisers.
>>: Are you going to talk about them?
>> Sudip Roy: Yes, use Quantum Databases on it. Right, so again, going back to our solution,
we are going to delay the assignment of resources beyond the transaction commits, so as opposed
to a classical model in which first a user requests some resources with constraints, the system
assigns a resource and then the transaction commits, now we are going to move to a lazy model,
where the user requests the resource with some constraints, the transaction commits if there is a
feasible assignment which exists, and the actual assignment of resource takes place at some point
in the future when a read is performed over the database. That is when Mickey needs to know
which seat he is sitting in. And between the point at which the transaction commits and the seat
assignment takes place, the database is in a partially uncertain state, and we call this state a
quantum state. In this state, Mickey has a seat, but which seat is unknown. And the database
which manages this uncertainty is called a Quantum Database. So let me first show you at a
conceptual level how Quantum Database supports this lazy execution model. So let us assume a
scenario in which we have an empty flight reservation table. Now, Mickey's transaction arrives
and is executed, as opposed to a classical model of execution, in which the database transitions to
a single next state, which corresponds to whichever seat was allotted to Mickey. A Quantum
Database transitions to three possible states. That is, it maintains all possibilities, one in which
Mickey is sitting in 1A, one in which Mickey is sitting in 1B, and a third in which Mickey is
sitting in 1C. After that, when Donald's transaction arrives, Donald's transaction executes in
each of these three possible worlds, and that leads to even more number of possibilities. Finally,
when Minnie's transaction arrives, Minnie's transaction can only execute on two of these possible
worlds, the one in which there is a window seat which is available. So what we have effectively
done is, by delaying Mickey's seat assignment, and we delayed it by maintaining all of these
possibilities, we have allowed Minnie's transaction to successfully commit. More formally, the
Quantum Database is nothing but a set of possible database states which are reachable through
different choices made in the transactions. And you may find them similar to uncertain or
probabilistic databases, and they basically differ from probabilistic or N-complete databases in
three main ways. One, we are deliberately introducing some uncertainty, and we are doing so to
enable this late binding. Second, we always need to maintain a guarantee that the Quantum
Database eventually resolves to a single state. It doesn't really make sense for Mickey to have
two seats. And the third is a key design choice, which is from where the name Quantum
Database arises, and that is to keep uncertainty internal to the database. And let me come back to
this key design choice in a few slides. So so far, I have introduced what at a conceptual level
Quantum Database is. Let me now give you one specific way of implementing Quantum
Databases. Clearly, enumerating all of these possible worlds is infeasible. In fact, there can be
an exponential number of possible worlds, exponential in the number of transactions which you
are delaying, and there is a rich literature on maintaining these uncertain databases, Codd-tables,
C-tables and PC-tables. However, we choose the simple representation, so what we do is we
partition the Quantum Database into two states, one which is deterministic, and the other which
is a sequence of transactions which have committed but whose seat assignment or whose value
assignment has not taken place. So because these sequence of transactions has already
committed, the Quantum Database needs to ensure that there is a feasible assignment of
resources. We do not want Mickey to be in a situation where the transaction has already
committed and later you see that, well, I don't have a seat for you anymore. So we need to
maintain some kind of a system invariant. We need to maintain the logical formula which would
guarantee that this sequence of transactions can always execute. So the next question is, how do
we construct this invariant automatically? And in order to do this, we need to extract the user's
constraints from the transaction itself automatically. And doing this in its full generality is
difficult, and therefore with a strict -- we require some hints from the user, and we require the
user to write the transactions as these resource transactions in an external SQL language, which
looks like this. So it has a SQL -- it has a conjunctive query initially, which says what are the
resources which are acceptable to me, in this case, only window seats on Flight 123. As to a
limit one keyword, we now use a choose one keyword which explicitly encodes this choice or
flexibility. And finally, we have a followed by clause which are all the writes which are
dependent on the resource which are selected, and these are the writes which are going to get
delayed and are going to get executed at some point in the future. Now, given transactions
which are written in this SQL form, I'm going to use equivalent datalog-like representation in
which the body of the datalog is going to correspond to this conjunctive query which is up here,
and the followed by clauses will be in the head, and I'm going to use the minus notation for a
deletion. I'm going to use a plus notation for insertion and updates can be modeled as a sequence
of deletion followed by another insertion. So going back to the problem of constructing this
invariant, we can do it now in two steps. First, we convert these two transactions to this
equivalent datalog-like form, and now we want to compose these transactions to construct a
single logical invariant, and we do this by unification. Yes.
>>: What is your followed by, the class of SQL inside? What is the class of constraints?
>> Sudip Roy: What is it?
>>: Does it delete values inside values, or do you allow sub-queries and stuff in there?
>> Sudip Roy: No. It just deletes and answers.
>>: There isn't sort of atomic queries.
>> Sudip Roy: Yes. It's probably possible to extend it further, but we haven't looked into that.
Right, so let's say that this is the datalog-like representation for Mickey's transaction. This is the
datalog-like representation for Donald's transaction. Now, I construct an equivalent larger
transaction, which is a sequential composition of these two transactions. Now, you want to be
careful, because Donald's transactions execute on a database state which is obtained after
Mickey's transaction has executed, and therefore it should perceive the writes which Mickey's
transaction would have done. In this case, it would have deleted this particular seat, and
therefore it results in this additional constraint, which is based on unification between the heads
of all previous transactions and the body of the latest transaction. Now, this of course is a simple
example of how we do this composition. We have a general algorithm for composition and
proof of correctness in the paper, but I won't have time to go into that in this talk, but I am happy
to talk about it later offline. So, assuming that -- okay, so let me just find out that now that we
have this composed transaction, as long as the body of this composed transaction has a valid
grounding over the database, we are sure that this sequence of transaction can commit. So that
was the original goal of constructing this invariant. So how does the transaction execute? Over
a Quantum Database, effectively, it basically checks if the invariant, which can be -- if the
invariant with the extended sequence of transaction has a valid assignment or a valid grounding,
if this is so, then you update the quantum state and you commit the transaction. But now the
assignment has not taken place. If not, then the transaction aborts. Okay. So finally, what
happens when you perform reads over the Quantum Database? At some point in the time,
Mickey has to actually know his seat, so what happens in that case? And this goes back to the
design choice which we made earlier to keep uncertainty completely internal to the Quantum
Database. So let's say that this is the initial Quantum Database, one in which Mickey has both
seat 1B and 1C, and now, if Mickey queries -- Mickey issues a read query over this Quantum
Database, the Quantum Database, in order to keep the uncertainty completely internal, collapses
all possible worlds in which Mickey can have two different seats. Sorry, it collapses to a set of
possible worlds over which the read query has a completely deterministic answer. In this case, it
has eliminated one of these possible worlds. In general, it can actually be a set of possible
worlds. And we have a unification-based algorithm, which is not optimal, but it works in
practice. In fact, the optimal solution is actually [indiscernible] to complete, and it can be related
to a completely different problem of information disclosure through views, this famous paper of
Miklau and Suciu'. So I hope if you understand this, then you now understand why we call it
Quantum Database. There's an analogy you can draw to Schrodinger's cat, that when the cat is
inside the box, it can be both dead and alive. But as soon as you open the box, which is in this
case issuing a read query, the cat can be either dead or alive, but not both.
>>: What happened to the state after the query? Do you take it back to the quantum state, or it
stays?
>> Sudip Roy: No. Once a read is -- so in effect, what is happening is a read is now also
changing the database state. A read internally may be converted to an update. Yes.
>>: What is the impact going to be?
>> Sudip Roy: So the impact is basically, in order to -- so the whole point of having these
possible worlds is by maintaining as many of these possibilities, I can optimize my resource
allocation, right? So to minimize the impact of reads, I want to maximize the number of possible
worlds which I retain and yet can answer the query deterministically.
>>: So what's the object function you're optimizing where you collapse the ->> Sudip Roy: In this case, we are just maximizing the objective function and maintaining the
maximum number of possible worlds after the collapse.
>>: How is that specified? Who specifies it?
>> Sudip Roy: We assume that that's the default in some sense. You can think of applications
where you would want to maximize some other notion, so let's say that if you want to maximize
revenue, then some possible worlds may be more beneficial for you than others.
>>: Do you need an extension to some syntax or new syntax to specify this, or is it ->> Sudip Roy: Yes, you would. We don't support that as of now, but it's definitely something
which can be extended. Right, so the takeaway was that we exploited this flexibility in the
transactions which are executing to be lazy in binding some of the values which are not read
immediately in the transaction. And I presented Quantum Database, which basically optimizes
this resource allocation using lazy binding. So that concludes the part on Quantum Databases,
and I'm going to now move on to Homeostasis. So any other questions on Quantum Databases
so far?
>>: Any performance numbers? Like in terms of doing it outside the database, was it worth the
benefit -- do you get a gain in terms of performance, or do you gain in terms of ->> Sudip Roy: You gain in terms of utility, so it's not exactly in terms of performance. You
may gain in terms of performance by implementing Quantum Database inside the database. Our
implementation was in the form of a middle tier which sits outside the database, and actually I'm
not going to show you the performance numbers for Quantum Databases just due to lack of time.
I have some backup slides where we can go for them.
>>: But to establish utility, you need to have a rich framework, right? I mean, it's also related to
the next question. So take your two examples. One was flight, another was hotel. In the hotel
case, there was an explicit goal that you want to maximize the revenue, so how you do express
that in your Quantum Database?
>> Sudip Roy: As I -- at this point, we don't support it. Yes. If we are to build a system, that's
definitely a useful add on that has to be supported. Other questions? Okay. So let us go back to
this example in which we were lazy, yet we were strongly consistent, and we achieved this by
exploiting the fact that we are going to allow these two database states to be in two different
states, yet two different states as long as they are equivalent to each other. And I kind of said
that we are going to use this global treaty, we are going to project it to these local treaties, and all
of this was a bit abstract. So in this part, I'm going to formalize and make all of this concrete.
So before I do that, I had also promised that I have these couple of assumptions, and I'm going to
come back and identify what exactly this class of applications are. So let us see a few examples.
Firstly, why is low latency important? Why do we really care about saving on the network round
trips? Now, there have been a number of anecdotal evidence which suggest that even a 100millisecond latency in the course of Amazon causes a 1% loss of revenue, and usually this figure
rises exponentially with the average latency. So, clearly, latency is something which is
important in order to -- which can directly be related to dollar values. And there are many
applications which satisfy the previous assumption. Let's say online shopping, in which the data
is actually replicated across different datacenters, and the flexibility is you can imagine -- and I'm
going to actually show you, my experiments are going to be over TPC-W benchmark, which is
an online shopping benchmark. So you don't really have to know how many items are there in
the stock exactly, so there is some flexibility in that, as long as they are sufficient for your order
to go through. Similarly, in auction systems, you only need to maintain which are the top set of
auctions. It doesn't really matter what the other lower values of auctions are. And, finally, this is
something which probably doesn't directly apply, at least right now, but you can imagine that, if
you can partition the application state for mobile devices, in which part of your application state
is in mobile, you are basically saying that you can make changes to some part of the application
state, which is on your mobile device. And as long as you're doing that, you don't have to
communicate to the server, then you can improve the app response time, because not every of
your actions is now going to require communicating with the server. So here's the overview of
our solution. So in the first step, we are basically going to analyze the application transactions to
automatically identify this notion of flexibility, and the intuition is that we want to basically
identify which database states are equivalent, and therefore we are going to partition the space of
database states into equivalence classes. And we are going to build upon a rich literature on
program analysis, because effectively the transactions, as I said initially, are user programs. And
in the second step, once we identify these equivalence classes, we are going to exploit this
flexibility to minimize coordination. And again, the intuition is that instead of trying to enforce
that the two replicas are in completely identical state, I'm going to instead enforce that the two
replicas are in equivalent state. They may be non-identical. And there is -- coming back to
[Sudipta's] question escrow transactions and demarcation protocols and distributed divergence
controlled protocols, it may be a bit vague right now, but we are a significant generalization over
each of these techniques. Moreover, we do a number of other things, and I hope it will be more
obvious by the end of the talk. Let me come back at the end of the talk to revisit how exactly we
are different from each of them. So let us apply the solution to the voting example, right? So the
input in the voting example was this set of three transaction types. The output of the first step
would be these three equivalence classes, one in which the Democrats are leading, one in which
the Republicans are leading and one in which they are tied. And this is going to feed into the
second step, and then the output of the second step is going to be a protocol which ensures
consistency by requiring that the replicas always stay in the same equivalence class. So
whenever you are actually changing from one equivalence class to another, then the protocol is
going to ensure that that happens consistently and no one perceives that you are in two different
states. And that's how we are going to achieve strong consistency. So, with that, let me dive into
how we do step one, and I'll get on to step two later. So doing this analysis in full generality is
difficult, and I do not expect you to parse this. So we restrict the transactions to be expressed in
a particular subset of the language. This is the language. I do not expect you to actually parse
this. Let me just highlight a few key points. We assume that the database is a collection of
integers. We have these IO statements to read and write from the database. Right now, we
support only conditionals, if, then, else. We do not support for loops and while loops, but for
OLTP transactions, this is not a big restriction. And, finally, we have arithmetic expressions and
Boolean expressions. So assuming that transactions are executed -- transactions are expressed in
this language, this is how a transaction would look like. So it has a read statement, and I use the
hat notation to indicate local variables. The non-hat variables are stored in the databases. So the
read(X, X-hat) would read the value of x from the database into the local variable X-hat.
Read(Y) would do the same for y, and then the transaction checks if X+Y is less than 10. Then it
increments X. Otherwise, it decrements X, and finally, it writes that value back into the
database. So this is going to be my running example for the rest of the talk. So let us try to
formalize this notion of flexibility a bit more. Assume that we have these three database states.
These three all have different values of Xs and Ys, and yet, from this transaction's perspective, if
you execute these transactions on each of these database states, it is going to produce an identical
effect, the effect being increment X by one. So how can we represent concisely this entire set of
database states? To do so, we use symbolic tables, which basically have two columns. The first
column corresponds to a partition of the space of database states, and the second column is what
effect the execution of the transaction has. So if you consider this tuple, it says that over all
database states in which X+Y < 10, executing this transaction would have the effect of
incrementing X by one and similarly, for the other case. Now, of course, application would have
multiple transactions, not just one transaction, so let's add another transaction to the mix. It's
very similar to the first transaction, except that now instead of writing to X, it is actually writing
to Y. And also, I have changed the threshold from 10 to 20. And here, you can see that that's
basically the symbolic table for transaction T2. Now, if these are the only two transactions
which are executed in the application, I can combine them to construct a joint symbolic table,
and I do so by taking a cross-product. Now, in normal cases, the cross-product would have four
tuples. One of them is degenerate, and therefore I have eliminated it, and therefore it has three
tuples. What does it say? It basically says that over all database states over which X+Y < 10,
executing Transaction 1 has the effect of incrementing X by one. Executing Transaction T2 has
the effect of incrementing Y by one. So I basically didn't explain how we construct the symbolic
table from this transaction, so let me show you how we do that. And again, we have a set of
inductive rules for constructing these symbolic tables from the transaction code. I do not expect
you to parse through them. Instead, let us look at an example construction. So this is, again, a
control flow graph for the transaction which I showed earlier, and we construct the symbolic
table in a bottom-up manner. So we start with the last statement -- in this case, it is a write, in
which case, executing only this statement overall database states would have the effect of
assigning the value of the local variable X-hat to X. And that's why the true indicates that it'll
have the same effect over all states. And as you work your way backward, in this case, you see
that, well, it is going to have the effect of incrementing along this branch. It'll have the effect of
decrementing along this branch. When you see an if statement, you see that in order to take this
part in the code, X+Y must be greater than 10. To take the other part, it has to be less than 10.
When you see a read statement, then you basically remove the local variables and substitute with
the corresponding database variables. You do the same thing for read(X), and finally, you end
up with the symbolic table, and this is exactly the symbolic table which I had shown you earlier.
Now, the key thing to note here is that the symbolic table only uses variables which are in the
database and does not have any references to local variables, because we have already
substituted these local variables with their corresponding -- so when they were read. So now that
we have constructed these symbolic tables, let us see how we can use these symbolic tables to
construct a protocol. So, again, the input to the second step is the output of the first step, in this
case, this giant symbolic table. And let us assume for simplicity that we are in a distributed case
in which one of the sites has the variable X, the other site has the variable Y, and the initial states
are 12 and 13. So what the Homeostasis Protocol does is it checks to which equivalence class
does my current state of the database belong to? So in this case, the values of X and Y being 12
and 13 indicate that it belongs to the third equivalence class, and it is going to use that to be a
global treaty. Let us assume that there is an efficient way of actually maintaining this global
treaty without requiring communication, and I'll come back to that in the next step. So now,
when I execute a transaction, what I do is basically I go and look up what effect that Transaction
T1 has in this particular equivalence class. In this case, it just decrements the value of X. So I
can keep on executing these transactions, as long as the overall state satisfies this global treaty.
So, finally, I will reach a stage where a transaction may actually cause a violation of this global
treaty, at which point I recheck and establish a new global treaty, and that begins a new round of
this Homeostasis Protocol. So what we have done is basically we have executed six transactions
in this case and incurred the cost of only two network latencies. If you had done it in a strongly
consist manner, you would have incurred six network latencies. Now, of course, how many
network latencies you actually incur will depend on how big your equivalence class is. Yes.
>>: When you're in the state Y = 11, you don't know what is the global value of X, so how do
you validate the global treaty locally without communicating?
>> Sudip Roy: Right, so that comes back to the question of this magic, which I am going to
come to in the next slide. But before I do that, so we have a theorem that proves that the
Homeostasis actually produces UC realizable schedules. So, of course, the naive approach to
enforce this global treaty is to be aware of the global state, and that will require communication
and knowing the values of both X and Y at every step, which kind of defeats the whole purpose,
because we'll be back in the world of strong consistency. We want a lazy approach, and to do
this, we basically project this global treaty into a set of locally enforceable treaties. And, of
course, because we are projecting it into this locally enforceable treaties, and these locally
enforceable treaties are working on a limited state, they have to be more conservative, but we
require that these locally enforceable treaties would together imply the global treaty. So in this
case, to enforce that X+Y is greater than or equal to 20, one possible set of local treaties would
be X > 10 and Y > 10. So as opposed to now enforcing the global treaty, I'm now going to
enforce these local treaties. So I keep on executing transactions until I run into a treaty violation.
Now, given that these local treaties have to be more conservative, this violation is going to occur
earlier in the previous case which I showed you, at which point you renegotiate and establish a
new set of treaties and the protocol goes on. So, of course, there are multiple possible ways of
projecting a global treaty into a set of local treaties, and if you assume that you know something
about the workload -- that is, you assume that you know that Transaction T1 is more frequent
than the other transaction, then you can find an optimal projection, projections which are least
likely to be violated. So in this case, this was the suboptimal solution of choosing 10 and 10,
which only allowed you to execute four transactions. As it turns out, this -- for this particular
sequence of transactions, the optimal projection is to have X is greater than or equal to 9 and Y is
greater than or equal to 11. This will allow you to execute six transactions without a violation.
Yes?
>>: What happens if the right side in there has something in common, like if you're writing X
and Y in both places. In this example, you're writing X on one side and Y on the other side, so
you can kind of work around it, right? But the right side intersects, will it cause a problem?
>> Sudip Roy: It does cause a problem, yes, in which case you would actually be back in the
world of strong consistency, and there is nothing which you can do. Now, of course, in a
replicated scenario, all the state is available locally.
>>: But the same problem would show up in -- let's take your voting example. So there are two
tables, and they're not -- they're applicable. You can't actually increment the vote in either place.
You're testing all the writes to one place, but the reads ->> Sudip Roy: No, no, no. No.
>>: No. I can make writes to both places. So in the example which I showed, I was actually
casting both Republican and Democrat votes at both the datacenters.
>>: When you know the final tally, some subset of votes got incremented here, some subset
there. We just know the result. You won't know the actual data.
>> Sudip Roy: Yes, but that's kind of the whole point, that the application doesn't really need to
know the exact tally. At some point, when you synchronize, you are going to know the tally.
You are going to merge these two states, so it's not that the state is going to be always divergent.
It is going to reconcile periodically at synchronization points. No? Okay.
>>: Sudip, without knowing the workload, of course, you don't really know what the optimal
global treaty should be, right? Meaning you could have all the workload on one side or the other
and all the updates to only one of the variables, and you wouldn't -- your global treaty wouldn't
be able to adjust dynamically for that unless you waited more steps?
>> Sudip Roy: Right. So in which case you can have something which is similar to the idea of
doing -- dynamically estimating what the workload is. That is, if during the day some items are
ordered more frequently on the west coast -- sorry, in America than on the other side of the
globe, then you would allocate more budget to the datacenter in the US. So putting all of this
together, we basically developed this system called Homeostasis, and it has a number of
components, and let me just briefly walk you through it. So we assume that we are given a set of
transactions, which are to be run, and then we use a compiler to construct these giant symbolic
tables. Of course, we do not actually construct a single large, giant symbolic table for the entire
set of transactions. We in fact use techniques from the SDD1 paper, which partitions -- it's
actually [indiscernible] work, which partitions these sets, this entire set of transaction, into
groups of interdependent transactions based on conflict graph analysis, and we construct a joint
symbolic table for each such interdependent group of transactions. We maintain a treaty for each
such group, so whenever a transaction is executed, the treaty enforcer allows a local execution if
the local treaty is not violated. If it is violated, then it goes and talks to -- then it initiates a round
of negotiation. The treaty negotiator goes and talks to the other replicas. It merges the changes
which have happened at the other replica since the last synchronization. Based on this new state,
new synchronized state of the database, it constructs an instance of a satisfiability problem. And
the solution to the satisfiability problem is the optimal partitioning of the global treaty into local
treaties. And then, it sets the new treaty, and that starts a new round of this homeostasis
protocol. So the overall takeaway is that there's a class of applications which have some
flexibility in their transactions. We can exploit these flexibilities to lazily propagate writes
without sacrificing consistency, and I showed you the Homeostasis, which is a system that
identifies and exploits this flexibility to minimize communication between different nodes in a
distributed or a replicated system. So with that, let me present to you some experimental results,
and as I had pointed out earlier, my experiments are going to focus on Homeostasis, but I'm
happy to talk about results on Quantum Databases after the talk. So the goal is to evaluate the
applicability of -- you had a question?
>>: You were talking about this global treaty. Do you have any constraints on what kind of
global treaty you can support? And are there any guidelines as to, once given the global treaty,
how can you translate into these more -- these local treaties? You gave an example, but it was ->> Sudip Roy: Right. So the first question was what are the constraints on the global treaties?
We actually restricted the language to the fragment I just showed you, and by analyzing those
transactions, you can only get a certain class of global treaties, and precisely that's going to be
P&O arithmetic first-order logic. Now, in general, solving satisfiability problems over P&O
arithmetic first-order logic is undecidable. We use some tricks to actual convert it into
Presburger arithmetic first-order logic, and that is decidable, as well as solvable. In fact, we use
Z3 software to do this, which is actually a Microsoft Research technology. Does that answer
your question? Right. So coming back to the experiments, we want to evaluate the benefits of
Homeostasis in a georeplicated setting, and more precisely, we want to answer the question as to
how often can actual we avoid coordination for realistic application workloads? Secondly, we
want to study this tradeoff between how much time we are spending in finding this optimal
projection of global treaties into local treaties and how does that correlate to how much savings
we get in coordination? So for workload, we use a TPC-W by confirm like transactions, so we
assume that there are 10,000 items in a database. We are assuming that each transform is
purchasing one to four pieces of particular item. Initially, the database is populated with stock
levels ranging from zero to 100 from each item, and based on the TPC-W specs, every time the
level actually goes to zero, the transaction automatically replenishes the stock level by adding
100 new pieces of the item. And we basically ran our experience on EC2. We used m3.xlarge
instances, and the system was deployed across five different datacenters, Virginia, Ireland,
Oregon, Sao Paolo and Singapore. For the first two experiments, I'm just going to use two
replicas. That is going to be Virginia and Ireland. And for the third experiment, I'm going to
show from two to five, the behavior of the system as we add sites. Sorry. So let me explain
what this graph is. So on the X-axis, I have the sequence of transactions issued for one particular
item -- in this case, let's say item A. On the Y-axis, on this side, I have what is the view of the
stock level from the point of view of Replica 1. And on the Y-axis on the other side, I have the
transaction latencies. So the red line here corresponds to the stock value, and the green line
corresponds to the transaction latencies. So let us walk through this graph from left to right. So
let's say that I will start with this value of 100 for the red line. So now I'm executing transactions
locally using the Homeostasis Protocol, and that manifests itself in these low latencies, because
the transactions are executed locally. At this point, I witness a local treaty violation, so I need to
run a round of synchronization, and that requires communication between the replicas, which is
why there is a spike in this green plot. And, also, I witness a sharp cliff in the red innovate,
which corresponds to that, and that is because I am now synchronizing the state, so this change
was the number of purchases which happened at the other replica while I was running my
transactions locally. At this point, I have established a new set of local treaties, and the
execution continues locally again until I reach zero, at which point I replenish my stock and the
protocol proceeds. So how often do we benefit from this? To understand this, basically, this is a
transaction latency profile. On the X-axis, I have latencies in the log scale. On the Y-axis, I
have the cumulative probability. That is what fraction of the transactions are executing under a
particular latency value. So we are comparing against 2PC, which is strongly consistent and it
always takes a round-trip hit -- in this case, 200 milliseconds. And, of course, there's a sharp
cliff, because after 200 milliseconds, all transactions will be able to execute. However, and these
four lines basically correspond to different settings of the optimization parameter, so the higher
value of L means that you are spending more time and finding optimal treaties, and therefore,
you expect more number of transactions to execute locally. So in this case, almost for all four of
these parameters, you see that more than 85% of transactions were executed locally. Now, how
does that behavior change as we increase the number of sites? So what exactly happens when
we increase the number of sites? As I pointed out earlier, when you factorize a global treaty into
locally enforceable treaties, you need to be conservative. So those local treaties have to be more
and more conservative. And if you are factorizing it into more number of fragments, then they
have to be even more conservative. So as you go from two replicas to five replicas, your local
treaties are going to be increasingly more conservative and therefore more likely to be violated
easily. And this manifests itself in this downward shift of the inflection point, which basically
says that slightly lesser numbers of transactions are executed locally and you witness the treaty
violations more frequently. Now, the takeaway from this is basically, even with five sites, more
than 80% of the transactions were executed locally. So with that, let me mention some of the
related works. There has been quite a bit of interest in the database community. I'm sorry?
>>: Throughput.
>> Sudip Roy: Throughput, I probably have a backup slide on that. We did have an experiment
on that. I'm happy to show that to you offline.
>>: What are the inter-site communication delays?
>> Sudip Roy: So between -- so, of course, that depends on which two datacenters we are
talking about. It ranges between 100 milliseconds between east coast and west coast -- actually,
around 85 milliseconds -- to more than 250 milliseconds between Virginia and Singapore.
>>: Can you back up then? So when you're in the 10-millisecond range, everything's local.
>> Sudip Roy: Yes.
>>: And then you have this big jump as soon as the communication -- as soon as you start
renegotiating, then you're into cross-datacenter communication. That's when you get the big
jump?
>> Sudip Roy: Yes. No, that's when you get this jump in the X-axis. So what happens is, you
can actually do better. You can give some anti-entropy protocol, which runs in the background
and periodically reconciles the state between the two so that you can eliminate some of these
local treaty violations. However, you cannot eliminate them completely, because whenever you
transition across an equivalence class boundary, that has to happen consistently.
>>: Do you update commute in this case?
>> Sudip Roy: Yes.
>>: So all you're really doing is you're just looking for better consistency of reads? Is that
what's going on? Because if the update's commute, then you can do multi-master replication
with impunity and you don't need to worry about cross-database delays or anything and
renegotiate. As long as the updates eventually reach their destination, everything will wash out.
>> Sudip Roy: However, I showed you in the voting example where -- yes, you were right
earlier when you said that we are doing read. We want to ensure read consistency.
>>: It's all about read consistency.
>> Sudip Roy: Yes.
>>: His reordering transaction sort of requires read consistency, at least at that point, right? But
you could do an incremental -- you could save yourself a lot of renegotiating of treaties by
simply incrementally sending the updates from one side to the other offline, if you will, or in the
background.
>> Sudip Roy: Yes, so in this particular case, you would actually not witness this vertical
latency if you do this.
>>: And if you were willing to soften your requirements about updating so that you had
reordering by saying anything less than 10 you'll reorder, you could also soften the spike at that
point, as well.
>> Sudip Roy: Yes. So going back to the related work, there's been quite a bit of interest in this
field in general, but we are the first ones who actually adapt the consistency which the data store
provides to what is required by the application, and we do so by doing this program analysis.
Moreover, we are also the first one who tried to adapt based on the transaction workload, which
none of the other protocols do. Moreover, these other protocols always assume that you are
given with a simple constraint, which is like an inequality constraint, and you want to maintain
that constraint. Our protocol generalizes what this class of constraints is, as well as allows you
to switch from one constraint to another. Only that has to happen consistently. There's also been
some work in the programming languages community for program analysis and automatically
identifying atomic sections and also, in systems community, to assert a formula in a distributed
manner. So with that, let me summarize. So the key idea is that we want to exploit flexibility in
transactions, and I have identified classes of transactions where such flexibility is available to be
lazy when possible. And I have showed you one example in which this laziness created room for
optimization, and I presented Quantum Databases as a system which does that. I have showed
you another instance where this laziness minimizes the amount of coordination required without
sacrificing consistency, and I presented Homeostasis, which provides such semantic-based
adaptive consistency. So let me mention something about what I plan to do if I get an
opportunity here at Microsoft Research. One of the interesting things which I want to pursue is
can we synthesize concurrency control protocols automatically. Assume that you are given
correctness criteria in terms of let's say one copy is realizability, and you are given some
information about the environment, as in what kind of hardware support you have, what its
efficient, what is not efficient, some specification of the environment. And then can we
automatically synthesize the best concurrency control protocol? And there's been quite a bit of
recent very interesting work in program synthesis, some of it from Microsoft Research itself, and
it would be really interesting to investigate how we can apply some of those techniques to
automatically synthesize concurrency control protocols. The other direction of research which
would be interesting to pursue is no-knob cloud services. As computing moves to the cloud,
managing cloud services by administrators becomes increasingly more and more difficult, so you
would want to have a system in which the cloud somehow automatically detects performance
anomalies or any other sort of anomalies and takes actions itself as far as possible to remove any
kind of anomalies. And the first step in this is, of course, diagnostics, and we have done some
initial work with Christian on this. Of course, the interesting question is once you have
identified what these anomalies are with some high-level idea of what the reason is, can you
close the loop and automatically improve the -- automatically take actions which improve the
performance of the cloud service? So I talked about Homeostasis and Quantum Databases today.
I have also worked -- I have also done some initial work on the Youtopia Project, which is about
designing declarative abstractions for data-driven coordination. You may have heard of
entangled queries and transactions. So the key idea there is basically with the rise of social
networking, you would want users -- users would actually want to issue transactions, which can
now talk to each other and take joint decisions. And we designed abstractions which allow you
to do this in a clean and efficient manner. Finally, I have done three internships, one with
Christian, in which we initiated this new project for robust diagnostic for cloud platforms. I have
done two other internships, one -- actually, both of them in Google Research in the Fusion
Tables team. For the first, I worked on spatial query processing in the Fusion Tables back end.
In fact, if you have used Fusion Tables or if you use it today, it's very likely that my code is
executed on the back end. And in the next internship, I worked on faceted navigation for data
exploration, and I'm happy to talk about any of these projects in the one-on-one meetings which I
have. So, with that, that concludes my talk. Thanks a lot for attending. I'm happy to answer any
other questions which you may have.
>> Christian Konig: Okay, any more questions? All right. Let's thank the speaker again.
Download