>> Sudipto Das: Hello, everyone. It's my pleasure... student at Cornell, working with Johannes Gehrke. She has...

advertisement
>> Sudipto Das: Hello, everyone. It's my pleasure to introduce Bailu Ding. She is a PhD
student at Cornell, working with Johannes Gehrke. She has been spending the past two years
here at UW, also collaborating with a bunch of UW folks. She was an intern with Phil and me,
working on Hyder and she did a great job during the internship, helped us build a prototype for
Hyder, and she has also done some great work in optimizing optimistic transactions, optimistic
concurrency control for distributed transactions. That's her thesis work, as well, which she's
going to talk to us about it in the next hour or so. Bailu?
>> Bailu Ding: Thank you. Okay, good morning. So today I'm going to present my thesis work
on optimizing optimistic concurrency control, so as Sudipto just introduced, I'm a PhD student
and still at Cornell, but right now, I'm visiting University of Washington in Seattle in the
database group there. I work with Magda and a bunch of students in your group, as well. So
transaction process is sort of everywhere in the current age, like when you do any searching,
when you do shopping, when you do Googling, Twittering and Facebooking. So there are two
main major ways to manage transactions. One way is to manage transactions by logging best
approach, which means when you try to access something, you first acquire a read or a write log
on the item you want to access, so that others won't interfere with your -- when they access their
items. The other way is called optimistic concurrency control. The idea is that you don't acquire
locks, so you just optimistically assume that you are the only one who executes the transactions,
but you do some verification afterwards to verify whether your assumption is true or not. So
there are some interesting warrants of optimistic concurrency control in recent systems, mainly
for two reasons. The first one is that, obviously to say that you don't acquire locks, you don't
need lock management, so that the overhead of the transactions is lower. The other part is that
one nice feature of optimistic concurrency control is that you don't have this blocking behavior,
which means that the readers don't block the other readers. They also don't block other writers.
So this is very desirable, especially for those web applications. But before I go in deep to what
my thesis is on -- sorry. Sorry. Before going to the details that my thesis is on, let me first give
you a brief review of how optimistic concurrency control works. So, for example, assuming you
start with a transaction, T3, you first go through a read phase. So in a read phase, what it does is
it reads from the storage. It does some execution on the I terms. Maybe it does a bunch of
additional reads, and then it comes up with some writes. When it does these writes, first, it only
updates the things in its local workspace, which is not visible to other transactions. And
afterwards, after it finishes its execution, it continues to our validation phase, where in the
validation phase it is checked against all the previously committed transactions to see whether
they will be conflicting with this transaction or not. So if there is no conflict, this transaction
continues to write first, where you store the updates to the storage and make it persistent and
available for others to read. But if there is any conflict, the transaction aborts, and upon its abort,
the system will restart the transaction to try for another time to make it execute for another time.
So -- sorry. So the idea key to the validation phase of the optimistic concurrency control is to
compare the read seed of transactions with the right seed of previously committed transactions,
so for example, assuming we have a transaction T0, it first enters the read phase, where it just
has a local update to the edge in each workspace. Then, it continues to a validation phase. Since
there is no prior transaction to T0, it just commits trivially, and the same time, a transaction T1
starts. It first enters the read phase, where it reads from the storage on X, and then it continues to
the validation phase. But since when T1 starts and reads from the storage, the writes from T0
has not been applied to the storage, so the validation finds that T1 should have read a newer
version of X, but it didn't, so there is a conflict. That's why T1 aborts. So after we give some
idea of what optimistic concurrency control is, now we are looking at two challenges in
optimistic concurrency control. The first one is that, as we're seeing, the validation phrase -- the
transaction should go through the validation phrase in a zero order, so essentially the validation
is zero and centralized, so that we can validate the transactions one after another. This imposes
some issues in the scalability of the protocol. The other part of the challenge is that as we're just
seeing, our transaction can't abort, and the abort in OCC is not a real event, so when there is data
conflict, the transaction can't abort and get restarted. So when the data concentration increases,
the chance of conflicts becomes higher, and the transaction can get aborted and restarted a
million times. This basically means the system wastes a lot of resources on doing meaningless
work. So in this case, the performance of the system will stretch. So this is another challenge
that is -- also to say, how do we handle data contention to reduce conflicts. So my thesis
addresses both problems. The first part is how we scale the validation phase. We actually
propose two kinds of scalability approaches from different aspects. One is how we introduce
pipeline parallelism into the validation phase. The main idea is that we chunk the validation
phase into smaller stages and assign a different thread to work on each of the stages. In this
sense, we can work on different phases of the validation stage in parallel. The other part of this
is that we'll try to introduce let's say a vertical parallelism into the system, which means that we'll
try to divide the work of the validation onto different instances, so each of the instances will
count one part of the work. In this sense, we also parallelize the work and centralize the
validation. So for the other challenge, we proposed a way to reduce the data contention between
conflicts by reordering, so the main idea is that in the optimistic concurrency control, the order of
the transactions are not decided up front. It is only serialized upon the validation phase, so this
gives us some flexibility in how we organize the execution of the operations in transactions. So
we thought that if we model this kind of order of execution carefully, we could reorder the
operations to reduce the conflicts between transactions. So due to the time constraint, I'll mainly
talk about two projects in detail. One is the Centiman project, where we do the parallel
validation. The other one is the transaction batching project, where we do operation batching
and reordering to reduce the conflicts. Okay, so next I will go through the -- introduce, discuss
the project and how do we distribute the optimistic concurrency control for the validation phase.
Okay. So when we designed the Centiman project, the idea of this -- let's say the guiding
principle of this project is that we want to design a system that has good modularity, which
means we separate the processing and the storage of the system, so that it is easy to deploy this
kind of system in the context of the cloud. So the design [indiscernible] is that we want to have a
simple design, and we also want to reduce the coupling between different components as much
as possible. So recall the three phases of optimistic concurrency control, where we have the read
phase, the validation phase and the write phase. If we want to do a distributed architecture of
this protocol, it naturally translates to three components in a system, so one component is the
processor component, which corresponds to the read phase and execution of the transactions.
And the other one is the storage component, where it actually does the reads and writes of the
items in the system. And finally, we have the validator component, which takes charge of the
validation of the system. So what we want to do is try to make each of the components
distributed. The first one that comes to our mind is to make the processor component distributed.
This one is pretty straight forward, where we just end a number of processor instances and try to
offload the traffic to different processor instances, and we're done with that. Next, we want to
make the storage distributed. The idea is that we're plugging our favorite key value stores and to
make it versioned. So the versioned key value store actually means something like if -- at each
of the key value pairs, each of the writes comes with a key value and also a version, so if the
version is newer than the existing version in the storage, we apply the updates to the storage. So
if the version is older than the existing version in the storage, which means the update is out of
date, so we just discard or ignore this update, because it comes too late. So now, let me give you
an example.
>>: There's only one version of each example.
>> Bailu Ding: So let me say it this way.
>>: Timestamped version.
>> Bailu Ding: Yeah, yeah, exactly. So let me say it this way, so there are two ways when you
want to do a version storage. One is that you just store the latest version of the item in the
storage, which basically is a single-version system. The other way is that you stay like multiple
versions in this storage, where you have like a multi-version storage, where you can do things
like snapshot isolation. So in our case, more like here we assume we just have a single version
of storage, where you just keep the latest version of the item in the storage. Okay? So let me
give you an example of how the version storage works. So first assume we have a transaction
T1. It reads -- so the storage starts with the version 0 of X and version 0 of Y, so assuming we
have a transaction T1, it starts with reading a version of X from the storage, which is now
version 0, and in its read phase, it also updates X in its local workspace. And next, the
transaction T1 enters the validation phase. Since there is no prior transaction before it, it just
commits trivially, and this time, when it passes the validation, the validation component assigns
a timestamp to this transaction, assuming the timestamp is just 1. So when transaction T1 enters
the validation phase, it just updates the X to version 1, where version 1 is its timestamp. So as
soon as my transaction T2 starts, when it reads Y from the storage, it's still version 0, and it
updates Y in its local workspace. Then it enters the validation phase, passes the validation and
gets the timestamp to T2. So then it continues to update the Y to version 2 in the storage. Now,
assuming we will have a transaction T3 starts, it first enters the read phase to read X from the
storage, which is now updated by the transaction 1 to be our version 1, and then it gets some
update to X in its local workspace, and it enters the validation, passes the validation and gets
timestamp T3, where you finally update X to 3. So the first challenge that comes from the
version storage is that -- I am so sorry -- is that we can have like reads from inconsistent
snapshot from the storage. Recall that in the original optimistic concurrency control, we want to
ensure that the validation phase and the write phase are in our critical section, so that the writes
to the storage are atomic for our transaction, as well as we apply the write in the order of the
validation. But as now we have the distributed key value store, we don't have this nice property,
so we suffered from two kinds of anomalies for key value store. The first one is that since the
updates are not atomic, we could write partial updates from the storage. For example, assuming
we have a transact T0, who writes to both X and Y, and then we have our transaction T1 who
reads X and Y, but since the order -- the order we issue the reads and writes are in are not
atomic, which means we could read the updates from T0 with X, but not the updates on Y. So
this means we can read partial updates from the storage. The next issue is probably more subtle.
This means we could actually read an inconsistent snapshot from the database. So for example,
assuming we have transaction T0 who writes to X and D, and then we have a transaction T1 who
reads D in its read phase, where it reads the update from T0, and then in its write phase, it
updates Y, so T1 is good, because we can stabilize T1 after T0. But the problem is with T2,
where we want to read both X and Y. So since the order we issue the reads means that we only
read the version of X before T0 updates the storage, but we read the version of Y after T2
updates the storage. This basically means we read the updates from T1 but not the updates from
T0. This means that because we must serialize T1 after T0, so this means we read from an
inconsistent snapshot, because a snapshot that contains updates from T1 must contain updates
from T0. But we only read the updates from T1 but not T0, so this means that we read from an
inconsistent snapshot of the database. So there are a couple of proposals that can solve this kind
of inconsistency. One is that we can just do like two-phase commit to update the -- to apply the
writes to the storage. But because we are building on top of a key value store, we don't really
want to end more layers on top of a key value store or end more APIs of the key value store, so
we sort of limit -- instead of doing all the heavy lifting in storage, we actually limit it to the
validation phase, where the validator can guard against this kind of anomalies, so we are not
worried about doing atomic updates or reading a consistent snapshot from the storage. Okay,
now we are good. Okay, so now we are good with distributed storage, with version key value
store. Now, what else is to do a distributed validation? So the idea of doing distributed
validation is actually fairly straightforward. We just partition the key space into different charts,
and then we ask each of the validators to take conflicts on its own chart, so when we issue a
transaction for validation, we first split the transaction based on the data it accesses. For
example, if it accesses just one validator, the data on one validator will just send the whole
transaction to one validator. However, if a transaction accesses multiple validators, we split the
transaction into smaller parts and send each of the parts to one validator in the system. After
each of the validators takes the conflicts based on its own data and sends back the local decision,
the processor acts as a coordinator. It receives all the decisions and makes a final decision to the
transactions. Up to this point, we are fine, but it turns out this protocol does not work nicely in
practice. As we will see, we will notice the abort rate of the transaction will rise fairly quickly if
we're using this approach. So the problem here is the version decisions. The idea is that since
we might split a transaction into different validators, so it might the be case that the different
validators come up with different decisions. For example, assume we are transacting T3, where
it writes to X and writes to Y. And it sends X to validator A, which has the data X, and it sends
part of the validation request to validator B, where you hold the item Y. So assume the validator
A finds no problem with X, and it votes a commit, but the validator B finds there's something
conflicting with the Y, so it votes an abort. Because our coordinator, our processor, is smart,
after it collects the two decisions, it knows that because one of the validators is not comfortable
with the update, so it will abort the transaction. Now, T3 is fine, but what if we have T4 come
afterwards, where T4 wants to read X and write to X. Because the validator A previously thinks
the transaction T3 has committed, so it will cache the updates from T3. If this is the case, since
T4 will not read the updates from T3, which because T3 eventually aborts, you cannot make it
update to the storage, so T4 will not be able to read the updates from T3. Now, when T4 comes,
the validator A thinks T3 has committed, but T4 has not read its updates, so it votes to abort. So
what this means is that our transaction T4 is aborted due to our transaction T3 that is not
committed, so we call this kind of aborts a spurious abort, because it's just a forced abort,
because T4 should commit. And we call this kind of updates, which is left in the validator cache
from aborted transactions as spurious updates. So how are we going to eliminate the spurious
update? So the first proposal is the very proactive proposal, where we just ask the processor,
after it collects the decisions from all the validators, it just sends back the final decision to all the
validators, so to notify them that the transaction has aborted and you should invoke any forced
updates left in your cache. However, if we do this synchronously, it actually slows down the
system, because the feedback first is in the critical parts of each transaction. But if we do this
asynchronously, which means after we get the decision, we asynchronously propagate back the
decision to the validator, it adds complexity in the system. So first, we need to implement the
logic to propagate back the decision and invoke the updates. The second part, which probably is
the more -- let's say more troublesome one -- is that we need to handle the case where we
somewhat lost the messages from the system, which means the updates is not going to be
revoked from the system. And that will create a lot of trouble and aborts for later transactions.
Okay, so this is one proposal. So on the other extreme, we can have a very lazy proposal. The
idea is that we are accumulating the updates in the validator cache, but we're not going to
accumulate things forever. At some point, we need to discard the old updates or garbage collect
the old updates. So what about we just count on the garbage collection process to remove those
spurious updates from the validator cache? So we did a simple experiment on our a
straightforward proposal on the garbage collection mechanism. The idea is that we just calculate
or get some expectation of how long it will take for the writes or for the transactions to hit the
storage and be persistent, and after we wait long enough, we think we will be safe to remove the
updates from the validator cache. So we proposed -- we implement this kind of garbage
collection logic in the system, and we tried let's say a different period, like how long we wait
before we expire the updates in the cache. So we did an experiment on different expiration
times, including from 10 seconds to a minute. So what I observe here is that for all the different
configurations, we first get a little bit boost or a little bit writes of the abort rate in the system in
the first few minutes. This means the system starts to accumulate the spurious updates in the
validator cache, so we see the rise of the aborted transactions. But after a while, the system sort
of reaches equilibrium, and the abort rates sort of go [plan], too. This means we finally sort of
stay stable with kind of abort rates, because we have this, cached all the spurious updates within
some time and we start to expire things. So as we can see from the figure, let the update stay in
the cache for a minute before you do garbage collection is clearly not an option, because we got
up to like 90% and 99% abort rate, which basically means the system is not progressing
significantly. But even we just -- yes?
>>: How big an abort rate do you have initially?
>> Bailu Ding: Yes, I forgot to mention that. The initial abort rate is less than 1%, so it's a very
low contentioned workload.
>>: With 100% spurious abort rate, that presumably you're doubling that.
>> Bailu Ding: No, so this is not a spurious abort. It's a total abort. Yes, this means like 99% of
the transactions actually are aborts in the system, because we cached -- yeah.
>>: So what is the workload here?
>> Bailu Ding: Oh, okay, the workload here is the same workload. We have a transaction with
10 items, like five of them are reads and five of them are writes. It's over 100,000 items in the
system.
>>: I don't understand how an abort rate of 1% can come with spurious aborts, suddenly become
100% abort rate.
>> Bailu Ding: So the idea is that when you have a couple of validators, and some of them will
cache the spurious updates, so you will basically -- when you store things for 60 seconds, it
basically means you cache the spurious updates for 60 seconds. So during that time, all the
transactions that read off those items for the spurious updates will abort on this validator, but
what makes it worse is that it might abort on one validator, but it commits on the other validator,
so it sort of pollutes the other validator, even if it aborts.
>>: It's almost as if you're turning a lot of transactions into very long-running transactions. It's
almost like that.
>> Bailu Ding: The long-running transaction will never hit the storage, or after 60 minutes -- 60
seconds, yes.
>>: What's the transaction rate, so we get a sense of 60 seconds is like how many transactions?
>> Bailu Ding: I see, I see. So I think in this case, we only run about 10,000 transactions per
second.
>>: So if you wait 60 seconds, then you're accumulating transactions or some such number.
>> Bailu Ding: Exactly, so this is why we even cache for -- I think 60 seconds is not too long, if
we want to wait for something to hit the storage in case of any jitter will happen. But even 60
seconds is not going to work in this case for spurious aborts. Even if we just cache like 10
seconds of the cache, which is fairly short, we still reach about more than like 10% of abort rate,
even if the original data contention will cause only like 1% of aborts. So the problem with the -so the other proposal is what if we use even shorter time to cache the updates or to do the
garbage collection more aggressively. The idea is that we reduce the expiration time, so that we
can garbage collect quickly after some time. But the risk here is that if we do garbage collection
too aggressively, we may suffer the problem of aborted transactions due to insufficient
information. For example, assuming we have 30 transactions, we first start with a T2, and we
update local X to some validation and write version 2 to X. Now we start with T10, when you
read X with version 0 and [understand] validation, because we are seeing [indiscernible] in T2
who had updated the X2 version 2, so there is a conflict. We abort T10 normally. But when we
start another transaction, T15, which reads the version 2 of X, which means you read the latest
version of X and it should commit, but if we garbage collect too aggressively -- for example, if
we garbage collect anything before T5 and the update from T2 is gone in this case, it will not
validate T15. The system has no idea whether there are any updates between the version 2 that it
reads and the T5, so the validation, in order to ensure the correctness of the system, the validator
has to abort the transaction conservatively. So this means if we garbage collect too aggressively,
we are at the risk of aborting transactions due to insufficient information or insufficient history.
So our proposal is something in between those two extremes, so which we called a reactive
approach. So the idea is that we asynchronously propagate the information of the commit of
transactions throughout the system, but the benefit of this approach is that firstly it's
asynchronous. Next, it's inclusive, which means even if we lost the message or the message
comes a little bit late in some cases, we are still good. We just need to tolerate a little bit of
inaccuracy here. And the third one is that it is fairly configurable in the sense that we could
configure like how frequently we propagate the transactions throughout the system, which gives
us some sort of space between the tradeoff between accuracy and the cost of the communication.
So let me give you some idea of what a watermark is. So the idea is that the watermark has the
same type as the timestamp in the system, and when we do a read from the storage, we associate
the read with a watermark. This watermark basically gives the guarantee we want in a garbage
collection case, which means when I get a watermark, I have the guarantee that all updates to this
record are made by transactions before that watermark has been reflected in the update. So what
it means, that if we got a watermark, we don't need to worry about transactions who update the
record before this watermark. So for example, assuming we had a transaction T20, it reads on X
with a version 10 and a watermark 15, so which means for all the updates cached in the validator,
we only need to worry about -- so all the updates cached in the validator, all the transactions with
timestamp before 15 has been reflected in this read, so it means in the validator, we only need to
worry about transactions that has a timestamp larger than 15, so which means we basically age
out the spurious updates before T15. So in this case, if we try to validate the read on X, although
we have spurious updates of T13 in the system, since it's below T15, we don't need to worry
about whether T13 will be conflicting with the transaction or not, so this means we only need to
take transactions with timestamps larger than 15. In this case, there is no transaction with a
timestamp larger than 15 updating X, so we can commit this transaction on X. And as we can
see, we sort of age out the updates before T15, which including the spurious updates from T13.
Yes.
>>: But you have to know these transactions -- there's no transactions that can have better
version of earlier than 15?
>> Bailu Ding: So I think the idea is it basically means we know the updates from transactions
before T15 should have hit the storage, and we should have read anything from the storage
whose timestamp is less than T15. So which means if T13 is not a spurious update, it should be
hitting the storage, so if it is not, we are sure. We have the guarantee that these updates from
T15 is a spurious update and we don't need to consider it anymore in a validation protocol. Does
that make sense to you?
>>: Well, I don't fully follow it, but why don't you go on? I'll think about it.
>>: So the condition is, given that transaction 20 read version 10 at watermark 15, it knows that
there will be no committed updates in the range 10 to 15 that arrive later.
>> Bailu Ding: Exactly, exactly. If there is in the cache, then it must be a spurious update. So
in this case, we age out the -- eliminate the spurious update in the cache, when the watermark
bumps up.
>>: Is it there to address the spurious update issue.
>> Bailu Ding: Yes, yes, this is to address the spurious update issue. And we will see you can
also do something else with the watermark. Yes. Okay. So here, as I said, the watermark is
configurable, which means in two senses. The first sense is you can figure out how frequently
you update the watermark. The second one is what granularity you apply the watermark. Here, I
just give an example of how you implement the watermark in a node level, so the idea is that -so as was our design principle, we don't want to create new connection channels. We don't want
to create new APIs for different components, so we just take advantage of the key value store in
the storage. What we do here is that for each processor, we keep track of the transactions who
have a completion watermark and a local processor, which means for all transactions whose
timestamp is before the completion watermark has been completed, which means it's persistent
in storage. So each of the processors propagate this information to each of the storage nodes,
and the storage nodes come up with a completion watermark, which is a minimum of the
completion watermarks for the processor, and afterwards, when we do a transaction or we do a
read from the storage, we read the storage, and we read off the completion watermark from each
of the storage nodes. And we come up with the read watermark as the minimum of all the
completion watermarks in the storage and the processor. So this is actually lazy propagation of
the information for which transactions has been completed before some kind of threshold in the
system. So this is node-level implementation of the watermarks. You can definitely implement
this in our final granularity, which means you can implement a large [shot] level, where you keep
track of the completion watermark per shot or per -- or in the extreme case, per item. So now we
have come up with the final architecture of the system, where we have a distributed processor.
We have the version key value store and we have a sharded distributive addition, and this makes
the system fully distributed. So in our implementation, it's like 20,000 lines of live code, in C++,
and for storage part, we just implement an in-memory key value store, which is implemented by
harsh map. We didn't use our realistic or real harsh map, real key value store. It's because the
performance of key value store cannot keep up with the rate we want to target. And for
processor part ->>: Is it open source?
>> Bailu Ding: Yes, open source. I think the open-source one has about like 10,000 operations
per second, which is too low.
>>: Would you influence [indiscernible].
>> Bailu Ding: Oh, in that time, there is no LevelDB, but we touched on something like
[indiscernible] and HBase and something like those.
>>: 10,000 operations per second at a given node?
>> Bailu Ding: Yes, at a given node, especially if we want to support versioning, yes. Okay.
Son the processor side, we implement processor in the model of instead of issuing a transaction
preferred, we actually sort of multiplex transaction, which means we have a message queue for
the processor, where it gets a message -- sorry, gets a request from a message queue, where it can
issue the read request or write request or validation request, depending on the stage of the
execution of the transactions. So in this sense, we could start our high concurrency level of the
transaction per node, instead of issuing one-third per transaction, which is blocking. Also, for
the timestamp assignment, for the purpose of our experiment, just because we use heterogeneous
workloads, we also use heterogeneous hardware, so -- sorry, homogenous hardware and
homogeneous workload, so assuming that each of the nodes assigns timestamp roughly in the
same rate, and we assume the locks between the different nodes are roughly synchronized, under
those two assumptions, we just assign the timestamp locally in each of the processors, where the
timestamp is something like the ID of the processor plus the ID of the transaction in the system.
And we update the completion watermark and the node level every like 10,000 transactions in
the system. We also try to update the watermark in less frequency, like 1,000 transactions per
second, but we find updating the watermark per 10,000 transactions will be suffice for our
workload. For the validator, we divide the validation phase into smaller stages. For example,
one thread is dedicated for receiving the messages and putting the messages in order. One-third
is responsible for doing the real check on the conflicts of the item, and we have another thread
which is doing the wrap up for the transaction and preparing for a networking messaging. And
because we are using -- because the timestamp is assigned locally in each of the processors, and
they can come out of order, we have a timeout strategy, which we wait for some transaction with
some certain timestamp to come, but if it comes too late, we just move on and reject the
transformation. For the networking part, because the pattern of networking is very bad for the
performance in our system. For example, sending out a lot of very, very small messages in the
system, like we send out the decision or we send out just a request. So in order to optimize for
the throughput, we use a model where we just periodically send out networking messages, like
we periodically send out -- we send out a message like every 10 milliseconds. This definitely
increases the latency of the transactions, but it reduces the overhead from a networking and gives
us better throughput in terms of networking. Okay. So we first different an experiment on a
variant of the TPC-C benchmark, so the idea is that we want to test the scalability under a
workload that's similar to the TPC-C benchmark. So as we say, we batch the networking
messages so that we have latency in terms of tens of milliseconds. So the updating transaction,
we only run the updating transactions in the system, that it's 50% new order transactions and
50% payment transactions. In order to reduce conflicts, we did a bunch of modifications for the
benchmark. The first one is that we did vertical partitioning of the tables in the sense that we
don't want to update to different fields of a record be conflicting with each other. The second
part is that we relaxed the specification on the counter-assignment. Instead of assigning the
counter in increasing order, we just assign a unique ID for each of the transactions who require a
counter. And the third one is that we replicate some of the hotspots. For example, like year-todate amount for the [indiscernible] is set true, and these will not create any problems for the
updating transactions. But if you really want to run the whole benchmark and you want to run
the read-only transactions like the stock level [in situ], you need to read off different replicas and
do some sort of an aggregation for the read-only transactions. Okay, so for deployment part -yes?
>>: You're not exactly replicating them. You're kind of partitioning value, in a sense, so it's
going to be the sum of all of them, in this real value.
>> Bailu Ding: Yes. When I say replica, I just mean we have one sound per node, or something
like that. Okay, so for the deployment, we run the experiment on EC2 clusters, with 50 storage
nodes and 50 processor nodes. Okay, and sorry, I modified that in the new slides, but it's not one
warehouse per storage node. It's like the number of warehouses increases with the storage node,
but we shuffle randomly -- shuffle the data randomly to each of the data warehouses, so we'll not
take advantage of the locality of the warehousing access here. And finally, we do about 200
concurrent transactions per processor in the system at max. So we do the experiment, we set up
the 50 nodes for a number of validators and it sort of increases the concurrency level on each of
the nodes until we reach a peak throughput for a certain number of validators. So here is the
result we have on TPC-C variant, where we increased the throughput -- sorry, increased the
concurrency of the level. We chased down from one validation node to eight validation nodes,
where the eight validation nodes are sufficient to saturate the storage of 50 processor instances,
and we increased the concurrency level until we reach a peak throughput for the nodes. So as we
can see, we started with about like 50,000 transactions per second, and then we dropped to about
80,000 transactions per second for two nodes, until we reached about like 230,000 transactions
per second. So it's not exactly linear scalability, but as we increase the number of nodes, we
didn't see the throughput increases continuously. Yes.
>>: So you have the number of storage nodes and partition and all of that isn't changing in this
experiment. What's changing is the amount of parallel processing that you have going into
validation?
>> Bailu Ding: Yes, exactly, and the way we try saturate the validation nodes is that we increase
the concurrency level per processor node, and then we get a max of about like 200 concurrent
transactions per node in this experiment.
>>: What's the abort rate with something like this?
>> Bailu Ding: Yes, so abort rate, because we have done a lot of engineering on reducing the
aborts, we actually got about 3% of aborts, merely due to conflicts on the stock level -- sorry, on
the stock item table. And the abort rate is stable, because it's low, so it's stable in all the
configurations. Yes.
>>: How many nodes are there in the system?
>> Bailu Ding: So there are 50 nodes for storage and 50 -- and 50 nodes for processor.
>>: And how many operations per transaction?
>> Bailu Ding: In average, we have about 16 reads and 15 writes.
>>: And you have 30 operations?
>> Bailu Ding: Yes, 30 operations. It's the same profile as the TPC-C benchmark. Yes. Okay,
so I'm going to mention one more optimization we used the watermarks for. The idea is for readonly transactions. So in original or in normal optimistic concurrency control protocol system,
where we can have -- can optimize read-only transactions by running it in snapshot isolation,
which means we're running a snapshot isolation and we don't need to do validation for those
read-only transactions. But we cannot do this in our system, because as we mentioned, we have
this kind of reading from an inconsistent snapshot problem from the version key value store. So
this is bad, because the read-only transaction will not create any problem for conflicts, but we
still need to send it for validation in a distributed manner. So instead, we're using a
watermarking, because the watermarking gives some extra information on the reads we have, so
we try to utilize the watermarks to optimize for read-only transactions. The idea is simple. So
assuming we have read a bunch of items from the storage -- in this case, we read X of version 3
with watermark 5, and read Y of version 4 with watermark 7. Because the watermark gives us
some idea of like the reading is still good until the watermark 5, so we know that the X is -- the
version of X we read is good for snapshot from 3 to 5, and the version of Y is good from
snapshot 4 to 7. So if this is the case, we just intersect the intervals from different reads and get
our interception and get the snapshot 4 and snapshot 5. So this means we basically can run the
transaction, read-only transaction, as if we run it from either snapshot 4 or snapshot 5. So if we
can find this sort of -- because this kind of thing second is done purely within the processor
locally, so if we can do this in the intersection and find the intersection is not empty, we can
bypass the validation phase. We don't need the transaction to run to validation anymore. So to
sum up, the validation workflow in the system is like this. So we firstly have a transaction, and
if it is an updating transaction, we just send it for distributed validation. However, if it is a readonly transaction, we first will try to check the intersection of the reads locally, and if it is a
success, we just commit the transaction. But if we cannot find the [indiscernible] intersection,
we send it to distributed validation to see whether it has some true conflicts or not. So if there is
no conflict, we'll commit those transactions, and after the validation, if there is conflict, we just
abort the transaction and restart the transaction again. So here, we give some idea -- let me just
jump off a little bit to the experiment on TPC-H, where the transaction workload is about 80%
read-only transactions, and the profile of the transactions is fairly small. In average, we just have
like one to two items per transaction. So it's very low contention, and we scale the database the
size of the database. We again do the same experiment, and here, the goal is to see how much
benefit the read-only optimization will bring to us. In this case, when we don't run the read-only
optimization, we get up to about 1 million transactions per second. However, if we turn on the
read-only optimization, we get up to like 4 million, more than 4 million of transactions per
second. This is a current response to the 80% of the read-only workload, which basically means
we run most of the -- we check the -- most of the read-only transactions passes the local check
and don't need to do the validation distributively. Okay. Finally, I've still got a little bit of time
to go through the transaction batching work, just give you some idea of what's the main idea of
that. So first, as we say, the drawback of the OCC is that we can waste our resources when there
are conflicts for the transaction, because we need to restart the transaction. However, because
we observe that this transition order of the OCC happens only prior to the validation phase, so
this gives some flexibility in how to reorder the operations before it goes to the validation. So
we propose transaction batching, because the batching is used -- let's see, is used in the execution
of the transaction anyway, because we need to batch for networking messages, for example. So
our idea is that we sort of move the batching process forward, so instead of just batching at the
very low level, we also batch the executions of the transaction together in different stages, so that
the batching gives us a larger scope of what are the transactions, what are the operations we
have, so that we have more flexibility and chance to reorder the transactions to reduce the
conflicts. Okay, let me jump a few slides. So the first place where we can do some sort of
batching is at the storage, because the storage receives a bunch of read and write requests, and if
we assume the storage received the read on X and the write on X was version 1 and the read on
Y. If we execute the request in this order, the read on X will give us a version 0, which will be
overwritten by the later write on X with version 1. So the transaction of who issued the reads on
X will eventually abort, because there is a conflict. However, if we have seen -- like if we put
the first three requests in groups, we know that we will later overwrite X with version 1, so we
can first process all the writes, and before processing all the reads in a group. So the reads in the
group will have a fresher view of the data, so that we can avoid conflicts but still read. Yes,
exactly.
>>: How can you be sure that the pressure [indiscernible] reduces conflicts?
>> Bailu Ding: I think it's more like a best effort. So the optimal strategy we can use given a
scope, given our scope, on the operations is to first process all the writes. Does that make sense?
So it might be the case that ->>: It's not clear to me that possibly all the reads might not get more commits.
>> Bailu Ding: Yes, yes. So the idea here is that in this case we know that some item will be
overwritten by some write request after the read request, it's the best strategy to process the write
first and then process the reads. Does it make sense?
>>: So you're sending me the freshest version at all times.
>> Bailu Ding: Yes, I try to read the latest version all the time.
>>: So you [indiscernible] older or new version.
>> Bailu Ding: Yes, yes, exactly. Yes.
>>: It's dependent on the abort rate being low, so that the write is ultimately ->> Bailu Ding: Yes.
>>: The write is a committed write.
>> Bailu Ding: Yes, it's a committed write. It's doing optimistic concurrency control, so it is
already -- yes.
>>: So you're only writing after you know it's committed.
>> Bailu Ding: Yes, exactly, it's optimistic, so the write must be committed, right. Yes. Okay,
so the second part that we can do some sort of reordering is the validator. Like for example, if
we do two transactions, where the first transaction reads version 0 of Y and version 0 of X and
writes on X. The second transaction does read on X with version 0, but write on Y. If we
validate the two transactions in this order, the first transaction gets committed, it assignments a
timestamp 1, where you will later update the X to version 1. This will be conflicting with the
second transaction, who reads the version 0 of the transaction of X. However, if we have a batch
of transaction, we know that we have those two transactions in a loop, we can reorder those
transactions, so that we validate the second transaction first. So the second transaction gets a
timestamp 1. It commits, and it will not cause the first transaction to commit, so the first
transaction could also commit in this case. In this case, instead of aborting one transaction, we
commit both transactions. So the idea of doing this validation batching is very straightforward,
but it turns out to be a hard graph problem if we're modeling the problem, probably. So in order
to represent the graphing on a mass manner, we first create a dependency graph or a conflicting
graph between transactions, where the nodes in the graph are transactions and the ages between
the graph are the read-write dependencies. For example, if a transaction T1 writes to X and
transaction T2 later reads on X, so we create edge from T2 to T1. This basically means if we
want to commit transaction T2, if we want to commit the two transactions in order, we must
commit T2 before T1. Otherwise, the T2 gets aborted, because T1 has a conflict on the write
with the item it has read. Okay. So if we create this kind of dependency graph, one thing we
noticed, that if a node in the system does not have any N degree, which means it is not depending
-- committing this transaction will not cause any other transaction to abort. So in this case, if the
dependency graph is acyclic, we can repeatedly commit the transaction who has no incoming
edges. For example, in the graph on the left, we could commit the transaction in the order of T3,
T2 and T1. However, when the graph becomes cyclic, we have a little bit of problem here,
because we cannot commit any transaction without aborting any other transaction. So in this
case, what do we want to do is that we want to choose a bunch of victim items, where we sort of
proactively abort those transactions so that we can come up with our graph, which is cyclic. So
in this case, if we chose to abort T1, then we can commit all the transactions, all the transactions
left in some order. Okay. So the goal of -- it turns out, if we want to minimize the number of
aborts, which translates to we want to pick the least number of victims from the graph to make it
acyclic, this turns out to be among the first list of NP hard problems, and it is also approximate
hard, which means we cannot find a constant or reach an approximation algorithm in polynomial
time. So if this is the case, the best case we can do is do a greedy algorithm, which comes up
with something good enough for us. Fortunately, we could have something to simplify this
problem in our case. This is we can control the size of a batch in the system, which basically
means we can control the number of nodes in the graph, which basically means we can control
the complexity of the graph in our case. So if this is the case, we propose an amount of greedy
algorithm that works best for smaller graph with less number of edges, which gives us let's say a
very good performance. So I guess since the time constraint, I guess I will jump off the detailed
algorithm on how we do the greedy algorithm.
>>: [Indiscernible].
>> Bailu Ding: Then just I'll give you some idea of one our algorithms that we propose, so the
idea is based on the intuition that if a node is in a cycle, then the node on a cycle will be able to
reach each other in the cycle, which means it basically contains a strongly connective component
in the graph. So if this is the case, we can just partition the graph by the strongly connected
components, and for each of the components, we choose some victim in the component to try to
break the cycle. The idea is that we can choose the one with a lot of degrees, which potentially
will break a lot of circling components, and after we choose the victim, we recursively process
the graph on each of the -- on the rest of the remaining nodes until we reach a solution where it
has no cycles in the graph. For example, in this example, we start with 12 nodes, and we first
partition the graph by a strongly connected component, and we are left with three components.
For those three components, we pick one node that breaks the cycles in the components. For
example, we've taken node 3, and then node 6 and node 8, and finally, we come up with a cyclic
graph after removing the three nodes in the graph. We have some optimizations on this
algorithm, because this algorithm, as you can see, also finding the strongly connected
components is linear, and we probably need to recall this procedure a couple of times during the
partition of the graph. So it basically makes this algorithm a little bit expensive. This is bad,
because when the graph algorithm is expensive, it actually increases the latency of the
transactions, which in turn increases the abort rate of the transactions. So we actually come up
with something cheaper than that, but the idea is sort of similar. Okay, so to conclude, in this
talk, I introduced, discussed two challenges in optimistic concurrency control. One is the
parallelism, how to make the validation, the centralized and central validation phase more
efficient. The other one is how to handle the high abort rate when the data contention is high.
So I first introduced the sentiment, which is a loosely coupled and elastic distributed OCC,
where we proposed the watermarks to reduce the spurious updates and to optimize read-only
transactions. And we can also use the watermarks to do elastic validation, where we can increase
the number of validators or shrink the number of validators in a very short time. And the second
part is that I introduced the transaction batching idea, where we use the batching and reordering
to increase the throughput of our system, and in our preliminary experiment, it increased the
throughput of the system by three times, and also, it halves the latency of the transactions,
because the transaction gets less likely to restart it. So there are a couple of things left in these
two pieces of work, and the pieces left actually form the common thing. That is, how do we do
things dynamically, autonomously and adaptively. So for the sentiment work, one thing is that
how we do the atomic timestamp adjustment. For example, in our experiment, since we have
homogeneous hardware and homogenous workload, we assume we can assign timestamping
locally with a synchronized clock, but what if this is not the case? If this is not the case, we need
to have a way to adjust the rate we assign the timestamp, and adjust how the order of the
timestamp should be proceeding off each other. So the second part is that when we do the
validation, distributed validation, what we can do is that we can increase the number of
validators or decrease the number of validators in flight in a short time. But what we don't do is
that we don't do it automatically, which means we manually decide when to scale up and scale
down. What is left is that it would be nice to do this atomically, like to monitor the statistics of
the system and to decide what would be the best configuration of the system. The second part
for the transaction batching, it is the same. So one thing we noticed in our experiment is that, as
we said, doing the reordering and the validator is sort of costly, so it is not always the -- it is not
always beneficial to turn down the validation ordering, especially when the data contention is
very low or data contention is very high. And also, so one thing I want to know is how we
dynamically enable the validator batching depending on the performance characteristics of the
workload. So the other thing is that we will notice that, because the size of a batch, first controls
how flexible we could do the reordering, so on the other hand, you also increase the latency of
the transactions. So we also want to find a good sweet spot of the size of the batch we want to
have depending on the workload in the system. Yes, I think that's it. Yes?
>>: You said that when the -- was it when the conflict rate is low, batching isn't helpful.
Nothing matters.
>> Bailu Ding: Yes, exactly.
>>: And then you said when the conflict rate is high ->> Bailu Ding: It's very high.
>>: It also doesn't matter, because?
>> Bailu Ding: Yes, so there are two reasons for that. The first reason is that when the abort
rate is very high, in the process of finding a good order, you actually spend more time on that,
because you need to do more partitioning on removing the nodes, so one thing you spend more
time on, finding a good order. The other thing is that when the contention is really, really high,
there are a lot of conflicts that cannot be resolved by reordering the operations, so you actually
get less out of the batching, as well. So those two factors actually make it sometimes worse than
not using batching in the case when contention is really, really high.
>>: So you're doing all this work to reorder, but it's futile, because you're not going to find an
order.
>>: You end up with a serial execution of your transactions anyway, I guess.
>>: Well, there's just too many conflicts. There's no way to pull out a moderate number of
transactions and still get the rest of them to commit.
>> Bailu Ding: Yes, especially when it's a read after write conflict, you are not going to do
anything better than just running one at a time. Yes. Okay.
>> Sudipto Das: Let's thank the speaker.
>> Bailu Ding: Thank you.
Download