>> Sudipto Das: Hello, everyone. It's my pleasure to introduce Bailu Ding. She is a PhD student at Cornell, working with Johannes Gehrke. She has been spending the past two years here at UW, also collaborating with a bunch of UW folks. She was an intern with Phil and me, working on Hyder and she did a great job during the internship, helped us build a prototype for Hyder, and she has also done some great work in optimizing optimistic transactions, optimistic concurrency control for distributed transactions. That's her thesis work, as well, which she's going to talk to us about it in the next hour or so. Bailu? >> Bailu Ding: Thank you. Okay, good morning. So today I'm going to present my thesis work on optimizing optimistic concurrency control, so as Sudipto just introduced, I'm a PhD student and still at Cornell, but right now, I'm visiting University of Washington in Seattle in the database group there. I work with Magda and a bunch of students in your group, as well. So transaction process is sort of everywhere in the current age, like when you do any searching, when you do shopping, when you do Googling, Twittering and Facebooking. So there are two main major ways to manage transactions. One way is to manage transactions by logging best approach, which means when you try to access something, you first acquire a read or a write log on the item you want to access, so that others won't interfere with your -- when they access their items. The other way is called optimistic concurrency control. The idea is that you don't acquire locks, so you just optimistically assume that you are the only one who executes the transactions, but you do some verification afterwards to verify whether your assumption is true or not. So there are some interesting warrants of optimistic concurrency control in recent systems, mainly for two reasons. The first one is that, obviously to say that you don't acquire locks, you don't need lock management, so that the overhead of the transactions is lower. The other part is that one nice feature of optimistic concurrency control is that you don't have this blocking behavior, which means that the readers don't block the other readers. They also don't block other writers. So this is very desirable, especially for those web applications. But before I go in deep to what my thesis is on -- sorry. Sorry. Before going to the details that my thesis is on, let me first give you a brief review of how optimistic concurrency control works. So, for example, assuming you start with a transaction, T3, you first go through a read phase. So in a read phase, what it does is it reads from the storage. It does some execution on the I terms. Maybe it does a bunch of additional reads, and then it comes up with some writes. When it does these writes, first, it only updates the things in its local workspace, which is not visible to other transactions. And afterwards, after it finishes its execution, it continues to our validation phase, where in the validation phase it is checked against all the previously committed transactions to see whether they will be conflicting with this transaction or not. So if there is no conflict, this transaction continues to write first, where you store the updates to the storage and make it persistent and available for others to read. But if there is any conflict, the transaction aborts, and upon its abort, the system will restart the transaction to try for another time to make it execute for another time. So -- sorry. So the idea key to the validation phase of the optimistic concurrency control is to compare the read seed of transactions with the right seed of previously committed transactions, so for example, assuming we have a transaction T0, it first enters the read phase, where it just has a local update to the edge in each workspace. Then, it continues to a validation phase. Since there is no prior transaction to T0, it just commits trivially, and the same time, a transaction T1 starts. It first enters the read phase, where it reads from the storage on X, and then it continues to the validation phase. But since when T1 starts and reads from the storage, the writes from T0 has not been applied to the storage, so the validation finds that T1 should have read a newer version of X, but it didn't, so there is a conflict. That's why T1 aborts. So after we give some idea of what optimistic concurrency control is, now we are looking at two challenges in optimistic concurrency control. The first one is that, as we're seeing, the validation phrase -- the transaction should go through the validation phrase in a zero order, so essentially the validation is zero and centralized, so that we can validate the transactions one after another. This imposes some issues in the scalability of the protocol. The other part of the challenge is that as we're just seeing, our transaction can't abort, and the abort in OCC is not a real event, so when there is data conflict, the transaction can't abort and get restarted. So when the data concentration increases, the chance of conflicts becomes higher, and the transaction can get aborted and restarted a million times. This basically means the system wastes a lot of resources on doing meaningless work. So in this case, the performance of the system will stretch. So this is another challenge that is -- also to say, how do we handle data contention to reduce conflicts. So my thesis addresses both problems. The first part is how we scale the validation phase. We actually propose two kinds of scalability approaches from different aspects. One is how we introduce pipeline parallelism into the validation phase. The main idea is that we chunk the validation phase into smaller stages and assign a different thread to work on each of the stages. In this sense, we can work on different phases of the validation stage in parallel. The other part of this is that we'll try to introduce let's say a vertical parallelism into the system, which means that we'll try to divide the work of the validation onto different instances, so each of the instances will count one part of the work. In this sense, we also parallelize the work and centralize the validation. So for the other challenge, we proposed a way to reduce the data contention between conflicts by reordering, so the main idea is that in the optimistic concurrency control, the order of the transactions are not decided up front. It is only serialized upon the validation phase, so this gives us some flexibility in how we organize the execution of the operations in transactions. So we thought that if we model this kind of order of execution carefully, we could reorder the operations to reduce the conflicts between transactions. So due to the time constraint, I'll mainly talk about two projects in detail. One is the Centiman project, where we do the parallel validation. The other one is the transaction batching project, where we do operation batching and reordering to reduce the conflicts. Okay, so next I will go through the -- introduce, discuss the project and how do we distribute the optimistic concurrency control for the validation phase. Okay. So when we designed the Centiman project, the idea of this -- let's say the guiding principle of this project is that we want to design a system that has good modularity, which means we separate the processing and the storage of the system, so that it is easy to deploy this kind of system in the context of the cloud. So the design [indiscernible] is that we want to have a simple design, and we also want to reduce the coupling between different components as much as possible. So recall the three phases of optimistic concurrency control, where we have the read phase, the validation phase and the write phase. If we want to do a distributed architecture of this protocol, it naturally translates to three components in a system, so one component is the processor component, which corresponds to the read phase and execution of the transactions. And the other one is the storage component, where it actually does the reads and writes of the items in the system. And finally, we have the validator component, which takes charge of the validation of the system. So what we want to do is try to make each of the components distributed. The first one that comes to our mind is to make the processor component distributed. This one is pretty straight forward, where we just end a number of processor instances and try to offload the traffic to different processor instances, and we're done with that. Next, we want to make the storage distributed. The idea is that we're plugging our favorite key value stores and to make it versioned. So the versioned key value store actually means something like if -- at each of the key value pairs, each of the writes comes with a key value and also a version, so if the version is newer than the existing version in the storage, we apply the updates to the storage. So if the version is older than the existing version in the storage, which means the update is out of date, so we just discard or ignore this update, because it comes too late. So now, let me give you an example. >>: There's only one version of each example. >> Bailu Ding: So let me say it this way. >>: Timestamped version. >> Bailu Ding: Yeah, yeah, exactly. So let me say it this way, so there are two ways when you want to do a version storage. One is that you just store the latest version of the item in the storage, which basically is a single-version system. The other way is that you stay like multiple versions in this storage, where you have like a multi-version storage, where you can do things like snapshot isolation. So in our case, more like here we assume we just have a single version of storage, where you just keep the latest version of the item in the storage. Okay? So let me give you an example of how the version storage works. So first assume we have a transaction T1. It reads -- so the storage starts with the version 0 of X and version 0 of Y, so assuming we have a transaction T1, it starts with reading a version of X from the storage, which is now version 0, and in its read phase, it also updates X in its local workspace. And next, the transaction T1 enters the validation phase. Since there is no prior transaction before it, it just commits trivially, and this time, when it passes the validation, the validation component assigns a timestamp to this transaction, assuming the timestamp is just 1. So when transaction T1 enters the validation phase, it just updates the X to version 1, where version 1 is its timestamp. So as soon as my transaction T2 starts, when it reads Y from the storage, it's still version 0, and it updates Y in its local workspace. Then it enters the validation phase, passes the validation and gets the timestamp to T2. So then it continues to update the Y to version 2 in the storage. Now, assuming we will have a transaction T3 starts, it first enters the read phase to read X from the storage, which is now updated by the transaction 1 to be our version 1, and then it gets some update to X in its local workspace, and it enters the validation, passes the validation and gets timestamp T3, where you finally update X to 3. So the first challenge that comes from the version storage is that -- I am so sorry -- is that we can have like reads from inconsistent snapshot from the storage. Recall that in the original optimistic concurrency control, we want to ensure that the validation phase and the write phase are in our critical section, so that the writes to the storage are atomic for our transaction, as well as we apply the write in the order of the validation. But as now we have the distributed key value store, we don't have this nice property, so we suffered from two kinds of anomalies for key value store. The first one is that since the updates are not atomic, we could write partial updates from the storage. For example, assuming we have a transact T0, who writes to both X and Y, and then we have our transaction T1 who reads X and Y, but since the order -- the order we issue the reads and writes are in are not atomic, which means we could read the updates from T0 with X, but not the updates on Y. So this means we can read partial updates from the storage. The next issue is probably more subtle. This means we could actually read an inconsistent snapshot from the database. So for example, assuming we have transaction T0 who writes to X and D, and then we have a transaction T1 who reads D in its read phase, where it reads the update from T0, and then in its write phase, it updates Y, so T1 is good, because we can stabilize T1 after T0. But the problem is with T2, where we want to read both X and Y. So since the order we issue the reads means that we only read the version of X before T0 updates the storage, but we read the version of Y after T2 updates the storage. This basically means we read the updates from T1 but not the updates from T0. This means that because we must serialize T1 after T0, so this means we read from an inconsistent snapshot, because a snapshot that contains updates from T1 must contain updates from T0. But we only read the updates from T1 but not T0, so this means that we read from an inconsistent snapshot of the database. So there are a couple of proposals that can solve this kind of inconsistency. One is that we can just do like two-phase commit to update the -- to apply the writes to the storage. But because we are building on top of a key value store, we don't really want to end more layers on top of a key value store or end more APIs of the key value store, so we sort of limit -- instead of doing all the heavy lifting in storage, we actually limit it to the validation phase, where the validator can guard against this kind of anomalies, so we are not worried about doing atomic updates or reading a consistent snapshot from the storage. Okay, now we are good. Okay, so now we are good with distributed storage, with version key value store. Now, what else is to do a distributed validation? So the idea of doing distributed validation is actually fairly straightforward. We just partition the key space into different charts, and then we ask each of the validators to take conflicts on its own chart, so when we issue a transaction for validation, we first split the transaction based on the data it accesses. For example, if it accesses just one validator, the data on one validator will just send the whole transaction to one validator. However, if a transaction accesses multiple validators, we split the transaction into smaller parts and send each of the parts to one validator in the system. After each of the validators takes the conflicts based on its own data and sends back the local decision, the processor acts as a coordinator. It receives all the decisions and makes a final decision to the transactions. Up to this point, we are fine, but it turns out this protocol does not work nicely in practice. As we will see, we will notice the abort rate of the transaction will rise fairly quickly if we're using this approach. So the problem here is the version decisions. The idea is that since we might split a transaction into different validators, so it might the be case that the different validators come up with different decisions. For example, assume we are transacting T3, where it writes to X and writes to Y. And it sends X to validator A, which has the data X, and it sends part of the validation request to validator B, where you hold the item Y. So assume the validator A finds no problem with X, and it votes a commit, but the validator B finds there's something conflicting with the Y, so it votes an abort. Because our coordinator, our processor, is smart, after it collects the two decisions, it knows that because one of the validators is not comfortable with the update, so it will abort the transaction. Now, T3 is fine, but what if we have T4 come afterwards, where T4 wants to read X and write to X. Because the validator A previously thinks the transaction T3 has committed, so it will cache the updates from T3. If this is the case, since T4 will not read the updates from T3, which because T3 eventually aborts, you cannot make it update to the storage, so T4 will not be able to read the updates from T3. Now, when T4 comes, the validator A thinks T3 has committed, but T4 has not read its updates, so it votes to abort. So what this means is that our transaction T4 is aborted due to our transaction T3 that is not committed, so we call this kind of aborts a spurious abort, because it's just a forced abort, because T4 should commit. And we call this kind of updates, which is left in the validator cache from aborted transactions as spurious updates. So how are we going to eliminate the spurious update? So the first proposal is the very proactive proposal, where we just ask the processor, after it collects the decisions from all the validators, it just sends back the final decision to all the validators, so to notify them that the transaction has aborted and you should invoke any forced updates left in your cache. However, if we do this synchronously, it actually slows down the system, because the feedback first is in the critical parts of each transaction. But if we do this asynchronously, which means after we get the decision, we asynchronously propagate back the decision to the validator, it adds complexity in the system. So first, we need to implement the logic to propagate back the decision and invoke the updates. The second part, which probably is the more -- let's say more troublesome one -- is that we need to handle the case where we somewhat lost the messages from the system, which means the updates is not going to be revoked from the system. And that will create a lot of trouble and aborts for later transactions. Okay, so this is one proposal. So on the other extreme, we can have a very lazy proposal. The idea is that we are accumulating the updates in the validator cache, but we're not going to accumulate things forever. At some point, we need to discard the old updates or garbage collect the old updates. So what about we just count on the garbage collection process to remove those spurious updates from the validator cache? So we did a simple experiment on our a straightforward proposal on the garbage collection mechanism. The idea is that we just calculate or get some expectation of how long it will take for the writes or for the transactions to hit the storage and be persistent, and after we wait long enough, we think we will be safe to remove the updates from the validator cache. So we proposed -- we implement this kind of garbage collection logic in the system, and we tried let's say a different period, like how long we wait before we expire the updates in the cache. So we did an experiment on different expiration times, including from 10 seconds to a minute. So what I observe here is that for all the different configurations, we first get a little bit boost or a little bit writes of the abort rate in the system in the first few minutes. This means the system starts to accumulate the spurious updates in the validator cache, so we see the rise of the aborted transactions. But after a while, the system sort of reaches equilibrium, and the abort rates sort of go [plan], too. This means we finally sort of stay stable with kind of abort rates, because we have this, cached all the spurious updates within some time and we start to expire things. So as we can see from the figure, let the update stay in the cache for a minute before you do garbage collection is clearly not an option, because we got up to like 90% and 99% abort rate, which basically means the system is not progressing significantly. But even we just -- yes? >>: How big an abort rate do you have initially? >> Bailu Ding: Yes, I forgot to mention that. The initial abort rate is less than 1%, so it's a very low contentioned workload. >>: With 100% spurious abort rate, that presumably you're doubling that. >> Bailu Ding: No, so this is not a spurious abort. It's a total abort. Yes, this means like 99% of the transactions actually are aborts in the system, because we cached -- yeah. >>: So what is the workload here? >> Bailu Ding: Oh, okay, the workload here is the same workload. We have a transaction with 10 items, like five of them are reads and five of them are writes. It's over 100,000 items in the system. >>: I don't understand how an abort rate of 1% can come with spurious aborts, suddenly become 100% abort rate. >> Bailu Ding: So the idea is that when you have a couple of validators, and some of them will cache the spurious updates, so you will basically -- when you store things for 60 seconds, it basically means you cache the spurious updates for 60 seconds. So during that time, all the transactions that read off those items for the spurious updates will abort on this validator, but what makes it worse is that it might abort on one validator, but it commits on the other validator, so it sort of pollutes the other validator, even if it aborts. >>: It's almost as if you're turning a lot of transactions into very long-running transactions. It's almost like that. >> Bailu Ding: The long-running transaction will never hit the storage, or after 60 minutes -- 60 seconds, yes. >>: What's the transaction rate, so we get a sense of 60 seconds is like how many transactions? >> Bailu Ding: I see, I see. So I think in this case, we only run about 10,000 transactions per second. >>: So if you wait 60 seconds, then you're accumulating transactions or some such number. >> Bailu Ding: Exactly, so this is why we even cache for -- I think 60 seconds is not too long, if we want to wait for something to hit the storage in case of any jitter will happen. But even 60 seconds is not going to work in this case for spurious aborts. Even if we just cache like 10 seconds of the cache, which is fairly short, we still reach about more than like 10% of abort rate, even if the original data contention will cause only like 1% of aborts. So the problem with the -so the other proposal is what if we use even shorter time to cache the updates or to do the garbage collection more aggressively. The idea is that we reduce the expiration time, so that we can garbage collect quickly after some time. But the risk here is that if we do garbage collection too aggressively, we may suffer the problem of aborted transactions due to insufficient information. For example, assuming we have 30 transactions, we first start with a T2, and we update local X to some validation and write version 2 to X. Now we start with T10, when you read X with version 0 and [understand] validation, because we are seeing [indiscernible] in T2 who had updated the X2 version 2, so there is a conflict. We abort T10 normally. But when we start another transaction, T15, which reads the version 2 of X, which means you read the latest version of X and it should commit, but if we garbage collect too aggressively -- for example, if we garbage collect anything before T5 and the update from T2 is gone in this case, it will not validate T15. The system has no idea whether there are any updates between the version 2 that it reads and the T5, so the validation, in order to ensure the correctness of the system, the validator has to abort the transaction conservatively. So this means if we garbage collect too aggressively, we are at the risk of aborting transactions due to insufficient information or insufficient history. So our proposal is something in between those two extremes, so which we called a reactive approach. So the idea is that we asynchronously propagate the information of the commit of transactions throughout the system, but the benefit of this approach is that firstly it's asynchronous. Next, it's inclusive, which means even if we lost the message or the message comes a little bit late in some cases, we are still good. We just need to tolerate a little bit of inaccuracy here. And the third one is that it is fairly configurable in the sense that we could configure like how frequently we propagate the transactions throughout the system, which gives us some sort of space between the tradeoff between accuracy and the cost of the communication. So let me give you some idea of what a watermark is. So the idea is that the watermark has the same type as the timestamp in the system, and when we do a read from the storage, we associate the read with a watermark. This watermark basically gives the guarantee we want in a garbage collection case, which means when I get a watermark, I have the guarantee that all updates to this record are made by transactions before that watermark has been reflected in the update. So what it means, that if we got a watermark, we don't need to worry about transactions who update the record before this watermark. So for example, assuming we had a transaction T20, it reads on X with a version 10 and a watermark 15, so which means for all the updates cached in the validator, we only need to worry about -- so all the updates cached in the validator, all the transactions with timestamp before 15 has been reflected in this read, so it means in the validator, we only need to worry about transactions that has a timestamp larger than 15, so which means we basically age out the spurious updates before T15. So in this case, if we try to validate the read on X, although we have spurious updates of T13 in the system, since it's below T15, we don't need to worry about whether T13 will be conflicting with the transaction or not, so this means we only need to take transactions with timestamps larger than 15. In this case, there is no transaction with a timestamp larger than 15 updating X, so we can commit this transaction on X. And as we can see, we sort of age out the updates before T15, which including the spurious updates from T13. Yes. >>: But you have to know these transactions -- there's no transactions that can have better version of earlier than 15? >> Bailu Ding: So I think the idea is it basically means we know the updates from transactions before T15 should have hit the storage, and we should have read anything from the storage whose timestamp is less than T15. So which means if T13 is not a spurious update, it should be hitting the storage, so if it is not, we are sure. We have the guarantee that these updates from T15 is a spurious update and we don't need to consider it anymore in a validation protocol. Does that make sense to you? >>: Well, I don't fully follow it, but why don't you go on? I'll think about it. >>: So the condition is, given that transaction 20 read version 10 at watermark 15, it knows that there will be no committed updates in the range 10 to 15 that arrive later. >> Bailu Ding: Exactly, exactly. If there is in the cache, then it must be a spurious update. So in this case, we age out the -- eliminate the spurious update in the cache, when the watermark bumps up. >>: Is it there to address the spurious update issue. >> Bailu Ding: Yes, yes, this is to address the spurious update issue. And we will see you can also do something else with the watermark. Yes. Okay. So here, as I said, the watermark is configurable, which means in two senses. The first sense is you can figure out how frequently you update the watermark. The second one is what granularity you apply the watermark. Here, I just give an example of how you implement the watermark in a node level, so the idea is that -so as was our design principle, we don't want to create new connection channels. We don't want to create new APIs for different components, so we just take advantage of the key value store in the storage. What we do here is that for each processor, we keep track of the transactions who have a completion watermark and a local processor, which means for all transactions whose timestamp is before the completion watermark has been completed, which means it's persistent in storage. So each of the processors propagate this information to each of the storage nodes, and the storage nodes come up with a completion watermark, which is a minimum of the completion watermarks for the processor, and afterwards, when we do a transaction or we do a read from the storage, we read the storage, and we read off the completion watermark from each of the storage nodes. And we come up with the read watermark as the minimum of all the completion watermarks in the storage and the processor. So this is actually lazy propagation of the information for which transactions has been completed before some kind of threshold in the system. So this is node-level implementation of the watermarks. You can definitely implement this in our final granularity, which means you can implement a large [shot] level, where you keep track of the completion watermark per shot or per -- or in the extreme case, per item. So now we have come up with the final architecture of the system, where we have a distributed processor. We have the version key value store and we have a sharded distributive addition, and this makes the system fully distributed. So in our implementation, it's like 20,000 lines of live code, in C++, and for storage part, we just implement an in-memory key value store, which is implemented by harsh map. We didn't use our realistic or real harsh map, real key value store. It's because the performance of key value store cannot keep up with the rate we want to target. And for processor part ->>: Is it open source? >> Bailu Ding: Yes, open source. I think the open-source one has about like 10,000 operations per second, which is too low. >>: Would you influence [indiscernible]. >> Bailu Ding: Oh, in that time, there is no LevelDB, but we touched on something like [indiscernible] and HBase and something like those. >>: 10,000 operations per second at a given node? >> Bailu Ding: Yes, at a given node, especially if we want to support versioning, yes. Okay. Son the processor side, we implement processor in the model of instead of issuing a transaction preferred, we actually sort of multiplex transaction, which means we have a message queue for the processor, where it gets a message -- sorry, gets a request from a message queue, where it can issue the read request or write request or validation request, depending on the stage of the execution of the transactions. So in this sense, we could start our high concurrency level of the transaction per node, instead of issuing one-third per transaction, which is blocking. Also, for the timestamp assignment, for the purpose of our experiment, just because we use heterogeneous workloads, we also use heterogeneous hardware, so -- sorry, homogenous hardware and homogeneous workload, so assuming that each of the nodes assigns timestamp roughly in the same rate, and we assume the locks between the different nodes are roughly synchronized, under those two assumptions, we just assign the timestamp locally in each of the processors, where the timestamp is something like the ID of the processor plus the ID of the transaction in the system. And we update the completion watermark and the node level every like 10,000 transactions in the system. We also try to update the watermark in less frequency, like 1,000 transactions per second, but we find updating the watermark per 10,000 transactions will be suffice for our workload. For the validator, we divide the validation phase into smaller stages. For example, one thread is dedicated for receiving the messages and putting the messages in order. One-third is responsible for doing the real check on the conflicts of the item, and we have another thread which is doing the wrap up for the transaction and preparing for a networking messaging. And because we are using -- because the timestamp is assigned locally in each of the processors, and they can come out of order, we have a timeout strategy, which we wait for some transaction with some certain timestamp to come, but if it comes too late, we just move on and reject the transformation. For the networking part, because the pattern of networking is very bad for the performance in our system. For example, sending out a lot of very, very small messages in the system, like we send out the decision or we send out just a request. So in order to optimize for the throughput, we use a model where we just periodically send out networking messages, like we periodically send out -- we send out a message like every 10 milliseconds. This definitely increases the latency of the transactions, but it reduces the overhead from a networking and gives us better throughput in terms of networking. Okay. So we first different an experiment on a variant of the TPC-C benchmark, so the idea is that we want to test the scalability under a workload that's similar to the TPC-C benchmark. So as we say, we batch the networking messages so that we have latency in terms of tens of milliseconds. So the updating transaction, we only run the updating transactions in the system, that it's 50% new order transactions and 50% payment transactions. In order to reduce conflicts, we did a bunch of modifications for the benchmark. The first one is that we did vertical partitioning of the tables in the sense that we don't want to update to different fields of a record be conflicting with each other. The second part is that we relaxed the specification on the counter-assignment. Instead of assigning the counter in increasing order, we just assign a unique ID for each of the transactions who require a counter. And the third one is that we replicate some of the hotspots. For example, like year-todate amount for the [indiscernible] is set true, and these will not create any problems for the updating transactions. But if you really want to run the whole benchmark and you want to run the read-only transactions like the stock level [in situ], you need to read off different replicas and do some sort of an aggregation for the read-only transactions. Okay, so for deployment part -yes? >>: You're not exactly replicating them. You're kind of partitioning value, in a sense, so it's going to be the sum of all of them, in this real value. >> Bailu Ding: Yes. When I say replica, I just mean we have one sound per node, or something like that. Okay, so for the deployment, we run the experiment on EC2 clusters, with 50 storage nodes and 50 processor nodes. Okay, and sorry, I modified that in the new slides, but it's not one warehouse per storage node. It's like the number of warehouses increases with the storage node, but we shuffle randomly -- shuffle the data randomly to each of the data warehouses, so we'll not take advantage of the locality of the warehousing access here. And finally, we do about 200 concurrent transactions per processor in the system at max. So we do the experiment, we set up the 50 nodes for a number of validators and it sort of increases the concurrency level on each of the nodes until we reach a peak throughput for a certain number of validators. So here is the result we have on TPC-C variant, where we increased the throughput -- sorry, increased the concurrency of the level. We chased down from one validation node to eight validation nodes, where the eight validation nodes are sufficient to saturate the storage of 50 processor instances, and we increased the concurrency level until we reach a peak throughput for the nodes. So as we can see, we started with about like 50,000 transactions per second, and then we dropped to about 80,000 transactions per second for two nodes, until we reached about like 230,000 transactions per second. So it's not exactly linear scalability, but as we increase the number of nodes, we didn't see the throughput increases continuously. Yes. >>: So you have the number of storage nodes and partition and all of that isn't changing in this experiment. What's changing is the amount of parallel processing that you have going into validation? >> Bailu Ding: Yes, exactly, and the way we try saturate the validation nodes is that we increase the concurrency level per processor node, and then we get a max of about like 200 concurrent transactions per node in this experiment. >>: What's the abort rate with something like this? >> Bailu Ding: Yes, so abort rate, because we have done a lot of engineering on reducing the aborts, we actually got about 3% of aborts, merely due to conflicts on the stock level -- sorry, on the stock item table. And the abort rate is stable, because it's low, so it's stable in all the configurations. Yes. >>: How many nodes are there in the system? >> Bailu Ding: So there are 50 nodes for storage and 50 -- and 50 nodes for processor. >>: And how many operations per transaction? >> Bailu Ding: In average, we have about 16 reads and 15 writes. >>: And you have 30 operations? >> Bailu Ding: Yes, 30 operations. It's the same profile as the TPC-C benchmark. Yes. Okay, so I'm going to mention one more optimization we used the watermarks for. The idea is for readonly transactions. So in original or in normal optimistic concurrency control protocol system, where we can have -- can optimize read-only transactions by running it in snapshot isolation, which means we're running a snapshot isolation and we don't need to do validation for those read-only transactions. But we cannot do this in our system, because as we mentioned, we have this kind of reading from an inconsistent snapshot problem from the version key value store. So this is bad, because the read-only transaction will not create any problem for conflicts, but we still need to send it for validation in a distributed manner. So instead, we're using a watermarking, because the watermarking gives some extra information on the reads we have, so we try to utilize the watermarks to optimize for read-only transactions. The idea is simple. So assuming we have read a bunch of items from the storage -- in this case, we read X of version 3 with watermark 5, and read Y of version 4 with watermark 7. Because the watermark gives us some idea of like the reading is still good until the watermark 5, so we know that the X is -- the version of X we read is good for snapshot from 3 to 5, and the version of Y is good from snapshot 4 to 7. So if this is the case, we just intersect the intervals from different reads and get our interception and get the snapshot 4 and snapshot 5. So this means we basically can run the transaction, read-only transaction, as if we run it from either snapshot 4 or snapshot 5. So if we can find this sort of -- because this kind of thing second is done purely within the processor locally, so if we can do this in the intersection and find the intersection is not empty, we can bypass the validation phase. We don't need the transaction to run to validation anymore. So to sum up, the validation workflow in the system is like this. So we firstly have a transaction, and if it is an updating transaction, we just send it for distributed validation. However, if it is a readonly transaction, we first will try to check the intersection of the reads locally, and if it is a success, we just commit the transaction. But if we cannot find the [indiscernible] intersection, we send it to distributed validation to see whether it has some true conflicts or not. So if there is no conflict, we'll commit those transactions, and after the validation, if there is conflict, we just abort the transaction and restart the transaction again. So here, we give some idea -- let me just jump off a little bit to the experiment on TPC-H, where the transaction workload is about 80% read-only transactions, and the profile of the transactions is fairly small. In average, we just have like one to two items per transaction. So it's very low contention, and we scale the database the size of the database. We again do the same experiment, and here, the goal is to see how much benefit the read-only optimization will bring to us. In this case, when we don't run the read-only optimization, we get up to about 1 million transactions per second. However, if we turn on the read-only optimization, we get up to like 4 million, more than 4 million of transactions per second. This is a current response to the 80% of the read-only workload, which basically means we run most of the -- we check the -- most of the read-only transactions passes the local check and don't need to do the validation distributively. Okay. Finally, I've still got a little bit of time to go through the transaction batching work, just give you some idea of what's the main idea of that. So first, as we say, the drawback of the OCC is that we can waste our resources when there are conflicts for the transaction, because we need to restart the transaction. However, because we observe that this transition order of the OCC happens only prior to the validation phase, so this gives some flexibility in how to reorder the operations before it goes to the validation. So we propose transaction batching, because the batching is used -- let's see, is used in the execution of the transaction anyway, because we need to batch for networking messages, for example. So our idea is that we sort of move the batching process forward, so instead of just batching at the very low level, we also batch the executions of the transaction together in different stages, so that the batching gives us a larger scope of what are the transactions, what are the operations we have, so that we have more flexibility and chance to reorder the transactions to reduce the conflicts. Okay, let me jump a few slides. So the first place where we can do some sort of batching is at the storage, because the storage receives a bunch of read and write requests, and if we assume the storage received the read on X and the write on X was version 1 and the read on Y. If we execute the request in this order, the read on X will give us a version 0, which will be overwritten by the later write on X with version 1. So the transaction of who issued the reads on X will eventually abort, because there is a conflict. However, if we have seen -- like if we put the first three requests in groups, we know that we will later overwrite X with version 1, so we can first process all the writes, and before processing all the reads in a group. So the reads in the group will have a fresher view of the data, so that we can avoid conflicts but still read. Yes, exactly. >>: How can you be sure that the pressure [indiscernible] reduces conflicts? >> Bailu Ding: I think it's more like a best effort. So the optimal strategy we can use given a scope, given our scope, on the operations is to first process all the writes. Does that make sense? So it might be the case that ->>: It's not clear to me that possibly all the reads might not get more commits. >> Bailu Ding: Yes, yes. So the idea here is that in this case we know that some item will be overwritten by some write request after the read request, it's the best strategy to process the write first and then process the reads. Does it make sense? >>: So you're sending me the freshest version at all times. >> Bailu Ding: Yes, I try to read the latest version all the time. >>: So you [indiscernible] older or new version. >> Bailu Ding: Yes, yes, exactly. Yes. >>: It's dependent on the abort rate being low, so that the write is ultimately ->> Bailu Ding: Yes. >>: The write is a committed write. >> Bailu Ding: Yes, it's a committed write. It's doing optimistic concurrency control, so it is already -- yes. >>: So you're only writing after you know it's committed. >> Bailu Ding: Yes, exactly, it's optimistic, so the write must be committed, right. Yes. Okay, so the second part that we can do some sort of reordering is the validator. Like for example, if we do two transactions, where the first transaction reads version 0 of Y and version 0 of X and writes on X. The second transaction does read on X with version 0, but write on Y. If we validate the two transactions in this order, the first transaction gets committed, it assignments a timestamp 1, where you will later update the X to version 1. This will be conflicting with the second transaction, who reads the version 0 of the transaction of X. However, if we have a batch of transaction, we know that we have those two transactions in a loop, we can reorder those transactions, so that we validate the second transaction first. So the second transaction gets a timestamp 1. It commits, and it will not cause the first transaction to commit, so the first transaction could also commit in this case. In this case, instead of aborting one transaction, we commit both transactions. So the idea of doing this validation batching is very straightforward, but it turns out to be a hard graph problem if we're modeling the problem, probably. So in order to represent the graphing on a mass manner, we first create a dependency graph or a conflicting graph between transactions, where the nodes in the graph are transactions and the ages between the graph are the read-write dependencies. For example, if a transaction T1 writes to X and transaction T2 later reads on X, so we create edge from T2 to T1. This basically means if we want to commit transaction T2, if we want to commit the two transactions in order, we must commit T2 before T1. Otherwise, the T2 gets aborted, because T1 has a conflict on the write with the item it has read. Okay. So if we create this kind of dependency graph, one thing we noticed, that if a node in the system does not have any N degree, which means it is not depending -- committing this transaction will not cause any other transaction to abort. So in this case, if the dependency graph is acyclic, we can repeatedly commit the transaction who has no incoming edges. For example, in the graph on the left, we could commit the transaction in the order of T3, T2 and T1. However, when the graph becomes cyclic, we have a little bit of problem here, because we cannot commit any transaction without aborting any other transaction. So in this case, what do we want to do is that we want to choose a bunch of victim items, where we sort of proactively abort those transactions so that we can come up with our graph, which is cyclic. So in this case, if we chose to abort T1, then we can commit all the transactions, all the transactions left in some order. Okay. So the goal of -- it turns out, if we want to minimize the number of aborts, which translates to we want to pick the least number of victims from the graph to make it acyclic, this turns out to be among the first list of NP hard problems, and it is also approximate hard, which means we cannot find a constant or reach an approximation algorithm in polynomial time. So if this is the case, the best case we can do is do a greedy algorithm, which comes up with something good enough for us. Fortunately, we could have something to simplify this problem in our case. This is we can control the size of a batch in the system, which basically means we can control the number of nodes in the graph, which basically means we can control the complexity of the graph in our case. So if this is the case, we propose an amount of greedy algorithm that works best for smaller graph with less number of edges, which gives us let's say a very good performance. So I guess since the time constraint, I guess I will jump off the detailed algorithm on how we do the greedy algorithm. >>: [Indiscernible]. >> Bailu Ding: Then just I'll give you some idea of one our algorithms that we propose, so the idea is based on the intuition that if a node is in a cycle, then the node on a cycle will be able to reach each other in the cycle, which means it basically contains a strongly connective component in the graph. So if this is the case, we can just partition the graph by the strongly connected components, and for each of the components, we choose some victim in the component to try to break the cycle. The idea is that we can choose the one with a lot of degrees, which potentially will break a lot of circling components, and after we choose the victim, we recursively process the graph on each of the -- on the rest of the remaining nodes until we reach a solution where it has no cycles in the graph. For example, in this example, we start with 12 nodes, and we first partition the graph by a strongly connected component, and we are left with three components. For those three components, we pick one node that breaks the cycles in the components. For example, we've taken node 3, and then node 6 and node 8, and finally, we come up with a cyclic graph after removing the three nodes in the graph. We have some optimizations on this algorithm, because this algorithm, as you can see, also finding the strongly connected components is linear, and we probably need to recall this procedure a couple of times during the partition of the graph. So it basically makes this algorithm a little bit expensive. This is bad, because when the graph algorithm is expensive, it actually increases the latency of the transactions, which in turn increases the abort rate of the transactions. So we actually come up with something cheaper than that, but the idea is sort of similar. Okay, so to conclude, in this talk, I introduced, discussed two challenges in optimistic concurrency control. One is the parallelism, how to make the validation, the centralized and central validation phase more efficient. The other one is how to handle the high abort rate when the data contention is high. So I first introduced the sentiment, which is a loosely coupled and elastic distributed OCC, where we proposed the watermarks to reduce the spurious updates and to optimize read-only transactions. And we can also use the watermarks to do elastic validation, where we can increase the number of validators or shrink the number of validators in a very short time. And the second part is that I introduced the transaction batching idea, where we use the batching and reordering to increase the throughput of our system, and in our preliminary experiment, it increased the throughput of the system by three times, and also, it halves the latency of the transactions, because the transaction gets less likely to restart it. So there are a couple of things left in these two pieces of work, and the pieces left actually form the common thing. That is, how do we do things dynamically, autonomously and adaptively. So for the sentiment work, one thing is that how we do the atomic timestamp adjustment. For example, in our experiment, since we have homogeneous hardware and homogenous workload, we assume we can assign timestamping locally with a synchronized clock, but what if this is not the case? If this is not the case, we need to have a way to adjust the rate we assign the timestamp, and adjust how the order of the timestamp should be proceeding off each other. So the second part is that when we do the validation, distributed validation, what we can do is that we can increase the number of validators or decrease the number of validators in flight in a short time. But what we don't do is that we don't do it automatically, which means we manually decide when to scale up and scale down. What is left is that it would be nice to do this atomically, like to monitor the statistics of the system and to decide what would be the best configuration of the system. The second part for the transaction batching, it is the same. So one thing we noticed in our experiment is that, as we said, doing the reordering and the validator is sort of costly, so it is not always the -- it is not always beneficial to turn down the validation ordering, especially when the data contention is very low or data contention is very high. And also, so one thing I want to know is how we dynamically enable the validator batching depending on the performance characteristics of the workload. So the other thing is that we will notice that, because the size of a batch, first controls how flexible we could do the reordering, so on the other hand, you also increase the latency of the transactions. So we also want to find a good sweet spot of the size of the batch we want to have depending on the workload in the system. Yes, I think that's it. Yes? >>: You said that when the -- was it when the conflict rate is low, batching isn't helpful. Nothing matters. >> Bailu Ding: Yes, exactly. >>: And then you said when the conflict rate is high ->> Bailu Ding: It's very high. >>: It also doesn't matter, because? >> Bailu Ding: Yes, so there are two reasons for that. The first reason is that when the abort rate is very high, in the process of finding a good order, you actually spend more time on that, because you need to do more partitioning on removing the nodes, so one thing you spend more time on, finding a good order. The other thing is that when the contention is really, really high, there are a lot of conflicts that cannot be resolved by reordering the operations, so you actually get less out of the batching, as well. So those two factors actually make it sometimes worse than not using batching in the case when contention is really, really high. >>: So you're doing all this work to reorder, but it's futile, because you're not going to find an order. >>: You end up with a serial execution of your transactions anyway, I guess. >>: Well, there's just too many conflicts. There's no way to pull out a moderate number of transactions and still get the rest of them to commit. >> Bailu Ding: Yes, especially when it's a read after write conflict, you are not going to do anything better than just running one at a time. Yes. Okay. >> Sudipto Das: Let's thank the speaker. >> Bailu Ding: Thank you.