>> David Lomet: So it's my pleasure today to introduce Alan Fekete to you all. I think Alan actually knows everybody in the room, so in some sense he needs no introduction at all. He's from the University of Sydney, and he has a long history of being involved with doing research in transaction processing. Over the last several years he's sort of focused his efforts on exploiting in some ways snapshot isolation in multiple ways and has done some very nice work, including the SIGMOD Best Paper Award in, what was it, 2008 on how to use the versions that you have from snapshot isolation to provide serializable transactions as well. So Alan continues to work in this area, and the talk today is on performance aspects of that sort of work. So I'm really happy to welcome Alan here. He's been a longtime friend and colleague, and I'm sure the talk will be very interesting. Thanks. >> Alan Fekete: Thanks very much, Dave. Yeah. So I should just set the context. Today's talk is purely an experimental talk. So I'm -- we don't have any new algorithms, we don't have any new proposals. What we have is some measurements. And so the -- the idea is just to look at some of the algorithms that have been published, including the one of our own that Dave mentioned, but also a number of others, and see what happens when you actually run them. So just a quick marketing. I noticed -- I'm on sabbatical at the moment at Berkeley, and I noticed that company talks always start with "we are hiring." So we're a university so we can't say that, but at least we can sort of advertise that we exist and that there is database research at the University of Sydney and should you have any interns or whatever who at any stage are looking to come out for a period, that would be a wonderful thing. >>: Notice the sunshine in that picture. >> Alan Fekete: Yes. >>: For those of you who don't recognize it anymore. >> Alan Fekete: So this work is in the context of LLTP-style activities. So we're looking at programs where there is a significant amount of modification going on of the data, not just analytic workloads, and the standard ACID properties that people want, particularly here focus on isolated and the different things that isolated might mean. And I guess with -- I wasn't sure about the audience. With this audience, I think I can skip the concept of what concurrency control is. Quickly back to the serializability, which is sort of the academic approach to capturing what isolated might mean. And so there's this nice definition that one covers in a database course that you have the -- and, you know, Phil in particular has spent years developing people's intuitions about this. The key things here are that the importance of serializability to me is fundamentally the fact that it gives you this preservation of integrity constraints even when the database doesn't know about the integrity constraints. It's easy enough to have a database that will maintain a declared integrity constraint, but if there's an integrity constraint which is complicated to express, the nice thing about serializability is that that integrity constraint will be maintained even without the platform knowing about it, provided you write each program to be sensible on its own. So you -- it turns a global problem into a local problem at each program. And there's the standard theory that you can do this by looking for the absence of cycles in a graph showing the conflicts. So that's the academic situation. The reality is of course that a lot of the programs out there don't run with serializability as -- I don't know, does anybody know if there is any platform at all for which the default is serializable? I'm not aware of one. All the ones I am aware of give you something like recommitted as the default. And a lot of people are running their transactions without realizing that they need to do something. So there's a lot of weaker isolation out there ->>: According to the [inaudible] paper, they run serializability. >> Alan Fekete: Right. Okay. So that will be ->>: I don't know that they really do. >> Alan Fekete: Yes. >>: [inaudible]. >> Alan Fekete: Yes. So, anyway, so weak isolation is something that's a practical reality. And so given the people in this audience, I don't need I think to say too much about snapshot isolation. It's been successful in practice. Oracle have I think done everybody a substantial disservice by confusing the terminology. So snapshot isolation is an isolation level which is not serializable, but is a very valuable isolation level in many ways. It has plenty of good properties. You don't get the typical anomalies. You don't block readers. So that's very helpful for a lot of applications. One of the things I should say about snapshot isolation that I found in teaching. Many students, when you explain serializability, the intuition that they build up is actually snapshot rather than serializability. Students, I have found, the way they develop a sense of what isolated ought to mean is that when you have concurrent transactions they should be isolated from each other; that neither should see the other, and of course that's what snapshot isolation gives you, rather than the -- there is a serial ordering of them, which is the academic definition. So clearly snapshot isolation hit a very sweet spot in the space. >>: What you're calling the academic definition was actually developed at IBM. >> Alan Fekete: It's the one that is taught academically. So thanks to Phil for putting out that it's actually the industry definition, but not all of the industry at the present time at least. And it's the one that is covered in the textbooks. Maybe I should say the textbook definition. So snapshot isolation does not, in fact, give you the textbook property of serializability in every situation. You can get the standard -- you can get this anomaly of right skew where two transactions don't see each other and violate integrity constraints because each of them is modifying different data. So it doesn't happen in most of the benchmarks that are out there, but you can design programs that make it work. >>: There's a reason why it doesn't happen in the benchmarks, is because Oracle was involved in negotiating what the benchmarks were. >>: Is that really true, though? >>: Oracle wasn't going to run -- wasn't going participate in TPC-C because they were asked to run serializable, and it could only support true serializable with tables of a box. And they negotiated, and then the TPC Council thought it was very important to have Oracle in the process, and so the idea was you don't have to support serializability, you have to run serializable for the benchmark. >>: So in that case if that's true then TPC-C somehow violates serializability on the SI? >>: No. Run serializable even though you use SI. >> Alan Fekete: Yeah. So T -- the TPC-C benchmark gives serializable executions on SI. And ->>: The point is that there was some earlier definition of the benchmark from [inaudible]. >>: I don't know which one they -- was the first one, but after TPC-C insisted on serializable execution, Oracle was threatened not to participate in this, they could do it without table-level lock. >> Alan Fekete: I wasn't there. People here are much closer to the action. I'll defer to them. So I hope I don't need to sort of walk through the example. This is what I liked because it's one where one can actually sort of get a sense of how important it could be to have integrity constraints be protected. So this is one where you're running a hospital and you want an integrity constraint which says for every day there's at least one doctor on duty. And by having programs concurrently take doctors off duty, what you can end up with is that the final state is one in which nobody is on duty. And so that seems a very real-world impact from the loss of textbook serializability from running snapshot isolation. One of the most important things, though, again, which I find is not intuitive -- I mean, for most low levels of isolation, like recommitted, the normal way people think about it is to say look at your program and decide whether your program needs serializability or whether your program could run recommitted. Snapshot isolation is not like that. You can't look at the program and see whether it has some feature that is -- needs snapshot isolation. It's not a property of each program. It's a property of the collection of programs as a whole. It's purely from interactions. And it's even more complicated. It can have three programs which work serial -- which give serial isolations under snapshot isolation. You introduce a new program, and then you need to alter what's being done in one of the previous set of programs because of the new program. So in order to reason about this, we need some form of serializability theory that deals with multiple versions. There have been a number out there. The papers I have worked on typically use the version of the theory from Raz, which is a slightly simplified from the most general definitions. But they're easy to use and easy to write down as a paper, you know, in a paper. In particular for everything like snapshot isolation, you have a version order that's given to you. The most general definitions, the ones Phil and I worked on a lot, involved quantifying over version orders in the theory. And in our case we just take the commit order as the version order because that's what snapshot isolation does. And so we've -- we operate that way. So you can draw your conflict graphs for multiversion, histories, and you get the standard result that if there are no cycles in the graph, that tells you that your execution is serializable. The important thing to notice is it's the read-to-write conflicts which are particularly critical in snapshot isolation. They're the ones where you have a transaction which reads a version and does not see the effect of another transaction which is writing the same item but is producing a later version than the one that the first transaction reads. And those already Atul Adya's paper now almost 13 years ago sort of establish the importance of these conflicts for weaker levels of isolation. So, anyway, using that we draw a serialization graph and we distinguish the edges which are from read-write -- read-to-write conflict and between transactions which are actually concurrent with one another. So those edges are the ones that we're going to play particular attention to in the diagram, so I'm going to generally show them as a dashed edge. That's what's called the antidependency. So the theory which I developed, I mean, extending stuff that Atul Adya had done, but a theory for snapshot isolation is that the crucial thing to look for is a situation where you have two of these vulnerable edges in a row in a cycle. So it's not just that you have a cycle, but you have a cycle with two of these antidependencies between concurrent transactions successively. And the theorem -- so this is a paper -- work I did with Pat and Betty O'Neil in 2005 is that that's the critical thing to look for when understanding whether you've got serializable executions or not in a snapshot isolation system. So following on from that, Dave already mentioned work principally by Michael Cahill, who was at that stage a Ph.D. student of ours at Sydney, has since graduated and is currently working believe for WiredTiger, proposed slightly changing the concurrency control algorithm to be very similar to snapshot isolation, have the same nice properties of snapshot isolation, but guarantee that the executions were serializable according to the textbook definition. The basic idea is very simple. It is if we can track when you get these vulnerable edges, when you have a read-to-write dependency between concurrent transactions, and you look for cases where you have two of those in a row, and if you -- whenever you have two of those in a row, you abort, that should ensure that you have serializable executions, because serializable executions would have to have two of the -- nonserializable executions would have to have two of those in a row, and you abort one of the transactions, that throws it away. But it should be a lot cheaper than traditional optimistic concurrency control where you essentially prevent all the conflicts from happening. So you should have to do a lot less. So Michael proposed mechanisms -- yes. >>: [inaudible] could you [inaudible] could you show how this -- like on this example of the bound of the nonduty, this is exactly what happens [inaudible]? >> Alan Fekete: Sure. So with this, this situation here, what you have is -- so transaction one is reading both these records and writing this one. Transaction two is reading both these records and writing this one. So what we have is we have a -- so transaction one has a -- an antidependency to transaction two because transaction one has read the Smith row, it did not see the change made by transaction two, so transaction two produces a version which transaction one did not see. So that is an edge from T1 to T2 of the read before a write. If we look at the Jones row, that's a case where transaction two reads it and does not see this one's write. So that gives you a dashed edge. So what we have is a cycle of two dashed edges, one after the other, and they're the whole cycle. There are no others. Okay? So that's what's happening here. >>: So your theory proved that this is a necessary and sufficient condition [inaudible]? >> Alan Fekete: Okay. So [inaudible] no. So what we show is -- is only one direction. If you have a -- if you do not have that structure, then you get serializability. Obviously -- sorry. Let me -- if you have that structure in a cycle, then you have a failure of conflict serializability. Now, it is still possible that even if it's not conflict serializable, it is serializable in, say, view serializable sense because of some of those writes being blind writes or whatever. So conflict serializable is actually an approximation to view serializability. But for our point, our goal is to ensure serializable executions. It's okay if we occasionally abort unnecessarily. I mean, there's a performance issue and you have to worry about how much that is. Okay. So, so essentially Michael's approach -- I mean, the basic idea is look for these two edges in a row and abort when you see that. Pragmatically how Michael suggested to do that was in every transaction you have two flags, one flag which says there is some edge coming in, and there is a flag which indicates if there's an edge going out, and every time you see one of these read-to-write dependencies, you set the out flag in the first transaction and the in flag in the second transaction, and whenever you see a transaction with both its flags set, you are bordered. It's not quite that simple because sometimes you discover that both the flags are set in a transaction which has in fact already committed. Basically that transaction is committed, and you're now doing another the pendency from another transaction. So at the time you discover it, then you have to abort a different transaction, but you still abort one of the transactions in the cycle and you still protect things. How does Michael propose to actually detect these situations? So the simplest case is somebody did a write, you're now reading the item, but you do not see their write because you're operating on a snapshot. That case is particularly easy to detect because when you do the read, the algorithm for finding the correct version for you to return goes down the list of versions, sees this one, sees that its timestamp is greater than what you're trying to read and skips over it. So the moment you skip over it, you know here is a case where there is an edge that's been produced. Sorry. The case where you do a read and then later somebody comes along and produces a version. But from a concurrent transaction, Michael proposes to do that by sticking in the lock manager a special lock mode to record that the read happened. And when this write happens, it's not blocked by that lock but it knows that there is an edge and so it goes and puts the appropriate flags into the transactions. And Michael -- I mean, one of the things is you actually have to keep these locks around longer than transaction commit time. You've got to keep them for another -- you know, basically until every transaction which had started when -- after this transaction has finished. So you have to keep them a little bit longer. But they don't block anything, so they're just sitting there in the lock table. So Michael's approach does sometimes abort unnecessarily. It's a conservative approximation. It, for example, doesn't check whether you've got two edges in a cycle. As soon as it sees two edges, it aborts. And sometimes that means that you're aborting things which will never show up in a cycle. Michael implemented this into InnoDB, so the back end for MySQL. It was very, very simple. He ended up adding 230 lines of code to the storage manager, most of which were simply to, you know, track these flags and keep them around and then garbage collect them as necessary. So we have an implementation of Michael's algorithm that he did. Then in 2011 another proposal was published for serializable snapshot isolation. So this was by Steve Revilak, a student of Pat and Betty O'Neil from UMass Boston. And what they proposed was to say let's try to get rid of these false positives, these unnecessary aborts. And what they proposed they call PSSI for precise serializable snapshot isolation. And essentially it is a very old algorithm, serialization graph testing. Right? They keep the serialization graph and check it for cycles, and if there's a cycle, they abort. Okay. So that was their proposal. They did an evaluation in that -- so they implemented that into InnoDB. And for their paper they did an evaluation where they compared it to an algorithm which they and we refer to as ESSI. So it's basically a similar idea -- it's the same idea as Michael Cahill's. And, I mean, they use it as a proxy for Michael's algorithm. Namely, you just look for two edges and don't bother with checking the full cycle. But their implementation is actually very different in the code. And we'll discuss that. So they had this and this and Michael Cahill had his all done in InnoDB, so we thought this would be an interesting place to actually try and do an experimental discussion and see how these different algorithms play out. Let's look at there are basically several different implementation design decisions in these systems. So the first is -- in some sense goes to the heart of what Michael was trying to do. So Michael tried to make his algorithm be very lightweight, not intrusive on the InnoDB code. And so Michael tracks things very lightly. Basically has two flags for each transaction. The alternative is you keep this global serialization graph, which is a big data structure that everybody shares. Then there is the one which is the -- in some sense the algorithmic rather than the implementation issue. Are you trying to -- or how accurate are you going to be in deciding when to abort something. The more approximate, the less close you are to exact, the more unnecessary aborts, so there could well be a performance impact from that. And then the other one is again an implementation issue when you do the checking. And so I'll go through each of these in turn. So, as I said, Michael's algorithm, you have two flags for each transaction, one which says do you have a dependency in, one which says do you have a dependency out. So the benefit of this is that you really have very local information. Each transaction just keeps its own threads. It has to -- when there's a conflict, you have to update one other transaction's flags, but there isn't a global structure. The downside is that those flags -- firstly, they're just Boolean. Is there an edge in. So if you spotted an edge in and then the transaction at the other end aborts, you don't have any way of knowing that that came from a particular transaction in cleaning it out. You leave the flag there, and that can cause you to abort unnecessarily. PSSI, of course, tracks the full dependency graph. And Revilak's implementation that he used as a proxy for Michael's, ESSI, also keeps that whole dependency graph and tracks everything at the level of which edge there is, which transaction you are depending on, or which transaction there's an edge out to. The downside is that you have this huge data structure that everybody's sharing, but -- I mean, for PSSI you can be incredibly precise, but even for ESSI, you're much more accurate than just with a Boolean flag of was there an edge. Then do you look for a whole cycle or do you just look for two edges in a row. So here PSSI goes for cycle detection, whereas both SSI and Revilak's proxy ESSI do it just when there are two edges. And obviously if you don't bother checking if your edges are in a cycle, that allows false positives and unnecessary aborts will occur. And then the other one, again, an implementation issue, Michael Cahill's code basically does this on each operation. When you do a read, you see if you're skipping over a version. And, if so, you set the flags at that point. When you do a write, you see if there is a conflicting -- a conflicting but not blocking lock indicating a previous read. And, if so, you set the flags. And every time you go in and set flags, you check if there are -- somebody has two flags set. And, if so, you abort. So Michael's algorithm is doing work on every operation. PSSI and Revilak's proxy for Michael's algorithm, ESSI, both do no work at all during read and write, no -- no work other than what is happening anyway. They do everything at commit time. So when the transaction goes to commit, you go into the lock table and see what other transactions had set locks and you build the extra edges into your conflict graph at that point and you do a cycle detection and you abort. So here we have the potential advantage that you -- with Michael's that you might detect things earlier. On the other hand, that also means that you're going into sort of the work frequently. And if you have many, many operations, it may well end up being more of a hit than just doing it once at commit time where you can get it all in a single hit. So at least on the surface it isn't clear which of these will perform better. There are definite advantages and disadvantages each way and complicated tradeoffs. So we thought there would be an interesting experimental evaluation to compare them. So that's what we did. So maybe we should start with what we had. So we basically got the code from Steve Revilak that they used in their system that was in a -sort of not -- a relatively recent version of MySQL with InnoDB at the back. Michael Cahill's system had been implemented in a somewhat older version of InnoDB. I mean, Michael Cahill's work was in sort of 2008, whereas Revilak's is 2011. So we've got several years of progress. So we took Michael's code, and taking advantage of the fact that we had Michael around to consult, took it into the same release that Revilak had used. So we had a common code base with Revilak's implementation and adaptation of Michael Cahill's. So we had Michael's SSI, and Michael had also implemented pure SI as a comparison. So we took his port of that also, so we had that. So this is work that was done in collaboration between people at Sydney University and Seoul National. At Seoul National they have plenty of nice equipment, and so we took great advantage of that. So we've got a 24-core machine. So that's -- they're not 6 corers, they're 6 cores. I think that's probably spell checker at work. Four dies. The structure is core private L1 and L2 cache and die L3 cache. We followed both Michael Cahill's experiments and Revilak's experiments, and we set it up to do something which is almost group commit. It's not proper group commit because MySQL doesn't have proper group commit, but you set it so that the log gets flushed only periodically. So it stops the log flushing, becoming the bottleneck, which otherwise it would tend to be very early in these experiments. >>: Quick question. Does generally this support [inaudible] in your program? >> Alan Fekete: Okay. So InnoDB itself is a complicated system. It is a multiversion system. Its serializable level, however, is two-phase locking phase. So it doesn't use the versions if you're running with serializable. The versions are used for its concept of recommitted, which is not the usual recommitted, it's some very bizarre different thing. So Michael implemented snapshot isolation using the versions that were there for their read committed and the code to read a particular version. So snapshot isolation is not there out of the box. So what we have is we have Michael's code for snapshot isolation, but it's very minimal change. I mean, the nice thing about InnoDB, the reason Michael used it, is it already had both the versions and the lock manager. Unlike Postgres which had the versions but no lock manager at the time, and others systems which had a lock manager and no versions, InnoDB gave both the pieces that he needed. So, as I said, his whole code doing both snapshot -- serializable SI and SI itself was 230 lines of code. And then the clients are running on a separate machine, which is tossing requests at this. So we use a microbenchmark. It's basically three tables. The transactions read one of the tables and write a different one and -- because they have -- otherwise you get cycles coming up. Here's one experiment, you know. So, remember, our goal was to see how these different algorithms compare. And, you know, so there are some times -- so the middle -- so the third one here is Michael Cahill's, then the Revilak proxy for it is here, and the precise -- the cycle detecting one is at the end. And the first bar is always just snapshot isolation as a baseline. So sometimes the Revilak ones do better, sometimes Michael Cahill's do better. There doesn't seem to be much in it. The difference from snapshot isolation is not that great, though it does seem that these are suffering. And, you know, so probably, you know, I'd say so the conclusion of the comparison is it depends, but not much in it. Perhaps the most interesting thing is that, you know, in fact, you know, sometimes even Revilak's proxy for Michael's algorithm does better than precise. But really I think if you were looking at this you would say really it's not important enough to worry about. So one other thing that we did was, you know, so the first experiment I showed you was a case where it's 75 percent read-only transactions and 25 percent updating transactions. If we vary the proportion of read-only transactions to updating transactions, we do see a significant difference. But, again, what you're seeing is not much difference in the performance between the algorithms at any particular workload. Though with a very high rate of read-only transactions, we do see that snapshot isolation is pulling well ahead, and then you're paying for more serializability, whereas with sort of the cases where you have a more limited amount of read-only, the costs are in the range of sort of 10, 15 percent, which is what people, both Revilak and Michael Cahill's paper, report. >>: So 100 percent [inaudible] there's no writes. >> Alan Fekete: That's correct. >>: So SI has like 20 percent or maybe more than the ->> Alan Fekete: Yes. >>: -- SSI [inaudible] so it's just flush. >> Alan Fekete: Yes. Okay. So things are getting interesting. And indeed the most interesting figure is what happens if ->>: Things are coming into focus. >> Alan Fekete: Yes. So if instead -- previously I sort of showed you a graph up to here. Let's have a look at allowing the MPL to keep going. >>: What is MPL? >> Alan Fekete: Multiprogramming level. The number of concurrent clients that are throwing requests at the database. So this is going back to the workload, which is 75 percent read-only, 25 percent update, which is sort of the one we sort of regard as the -- in some sense the most natural one to look at. And this is what we found. And basically the rest of my talk will be not talking about the differences between these three algorithms, but in fact this much more noticeable issue which is that the performance of all of these algorithms collapsed catastrophically as the multiprogramming level went up as you get -try and get concurrency into the system. Okay? So the -- so people expect to reach a bottleneck and plateau, but a collapse is a sign of some worry in the engineering of the system I would have said. Yes. >>: How big [inaudible] how many reads and how many writes? >> Alan Fekete: So these transactions are I think the ones which did 20 writes and a hundred reads each of pretty small records. We did experiments where we made the transactions bigger by an order of magnitude. I mean, it changes the numbers but not the phenomenon of the collapse. >>: 100 reads and 20 writes, those are [inaudible] transactions? >> Alan Fekete: Yeah. But, I mean, not that big. >>: [inaudible]. >> Alan Fekete: Yeah. So -- but we're seeing -- we're seeing a distinct collapse. And that's really what this talk is about, that collapse and our attempts to understand that collapse and explore it. So if we look at the runtimes of the transactions, what we see is that the collapse goes along with transactions taking much, much longer. So snapshot isolation transactions at the -- so the read-only transactions of snapshot don't go too badly, though they do get a bit larger. The updating transactions get a lot longer. But with the serializable versions they really slow down, and you're seeing a very substantial penalty showing up in the runtime there. We also looked at abort rates. And while at MPL 30 you're beginning to get aborts, the performance collapses already at 20 and you're not getting a lot of aborts at MPL 20. So it's not from lots of aborts and retries. It really is that the transactions are running much longer. Some profiling. So this one is set up so that each algorithm we show what's happening as the MPL increases. So this is snapshot isolation. This is Revilak's, Michael's, and the Precise version for Revilak's. And what we're seeing is the huge amount of time in holding the mutex. So the ->>: The mutex on what, the lock table? >> Alan Fekete: It's a mutex on the entire database kernel. InnoDB has a mutex on the database kernel. It did. There is a release candidate. It's not yet I think the official current version of InnoDB. So InnoDB is moving towards a situation where it's refactored to try to divide the mutex. But at this stage it is a kernel mutex, single kernel mutex. >>: [inaudible] in the InnoDB runtime? [inaudible] InnoDB internal data structures for one shared mutex? >> Alan Fekete: Yes. >>: So is the 24-core machine essentially bottleneck? >> Alan Fekete: Yes. So, I mean, it's not every -- I mean, it's only if you have to modify the structures. But, yes, this has basically been the problem. So essentially you're taking these mutexes to avoid -- it's the latch to avoid modification of shared data structures. What's even worse is this is what's happening with a read-only workload. Okay? So there is no logical contention here. You're still getting, however -- if you're running with these algorithms, you are still getting the various locks. You're recording things. You're not blocking anybody. But you are recording them or you're doing a check at commit time of this graph. It will not have any edges in it, but you still have to go and do the check. And we're seeing this performance collapse. Pure read-only workload is -- so snapshot isolation is managing to keep going, but the other ones are all collapsing appallingly. Let's also -- I'm sorry for the blurring of this. This seemed to be something when I took stuff out, so I'll interpret this figure so you can see. These are the same sort of graphs from MPL from 1 to 30. The bars are in the same order. This is what happens when we run on four of the cores. So the platform has the capacity to essentially make some of the cores invisible. You can say use only certain cores. So if you run on four cores, you're this. If you run on 12 of the cores, you get the collapse already. And fundamentally running on 24 cores you do worse than if there was only one core there. That's the basic story. >>: [inaudible] magnify the performance? >> Alan Fekete: So we did experiment with doing it whether the cores were on the same die or otherwise. There are some differences, but it's not the dominant thing. So whether you're -you're sharing the level 3 cache or not, it's not the big effect. There's a small effect in that for the PSSI and ESSI sometimes you can work so that if you're on a single die, the whole serialization graph fits into the cache on that die. So that can give you a bit of advantage. But it's not a strong effect. The collapse is not from that. So here you -- so this is one where we've actually plotted out. So this is 1 core to 24 cores. And you see that MPL 30 your performance is as bad on 24 cores as it is on one core. And already at 12 cores you've got a fairly substantial collapse. Four cores you do pretty well on. And it's interesting, you know, the earlier papers of both Michael Cahill's and Revilak's were typically looking on machines with two cores or thereabouts, but 24 cores is really causing problems. >>: So is the reason that [inaudible] database mutex to mark the dependencies? >> Alan Fekete: So yes. So all of these algorithms are taking that database mutex in one place or another and will now drill into where those places are. This just again shows the experiments on different numbers of cores. And on MPL 10 you don't get dramatic increase in the amount of time you spend in the mutex. But if you look at MPL 30, you're getting this huge effect. Even once you get 12 cores, you're getting lots and lots of time spent with the kernel mutex. So let's have a look at where the mutex gets taken and why. So with Michael Cahill's code -- so Michael Cahill went to a lot of trouble to have it so that each transaction has transaction local flags. But you still end up scanning the list of transactions to find the transactions -- to check the other transactions. And that is a place where you end up getting the kernel mutex. So you do that on every read and write. You don't do much inside that. So you take it only for a short time, but you do it often. With Revilak's algorithm you do nothing during the transactions running, but when you go into commit, you take the kernel mutex and you then have this huge thing which updates the global dependency graph and runs a cycle detection over it. So both of them have problems. What we do see is that the ones which operate on the global dependency graph tend to collapse earlier and more extreme. But Michael Cahill's still collapses, even though he went to all the work of trying to have stuff with sort of transaction-local flags just because of the problem of finding the transaction in the transaction table. So okay? So the problem -- as been alluded to, the problem is the fact that you have a kernel mutex which everybody was using to protect everything. And that's where the collapse came from. Just as an interesting thing, we decided also to profile the snapshot -- serializable snapshot isolation. I mean, this came about after we've done the previous work. Then Postgres introduced their implementation, which was based on Michael Cahill's original ideas, but they implemented it in Postgres. It involved not just as a research prototype as ours were, so this is now the production code of Postgres. Postgres is now using Michael Cahill's algorithm with some further refinement. So this is a VLDB Industrial paper last year. So they have structures where they don't have a single global mutex for the kernel. They have stuff sort of separated out, and they have sort of a separate latching structure for the little different -- different little bits of the shared data. So in a sense it's similar to Michael Cahill's algorithm. However, it's a bit more accurate. Firstly, they don't just set Boolean flags, they actually track which transaction is at the other end. So you get fewer cases of a dangling flag left by a transaction that aborted. They also have some additional optimizations, including some for analytics which don't matter for this. So we also took their code and measured that. For this we actually used the TPC C++ benchmark, which is one that Michael Cahill invented, so it's basically TPC-C, and you add one more transaction to the mix so that you actually have a situation where snapshot isolation wouldn't work correctly. And so that was run -- this is done on a 32-core machine. And what we see here is there is still a performance drop-off, but it happens much, much later and it's not nearly as bad. So here -- so this is MPL 128. You're down by a third from the peak at about MPL 32, but you haven't collapse catastrophically. Though it's still the case that with high MPL you are definitely paying for serializability compared to snapshot isolation. Which PostgreSQL still has available. So if you ask for repeatable read, you get snapshot there. And if you ask for serializable, you get serializable snapshot. So this shows that, you know, serializable snapshot isolation, it's not something sort of intrinsic to the algorithm that you have this performance collapse, it comes very much from the implementation details and how you share the mutexes. But even here with some careful work, and, you know, the Postgres people did measure it on machines with some numbers of cores, as you get to 32 cores and higher multiprogramming levels, you are really hurting still in performance. I guess the conclusion I would give is that -- and here you guys are way ahead of us, is the great importance of thinking about the implementation details and particularly the latching that goes on in order to get decent performance on sort of the high-end servers that are coming. And so really thinking about what's shared and how you protect what's shared become vital. We agree with all the previous papers that the serializable algorithms do -- don't have much penalty compared to snapshot isolation when there's one core or two cores, but at the moment the algorithms are still not taking good advantage of lots of cores. And the work, though, what we were suggesting was not so much algorithmic rethinking of the concurrency control but implementation rethinking to avoid contention overlatches. And that's where I end, and I guess I will hear more about the way you've sort of gone all around this and managed to do stuff with both concurrency control changes and implementation ones. >> David Lomet: Thank you very much. [applause]. >>: I presume that some of these papers, past papers on serializable snapshot isolation, had something to say about the performance drop-off from snapshot isolation perhaps simulating it or doing it in some more controlled setting. I mean, all concurrency control algorithms ultimately become a bottleneck under sufficiently high contention. Can you give us a sense of what would be expected if that were the only effect? >> Alan Fekete: So they ->>: We may actually just see it on those graphs. >> Alan Fekete: Okay. Yeah. So I don't have it. Actually let me just go in. I may have some of Michael's slides among the hidden stuff. So let me just see if I've got some of those. >>: [inaudible]. >> Alan Fekete: Okay. So, anyway, maybe let me go back to summarize. What Michael found in his experiments was there were lots of situations where the concurrency control wasn't the bottleneck. So most of the cases he did snapshot isolation, and his serializable snapshot isolation, the performance was very close. However, he could tune the situation so it became a bottleneck and then you can start getting substantial drop-offs. Most of the time that wasn't where he found it. >>: Well, the question is how much the divergence, and you're going to get some drop-off even for snapshot isolation if you get a high enough conflict level, so the question is what's the spread between them? >> Alan Fekete: So Michael got graphs which -- where he -- many of his graphs basically, you know, you had SI, SSI, and 2PL essentially on the same curve. But by -- he was able also to get graphs where you had SI, SSI, and 2PL with something like 30 to 50 percent there by choosing the weight of the transactions appropriately, putting the right delays in them. >>: [inaudible], you know, I don't know, we have a -- we have an implementation of [inaudible] Ph.D. thesis simulation of concurrency control algorithms that we used for an earlier study you may recall. >> Alan Fekete: Yes. >>: It'd be interesting to get some kind of apples-to-apples comparison of these things, just purely based on the concurrency control's behavior. Because there's so many moving parts, to try to actually do it any other way is really going to be really, really difficult. And it's worth knowing. >> Alan Fekete: Yes. So, I mean, yeah, so the first thing of course is just trying to understand what environment will it be that will make the concurrency control the bottleneck rather than all the other things. >>: You know, all the bench -- I mean, all the transaction benchmarks are designed to have that effect. And so ->> Alan Fekete: That wasn't what -- well, I guess the first thing is, you know, you have to do something to make sure that the log, the disk -- the disk writes of the log don't become a bottleneck. So you have to have something like group commit, otherwise I think that's the one that hits you first. All of these are done with everything fitting into main memory. So that takes that out. >>: [inaudible] I'm not expecting an answer, sort of a rhetorical question, it might be if you're interested in pursuing this line of understanding performance, I think that would be a useful -- >> Alan Fekete: Yeah. I mean, what we found -- so one of the experiments I didn't show you was we -- we did the one where we really sort of crunched down onto a hot spot to try to get contention in the concurrency control as the issue. But basically it doesn't turn out to be that much. So what we found there was that when you make it a hot spot, all three of the SSIs are dropping off, they're all dropping off far more than SI is, so they are all collapsing. And even -- yep. >>: [inaudible] nature by design and by definition you have more conflicts, so you can come up with any workload that can start official workload and can cause I guess as much conflicts [inaudible] even by the theory [inaudible] analytical model which cause the problem, whatever amount of [inaudible]. >> Alan Fekete: Yeah. But so -- I mean, there's the difference between SI and the others. But then within the SSI variance, there's also how they got it, so how many false positives you get from just looking for two as against from being more precise, how many false positives you get from not cleaning up when an -- an edge from an aborted transaction. >>: But you're saying -- I think my takeaway from this is those sort of differences -- the differences that the concurrency control level are small is really -- this is really sensitive to how you build the concurrency control mechanism and what mutexes you use to protect the data structures within the concurrency control mechanism itself. >> Alan Fekete: [inaudible] yes. >>: [inaudible] may be it does a lot better than if it isn't. >> Alan Fekete: Yeah, but, I mean, Phil's question is suppose you got away from all of that and really just looked at the concurrency control. >>: So it's kind of a weird presentation. I mean, you had this rather lengthy buildup about serializable snapshot isolation, and then in the end you told us about the way database kernels behave badly, which is actually the choice of concurrency control algorithms is quite irrelevant. >> Alan Fekete: Yes. >>: I mean, so, which is ->> Alan Fekete: But, I mean, it's not completely irrelevant because ->>: [inaudible] talk about, but ->> Alan Fekete: I mean, you still have the fact that snapshot isolation doesn't have the collapse nearly as badly as all the serializable variants. >>: It's pretty interesting that they're that crude. >> Alan Fekete: Yes. So but, I mean, there is still a takeaway which says the price of getting serializability, which Michael and we've argued is, is it was manageable, it was like 10 percent at the most, compared to snapshot under most conditions. Turns out not to be true with enough cores, with the sorts of implementations that we had. Now, one could maybe -- and we're trying to find better implementations which don't have these issues. And the question is can you get it to close to what snapshot isolation has under those conditions. >>: The takeaway that I get from this is that what matters -- what really shows up here is the -how efficiently you could maintain the data structures that you have to maintain, how efficiently you could do it under high concurrency. Right? >> Alan Fekete: Yes. >>: Those effects are now so big that you can't really see the difference between the various algorithms. >> Alan Fekete: Yes. Yes. And, I mean, the thing is that for snapshot isolation, pure snapshot isolation, when you read, you don't have to do any modification of the internal state. And so that's fast. And it doesn't cause all these problems. Whereas all of these things, when you do a read, you have to record the fact. And that you're paying for. >>: So if you would have got this wrapped, you really have a very -- essentially a lock management [inaudible] very high levels of concurrency. >> Alan Fekete: Yes. Yes, you need a smarter lock manager that can do this. And you have to think about the whole system in terms of what you're doing at each stage of the checks. I mean, Michael had tried to do this in designing it so the flags were transaction local. But in the implementation, he just didn't check all the other bits. >>: With a big mutex there isn't that much you can do. >> Alan Fekete: Yes. Yes. >>: [inaudible]. >> Alan Fekete: Yes. >> David Lomet: Well, thank you very much, Alan. [applause].