>> David Lomet: So it's my pleasure today to... actually knows everybody in the room, so in some sense...

advertisement
>> David Lomet: So it's my pleasure today to introduce Alan Fekete to you all. I think Alan
actually knows everybody in the room, so in some sense he needs no introduction at all. He's
from the University of Sydney, and he has a long history of being involved with doing research
in transaction processing.
Over the last several years he's sort of focused his efforts on exploiting in some ways snapshot
isolation in multiple ways and has done some very nice work, including the SIGMOD Best Paper
Award in, what was it, 2008 on how to use the versions that you have from snapshot isolation to
provide serializable transactions as well.
So Alan continues to work in this area, and the talk today is on performance aspects of that sort
of work. So I'm really happy to welcome Alan here. He's been a longtime friend and colleague,
and I'm sure the talk will be very interesting. Thanks.
>> Alan Fekete: Thanks very much, Dave. Yeah. So I should just set the context. Today's talk
is purely an experimental talk. So I'm -- we don't have any new algorithms, we don't have any
new proposals. What we have is some measurements. And so the -- the idea is just to look at
some of the algorithms that have been published, including the one of our own that Dave
mentioned, but also a number of others, and see what happens when you actually run them.
So just a quick marketing. I noticed -- I'm on sabbatical at the moment at Berkeley, and I noticed
that company talks always start with "we are hiring." So we're a university so we can't say that,
but at least we can sort of advertise that we exist and that there is database research at the
University of Sydney and should you have any interns or whatever who at any stage are looking
to come out for a period, that would be a wonderful thing.
>>: Notice the sunshine in that picture.
>> Alan Fekete: Yes.
>>: For those of you who don't recognize it anymore.
>> Alan Fekete: So this work is in the context of LLTP-style activities. So we're looking at
programs where there is a significant amount of modification going on of the data, not just
analytic workloads, and the standard ACID properties that people want, particularly here focus
on isolated and the different things that isolated might mean.
And I guess with -- I wasn't sure about the audience. With this audience, I think I can skip the
concept of what concurrency control is. Quickly back to the serializability, which is sort of the
academic approach to capturing what isolated might mean.
And so there's this nice definition that one covers in a database course that you have the -- and,
you know, Phil in particular has spent years developing people's intuitions about this.
The key things here are that the importance of serializability to me is fundamentally the fact that
it gives you this preservation of integrity constraints even when the database doesn't know about
the integrity constraints. It's easy enough to have a database that will maintain a declared
integrity constraint, but if there's an integrity constraint which is complicated to express, the nice
thing about serializability is that that integrity constraint will be maintained even without the
platform knowing about it, provided you write each program to be sensible on its own. So you --
it turns a global problem into a local problem at each program. And there's the standard theory
that you can do this by looking for the absence of cycles in a graph showing the conflicts.
So that's the academic situation. The reality is of course that a lot of the programs out there don't
run with serializability as -- I don't know, does anybody know if there is any platform at all for
which the default is serializable? I'm not aware of one. All the ones I am aware of give you
something like recommitted as the default. And a lot of people are running their transactions
without realizing that they need to do something. So there's a lot of weaker isolation out there ->>: According to the [inaudible] paper, they run serializability.
>> Alan Fekete: Right. Okay. So that will be ->>: I don't know that they really do.
>> Alan Fekete: Yes.
>>: [inaudible].
>> Alan Fekete: Yes. So, anyway, so weak isolation is something that's a practical reality. And
so given the people in this audience, I don't need I think to say too much about snapshot
isolation. It's been successful in practice. Oracle have I think done everybody a substantial
disservice by confusing the terminology. So snapshot isolation is an isolation level which is not
serializable, but is a very valuable isolation level in many ways. It has plenty of good properties.
You don't get the typical anomalies. You don't block readers. So that's very helpful for a lot of
applications.
One of the things I should say about snapshot isolation that I found in teaching. Many students,
when you explain serializability, the intuition that they build up is actually snapshot rather than
serializability. Students, I have found, the way they develop a sense of what isolated ought to
mean is that when you have concurrent transactions they should be isolated from each other; that
neither should see the other, and of course that's what snapshot isolation gives you, rather than
the -- there is a serial ordering of them, which is the academic definition. So clearly snapshot
isolation hit a very sweet spot in the space.
>>: What you're calling the academic definition was actually developed at IBM.
>> Alan Fekete: It's the one that is taught academically. So thanks to Phil for putting out that
it's actually the industry definition, but not all of the industry at the present time at least. And it's
the one that is covered in the textbooks. Maybe I should say the textbook definition.
So snapshot isolation does not, in fact, give you the textbook property of serializability in every
situation. You can get the standard -- you can get this anomaly of right skew where two
transactions don't see each other and violate integrity constraints because each of them is
modifying different data. So it doesn't happen in most of the benchmarks that are out there, but
you can design programs that make it work.
>>: There's a reason why it doesn't happen in the benchmarks, is because Oracle was involved in
negotiating what the benchmarks were.
>>: Is that really true, though?
>>: Oracle wasn't going to run -- wasn't going participate in TPC-C because they were asked to
run serializable, and it could only support true serializable with tables of a box. And they
negotiated, and then the TPC Council thought it was very important to have Oracle in the
process, and so the idea was you don't have to support serializability, you have to run serializable
for the benchmark.
>>: So in that case if that's true then TPC-C somehow violates serializability on the SI?
>>: No. Run serializable even though you use SI.
>> Alan Fekete: Yeah. So T -- the TPC-C benchmark gives serializable executions on SI.
And ->>: The point is that there was some earlier definition of the benchmark from [inaudible].
>>: I don't know which one they -- was the first one, but after TPC-C insisted on serializable
execution, Oracle was threatened not to participate in this, they could do it without table-level
lock.
>> Alan Fekete: I wasn't there. People here are much closer to the action. I'll defer to them. So
I hope I don't need to sort of walk through the example. This is what I liked because it's one
where one can actually sort of get a sense of how important it could be to have integrity
constraints be protected.
So this is one where you're running a hospital and you want an integrity constraint which says for
every day there's at least one doctor on duty. And by having programs concurrently take doctors
off duty, what you can end up with is that the final state is one in which nobody is on duty. And
so that seems a very real-world impact from the loss of textbook serializability from running
snapshot isolation.
One of the most important things, though, again, which I find is not intuitive -- I mean, for most
low levels of isolation, like recommitted, the normal way people think about it is to say look at
your program and decide whether your program needs serializability or whether your program
could run recommitted.
Snapshot isolation is not like that. You can't look at the program and see whether it has some
feature that is -- needs snapshot isolation. It's not a property of each program. It's a property of
the collection of programs as a whole. It's purely from interactions.
And it's even more complicated. It can have three programs which work serial -- which give
serial isolations under snapshot isolation. You introduce a new program, and then you need to
alter what's being done in one of the previous set of programs because of the new program.
So in order to reason about this, we need some form of serializability theory that deals with
multiple versions. There have been a number out there. The papers I have worked on typically
use the version of the theory from Raz, which is a slightly simplified from the most general
definitions. But they're easy to use and easy to write down as a paper, you know, in a paper.
In particular for everything like snapshot isolation, you have a version order that's given to you.
The most general definitions, the ones Phil and I worked on a lot, involved quantifying over
version orders in the theory. And in our case we just take the commit order as the version order
because that's what snapshot isolation does. And so we've -- we operate that way.
So you can draw your conflict graphs for multiversion, histories, and you get the standard result
that if there are no cycles in the graph, that tells you that your execution is serializable.
The important thing to notice is it's the read-to-write conflicts which are particularly critical in
snapshot isolation. They're the ones where you have a transaction which reads a version and
does not see the effect of another transaction which is writing the same item but is producing a
later version than the one that the first transaction reads. And those already Atul Adya's paper
now almost 13 years ago sort of establish the importance of these conflicts for weaker levels of
isolation.
So, anyway, using that we draw a serialization graph and we distinguish the edges which are
from read-write -- read-to-write conflict and between transactions which are actually concurrent
with one another. So those edges are the ones that we're going to play particular attention to in
the diagram, so I'm going to generally show them as a dashed edge. That's what's called the
antidependency.
So the theory which I developed, I mean, extending stuff that Atul Adya had done, but a theory
for snapshot isolation is that the crucial thing to look for is a situation where you have two of
these vulnerable edges in a row in a cycle. So it's not just that you have a cycle, but you have a
cycle with two of these antidependencies between concurrent transactions successively.
And the theorem -- so this is a paper -- work I did with Pat and Betty O'Neil in 2005 is that that's
the critical thing to look for when understanding whether you've got serializable executions or
not in a snapshot isolation system.
So following on from that, Dave already mentioned work principally by Michael Cahill, who
was at that stage a Ph.D. student of ours at Sydney, has since graduated and is currently working
believe for WiredTiger, proposed slightly changing the concurrency control algorithm to be very
similar to snapshot isolation, have the same nice properties of snapshot isolation, but guarantee
that the executions were serializable according to the textbook definition.
The basic idea is very simple. It is if we can track when you get these vulnerable edges, when
you have a read-to-write dependency between concurrent transactions, and you look for cases
where you have two of those in a row, and if you -- whenever you have two of those in a row,
you abort, that should ensure that you have serializable executions, because serializable
executions would have to have two of the -- nonserializable executions would have to have two
of those in a row, and you abort one of the transactions, that throws it away. But it should be a
lot cheaper than traditional optimistic concurrency control where you essentially prevent all the
conflicts from happening. So you should have to do a lot less.
So Michael proposed mechanisms -- yes.
>>: [inaudible] could you [inaudible] could you show how this -- like on this example of the
bound of the nonduty, this is exactly what happens [inaudible]?
>> Alan Fekete: Sure. So with this, this situation here, what you have is -- so transaction one is
reading both these records and writing this one. Transaction two is reading both these records
and writing this one.
So what we have is we have a -- so transaction one has a -- an antidependency to transaction two
because transaction one has read the Smith row, it did not see the change made by transaction
two, so transaction two produces a version which transaction one did not see. So that is an edge
from T1 to T2 of the read before a write.
If we look at the Jones row, that's a case where transaction two reads it and does not see this
one's write. So that gives you a dashed edge. So what we have is a cycle of two dashed edges,
one after the other, and they're the whole cycle. There are no others. Okay? So that's what's
happening here.
>>: So your theory proved that this is a necessary and sufficient condition [inaudible]?
>> Alan Fekete: Okay. So [inaudible] no. So what we show is -- is only one direction. If you
have a -- if you do not have that structure, then you get serializability. Obviously -- sorry. Let
me -- if you have that structure in a cycle, then you have a failure of conflict serializability.
Now, it is still possible that even if it's not conflict serializable, it is serializable in, say, view
serializable sense because of some of those writes being blind writes or whatever.
So conflict serializable is actually an approximation to view serializability. But for our point,
our goal is to ensure serializable executions. It's okay if we occasionally abort unnecessarily. I
mean, there's a performance issue and you have to worry about how much that is.
Okay. So, so essentially Michael's approach -- I mean, the basic idea is look for these two edges
in a row and abort when you see that. Pragmatically how Michael suggested to do that was in
every transaction you have two flags, one flag which says there is some edge coming in, and
there is a flag which indicates if there's an edge going out, and every time you see one of these
read-to-write dependencies, you set the out flag in the first transaction and the in flag in the
second transaction, and whenever you see a transaction with both its flags set, you are bordered.
It's not quite that simple because sometimes you discover that both the flags are set in a
transaction which has in fact already committed. Basically that transaction is committed, and
you're now doing another the pendency from another transaction. So at the time you discover it,
then you have to abort a different transaction, but you still abort one of the transactions in the
cycle and you still protect things.
How does Michael propose to actually detect these situations? So the simplest case is somebody
did a write, you're now reading the item, but you do not see their write because you're operating
on a snapshot.
That case is particularly easy to detect because when you do the read, the algorithm for finding
the correct version for you to return goes down the list of versions, sees this one, sees that its
timestamp is greater than what you're trying to read and skips over it. So the moment you skip
over it, you know here is a case where there is an edge that's been produced. Sorry. The case
where you do a read and then later somebody comes along and produces a version.
But from a concurrent transaction, Michael proposes to do that by sticking in the lock manager a
special lock mode to record that the read happened. And when this write happens, it's not
blocked by that lock but it knows that there is an edge and so it goes and puts the appropriate
flags into the transactions.
And Michael -- I mean, one of the things is you actually have to keep these locks around longer
than transaction commit time. You've got to keep them for another -- you know, basically until
every transaction which had started when -- after this transaction has finished. So you have to
keep them a little bit longer. But they don't block anything, so they're just sitting there in the
lock table.
So Michael's approach does sometimes abort unnecessarily. It's a conservative approximation.
It, for example, doesn't check whether you've got two edges in a cycle. As soon as it sees two
edges, it aborts. And sometimes that means that you're aborting things which will never show up
in a cycle.
Michael implemented this into InnoDB, so the back end for MySQL. It was very, very simple.
He ended up adding 230 lines of code to the storage manager, most of which were simply to, you
know, track these flags and keep them around and then garbage collect them as necessary.
So we have an implementation of Michael's algorithm that he did. Then in 2011 another
proposal was published for serializable snapshot isolation. So this was by Steve Revilak, a
student of Pat and Betty O'Neil from UMass Boston. And what they proposed was to say let's try
to get rid of these false positives, these unnecessary aborts.
And what they proposed they call PSSI for precise serializable snapshot isolation. And
essentially it is a very old algorithm, serialization graph testing. Right? They keep the
serialization graph and check it for cycles, and if there's a cycle, they abort.
Okay. So that was their proposal. They did an evaluation in that -- so they implemented that
into InnoDB. And for their paper they did an evaluation where they compared it to an algorithm
which they and we refer to as ESSI.
So it's basically a similar idea -- it's the same idea as Michael Cahill's. And, I mean, they use it
as a proxy for Michael's algorithm. Namely, you just look for two edges and don't bother with
checking the full cycle. But their implementation is actually very different in the code. And
we'll discuss that.
So they had this and this and Michael Cahill had his all done in InnoDB, so we thought this
would be an interesting place to actually try and do an experimental discussion and see how
these different algorithms play out.
Let's look at there are basically several different implementation design decisions in these
systems. So the first is -- in some sense goes to the heart of what Michael was trying to do. So
Michael tried to make his algorithm be very lightweight, not intrusive on the InnoDB code. And
so Michael tracks things very lightly. Basically has two flags for each transaction. The
alternative is you keep this global serialization graph, which is a big data structure that
everybody shares.
Then there is the one which is the -- in some sense the algorithmic rather than the
implementation issue. Are you trying to -- or how accurate are you going to be in deciding when
to abort something. The more approximate, the less close you are to exact, the more unnecessary
aborts, so there could well be a performance impact from that.
And then the other one is again an implementation issue when you do the checking. And so I'll
go through each of these in turn.
So, as I said, Michael's algorithm, you have two flags for each transaction, one which says do
you have a dependency in, one which says do you have a dependency out.
So the benefit of this is that you really have very local information. Each transaction just keeps
its own threads. It has to -- when there's a conflict, you have to update one other transaction's
flags, but there isn't a global structure.
The downside is that those flags -- firstly, they're just Boolean. Is there an edge in. So if you
spotted an edge in and then the transaction at the other end aborts, you don't have any way of
knowing that that came from a particular transaction in cleaning it out. You leave the flag there,
and that can cause you to abort unnecessarily.
PSSI, of course, tracks the full dependency graph. And Revilak's implementation that he used as
a proxy for Michael's, ESSI, also keeps that whole dependency graph and tracks everything at
the level of which edge there is, which transaction you are depending on, or which transaction
there's an edge out to.
The downside is that you have this huge data structure that everybody's sharing, but -- I mean,
for PSSI you can be incredibly precise, but even for ESSI, you're much more accurate than just
with a Boolean flag of was there an edge.
Then do you look for a whole cycle or do you just look for two edges in a row. So here PSSI
goes for cycle detection, whereas both SSI and Revilak's proxy ESSI do it just when there are
two edges. And obviously if you don't bother checking if your edges are in a cycle, that allows
false positives and unnecessary aborts will occur.
And then the other one, again, an implementation issue, Michael Cahill's code basically does this
on each operation. When you do a read, you see if you're skipping over a version. And, if so,
you set the flags at that point. When you do a write, you see if there is a conflicting -- a
conflicting but not blocking lock indicating a previous read. And, if so, you set the flags. And
every time you go in and set flags, you check if there are -- somebody has two flags set. And, if
so, you abort.
So Michael's algorithm is doing work on every operation. PSSI and Revilak's proxy for
Michael's algorithm, ESSI, both do no work at all during read and write, no -- no work other than
what is happening anyway. They do everything at commit time. So when the transaction goes to
commit, you go into the lock table and see what other transactions had set locks and you build
the extra edges into your conflict graph at that point and you do a cycle detection and you abort.
So here we have the potential advantage that you -- with Michael's that you might detect things
earlier. On the other hand, that also means that you're going into sort of the work frequently.
And if you have many, many operations, it may well end up being more of a hit than just doing it
once at commit time where you can get it all in a single hit.
So at least on the surface it isn't clear which of these will perform better. There are definite
advantages and disadvantages each way and complicated tradeoffs. So we thought there would
be an interesting experimental evaluation to compare them.
So that's what we did. So maybe we should start with what we had.
So we basically got the code from Steve Revilak that they used in their system that was in a -sort of not -- a relatively recent version of MySQL with InnoDB at the back. Michael Cahill's
system had been implemented in a somewhat older version of InnoDB. I mean, Michael Cahill's
work was in sort of 2008, whereas Revilak's is 2011. So we've got several years of progress.
So we took Michael's code, and taking advantage of the fact that we had Michael around to
consult, took it into the same release that Revilak had used. So we had a common code base
with Revilak's implementation and adaptation of Michael Cahill's. So we had Michael's SSI, and
Michael had also implemented pure SI as a comparison. So we took his port of that also, so we
had that.
So this is work that was done in collaboration between people at Sydney University and Seoul
National. At Seoul National they have plenty of nice equipment, and so we took great advantage
of that. So we've got a 24-core machine. So that's -- they're not 6 corers, they're 6 cores. I think
that's probably spell checker at work. Four dies. The structure is core private L1 and L2 cache
and die L3 cache.
We followed both Michael Cahill's experiments and Revilak's experiments, and we set it up to do
something which is almost group commit. It's not proper group commit because MySQL doesn't
have proper group commit, but you set it so that the log gets flushed only periodically. So it
stops the log flushing, becoming the bottleneck, which otherwise it would tend to be very early
in these experiments.
>>: Quick question. Does generally this support [inaudible] in your program?
>> Alan Fekete: Okay. So InnoDB itself is a complicated system. It is a multiversion system.
Its serializable level, however, is two-phase locking phase. So it doesn't use the versions if
you're running with serializable. The versions are used for its concept of recommitted, which is
not the usual recommitted, it's some very bizarre different thing.
So Michael implemented snapshot isolation using the versions that were there for their read
committed and the code to read a particular version.
So snapshot isolation is not there out of the box. So what we have is we have Michael's code for
snapshot isolation, but it's very minimal change. I mean, the nice thing about InnoDB, the
reason Michael used it, is it already had both the versions and the lock manager.
Unlike Postgres which had the versions but no lock manager at the time, and others systems
which had a lock manager and no versions, InnoDB gave both the pieces that he needed. So, as I
said, his whole code doing both snapshot -- serializable SI and SI itself was 230 lines of code.
And then the clients are running on a separate machine, which is tossing requests at this.
So we use a microbenchmark. It's basically three tables. The transactions read one of the tables
and write a different one and -- because they have -- otherwise you get cycles coming up.
Here's one experiment, you know. So, remember, our goal was to see how these different
algorithms compare. And, you know, so there are some times -- so the middle -- so the third one
here is Michael Cahill's, then the Revilak proxy for it is here, and the precise -- the cycle
detecting one is at the end. And the first bar is always just snapshot isolation as a baseline.
So sometimes the Revilak ones do better, sometimes Michael Cahill's do better. There doesn't
seem to be much in it. The difference from snapshot isolation is not that great, though it does
seem that these are suffering. And, you know, so probably, you know, I'd say so the conclusion
of the comparison is it depends, but not much in it.
Perhaps the most interesting thing is that, you know, in fact, you know, sometimes even
Revilak's proxy for Michael's algorithm does better than precise. But really I think if you were
looking at this you would say really it's not important enough to worry about.
So one other thing that we did was, you know, so the first experiment I showed you was a case
where it's 75 percent read-only transactions and 25 percent updating transactions. If we vary the
proportion of read-only transactions to updating transactions, we do see a significant difference.
But, again, what you're seeing is not much difference in the performance between the algorithms
at any particular workload.
Though with a very high rate of read-only transactions, we do see that snapshot isolation is
pulling well ahead, and then you're paying for more serializability, whereas with sort of the cases
where you have a more limited amount of read-only, the costs are in the range of sort of 10, 15
percent, which is what people, both Revilak and Michael Cahill's paper, report.
>>: So 100 percent [inaudible] there's no writes.
>> Alan Fekete: That's correct.
>>: So SI has like 20 percent or maybe more than the ->> Alan Fekete: Yes.
>>: -- SSI [inaudible] so it's just flush.
>> Alan Fekete: Yes. Okay. So things are getting interesting. And indeed the most interesting
figure is what happens if ->>: Things are coming into focus.
>> Alan Fekete: Yes. So if instead -- previously I sort of showed you a graph up to here. Let's
have a look at allowing the MPL to keep going.
>>: What is MPL?
>> Alan Fekete: Multiprogramming level. The number of concurrent clients that are throwing
requests at the database.
So this is going back to the workload, which is 75 percent read-only, 25 percent update, which is
sort of the one we sort of regard as the -- in some sense the most natural one to look at. And this
is what we found.
And basically the rest of my talk will be not talking about the differences between these three
algorithms, but in fact this much more noticeable issue which is that the performance of all of
these algorithms collapsed catastrophically as the multiprogramming level went up as you get -try and get concurrency into the system. Okay?
So the -- so people expect to reach a bottleneck and plateau, but a collapse is a sign of some
worry in the engineering of the system I would have said. Yes.
>>: How big [inaudible] how many reads and how many writes?
>> Alan Fekete: So these transactions are I think the ones which did 20 writes and a hundred
reads each of pretty small records. We did experiments where we made the transactions bigger
by an order of magnitude. I mean, it changes the numbers but not the phenomenon of the
collapse.
>>: 100 reads and 20 writes, those are [inaudible] transactions?
>> Alan Fekete: Yeah. But, I mean, not that big.
>>: [inaudible].
>> Alan Fekete: Yeah. So -- but we're seeing -- we're seeing a distinct collapse. And that's
really what this talk is about, that collapse and our attempts to understand that collapse and
explore it.
So if we look at the runtimes of the transactions, what we see is that the collapse goes along with
transactions taking much, much longer. So snapshot isolation transactions at the -- so the
read-only transactions of snapshot don't go too badly, though they do get a bit larger. The
updating transactions get a lot longer. But with the serializable versions they really slow down,
and you're seeing a very substantial penalty showing up in the runtime there.
We also looked at abort rates. And while at MPL 30 you're beginning to get aborts, the
performance collapses already at 20 and you're not getting a lot of aborts at MPL 20. So it's not
from lots of aborts and retries. It really is that the transactions are running much longer.
Some profiling. So this one is set up so that each algorithm we show what's happening as the
MPL increases. So this is snapshot isolation. This is Revilak's, Michael's, and the Precise
version for Revilak's. And what we're seeing is the huge amount of time in holding the mutex.
So the ->>: The mutex on what, the lock table?
>> Alan Fekete: It's a mutex on the entire database kernel. InnoDB has a mutex on the database
kernel. It did. There is a release candidate. It's not yet I think the official current version of
InnoDB. So InnoDB is moving towards a situation where it's refactored to try to divide the
mutex. But at this stage it is a kernel mutex, single kernel mutex.
>>: [inaudible] in the InnoDB runtime? [inaudible] InnoDB internal data structures for one
shared mutex?
>> Alan Fekete: Yes.
>>: So is the 24-core machine essentially bottleneck?
>> Alan Fekete: Yes. So, I mean, it's not every -- I mean, it's only if you have to modify the
structures. But, yes, this has basically been the problem.
So essentially you're taking these mutexes to avoid -- it's the latch to avoid modification of
shared data structures. What's even worse is this is what's happening with a read-only workload.
Okay? So there is no logical contention here. You're still getting, however -- if you're running
with these algorithms, you are still getting the various locks. You're recording things. You're
not blocking anybody. But you are recording them or you're doing a check at commit time of
this graph. It will not have any edges in it, but you still have to go and do the check. And we're
seeing this performance collapse.
Pure read-only workload is -- so snapshot isolation is managing to keep going, but the other ones
are all collapsing appallingly.
Let's also -- I'm sorry for the blurring of this. This seemed to be something when I took stuff out,
so I'll interpret this figure so you can see. These are the same sort of graphs from MPL from 1 to
30. The bars are in the same order. This is what happens when we run on four of the cores.
So the platform has the capacity to essentially make some of the cores invisible. You can say
use only certain cores. So if you run on four cores, you're this. If you run on 12 of the cores,
you get the collapse already.
And fundamentally running on 24 cores you do worse than if there was only one core there.
That's the basic story.
>>: [inaudible] magnify the performance?
>> Alan Fekete: So we did experiment with doing it whether the cores were on the same die or
otherwise. There are some differences, but it's not the dominant thing. So whether you're -you're sharing the level 3 cache or not, it's not the big effect.
There's a small effect in that for the PSSI and ESSI sometimes you can work so that if you're on
a single die, the whole serialization graph fits into the cache on that die. So that can give you a
bit of advantage. But it's not a strong effect. The collapse is not from that.
So here you -- so this is one where we've actually plotted out. So this is 1 core to 24 cores. And
you see that MPL 30 your performance is as bad on 24 cores as it is on one core. And already at
12 cores you've got a fairly substantial collapse. Four cores you do pretty well on.
And it's interesting, you know, the earlier papers of both Michael Cahill's and Revilak's were
typically looking on machines with two cores or thereabouts, but 24 cores is really causing
problems.
>>: So is the reason that [inaudible] database mutex to mark the dependencies?
>> Alan Fekete: So yes. So all of these algorithms are taking that database mutex in one place
or another and will now drill into where those places are. This just again shows the experiments
on different numbers of cores. And on MPL 10 you don't get dramatic increase in the amount of
time you spend in the mutex. But if you look at MPL 30, you're getting this huge effect. Even
once you get 12 cores, you're getting lots and lots of time spent with the kernel mutex.
So let's have a look at where the mutex gets taken and why. So with Michael Cahill's code -- so
Michael Cahill went to a lot of trouble to have it so that each transaction has transaction local
flags. But you still end up scanning the list of transactions to find the transactions -- to check the
other transactions. And that is a place where you end up getting the kernel mutex.
So you do that on every read and write. You don't do much inside that. So you take it only for a
short time, but you do it often.
With Revilak's algorithm you do nothing during the transactions running, but when you go into
commit, you take the kernel mutex and you then have this huge thing which updates the global
dependency graph and runs a cycle detection over it.
So both of them have problems. What we do see is that the ones which operate on the global
dependency graph tend to collapse earlier and more extreme. But Michael Cahill's still
collapses, even though he went to all the work of trying to have stuff with sort of
transaction-local flags just because of the problem of finding the transaction in the transaction
table.
So okay? So the problem -- as been alluded to, the problem is the fact that you have a kernel
mutex which everybody was using to protect everything. And that's where the collapse came
from.
Just as an interesting thing, we decided also to profile the snapshot -- serializable snapshot
isolation. I mean, this came about after we've done the previous work. Then Postgres introduced
their implementation, which was based on Michael Cahill's original ideas, but they implemented
it in Postgres. It involved not just as a research prototype as ours were, so this is now the
production code of Postgres. Postgres is now using Michael Cahill's algorithm with some further
refinement. So this is a VLDB Industrial paper last year.
So they have structures where they don't have a single global mutex for the kernel. They have
stuff sort of separated out, and they have sort of a separate latching structure for the
little different -- different little bits of the shared data.
So in a sense it's similar to Michael Cahill's algorithm. However, it's a bit more accurate.
Firstly, they don't just set Boolean flags, they actually track which transaction is at the other end.
So you get fewer cases of a dangling flag left by a transaction that aborted. They also have some
additional optimizations, including some for analytics which don't matter for this.
So we also took their code and measured that. For this we actually used the TPC C++
benchmark, which is one that Michael Cahill invented, so it's basically TPC-C, and you add one
more transaction to the mix so that you actually have a situation where snapshot isolation
wouldn't work correctly. And so that was run -- this is done on a 32-core machine.
And what we see here is there is still a performance drop-off, but it happens much, much later
and it's not nearly as bad. So here -- so this is MPL 128. You're down by a third from the peak
at about MPL 32, but you haven't collapse catastrophically. Though it's still the case that with
high MPL you are definitely paying for serializability compared to snapshot isolation. Which
PostgreSQL still has available. So if you ask for repeatable read, you get snapshot there. And if
you ask for serializable, you get serializable snapshot.
So this shows that, you know, serializable snapshot isolation, it's not something sort of intrinsic
to the algorithm that you have this performance collapse, it comes very much from the
implementation details and how you share the mutexes. But even here with some careful work,
and, you know, the Postgres people did measure it on machines with some numbers of cores, as
you get to 32 cores and higher multiprogramming levels, you are really hurting still in
performance.
I guess the conclusion I would give is that -- and here you guys are way ahead of us, is the great
importance of thinking about the implementation details and particularly the latching that goes
on in order to get decent performance on sort of the high-end servers that are coming. And so
really thinking about what's shared and how you protect what's shared become vital.
We agree with all the previous papers that the serializable algorithms do -- don't have much
penalty compared to snapshot isolation when there's one core or two cores, but at the moment the
algorithms are still not taking good advantage of lots of cores.
And the work, though, what we were suggesting was not so much algorithmic rethinking of the
concurrency control but implementation rethinking to avoid contention overlatches.
And that's where I end, and I guess I will hear more about the way you've sort of gone all around
this and managed to do stuff with both concurrency control changes and implementation ones.
>> David Lomet: Thank you very much.
[applause].
>>: I presume that some of these papers, past papers on serializable snapshot isolation, had
something to say about the performance drop-off from snapshot isolation perhaps simulating it or
doing it in some more controlled setting. I mean, all concurrency control algorithms ultimately
become a bottleneck under sufficiently high contention. Can you give us a sense of what would
be expected if that were the only effect?
>> Alan Fekete: So they ->>: We may actually just see it on those graphs.
>> Alan Fekete: Okay. Yeah. So I don't have it. Actually let me just go in. I may have some
of Michael's slides among the hidden stuff. So let me just see if I've got some of those.
>>: [inaudible].
>> Alan Fekete: Okay. So, anyway, maybe let me go back to summarize. What Michael found
in his experiments was there were lots of situations where the concurrency control wasn't the
bottleneck. So most of the cases he did snapshot isolation, and his serializable snapshot
isolation, the performance was very close. However, he could tune the situation so it became a
bottleneck and then you can start getting substantial drop-offs. Most of the time that wasn't
where he found it.
>>: Well, the question is how much the divergence, and you're going to get some drop-off even
for snapshot isolation if you get a high enough conflict level, so the question is what's the spread
between them?
>> Alan Fekete: So Michael got graphs which -- where he -- many of his graphs basically, you
know, you had SI, SSI, and 2PL essentially on the same curve. But by -- he was able also to get
graphs where you had SI, SSI, and 2PL with something like 30 to 50 percent there by choosing
the weight of the transactions appropriately, putting the right delays in them.
>>: [inaudible], you know, I don't know, we have a -- we have an implementation of [inaudible]
Ph.D. thesis simulation of concurrency control algorithms that we used for an earlier study you
may recall.
>> Alan Fekete: Yes.
>>: It'd be interesting to get some kind of apples-to-apples comparison of these things, just
purely based on the concurrency control's behavior. Because there's so many moving parts, to
try to actually do it any other way is really going to be really, really difficult. And it's worth
knowing.
>> Alan Fekete: Yes. So, I mean, yeah, so the first thing of course is just trying to understand
what environment will it be that will make the concurrency control the bottleneck rather than all
the other things.
>>: You know, all the bench -- I mean, all the transaction benchmarks are designed to have that
effect. And so ->> Alan Fekete: That wasn't what -- well, I guess the first thing is, you know, you have to do
something to make sure that the log, the disk -- the disk writes of the log don't become a
bottleneck. So you have to have something like group commit, otherwise I think that's the one
that hits you first.
All of these are done with everything fitting into main memory. So that takes that out.
>>: [inaudible] I'm not expecting an answer, sort of a rhetorical question, it might be if you're
interested in pursuing this line of understanding performance, I think that would be a useful --
>> Alan Fekete: Yeah. I mean, what we found -- so one of the experiments I didn't show you
was we -- we did the one where we really sort of crunched down onto a hot spot to try to get
contention in the concurrency control as the issue. But basically it doesn't turn out to be that
much.
So what we found there was that when you make it a hot spot, all three of the SSIs are dropping
off, they're all dropping off far more than SI is, so they are all collapsing.
And even -- yep.
>>: [inaudible] nature by design and by definition you have more conflicts, so you can come up
with any workload that can start official workload and can cause I guess as much conflicts
[inaudible] even by the theory [inaudible] analytical model which cause the problem, whatever
amount of [inaudible].
>> Alan Fekete: Yeah. But so -- I mean, there's the difference between SI and the others. But
then within the SSI variance, there's also how they got it, so how many false positives you get
from just looking for two as against from being more precise, how many false positives you get
from not cleaning up when an -- an edge from an aborted transaction.
>>: But you're saying -- I think my takeaway from this is those sort of differences -- the
differences that the concurrency control level are small is really -- this is really sensitive to how
you build the concurrency control mechanism and what mutexes you use to protect the data
structures within the concurrency control mechanism itself.
>> Alan Fekete: [inaudible] yes.
>>: [inaudible] may be it does a lot better than if it isn't.
>> Alan Fekete: Yeah, but, I mean, Phil's question is suppose you got away from all of that and
really just looked at the concurrency control.
>>: So it's kind of a weird presentation. I mean, you had this rather lengthy buildup about
serializable snapshot isolation, and then in the end you told us about the way database kernels
behave badly, which is actually the choice of concurrency control algorithms is quite irrelevant.
>> Alan Fekete: Yes.
>>: I mean, so, which is ->> Alan Fekete: But, I mean, it's not completely irrelevant because ->>: [inaudible] talk about, but ->> Alan Fekete: I mean, you still have the fact that snapshot isolation doesn't have the collapse
nearly as badly as all the serializable variants.
>>: It's pretty interesting that they're that crude.
>> Alan Fekete: Yes. So but, I mean, there is still a takeaway which says the price of getting
serializability, which Michael and we've argued is, is it was manageable, it was like 10 percent at
the most, compared to snapshot under most conditions. Turns out not to be true with enough
cores, with the sorts of implementations that we had.
Now, one could maybe -- and we're trying to find better implementations which don't have these
issues. And the question is can you get it to close to what snapshot isolation has under those
conditions.
>>: The takeaway that I get from this is that what matters -- what really shows up here is the -how efficiently you could maintain the data structures that you have to maintain, how efficiently
you could do it under high concurrency. Right?
>> Alan Fekete: Yes.
>>: Those effects are now so big that you can't really see the difference between the various
algorithms.
>> Alan Fekete: Yes. Yes. And, I mean, the thing is that for snapshot isolation, pure snapshot
isolation, when you read, you don't have to do any modification of the internal state. And so
that's fast. And it doesn't cause all these problems. Whereas all of these things, when you do a
read, you have to record the fact. And that you're paying for.
>>: So if you would have got this wrapped, you really have a very -- essentially a lock
management [inaudible] very high levels of concurrency.
>> Alan Fekete: Yes. Yes, you need a smarter lock manager that can do this. And you have to
think about the whole system in terms of what you're doing at each stage of the checks.
I mean, Michael had tried to do this in designing it so the flags were transaction local. But in the
implementation, he just didn't check all the other bits.
>>: With a big mutex there isn't that much you can do.
>> Alan Fekete: Yes. Yes.
>>: [inaudible].
>> Alan Fekete: Yes.
>> David Lomet: Well, thank you very much, Alan.
[applause].
Download