>> Vivek Narasayya: So it's my pleasure to introduce... graduate from Brown University. And so today he's going...

advertisement
>> Vivek Narasayya: So it's my pleasure to introduce Andy Pavlo who is a PhD student about to
graduate from Brown University. And so today he's going to talk about his adventures at the
dog track.
>> Andy Pavlo: All right. Thanks everybody for coming today. So he's right. I am going to talk
about my two main passions in life. And that's the science of database systems and gambling
on greyhounds at the dog track. Now, I realize, I hear some snickers, for a lot of you these two
things seem like they have nothing to do with each other whatsoever. But what I'm going to
show today is that there's been specific research challenges or problems that we faced when
trying to scale up database systems to support modern transaction processing workloads or
modern workloads where the answers to those problems have come directly from things I’ve
either seen or learned or saw while at the dog track.
So before I get into this, I want to give a quick overview of what sort of the state of the
database world is right now for people that want to run front end applications, and I'll loosely
categorize the type of systems that are out there today into three groups. So the first is what
I'll call our traditional data base system. So these are things like DB2, SQL Server, Oracle,
MySQL, Postgres, and the key thing about these systems is that they make the same
architectural design assumptions and hardware assumptions that were made in the 1970s
when the original database system, System R and Ingres were invented. Obviously a lot of
things have changed since then. The second group is come around about the last decade or so,
and these are colloquially referred to as NoSQL systems. So these are things like MongoDB,
Cassandra, ReAc[phonetic], and these systems are really focusing on being able to support a
large number of concurrent users at the same time because they want to be able to support
web-based and internet-based applications. And so for them, these traditional database
systems weren't able to scale up to support their needs.
So my work, my research is really focused on this emerging class of systems that has come
about called NewSQL. And we are trying to have the best of both worlds. We’re trying to
maintain all the transactional guarantees that you would get in a traditional database system,
but while still being able to scale up and support a large number of concurrent users in the
same way that the NoSQL guys can. And so for this talk, I'm not really going to talk about the
NoSQL guys other than to say that their work is complementary to ours. Right? So there's
certain applications where you'd want to use a NoSQL system and not a NewSQL system and
vice versa.
So now let's look at what one of these modern transactional workloads look like. So here we
have a workload that's derived from the real software system that powers the Japanese version
of American Idol. I realize this is not called American Idol in Japan, but just go with me for this.
And so in this application, what you have are people either calling in or using their laptops to go
online and vote for contestants that they like on the show. And so when one of these requests
comes in, one of these calls comes in, the application starts a transaction that will check in the
database to see whether this person has called before, voted before. And if they haven’t, it will
go ahead and create a new vote entry for them and update the number of votes that a
contestant has gotten. All right? So this seems like a pretty simple application. It seems like a
pretty simple workload that really any modern database system should be able to support
without any problems.
So to test this hypothesis, we took to open source traditional database systems, MySQL and
Postgres, and we tuned them for these type of sort of front-end transactional workloads. And
we want to measure how well they can scale up and get better performance as we give them
more resources. So on one machine we are going to run the database system, and we’re going
to scale up the number of CPU cores that we’re going to allocate to the system to process these
transactions. And then on another node, we’re going to simulate people calling in and voting
for contestants. And what we found is that in both cases, both these systems, they can't break
past 10,000 transactions a second. Right? In case of Postgres, actually the performance gets
worse as you give them more CPU cores. And so I'm not trying to pick on MySQL and Postgres
here, but I'll say that these results are emblematic to other performance results that we've seen
in other traditional database systems. So we've done other experiments where we take Oracle
and to pay for a very expensive DBA to come out and tune it for us, and we see the same kind
of results. As you add more CPU cores you don't get better performance.
So now the question is: what's going on? What is it about these traditional database systems
that is causing them to not be able to scale up in what seems like a simple workload? So in
another project in our research group, they took another open source traditional database
system and they instrumented the code to allow them to measure how much time is spent, or
how many CPU cycles are spent, in different components of the system when they run one of
these transactional processing workloads. They found it’s about 30 percent of the time is spent
in the buffer pool. So this is managing an in memory cache of records that have been pulled
from disk and running some eviction policy to be able to decide when we need for room, what
to evict, what to write back out to disk. Another 30 percent of the time is spent in the locking
mechanisms of the system. So because there's a disk, transactions can be run at the same time,
and one of them could try to touch data that's not in the buffer pool, so therefore has to get
stalled while the record that it needs gets fetched in to main memory. So this all of accounts
for about 30 percent overhead if you have to do this.
Another 28 percent of the time is spent in the recovery mechanisms of the system. So this is
because we have an in memory buffer pool; there could be dirty records that have not been
safely written to disk yet. So they have to use things like a radio log or write ahead log or other
mechanisms to make sure that any transaction that gets committed, that all of the changes are
durable and persistent if there's a crash. You don't lose anything that's already been
committed. So that leaves us a paltry 12 percent of time left over to actually do useful work for
the transactions. So this is why these traditional systems are not scaling up, right? Because we
just simply have all this other overhead for all these other components. Right? And so where
does that leave an application developer? Yes.
>>: I don’t quite see why that's scalability and not simply characteristic of even a uniprocessor
system.
>> Andy Pavlo: So your question is>>: What's the multicore angle on this?
>> Andy Pavlo: So this is like if, in a multicore system, right, I see what you're saying. The
question is: this is more emblematic to traditional systems rather than being a multicore
system. And I'll say, sort of like the idea is that well, in these type of workloads you are usually
CPU bound, right, from these experiments here at the main memory database system. So there
is no disk. So that's why we're showing the CPU latencies. So this is just showing that if you
want to be able to scale your systems to support on these modern hardware with a lot of cores,
as you start to scale them up you think you get better performance but you don't because
you're paying sort of all this extra overhead to do all of this locking stuff. You're shaking your
head no. Sorry. Yes.
>>: That’s not the point. One says that the overhead doesn’t [inaudible] per se.
>> Andy Pavlo: The question is>>: So there's something else going on?
>> Andy Pavlo: Yes. Correct. Well, no. So it's, I mean I’ll come to this, yes. There are other
things. But the basic idea is that there is all this architectural baggage from the 1970s that all
these three things are representative of. And so if you have, if you take a new look at what
these workloads looked like and what the hardware can do, what you can do with that
hardware, then maybe you don't need these three things.
>>: Should I think of the gray parts as having lots of synchronization [inaudible] in them and
green parts being like>> Andy Pavlo: Absolutely. So yes, in the case, especially in the lock manager, right?
>>: Okay.
>> Andy Pavlo: So if you have concurrent transactions>>: That's kind of the answer.
>> Andy Pavlo: Locks, latches, and Utexts and other things.
>>: The other problem about the overhead is the synchronization that's done in all these>>: But these types of operations, like maintaining a buffer pool, recovery, they may be
inherently synchronization>> Andy Pavlo: Yeah, because you have to pin pages, yes.
>>: In order to prevent scalability, you would have to argue that the percentage of time spent
in each of these increases as you increase the number of cores. It's not just a matter of having
this overhead. If this overhead remained constant while you increased the number of cores,
you get scalability.
>> Andy Pavlo: Right. So let me go further, and if you have more questions when I say sort of
what our system is doing, we can be talk about that. Okay. So where does that leave
application developers? Well, up to, like I said, a few years ago there's really only two choices.
You could go with a traditional database system. A lot of people do this because they provide
the strong transactional guarantees, and it's actually easier to write programs when you have
transactional semantics. But these systems are notoriously hard and notoriously expensive to
scale up. So anecdotally I'll say I have a colleague that works at one of the big three database
vendors, which is in Microsoft, and he tells me that their, one of their largest customers is a
major bank in the US that pays about a half billion dollars a year, per year, just to run their
transaction processing system. And so for most companies, most organizations, that's simply
infeasible. So for a lot of people, in the last decade or so, we still have the rise of these NoSQL
Systems because they're able to get the better performance that you want just for Internet and
web-based applications when you have a large number of concurrent users. But these systems
achieve this performance over the traditional database systems by forgoing all the transactional
guarantees that the traditional systems provide. So in the application you have the right code
to be able to reason with inconsistent or conventionally inconsistent views of the database.
So again, our focus is going to be on the class, the NewSQL class of systems where again, we’re
trying to have the best of both worlds. We're trying to be able to scale up and get better
performance and while still maintaining support for transactions. So a real, the research
problem we are trying to solve here is how do we actually do this? And what I'll say is we're not
going to do this by being a general-purpose system. We really going to focus on certain class
applications that have key properties that we can exploit in our system. And we're not going to
try to claim to be a one-size-fits-all database system for everyone. And so the first question is:
well, what are these properties in the these type of applications that we are going to focus on
that we want them to have our system be optimized for that we are really going to take into
consideration? And so the answer to this first problem is, can be found actually at the dog
track.
So specifically, there's three important characteristics of greyhounds and dog racing and the
type of transactions and the applications we want to support that actually directly analogous to
each other. So the first is that both of these things are very fast. So in the case of these, in the
case of greyhounds, they’re the fastest type of, one of the fastest animals you can have on the
planet. They can run almost about 40 miles per hour. And similarly, our transactions are the
fastest type of transactions you can have in database systems. So we’re talking about
transactions that can finish on the order of milliseconds rather than minutes or even seconds.
We’re not talking about long running transactions.
The second thing is that both these things are very repetitive. So in greyhound racing, the dog
just runs around in a circle on the track and that’s it. Right? There's nothing else that it
actually can do, right? Similarly, in our applications we’re going to focus on, the database
system is going to be doing the same set of operations repeatedly over and over and over
again. All right? So if you remember back from our American Idol example, when someone
calls in, there's only one transaction that's ever going to be invoked and that one transaction
only has three steps. So we're not going to focus on optimizing system to allow people to write
arbitrary transactions or open up a terminal and write random query in.
And the last thing is that both of these items are very small or have a small footprint. So what I
mean by that is greyhounds actually have a small footprint or paw print for a dog of their size or
stature; and similarly, our transactions are going to have a small footprint in the overall
database. So the data set itself could be quite large, but each individual transaction is only
going to touch a small number of records at a time. So we're not talking about long running
queries that are doing full table scans, doing complex joins to compute aggregates and things
like that. We are really talking about transactions that come in, use an index to find, to do point
queries to find individual records that it wants to read, and only processing them.
So now based on these three properties, we've designed a system called H-store that's
optimized from the ground up to be specifically work for, operate efficiently for transactions in
these types of applications; and this is work that I've done as part of my dissertation, along with
colleagues at Brown, MIT, Yale, and I was at the time, Verticus[phonetic] systems. And so in Hstore, and this maybe gets to address some of your questions, in H-store there’s three key
design decisions that we’re going to make that are direct reaction to the bottlenecks that we
saw in the traditional systems. So in the traditional systems, they're inherently disk oriented.
So all that machinery that I talked about before in that pie chart, a lot of that is based, you have
to have because everything, you know, a transaction could try to touch data that's not in the
buffer pool but that’s on disk. But for these modern workloads that we’re looking at, in many
cases the database can fit entirely in main memory. You can buy a few number of machines
that have enough RAM that's able to store the entire database entirely in main memory. And
so in H-store, we are going to have a main memory storage engine. We’re going to assume that
the database is small enough to be able to fit in RAM. So we’re talking about databases that
are usually, for these type of applications, are a couple hundred bits of gigabytes. And the
largest one that I know of is Zinga, which is roughly about 10 terabytes. So again, it's perfectly
feasible to buy enough machines that have enough memory to do this.
The second thing is because in a traditional systems, because there's a disk, they have to allow
transactions to run currently because anytime one could stall because they tried to touch
something that’s not in the buffer pool. And so again, to do this you have to have a
concuragable[phonetic] screen that's using locks and latches and UTexts and other
synchronization methods to make sure that one running transaction does not violate the
system view of another transaction that's running at the same time. But now if everything is in
main memory, you're never going to have those kind of disk stalls. So maybe does not make
sense to actually have a lock manager and have to concurrent transactions anymore. So in Hstore, we're going to have serial execution of transactions, meaning we’re going to execute
transactions one at a time at a single core. And this sort of makes sense because if the cost of
going and acquiring a lock in main memory is the same as actually just accessing the data in
main memory, you might as well go access the data.
And lastly, in a traditional system, they have to use a more heavyweight recovery mechanism to
make sure that all changes are persistent and durable after a crash. So they use something
where they record the individual changes that were made on each record that was read or
written to by a transaction. So in H-store we're going to use a more compact logging scheme
that's more lightweight and more efficient where we only need to store what a transaction was
rather than what it actually did. And I'll explain a little bit more about what I mean by that in a
second.
So now, the basic architecture of H-store is that the databases can be split up into disjoint
subsets call partitions that are stored entirely in main memory. So this example here, I have,
say I have a single node I have two partitions, and so my database is going to be split into
disjoint subsets where one half the database will be in one partition, the other half will be in
the other partition. And for each of these partitions, it's going to be assigned a single threaded
execution engine that has exclusive access to all the data at that partition. And what that
means, if any transaction needs to touch data that partition, it has to first get queued up and
then wait to be executed by that partition’s engine, and because when I have these [inaudible]
locks at the partition level, when a transaction’s running, since the engine is single threaded, it
knows that no other transaction is running at the same time, so we don't have to set any finegrained locks or latches down within the underlying data structures within the partition. So
when these are run beginning to end without ever stalling.
So now to execute a transaction, the application comes along and it's going to pass in the name
of the stored procedure that it wants to invoke and then the input parameters for that
transaction. So in H-store, the primary execution API is going to be through stored procedures.
And stored procedures are essentially in our world, a data class file where you have a bunch of
predefined queries that each have a unique name, and then on run method, it takes in the
input parameters sent in by the application and invokes program logic that will make
invocations of the predefined queries. And so we have a very important constraint in our
stored procedures in H-store and that is they have to be deterministic. And what I mean by
that is they’re not allowed to make, you know, use a random number generator inside the run
method, or go grab the current time, or make invocations using our PC to some outside system.
All the information that it needs, and to process that transaction, has to be contained within a
past in from the client. And this will be important later on for recovery mechanisms. Yes.
>>: Does that mean you can't depend upon the prior state of the database? You can't have a
conditional logic which>> Andy Pavlo: That's okay. Yeah. So the determinism really has to be if we re-execute this
transaction at a later date in the same order that we process it originally, we need to end up
with the same ending state. It's perfectly fine to do a query, read back the state and then if
branched, do something different. That's fine.
Okay. So now the transactional request will be queued up at the partition that has the data
that it needs, and once it reaches the front of that engine's queue, it will have the global lock
for that partition, and it's going to be allowed to start running. Now when it finishes, we're
going to go commit its changes right away; but before we send the result back to the
application, we have to write out the same information the application sent us originally out to
a command log on disk. Yes.
>>: [inaudible] an individual transaction can't access multiple data from multiple partitions?
>> Andy Pavlo: No, it can. And we'll get to that too. That's later. Yes. And so I’ll say this
command log, this writing this out is done as separate threads; we are not blocking the main
engine. Yes.
>>: Does this mean as you get up in cores you have to do a finer-grain partition?
>> Andy Pavlo: Yeah. So you could. I mean for one core there's one partition. Finer grain in
the sense of like>>: You have more cores, you have more partitions.
>> Andy Pavlo: Absolutely. Yes. So there is an upper limit, that's sort of related to his
question. There’s an upper limit as you partition more. You could end up with more multipartition transactions. Yeah. And so we’ll do, we're going to batch these entries into command
log together and do a group command where it’s just one [inaudible] to write them all at the
same time and sort advertising the cost of doing that write across multiple transactions. So
now once this node’s safely written endurable out to the command log, it's a safe for us to go
ahead and write, send back the result to the application. Now there's also a replication scheme
that's going on here where we are doing active, active application where we can just forward
these transactional requests from the application to our replica nodes and then process them in
parallel, but I'm not going to talk about that today, right now because it sort of complicates
everything that we’ll talk about later on. But if you want to know more about it I'll be happy to
talk about it afterwards.
So now, while this is all going on, the database system in the background is we taking
asynchronous snapshots of the partitions in memory and then writing them out the disk as well.
So we're going to use a copy on write mechanism so we don't slow down the main execution
pipeline of the execution engines. And so now if there's a crash, all we need to do is load in the
last checkpoint that we took then we can replay the command log to put us back into the same
database tape again. So this is why they have to be deterministic. On recovery, we want to
make sure we end up with the same result.
>>: You need a transaction consistent checkpoint, right?
>> Andy Pavlo: Correct. Yes. But you need to make sure all the nodes are doing the checkpoint
at the same time, but then you use a copy on write mechanism to make sure that you don't
block everybody else. But yes. Okay. So now if we go back to our Japanese American Idol
workload we looked at before, and this time we’re going to run the same workload and the
same hardware using H-store. But you see what we get H-store a single CPU core to process
transactions, it can do over 20,000 transactions a second. But now the main difference is as we
scale up the number of cores that we give the system, we can get better and better
performance, up to a factor of about 25X over what we can get in the traditional systems on
eight cores. And so now immediately every single one of you here, when you see a
performance gain like this, should have a red light or siren going off in the back of your head
telling you to be skeptical. And what I'll say is that this is not a parlor trick. All three systems
are running the same workload, the same serializable isolation level, and they're running the
same durability and persistence guarantee. So if the database crashes, all three systems can
recover any transactions that were committed.
So we can actually look at some other workloads and see the same kind of performance results.
So TBCC, as everyone I'm sure is aware of, is the canonical benchmark that everyone uses to
measure the performance of these sort of types of systems. And we see the same kind of
performance results as we give more cores to H-store we can get a better performance,
whereas the traditional database systems it simply flat lined. Telecom One or TM1 or it’s
actually referred to as TDP now, is a workload from Ericsson where that simulates someone
driving down the highway with their cell phone and the cell phone has to update the towers
and say, if you need to call me, here's where to find me. And we see the same kind of result as
well. As we give more cores to H-store, we can do better, whereas the traditional systems, the
performance actually gets worse.
So now you would look at these results and say, well this is great. H-store does much better
than the traditional system; why would I ever want to use a traditional system today when
something like the architecture, of a main memory architecture like H-store, can get much
better performance? And the answer should be quite obvious everyone here, and that is it's an
inherent problem to the main memory database system is that you're limited to databases that
can fit in main memory. But that's okay, because out-of-the-box H-store supports multi-node
deployment. So here we looked at the same three workloads that we started off with before,
and this time we’re going to scale up the number of nodes in our cluster, so we're going to go
from one to 2 to 4 nodes, with eight cores per node, and we see the same kind of thing. As we
add more hardware to the system, we are able to get better performance, and this checkered
line here is sort of marking where we ought to be in terms of achieving linear scalability, which
is the Gold mark standard, what you want to have in a distributed database. So as we double
the number of nodes, we want to get double the performance. In the case of the voter
benchmark, we're pretty close to achieving that. In the case of TPC-C and Telecom One, we're a
little bit off because of the nature of the workload. And so again, now you look at us and say
well this is great because in H-store I can add more machines, I can just put databases that are
larger than memory on a single machine, why would I ever want to use a traditional database
system when I can just buy more machines and scale out with H-store? And again, sort of
related to the gentleman's here’s question is: in these three workloads, all the transactions
were single partitioned, meaning>>: It’s not CPU cores, it’s the number of machines?
>> Andy Pavlo: CPU cores, divide by eight. So it's one to two to four. Actually I marked that.
Sorry. So for these workloads here, all of the transactions only touch at a single partition. So
when they ran, they did not need to coordinate or synchronize with any other node in the
cluster. And that's why we're able to get this really good performance. But now if you have a
transaction that has to touch multiple partitions, you end up with what is known as a
distributed transaction. And this is really the, been the main bottleneck, the main problem of
why a lot of the distributed databases from the 1980s did not really come become popular
because this is why these systems aren't able to scale up. And although H-store is a NewSQL
System, its modern code base and modern hardware assumptions were not immune to this
problem either. So now we go back to GPC benchmark again and this time we make 10 percent
of the transactions be distributed, meaning 10 percent of the transactions need to touch data
at two more partitions. Now we see as we scale up the number of nodes, the performance is
terrible. It's completely flat-lined, and we are nowhere near where we want to be in terms of
linear scalability. So now you look at this and say this is God-awful, why would I ever want to
use H-store because I try to scale up and have multiple nodes in my cluster, I’m paying for more
hardware, I’m paying for more energy, I’m paying for more maintenance for those machines,
and I'm not getting better performance at all. I might as well go back to traditional database
system where at least if I pay more money I can try to scale up and get better hardware that
way.
And so this is a real problem because not all transaction, not all workloads can be perfectly
single partitioned in the way that we assumed before. So in the early days of this project, we
actually visited PayPal, and PayPal had this legal requirement where customers from different
countries couldn’t be on the same partitions. So if you had an account in Italy, if you had an
account in the US and you wanted to send money between the two people, they had some
legal requirements that had to be a distribution transaction. So an architecture like H-store
simply would not work for them. In many cases, the application scheme itself is not easily
partitionable at all either. So you would end up with this bottleneck. So this is the main thing
we’re trying to solve here. How can we try to achieve linear scalability? And when we have
distributed transactions.
So to do this, the first thing we have to do is to figure out well, what's going on? What is it
about these distributed transactions in a system like H-store that's causing this bottleneck? So
now we want to look at an example where we have a multi node cluster. Say we have an
application that comes along and submits a transactional request to this system, and this time
this transaction needs to touch data at these four partitions. So the way H-store’s concurrency
control protocol works is that we have to acquire the locks for these partitions first before the
transaction’s allowed to start running. And the reason why we have to do it first is we don't
have to do deadlock detection if we sort of have more fine grain locking now because that
would be expensive to do in a distributed environment, especially if you have transactions that
are finishing the order of milliseconds.
But now the first problem we are going to hit is we don't actually know what partitions this
transaction actually needs before it starts running. So we have to lock the entire cluster even
though we're never going to need most of those guys. So now once we do this, and the
transaction is allowed to start running, it can issue the query requests to these remote
partitions to either access or modify data that's located at the other nodes, and we’re going to
see a second problem. That is, if we actually knew the number of queries this transaction
needed to execute at each node, we would see that it needs to execute more data at this node
at the bottom, or touch more data at the node at the bottom. So what we really wanted to do
is when the request came in, we wanted to be able to automatically redirect it to this node
down here and run the stored procedure there because that would result in a fewer number of
network messages because most of the data you need is, would be local, we won’t have to go
to the network to send query requests to these remote nodes. But again, we don't know this
information because we’re dealing with arbitrary stored procedures. Right? So this is a difficult
problem, and you could try to apply things like static code analysis and other things, but that
would be too slow to do on every single transaction that comes in.
So luckily for us, the solution to this problem can actually be found back at the dog track. So I
wouldn't say that I was going all the time, I wouldn’t go every day, but you know, you go a
couple times a week, holidays, Fourth of July, Memorial Day, Mother's Day, stuff like that. And
it was one of those things where you start going to the same place over and over again you
start to notice the same people in the same patterns, right, of people doing the same thing
every single time. And this guy, I met this guy named Fat-faced Rick; and he first came to my
attention because he was winning every single bet he was making with all the bookies at the
track. He wasn't betting a lot amount each time, but every single time he made a bet, he was
right almost 100 percent of the time. And it took me a while, but I finally figured out what he
was doing. Every single morning before a race he would go down to the parking lot where all
those trainers would bring in their dogs and sort of check in for that night's race, and he would
pretend to be a vet from the state gaming commission, and he would tell the trainers I need to
look at your dogs, make sure that they're up to regulation and they don't have any health code
problems, right? But what he was really doing was checking them out to figure out which ones
were in the best shape, which ones were the strongest, and which ones didn’t have any injuries.
And those are the ones he would go make his bets on. And that's why he was always winning.
So this is the same thing we need to do in our database system. We need to know what things
are going to do when they come in before they start running. So to do this, we build a machine
learning framework called Houdini that we've integrated into H-store that’s going to allow us to
predict behavior transactions right when the request comes in without having to run it first. So
we have a very important constraint in this work of how we make these predictions, and that is,
we can't spend a lot of time figuring these things out. So we can’t spend about 100
milliseconds figuring out what a transaction’s going to do if that transaction’s only going to run
for five milliseconds.
So the underlying component of how Houdini works is that we’re going to create Markov
models, or probabilistic models, of all the stored procedures that the application could execute.
And so we’re going to build these based on, from training sets of previously executed
transactions. So for each model, we're going to have the starting and terminal states for the
transactions, so the beginning, the commit, the abort states. And then we’re going have the
various execution states that the transaction could be in at runtime. So theses execution states
are represented by the name of the query being executed, how many times you've executed
the query in the past, and what partitions this query’s going to touch. Now each of these states
are going to be connected together by edges that are weighted by the probability that if a
transaction’s at one state, it’ll transition to another state. So now at runtime, when a request
comes in, Houdini will grab the right model for that unique request, unique transaction
invocation, and it will estimate some path through this model, and based on the states that the
transaction will visit when it traverses through the model, that will tell us what are the
optimizations we can apply at runtime.
Now if we get our predictions wrong, it's okay because that runtime, we are actually going to
follow along with what state transitions that transaction actually does make. And so we're
maintaining internal counters of how many times it goes across some path. So we start noting
that are predictions are deviating from reality or in the actual runtime behavior transactions,
we can just re-compute these edge weights really quickly online and with a cheap computation
to get us back now in synch where the application, what the application’s actually doing. So
how we’re going to generate these models is that again, we are going to have a training set, a
previous executed transactions, yes.
>>: So I can understand why you predict [inaudible] more frequent than aborts. But it seems
to me what you really need to do is be able to predict which partition’s being referenced.
>> Andy Pavlo: Okay, so again, that's what we do. So the state has the name of the query
being executed, how many times we've executed query in the past, what partition’s this end
location that query will go to. And because it’s a Markov model, we have to encode all the
history at any state. We also have the history of all the partitions we touched in the past.
>>: So you're weightings there suggests an overwhelming [inaudible]. But you could easily end
up with a path which there are three possible partitions and they reach around 33 percent.
>> Andy Pavlo: Yes. Give me like two minutes and I'll solve your problem. Okay. Great
[inaudible]. So we have a training set of previously executed transactions, right? So these are
all the queries and input parameters that were invoked in each transaction, and we’re going to
feed that first into a feature cluster that's going to split them up based on characteristics or
attributes of the transaction’s input parameters that create the most accurate model. So these
can be things like the length of an array parameter, the hash value of another parameter. And
now with these bucketed training sets, we're going to first feed that into our model generator
that will create the Markov models for each bucket, and then we’ll have a decision tree, a
classified, create a decision tree that will split them up based on the features that we originally
clustered them on. So now, one of the features could be: what's the hash value of some
parameter? And that will tell us what partition we’re going to execute this query on, what
partition we’ll execute this transaction on, and that will tell us now that it almost becomes like
a linear state machine where we don't have that equal probability of what partition we take
because is no longer a giant monolithic [inaudible] model; it's more individualized. Yes.
>>: There are many models, many classifiers that you could have used. Why did you choose a
decision tree?
>> Andy Pavlo: We chose a decision tree because we wanted to be able to quickly traverse it
and say at runtime>>: Runtime speed?
>> Andy Pavlo: Correct. So this whole top part here is actually what I was going to say next.
This whole top part here we’re doing off-line, so it could take a while, that's fine. But now at
runtime, we can quickly traverse the decision tree and then quickly estimate some path to the
model. So we can do this bottom part here in microseconds per transaction.
>>: So just to make sure I understood. So the parameter value is being [inaudible]?
>> Andy Pavlo: Yes.
>>: From that you can exactly know, it’s not a prediction, you know exactly which partition
that's connected [inaudible] or you don't know?
>> Andy Pavlo: It tells us, it suggests to us what partition to run the stored procedure at. But
now within that stored procedure, it could touch any number of partitions. It just so happens in
the case>>: So then how does it help you with the lock? [inaudible] some sort of computed example.
>> Andy Pavlo: Yes.
>>: So let’s say [inaudible]. How do you know which partitions to lock?
>> Andy Pavlo: Yes. So when a request comes in, we grab the right model, do a little handwaving magic with the machine learning, so we take the model that is best representative
based on the decision tree, and we estimate what the path is through it. And so when we
estimate, when we're trying to figure out what the state transitions we’re making, it’s, since we
know what the tables are partition on, because we have to be told that ahead of time, now we
have, we know what the input parameters are to that transaction and that can tell us what
actually partition the query’s going to get to have to touch. Yes.
>>: There might be some dependencies here that are actually stored in the database.
>> Andy Pavlo: Correct.
>>: So if I look up my main customer and the hash of my main customer now determines the
partition.
>> Andy Pavlo: Yes.
>>: What is your boundary? So in many cases, this should work fine>> Andy Pavlo: If you have to read the state of the database and say, based on that, now you
hash whatever that, the output of one query’s being as used as input for another query, these
models don't capture, capsulate that information.
>>: Right.
>> Andy Pavlo: So we take care of that in other ways. We take care of that, again, for the
partitioning work to be done, the automatic database [inaudible] we've done, we can figure out
hey, we see this pattern happened a lot, it's usually read only, read mostly, so we’ll have a,
create secondary indexes that are replicated in every single node so that we can do that lookup
and then that will direct us to the write locations. So there's other things more than that.
Right?
>>: But you would know, right? So in these cases, if you have any mode of saying well, for
these transactions we can get a very accurate model, for these cases we just don't know. So
we'll back off from our prediction and do something more conservative because even though
your most likely path may not be very likely.
>> Andy Pavlo: Yes. So your question is, is there a way for us to identify whether different
types of workloads that we see, that have this dependency where you're reading stuff in the
database system and then maybe not applied these optimizations for that?
>>: Or just more general, do you know when your model works well and when it doesn't?
>> Andy Pavlo: So we have not done anything formal about that, but I can tell you sort of off
the cuff, things again, doing that lookup and using applicable queries and then put in another
query, that won’t work well with this. But again, we take care of that in other cases. If you
have sort of large range queries that have to touch multiple partitions, and it’s arbitrary what
partitions you have to touch, that won't work well in this, but I'll say for those types of
workloads, for that second type of query, we don't see often in, for the type of applications
we’re focusing on. That's more getting into the real-time analytical stuff, which we haven't
focused on yet. And I'll talk about that in future work. Any other questions? Yes.
>>: So are they able to do optimizations for blocking the partitions as well, based on your
predictions on which partitions the transaction is going to access?
>> Andy Pavlo: So we can do like, this case here, we have our path, right? So we know what
partitions we think it’s going to touch, so now we only lock the ones we need.
>>: And what if the partition model is incorrect, and during the runtime you decide [inaudible]?
>> Andy Pavlo: So we’d say, yeah. [inaudible]? If we predict, if we fail to predict that we need
a partition in the beginning and we don't lock it, when the transaction tries to actually access
that partition, we'll abort it, roll back any changes, and restart it, and acquire the locks.
>>: That’s to prevent any deadlocks?
>> Andy Pavlo: Correct. So then, yeah. You can't touch anything without having a lock
beforehand.
>>: So [inaudible] the partition is just the transaction that works [inaudible]?
>> Andy Pavlo: No. There's other things as well. So like you try to lock something you don't
actually end up needing, and now the thing is just sitting idle and can't do anything. I don't
have a good sense of what’s worse.
>>: Okay.
>> Andy Pavlo: I know that, I guess I'll just jump to it now.
>>: My other question is how big is the difference, so if I remember correctly from our previous
conversation, the difference between a transaction that runs correctly, is predicted correctly,
runs either completely low color or in feature transaction, versus a transaction that needs to
run locking everything. To get between these is huge, right? The idea of running a transaction
and aborting halfway to [inaudible]. The [inaudible] of that abort is actually, if I understand
correctly, is not that relevant. Is that a fair statement?
>> Andy Pavlo: The overhead of aborting the transaction and restarting it>> Yeah. So you’re trying to run the fast version of it>> Andy Pavlo: Yes.
>>: I fail to predict and I run the slow version>> Andy Pavlo: When you say slow version, it's more like you restart it and you lock, you restart
it locking the full cluster. If you have to lock the full cluster, so this is the naive prediction
scheme. This is when you assume that everything is single partition, if you get it wrong you
abort, restart, acquire the locks that you need, right? And the top line is when we actually use
Houdini of when we are accurately predicting what we actually need it. So it's about a 2X
difference between the two. Right? [inaudible] I don't know whether it's better to lock
something you don't end up needing or miss a lock. My sense is it's roughly the same, but if
you have to lock the entire cluster every single time, the performance is absolutely terrible. So
this is what we can do if we use our model, sorry, yes.
>>: So this model, how would it compare to say doing like just a dirty run in the transaction >> Andy Pavlo: To like stimulate it? Yeah.
>>: Use that as a model.
>> Andy Pavlo: So I mean, there's other techniques that you could use. You could simulate the
transaction, you could use static code analysis, you could use [inaudible] checking. For the
simulating one, I think the overhead of doing it might be, the problem with that is you miss
things like if I read back a value and do an [inaudible] branch.
>>: But the same in here is that the fast version is much faster than the slow version, you're
doing an even faster fast version which didn't acquire locks just to see where you think it might
touch.
>> Andy Pavlo: But you still have to acquire the locks. Are you suggesting you just simulate the
transaction, not in the engine, in a separate thread>>: Yeah.
>> Andy Pavlo: See what data tries to touch>>: Right.
>> Andy Pavlo: And that tells us how to schedule.
>>: [inaudible] you almost at the same cost [inaudible] transaction?
>>: Not acquiring the locks is a big delay though. Like you have to>> Andy Pavlo: So for your case, you would still have to acquire the locks.
>>: No. I'm saying like when you do the dirty simulation, you acquire no locks. So you use
some IO, but>> Andy Pavlo: So you're doing optimistic concurrency control kind of thing.
>>: [inaudible] transaction? On a separate thread that's actually real data, pretending that
there's no transaction exists>> Andy Pavlo: Right. And that will tell you what transaction partitions you need to lock. But
now when you want it for real, you have to do the concurrency control scheme with, acquire
the real locks.
>>: That's right. I'm just wondering like, this dirty simulation is also a model of what could be
locked.
>> Andy Pavlo: Correct. Yes.
>>: So you have sort of like a fancy machine learning static analysis. I'm just wondering what's
the strongman, I guess is what I'm trying to get at. Like how should I think of this model as
being better or worse than other attempts to guess at what to log?
>> Andy Pavlo: Right. I got that. I have not done that simulation example. That is something I
have to do for my dissertation work. There's other things that we can do these models that I'm
not really going to talk about today; so we can do things like we can identify when we’re done
with a partition, so we can go ahead and send like the early two phase commit message. We
can also, I don't want to bring it up because Phil’s here, but you can do things like, if you know
there's never going to be a user abort with absolute certainty, maybe don't need your undo
locks, undo logging, and you can get about 10 percent speedup as well. But I don't have a good
answer of how much you get just using simulation. There's other things I'll talk about later on,
how we can leverage these models into [inaudible] execution, other things like that. We can
talk about afterwards whether we can still do simulation for that. But it's more, we’re doing
more than just figuring out where should we send it and what should we lock. And that's my
hunch on why just doing a quick and dirty simulation might be insufficient. Any other
questions?
Okay. So this is what we get, about 2X improvement over that naive prediction scheme; but we
compare how well we did versus what I'll call the optimal case, so this is you have an Oracle
that knew exactly what every single transaction was going to do. This is the best performance
that you can get. So we’re about 98, 99 percent accurate. So there's some cases where we
need to lock something we didn't end up using or we lock something, or we don’t lock
something that we do need later on. And so we're not that far off from where we want to be in
that case. And actually, we run this even longer over time, we can learn more, and our models
get improved, and we get closer and closer. Now 2X improvement is always a welcome>>: You say optimal, you mean perfect prediction?
>> Andy Pavlo: Yes. So I hardcoded something in the system that said, here's a request, what is
it going to do?
>>: [inaudible] how these models can be so accurate? [inaudible]. Let's go back to the
example I had before. Let's say that our transactions [inaudible].
>> Andy Pavlo: Yes.
>>: So you have a history of that parameter value showing up; a specific value, right?
>> Andy Pavlo: So it's more like>>: But imagine per second.
>> Andy Pavlo: Yes.
>>: When you see a parameter value you’ve never seen before.
>> Andy Pavlo: Yes.
>>: How [inaudible]?
>> Andy Pavlo: Again, we are doing-
>>: [inaudible] I would imagine there would be a very long tail of parameter you’ve never seen
before, even though the stored procedure’s exactly the same.
>> Andy Pavlo: Yes. Correct. So we are, this is like, let me hold your question, and then I have
a slide later on that I'll show you. Because again, we are doing hash partitioning. We only have
to encode, here's Andy, here’s Vivek, individual records of how they map in partitions.
>>: [inaudible] learning based on the result of the hash function?
>> Andy Pavlo: Correct. Yes. But I'll show how we can more deterministic in our selection later
on. So 2X improvement is always pretty good. It's always a welcome improvement when
you’re doing database research. But the problem is if we change the graph to be reflective
where we’re going to be in terms of linear scalability, we are still not in the right direction. So
the absolute numbers have improved, but the trend is still not where we want to be. So this
doesn't help us. So again, we are adding more machines and we are not getting a better
performance that we want. So the question is: what's going on? What’s causing us not to be
able to scale up? So it has to do with the inherent nature of our concurrency control model.
And that is because we have these granular locks at a partition, when a transaction starts
running, when a distribution transaction is running and it holds the lock at remote nodes, these
remote nodes, the engine for them, are idle doing nothing because they have to wait before,
we have to wait until the distribution transaction sends a message over the network to tell it to
execute a query and send back the result or start that two-phase commit process to finish the
transaction. So they're essentially doing nothing.
So now, when the standard procedure does send a request to these remote nodes, they have
something to do. They can process the query. But now we go back, we sort of flip, and now
the guy at the bottom, he’s idle because he has to wait for the results to come back before he
can make forward progress. And once it does come back, again, the remote guys go back to
being idle. Right? And this is because we are optimizing our system for these single partition
transactions which are their majority of the workloads that we’re looking for these types of
applications. But because we have such granular locks and we have these long wait times, this
is why our system is just being completely slowed down. So once again, one last time, the
answer to how to solve this problem can be found back at the dog track.
So I first met these guys, they were ex-taxi drivers from Argentina, and they were kind of like a
CD group. They didn't really talk to anybody else, they were always talking to themselves. And
I first noticed them because they were always running around looking very, very busy. So most
people go to the dog track to relax, at least that's what I do because there’s actually not a lot of
things to do while you're at the track because there's only about 14 to 15 races per night, and
each race is about 50 seconds. It’s over pretty quickly. So there's a lot of time you’re just sort
of sitting, eating food, waiting for the next thing to happen. But these guys weren't there to
relax. They were at the track to make money. And so what they would do is they would go
down to the payphones by the bathrooms and whenever we were in between a race at the
track we were at, they would go call their bookies at tracks in the next county over so they can
make more bets and so they can make more money.
So again, they’re something useful when everyone else is sitting around idle. So again, this is
the same thing we want to do in our database system. We want to be able to do some kind of
useful work whenever we know that the engine is blocked waiting on the network. So to do
this, we developed a new protocol that allows us to [inaudible] execute single partition
transactions at a execution engine four partition that is blocked because of a distributed
transaction. And the key thing about this is that we have to make sure that we maintain that
the serializeability of the database system. So my apologies, Phil, that you have to sit through
this.
So for a serializeable database, what we want them to do is we want to have the end state of
the data be the same as if we executed transactions sequentially one after the other. And this
is essentially what we're doing now. So we have a distribution transaction and two single
partition transactions, and each of those single partition transactions do not start executing
until it knows that its previous guy has committed successfully. What we see from the case of
the distributed transaction, we have this huge block of time here, where we’re idle, where
we’re waiting for somebody to come over the network to tell us to do work, or waiting for the
result to come back before we can continue to forward in our transaction. So what we're going
to try to do here, we're going to try to find a schedule where we can interleave these single
partition transactions during this time when we are blocked, and then we're going to hold on
the results to the end, and we’re going to do a verification process to check whether everyone
didn't created any conflicts, and that end database is still the same as it was if we executed
them sequentially, and so then, you know, we have a system view at all times.
So this sort of looks like optimistic concurrency control, right? But in that original paper from
Professor Cong in 1981, they assume that conflicts are very rare. So the number of transactions
you have to abort because, in the verification step because there was a conflict, is small. But in
a lot of the applications that we've looked at, there usually is a lot of skew, either temporal
skew or popularity skew. So conflicts are not rare, and you end up having to abort a lot of
things over and over. So what we're going to do instead is we’re going to use precomputed
rules to allow us to identify whether two transactions conflict, and if they don't, we can, we
know that it’s safe to interleave them. And now at the verification step, because we scheduled
them based on the predictions that we generate from our Markov models, if we know we
selected them that they wouldn’t conflict, as long as they did what we thought they were going
to do, we know there aren't any conflicts and we can commit everybody all at the end safely.
So now let's look at an example here. So let's say we have a distributed transaction that needs
to touch data at these two partitions, so this sort of procedure is going to run on the top, and
acquires the lock for the partition at the bottom, so when it starts running, at some point it’s
going to issue a query request to this partition at the bottom, and then that means that stored
procedure will get blocked and is idle because it has to wait for the result to come back. So
now when this occurs, HERMES will kick in and is going to look at this engine’s transaction
queue to try to find single partition transactions that it can interleave. So we have two
important requirements how we’re going to do this scheduling. First is that we have to make
sure that single partition transactions will finish in time before the distributed transaction
needs to resume. So we've extended our models that we used in Houdini to now include the
estimated runtime in between state transitions.
>>: Why does that matter?
>> Andy Pavlo: Because cascading [inaudible]. Because you’re holding these locks, you want to
finish up and have everything commit as soon as possible. So let's say we have a distributed
transaction, we have the prediction that we generated from the model in the beginning, and so
it just executed this query here, and we anticipate that it's going to execute this next query
here. So now our models include the estimated elapsed time in between these transitions at
this partition. So when we go look in our queue to try to figure out what guy we want to
execute, we want to make sure that they’ll finish within this time. But then the second
requirement is that we need to make sure we don't have any read, write or write, write
conflicts. So let's say, for this speculation candidate, if we scheduled it, we have this problem
where the distributions transaction just read the value of this record X from the database at
this partition. And then it gets blocked, because it has to wait to execute some query at a
remote node, but now if we execute this candidate here, it’s going to write some value that's
going to change the value that record X, the same one that the distribution transaction read, so
normally this would be okay, except that when the distribution transaction resumes, it's going
to try to read that value back again and this is going to be a phantom read. It's going to be an
inconsistent result. So this is a conflict that we can’t allow to occur. So we're not going to
choose to speculatively execute this guy, and we’re going to skip it and go on to the next one.
This next guy will finish in time, and it doesn't have any conflicts, so we know it's safe to
execute. So we’re going to pull out of the queue and execute it directly on top of the
distributions transaction at that partition as if it had gone through the normal locking process.
>>: Can you prove that that will never ever going to have any conflict or you're saying it's very
unlikely and it will abort if [inaudible].
>> Andy Pavlo: I'm not claiming that we can do this for all stored procedures. For some things
it’s too complex to try to figure it out. But some basic things like, I read a table X, table
foo[phonetic], this guy reads and writes table bar, now there's no conflict. So it's okay to do
that.
>>: Okay.
>> Andy Pavlo: It's rules like that. Heuristics.
>>: But my question is, so imagine there are two versions of these, you can imagine.
>> Andy Pavlo: Yes.
>>: So one is like, it’s absolutely guaranteed by construction of these two transactions. They
will not interfere.
>> Andy Pavlo: Yes.
>>: So again, to do, I potentially cannot even check afterwards what happened because>> Andy Pavlo: So we still can, we’ll check at the end at the logical level. There is a conflict that
occurs we can abort and roll back. So we’re not going to be unrecoverable.
>>: And the other cases, I might have a conflict, but there is a zero, zero, zero, one percent
probability to do it.
>> Andy Pavlo: Yes.
>>: You can go ahead anyway and abort it?
>> Andy Pavlo: Yes. And so, it’s actually sort of related to his question is, so when it finishes
we'll commit it, we won't commit it right away; we’ll put its results on a side buffer because we
need to wait to see to learn whether the distribution transaction actually finishes. Yes.
>>: So I think I'm missing something. But if you know there’s no conflicts, then you can commit
the single partition ones right away, couldn’t you?
>> Andy Pavlo: The single partition transaction could have read something written by the
distributions transaction and therefore you have to hold it because you can’t, you know,
release them by>>: [inaudible]?
>> Andy Pavlo: It's not a conflict in the sense of like I read something, and I try to read it twice,
and I [inaudible] result. It's more like you don't want to release the state of the database of any
changes for the distribution transaction until you now that distribution transaction has
committed successfully. For some cases, yes. If you read something that the distribution
transaction hasn’t touched, then it’s okay to send back the result. Yeah.
>>: So the problem is that if I read something the distributed transaction>>: It depends on how finely you look>> Andy Pavlo: Correct. Yes. It depends on whether the distributed transaction has written
anything before it got stalled. They just read something, then it’s okay. I'm just sort of giving
you an example of like when you would read, write or write, write conflict.
So normally, this is sort of related to Vivek’s question in the beginning, normally this would be
an unsafe to thing to do because if you're using these probabilistic models, there's a bit of
randomness involved in picking what path you're going to take and trying to identify what data
you're going to, you know, what the transaction’s going to end up reading and writing. But we
can actually exploit a property of our stored procedures that make these selections, what
predictions more deterministic. So let's say we have, our transaction comes, a new transaction
comes in the system and starts off the beginning state, and again, we want to figure out what
path we are going to take for this model, and that's going to be able to tell us what data we are
going to read and write.
So normally you would say, alright, well here’s the two transitions I can make from where I'm at
now, so roll the dice and based on the edge weights, the distribution of the edge weights you’ll
make, go down one path or another. But in this case here, we see that there's only one query
that the transaction could ever execute, this GetWarehouse query, and so in the GetWarehouse
query, we see that there's an input parameter that's being passed in from the program logic of
the stored procedure that's going to be used on the Warehouse ID which is the primary key for
this table. And so since we partitioned this table on this Warehouse ID that tells us we know
the value of this input parameter that tells us exactly what record we are going to access. And
so we can generate a statistical mapping between these input parameters that are passed in
from the application into the transaction to the input parameters of the queries. So that means
if we know that, since we know the value of the transaction’s input parameters, we know the
value for this input parameter for the query. So that tells us exactly what path we’re going to
take, what transition we’re going to take for this case here. And we can continue to do this all
down the line. So again, I'm not claiming that we can do this for all stored procedures, but for a
lot of them in the applications we look at, we see this pattern and we can exploit this.
>>: Is this analysis done manually? Or do you have an automated-
>> Andy Pavlo: It’s automatic. Yeah.
>>: What is the [inaudible] for this? Given [inaudible], how do you go about, so now you have
the semantic model the Warehouse ID is>> Andy Pavlo: Yes.
>>: You have a semantics associated [inaudible] application, how do you>> Andy Pavlo: So we have these stored procedures and we have these workload traces and we
say, there's both different ways you could do this. You would end up with the same result. The
static code analysis could tell you to do this, tain[phonetic] checking is another example of how
to do this, we do a dynamic approach. We take these previously executed workload traces and
we say all right, well this input parameter, we see the value of this query being executed within
this transaction, how often does it correspond to the input parameters for the stored
procedure? As we, if we see that there's a direct mapping, then we know that we can identify
those. Okay?
So now, again, now this makes our predictions more accurate, more deterministic. So now at
runtime, when we want to commit a bunch of speculative transactions, instead of checking the
read, write sets and the dependency graphs in between them, we just need to verify that all our
transactions, at the logical level, made these correct path transitions that we predicted they
were going to, and because we have our pre-computive rules to schedule them to not conflict,
those rules are based on our initial predictions, as long as our predictions are accurate, we
know that there's no conflicts and we don't commit everybody all at once and we’re happy.
>>: Can you say what’s the size of this set of rules? I mean, for every transaction type you see
how many different rules can come out and, you know. Like how big is this space of things
you’re pre-computing?
>> Andy Pavlo: Yeah. I don't have a good sense, I don't have a number.
>>: But in the tens, in the thousands, in the millions?
>> Andy Pavlo: In the hundreds. In the case of TPC-C, the low hundreds. Yeah.
>>: The other question that I have is in the case of TPC-C as a benchwarmer. There are some
assumptions in this work that may make sense in TPC-C, and I just don’t know if they make
sense for the types of applications that require this sort of scaled out lightweight transactions.
>> Andy Pavlo: I would say that TPC-C is actually very representative.
>>: Is it?
>> Andy Pavlo: Yes.
>>: Do typically have those kinds of conflicts and things like that?
>> Andy Pavlo: In TPC-C, the workload, in both terms of the complexity and what the
transactions actually do, is actually very representative of what I've seen in the industry.
>>: I have to confess I'm a little bit surprised because it seems to me like I’m thinking about
Amazon shopping baskets or something like that and it doesn't seem sort of like they're likely to
be a lot of conflict because you’d have to have conflicts within the same shopping basket. And
that doesn't seem likely to me. There may be shopping baskets that are hot but not necessarily
transactions that are going to conflict with each other over those shopping baskets.
>> Andy Pavlo: Shopping basket is a bad example because they don't actually use transactions
for that. I can't give you a number to say like 80 percent of the workloads we see have these
properties, but I would say in the things that I've looked at, it's fairly common. But I don't have
a way to quantify that just yet.
>>: [inaudible] selections [inaudible] workforce [inaudible] really work well in the system.
>> Andy Pavlo: Right. So again, I'm not trying to build a general purpose system. I'm very
careful about saying that. Right? So, yes. There could be some things that clearly would not
want to use H-store for because they don't have these properties. But I would say that there's,
there are a significant number of them.
>>: With the direction we want to be, you get the impression that this research will apply to
their customers?
>> Andy Pavlo: Yes. We’ll get to that. Okay. So let's go back to that TPC-C benchmark a third
time. Again, we're going to have the same number of distribution transactions as before, and
we’re going to scale up the number of machines, and this time when we use H-store and using
HERMES, you can see that we can get almost about 60,000 transactions again across four
machines with eight cores apiece, right? So now we see, in terms of linear scalability, we are in
the right direction. So we’re not actually going to be, we're never going to be perfectly linear
scalable because we’re a bit conservative in our estimates, but could've been cases where we
speculatively executed transaction, but we didn't because we didn't want to cause any stalls.
Yes.
>>: Before, on the previous slide, you showed us TPC-C and you said something like 10 percent
distributed transactions. So what's the fraction [inaudible] versus a single partition here, and if
I change that, what happens to this [inaudible]?
>> Andy Pavlo: You’re absolutely right. So this is, they’re all executing the same number,
absolute number of distributed transactions percentage varies because what we're doing is
we’re executing more single partition transactions. You absolutely, that’s an astute
observation. We’re not actually doing anything to make each individual distribution transaction
run faster. We’re just sort of be able to do more work when we otherwise would be idle. It’s
future work to make, to speed up each individual distribution transaction, reduce its latency.
>>: Right. So specifically, you're able to speculate the, execute the single partition transaction.
>> Andy Pavlo: Yes.
>>: So what fraction of the transactions here are single partition? I guess I'm just wondering.
Because it seems like you need to have basically a nice queue of single transaction or single
partition work that's ready to [inaudible] to exploit>> Andy Pavlo: Right. Yes.
>>: So is that typical and is that a dial you can turn here to show me what happens to the
slope?
>> Andy Pavlo: Yes. So again, if you add more distribution transactions, yes. You’re going to be
down here again. Yeah.
>>: Okay.
>> Andy Pavlo: I'm not claiming, you're absolutely right. It’s future work for us to figure out
how to slope keep going up and get more distribution transactions.
>>: Well, specific on TPC-C, [inaudible] specific transaction makes [inaudible] right?
>> Andy Pavlo: It’s 10 percent, yes.
>>: Are you maintaining that>> Andy Pavlo: No. Again, it’s the same number distribution transactions for all three points
within a single column.
>>: And is that, the fixed 10 percent or whatever, TPC-C [inaudible]?
>> Andy Pavlo: What's that?
>>: The point of 60,000 for thirty cores, is it running that TPC-C with the fixed 10 percent
[inaudible] transaction? Or by running>> Andy Pavlo: It's absolutely diverse. So like 10 percent of whatever this is, say this is like
1500, it’s 1500 or whatever it is for all of them because we’re executing more single partition
transactions.
>>: So you're not running 10 percent of distribution transaction anymore. You’re now down to
say>> Andy Pavlo: But the absolute number is the same.
>>: I see what you're saying.
>>: 10,000. That is 10 percent of that is 1000. You're up to 60,000. So [inaudible] 60.
>>: Can you always tell whether you have a distributed transaction versus a single partition
transaction?
>> Andy Pavlo: In case of TPC-C, in these workloads we’re about, I think we’re almost always
accurate, like 98, 99 percent.
>>: TPC-C is very [inaudible].
>> Andy Pavlo: Well the other workloads that we look at, you do the same thing as well. TM1
is a little bit different because you do this>>: You can’t always tell. So when you can’t tell, you have to run [inaudible].
>> Andy Pavlo: When we can't tell we, when we can’t tell it's more like, it’s the same thing as
getting a prediction wrong. We assumed a single partition. We always execute a single
partition. And then when it tries to touch something that we didn't know about, we just have
to abort and restart it. So we actually compare ourselves, what we're doing here, versus what
I'll call the optimal case or the unsafe case, so this is where if you assumed that there are no
conflicts and you blindly pull out whatever in the, first in the transaction queue when you want
to spectrally execute something and you don't care about serialize ability, this is the best you
can do. So we're not that far off in terms of the optimal case and we are maintaining serialize
ability. So that's good. Okay. Yes.
>>: So when you say [inaudible] distributed transactions. So do you have an idea whether any
of the [inaudible] distributed transactions are the same?
>> Andy Pavlo: Yes.
>>: But the transactions start touching more partitions. But I think in case of TPC-C, typically to
partitions, right? [inaudible]>> Andy Pavlo: Yes.
>>: Distributed transactions. So I say, for example, they're starting to use four partition
transaction or eight partition transaction, you expect [inaudible] or do you expect it to flatten
out?
>> Andy Pavlo: I suspect the performance would get worse.
>>: Would get worse. Okay.
>> Andy Pavlo: Yeah.
>>: It’s not only the [inaudible] distributed transactions, but how much>>: Is it? If you're allowed a single, the speculative one to get in, I mean if you’re touching,
depending on how much you’re touching, right?
>> Andy Pavlo: Yes. If you touch more partitions, for the HERMES case, I don't think it be,
anything would be the same trend because we’re just executing more single partitions
transactions. So it's okay.
>>: So the heated assumption is that the workload is a sort of, there are different guys, some
which are trying to go as fast as they can over a single transaction and some other guys trying
to go as fast as they can on the distributed ones.
>> Andy Pavlo: Yes.
>>: That is the case of, this holds. Otherwise, [inaudible] the ratio.
>> Andy Pavlo: Correct. Yes.
>>: So in this particular mix, you had 10 percent cross partitions and 90 percent within a
partition, right?
>> Andy Pavlo: Yes.
>>: So that means that if you just sort of did a very simple optimistic concurrency control thing
here, you would commit 90 percent of your transactions since you have serialized access for
those and they remain on the partition. Now, of course, you potentially could have a higher
percentage of aborts for the things that span multiple partitions. Do have a sense of how this
compares to that solution?
>> Andy Pavlo: So your question is>>: [inaudible] baseline support?
>> Andy Pavlo: So I think you just execute a single partition guy without any rewrites and just
let it fly, right? And then if you have a distribution transaction, it will try to touch something
that doesn't have a lock, you have to abort and restart it, either move it to another node
because you don't know when it comes in what node you should actually run it on.
>>: And the next time around your taking locks, right?
>> Andy Pavlo: Yes.
>>: Right. So I’m saying, suppose you just never do that. You just keep trying to rerun it and
maybe they even start. Right? You haven’t said anything about an SOA on being able to
actually commit a transaction that you try to do, right?
>> Andy Pavlo: Correct.
>>: So it seems to me that 10 percent of the transactions that, I think if you're suggesting if you
run optimistically, you could rerun very fast, but you’ll always be>>: The assumption>> Andy Pavlo: Let me>>: Let’s [inaudible] check. So how much minutes you have left?
>> Andy Pavlo: I’ve got 10 minutes.
>>: You have 10 minutes?
>> Andy Pavlo: We can talk about this afterwards.
>>: Maybe let's hold the questions till the end.
>> Andy Pavlo: [inaudible] H-store has actually been commercialized as VoltDB; it’s actually
another resource project that's based on our code, and it’s actually used in a couple hundred
installations throughout the world today. So the Japanese American Idol example I talked
about in the beginning, that's actually a real-world deployment of this system. And I actually
found out a few weeks ago, we apparently power the Canadian version of American Idol, which
I think is just called Canadian Idol. There's a lot of people using this system to do so sort of
network traffic monitoring, in particular the government. The CIA and the NSA apparently love
this system. They don't tell us what for, obviously, but it’s, other than to say it's installed in
every single telecommunication hub throughout the country. So that can make you feel warm
and fuzzy at night.
>>: That’s the only thing that scared me the most in the last 10 years.
>> Andy Pavlo: Is what, those guys?
>>: You have your hands on>> Andy Pavlo: Apparently they run the same amount of data through VoltDB as a run through
accumulo[phonetic], which is their version of big table security stuff. Again, we don't know
what they're actually doing with it. A lot of high-frequency trading guys and the high finance
guys are using this system because they like the low latency properties and the high
throughput. So we power the Malaysian derivatives market runs entirely off of the system.
And then a lot of online companies, like AOL’s game.com, use this [inaudible] runtime state of
players. And my personal favorite is that there is a shady offshore gambling website that I can’t
mention that uses the system for all their bets and wagers. So that's nice.
So what did I talk about today? So I initially started off by describing a system that’s really
optimized from the ground up to be, support these types of high throughput transactional
workloads that we see a lot of. But then I showed how if you have just a small fraction of the
workload, how to touch data in multiple partitions, the performance really breaks down. So
then I showed you how to use probabilistic models to be able to predict behavior transactions
and do the correct optimizations to lock the minimum resources you need in the beginning.
And then I showed how to extend these models further to allow you to safely interleave single
partition transactions whenever an engine is blocked because of a distribution transaction. So
it's no longer a question of can you scale up and get better performance in the system while
maintaining transactions, what I showed today is exactly how to do that.
So where do we go from here? For future work, I can [inaudible] say that there is probably five
or six new products that I've been working on in the next couple years to extend the
performance of H-store for a bunch of different things. So I'll loosely categorize these into two
groups. So first, I want to look for ways to improve the scalability of the system. So I'm working
with, I think Aaron Omar, whose coming to do here an internship with you guys this summer,
it’s colleague UCSB, we’re looking at ways to have a lasting deployments of the system. So right
now, if you have a partition that the hotspot and all the transactions are getting queued up
there, there's nothing we can do at this point to be able to mitigate that. So we want to look at
ways to say, to identify that we’re overloaded at a partition or maybe it will split it up into
multiple pieces and migrate the data, or maybe coalesce partitions that aren't being actively
used. We are also looking at ways to exploit new hardware that’s coming out. And this is part
of my involvement of the big data ICC that’s headquarters at MIT that’s sort of sponsored by
Intel. So particular, we are very interested in looking at what kind of database system that we
want to be able to build. If you have some of the new no-volatile memory devices that are
coming out, so whether this is something like H-store or whether you want to try a different
concurrency control models; we want to explore that to figure out what's the best kind of
system to have. You have storage device that has the speed of reads and writes of DRAM, but
the persistency of an SSD. And then we are also looking at the exploit and sort of the new mini
core architectures that are coming along.
And then the thing that I'm probably the most excited about that’s coming out in the next year,
which is sort of similar to your glacier project here on a Hecatin[phonetic] system, is a new
system model that we've been working on called Anti-caching. So in Anti-caching, although I’ve
just spent the last hour talking about a main memory database system, with Anti-caching we’re
going to go ahead and add the disk back in. And so we're going to use the Anti-cache as a place
to store cold data, right? For data we're not going to need anymore. So we’re going to monitor
how much memory is being used at a partition, and then when we reach above a certain
threshold, we’re going to go ahead and evict data that has not been used recently or is unlikely
to be used in the future, and we’re going to start that in a block storage hash table out on disk.
And then we still have to maintain index information and catalog information in main memory
for all the data that we've evicted. So the main difference between a traditional database
system and the Anti-caching model is in traditional system, a single record could either exist on
disk or in memory at the same time, and so in our model, it's either one or the other.
So now when a transactional request comes along that only needs to touch data that's entirely
in main memory, then execute it just like before without any problems. But as soon as it tries
to touch data that's been evicted, we’re going to go ahead and put into a special monitoring
mode where we keep track of all the records you try to touch that out in the Anti-cache, and
when it tries to actually do something with that data, either modify it or send it back to the
application, we’re going to abort it, roll back any changes that it had in a separate thread, and
asynchronously we’re going to go fetch in the blocks that it needs to get the data from, merge
that into main memory, all the while we’re still going to execute other transactions, now block
them, and then once we know that our data have been merged in, we update our indexes and
we can now restart this transaction it runs just like before.
So again, the key thing about this is now we are able to support databases that are larger than
the amount of memory on a single machine. So we’ve done some initial experiments where
we’ve taken TPC-C, this time with 100 percent single petition transactions for 10 gigabyte
database, we want to scale up how much memory we’re going to give the system, and we see
when we use MySQL, the performance isn’t so great. Using MySQL with Memcache for certain
read operations, the performance isn't that much better because TPC-C really can't take
advantage of sort of the caching properties, the read, write properties you get from Memcache.
But now when we use H-store, you see that performance does not degrade that much for
databases that are eight times the amount of memory that's available to the system at a single
node. So this is pretty exciting work here.
The other stuff we are looking to do is we want to expand the type of workloads we want to
run on the H-store system. So we're, with a colleague at Brown, we are looking how to
integrate stream processing primitives or continuous queries as a first-class entity directly in
the system. So we have a mix of streaming data and transactional data and sort of intertwine
them together. And then sort of related to what all the discussion was, for distribution
transactions, we want to look at how can we improve the performance system for workloads
that are not easily partitionable. So we add more distributed transactions, how can we make
the system perform better? Because it is doing things like batching and other techniques to
amortize the cost of acquiring these locks at all these different partitions.
I'm also looking at getting involved back into, getting back involved in working with scientists to
apply database techniques for scientific workloads. So this is possibly building a whole separate
system that’s really optimized for their types of workloads because a lot of them are just sort of
running these four transcripts or C-scripts, running MPI programs that don't take advantage of
the kind of things that we know about in the database world.
And then lastly, the thing, the next major trend I see, which I'm sure you guys are aware of
here, is adding support for these front end main memory systems to do real-time analytical
operations. So now we want to be able to do the longer running queries that have to touch
more data without having to run that as a distribution transaction that needs to slow down the
entire system. So a lot of the game companies care about this kind of thing. They have, the
front end system is maintaining all the runtimes data players, and they want to be able to say,
here's the top 10 players for a bunch of different metrics. So instead of shoving that off into
your backend system or Hadoop or whatever, running the analytical query there and then
shoving it forward, we want to be able to do that directly in the front end. And so to do this,
we want to apply techniques like using hybrid storage models, hybrid nodes, and applying
techniques like relax consistency guarantees of these analytical queries. These are new
techniques, but we want to sort of come up with clever ways to combine them together, and
again, other leverage, other machine learning techniques to make the system do this
automatically. That's really the main take away for what my research is all about, it’s really
having the database system know as much as possible about what the application’s trying to do
and what the workload’s trying to do and be able the system, have the system go faster and get
better performance because of it. So I want to thank everyone for coming today; I'll be happy
to answer any more questions, and there’s also website set up for the H-store system where,
you know, for more information about the documentation about all the things that I talked
about today. So thank you.
>>: Any questions?
>>: You measured TPC-C?
>> Andy Pavlo: So, yes. So TPC-C won't work well on this. You're absolutely right. I will say
though, from the VoltTB guys, what they tell us is a lot of the workload, there’s very, very few
workloads that they see that are complex of TPC-E. TPC-C is actually very representative of
what's out there in terms of the size and complexity. But you're right. TPC-E won't work well
on this system. And that's one of the things we want to look at in the future. Yes.
>>: So typically, say it was designed as a benchmark and its got uniformly distributed data>> Andy Pavlo: Yes,
>>: Carefully designed, which is not proven is most cases>> Andy Pavlo: Correct. Yes.
>>: And so in petitioning, will likely result in hot spots [inaudible].
>> Andy Pavlo: Right. Yes.
>>: And the other thing is the design with Oracle in mind, it's also designed [inaudible] using
snapshot isolation. So TPC-C is an interesting benchmark, but it's certainly isn’t conclusive in
terms of anything.
>> Andy Pavlo: So we’ve done, I didn’t talk about that, we've done other experiments with
Carlo and I, we have a whole benchmark suite with a bunch of different LOTB[phonetic]
workloads and we tried it out. And again, TPC-E is an excellent example of something that
won’t work well in this. Anything that looks like a social graph, Twitter, Facebook, that stuff,
that won't work well on this. But again, I would say from what I've seen in industry, working
with the VoltTB guys, TPC-C is actually very representative for a large number of applications
that simply it doesn’t, they're not getting the performance they want using Times 10 or other
things, and they're switching to something like VoltTB. All right. Thanks, guys.
Download