>> Vivek Narasayya: So it's my pleasure to introduce Andy Pavlo who is a PhD student about to graduate from Brown University. And so today he's going to talk about his adventures at the dog track. >> Andy Pavlo: All right. Thanks everybody for coming today. So he's right. I am going to talk about my two main passions in life. And that's the science of database systems and gambling on greyhounds at the dog track. Now, I realize, I hear some snickers, for a lot of you these two things seem like they have nothing to do with each other whatsoever. But what I'm going to show today is that there's been specific research challenges or problems that we faced when trying to scale up database systems to support modern transaction processing workloads or modern workloads where the answers to those problems have come directly from things I’ve either seen or learned or saw while at the dog track. So before I get into this, I want to give a quick overview of what sort of the state of the database world is right now for people that want to run front end applications, and I'll loosely categorize the type of systems that are out there today into three groups. So the first is what I'll call our traditional data base system. So these are things like DB2, SQL Server, Oracle, MySQL, Postgres, and the key thing about these systems is that they make the same architectural design assumptions and hardware assumptions that were made in the 1970s when the original database system, System R and Ingres were invented. Obviously a lot of things have changed since then. The second group is come around about the last decade or so, and these are colloquially referred to as NoSQL systems. So these are things like MongoDB, Cassandra, ReAc[phonetic], and these systems are really focusing on being able to support a large number of concurrent users at the same time because they want to be able to support web-based and internet-based applications. And so for them, these traditional database systems weren't able to scale up to support their needs. So my work, my research is really focused on this emerging class of systems that has come about called NewSQL. And we are trying to have the best of both worlds. We’re trying to maintain all the transactional guarantees that you would get in a traditional database system, but while still being able to scale up and support a large number of concurrent users in the same way that the NoSQL guys can. And so for this talk, I'm not really going to talk about the NoSQL guys other than to say that their work is complementary to ours. Right? So there's certain applications where you'd want to use a NoSQL system and not a NewSQL system and vice versa. So now let's look at what one of these modern transactional workloads look like. So here we have a workload that's derived from the real software system that powers the Japanese version of American Idol. I realize this is not called American Idol in Japan, but just go with me for this. And so in this application, what you have are people either calling in or using their laptops to go online and vote for contestants that they like on the show. And so when one of these requests comes in, one of these calls comes in, the application starts a transaction that will check in the database to see whether this person has called before, voted before. And if they haven’t, it will go ahead and create a new vote entry for them and update the number of votes that a contestant has gotten. All right? So this seems like a pretty simple application. It seems like a pretty simple workload that really any modern database system should be able to support without any problems. So to test this hypothesis, we took to open source traditional database systems, MySQL and Postgres, and we tuned them for these type of sort of front-end transactional workloads. And we want to measure how well they can scale up and get better performance as we give them more resources. So on one machine we are going to run the database system, and we’re going to scale up the number of CPU cores that we’re going to allocate to the system to process these transactions. And then on another node, we’re going to simulate people calling in and voting for contestants. And what we found is that in both cases, both these systems, they can't break past 10,000 transactions a second. Right? In case of Postgres, actually the performance gets worse as you give them more CPU cores. And so I'm not trying to pick on MySQL and Postgres here, but I'll say that these results are emblematic to other performance results that we've seen in other traditional database systems. So we've done other experiments where we take Oracle and to pay for a very expensive DBA to come out and tune it for us, and we see the same kind of results. As you add more CPU cores you don't get better performance. So now the question is: what's going on? What is it about these traditional database systems that is causing them to not be able to scale up in what seems like a simple workload? So in another project in our research group, they took another open source traditional database system and they instrumented the code to allow them to measure how much time is spent, or how many CPU cycles are spent, in different components of the system when they run one of these transactional processing workloads. They found it’s about 30 percent of the time is spent in the buffer pool. So this is managing an in memory cache of records that have been pulled from disk and running some eviction policy to be able to decide when we need for room, what to evict, what to write back out to disk. Another 30 percent of the time is spent in the locking mechanisms of the system. So because there's a disk, transactions can be run at the same time, and one of them could try to touch data that's not in the buffer pool, so therefore has to get stalled while the record that it needs gets fetched in to main memory. So this all of accounts for about 30 percent overhead if you have to do this. Another 28 percent of the time is spent in the recovery mechanisms of the system. So this is because we have an in memory buffer pool; there could be dirty records that have not been safely written to disk yet. So they have to use things like a radio log or write ahead log or other mechanisms to make sure that any transaction that gets committed, that all of the changes are durable and persistent if there's a crash. You don't lose anything that's already been committed. So that leaves us a paltry 12 percent of time left over to actually do useful work for the transactions. So this is why these traditional systems are not scaling up, right? Because we just simply have all this other overhead for all these other components. Right? And so where does that leave an application developer? Yes. >>: I don’t quite see why that's scalability and not simply characteristic of even a uniprocessor system. >> Andy Pavlo: So your question is>>: What's the multicore angle on this? >> Andy Pavlo: So this is like if, in a multicore system, right, I see what you're saying. The question is: this is more emblematic to traditional systems rather than being a multicore system. And I'll say, sort of like the idea is that well, in these type of workloads you are usually CPU bound, right, from these experiments here at the main memory database system. So there is no disk. So that's why we're showing the CPU latencies. So this is just showing that if you want to be able to scale your systems to support on these modern hardware with a lot of cores, as you start to scale them up you think you get better performance but you don't because you're paying sort of all this extra overhead to do all of this locking stuff. You're shaking your head no. Sorry. Yes. >>: That’s not the point. One says that the overhead doesn’t [inaudible] per se. >> Andy Pavlo: The question is>>: So there's something else going on? >> Andy Pavlo: Yes. Correct. Well, no. So it's, I mean I’ll come to this, yes. There are other things. But the basic idea is that there is all this architectural baggage from the 1970s that all these three things are representative of. And so if you have, if you take a new look at what these workloads looked like and what the hardware can do, what you can do with that hardware, then maybe you don't need these three things. >>: Should I think of the gray parts as having lots of synchronization [inaudible] in them and green parts being like>> Andy Pavlo: Absolutely. So yes, in the case, especially in the lock manager, right? >>: Okay. >> Andy Pavlo: So if you have concurrent transactions>>: That's kind of the answer. >> Andy Pavlo: Locks, latches, and Utexts and other things. >>: The other problem about the overhead is the synchronization that's done in all these>>: But these types of operations, like maintaining a buffer pool, recovery, they may be inherently synchronization>> Andy Pavlo: Yeah, because you have to pin pages, yes. >>: In order to prevent scalability, you would have to argue that the percentage of time spent in each of these increases as you increase the number of cores. It's not just a matter of having this overhead. If this overhead remained constant while you increased the number of cores, you get scalability. >> Andy Pavlo: Right. So let me go further, and if you have more questions when I say sort of what our system is doing, we can be talk about that. Okay. So where does that leave application developers? Well, up to, like I said, a few years ago there's really only two choices. You could go with a traditional database system. A lot of people do this because they provide the strong transactional guarantees, and it's actually easier to write programs when you have transactional semantics. But these systems are notoriously hard and notoriously expensive to scale up. So anecdotally I'll say I have a colleague that works at one of the big three database vendors, which is in Microsoft, and he tells me that their, one of their largest customers is a major bank in the US that pays about a half billion dollars a year, per year, just to run their transaction processing system. And so for most companies, most organizations, that's simply infeasible. So for a lot of people, in the last decade or so, we still have the rise of these NoSQL Systems because they're able to get the better performance that you want just for Internet and web-based applications when you have a large number of concurrent users. But these systems achieve this performance over the traditional database systems by forgoing all the transactional guarantees that the traditional systems provide. So in the application you have the right code to be able to reason with inconsistent or conventionally inconsistent views of the database. So again, our focus is going to be on the class, the NewSQL class of systems where again, we’re trying to have the best of both worlds. We're trying to be able to scale up and get better performance and while still maintaining support for transactions. So a real, the research problem we are trying to solve here is how do we actually do this? And what I'll say is we're not going to do this by being a general-purpose system. We really going to focus on certain class applications that have key properties that we can exploit in our system. And we're not going to try to claim to be a one-size-fits-all database system for everyone. And so the first question is: well, what are these properties in the these type of applications that we are going to focus on that we want them to have our system be optimized for that we are really going to take into consideration? And so the answer to this first problem is, can be found actually at the dog track. So specifically, there's three important characteristics of greyhounds and dog racing and the type of transactions and the applications we want to support that actually directly analogous to each other. So the first is that both of these things are very fast. So in the case of these, in the case of greyhounds, they’re the fastest type of, one of the fastest animals you can have on the planet. They can run almost about 40 miles per hour. And similarly, our transactions are the fastest type of transactions you can have in database systems. So we’re talking about transactions that can finish on the order of milliseconds rather than minutes or even seconds. We’re not talking about long running transactions. The second thing is that both these things are very repetitive. So in greyhound racing, the dog just runs around in a circle on the track and that’s it. Right? There's nothing else that it actually can do, right? Similarly, in our applications we’re going to focus on, the database system is going to be doing the same set of operations repeatedly over and over and over again. All right? So if you remember back from our American Idol example, when someone calls in, there's only one transaction that's ever going to be invoked and that one transaction only has three steps. So we're not going to focus on optimizing system to allow people to write arbitrary transactions or open up a terminal and write random query in. And the last thing is that both of these items are very small or have a small footprint. So what I mean by that is greyhounds actually have a small footprint or paw print for a dog of their size or stature; and similarly, our transactions are going to have a small footprint in the overall database. So the data set itself could be quite large, but each individual transaction is only going to touch a small number of records at a time. So we're not talking about long running queries that are doing full table scans, doing complex joins to compute aggregates and things like that. We are really talking about transactions that come in, use an index to find, to do point queries to find individual records that it wants to read, and only processing them. So now based on these three properties, we've designed a system called H-store that's optimized from the ground up to be specifically work for, operate efficiently for transactions in these types of applications; and this is work that I've done as part of my dissertation, along with colleagues at Brown, MIT, Yale, and I was at the time, Verticus[phonetic] systems. And so in Hstore, and this maybe gets to address some of your questions, in H-store there’s three key design decisions that we’re going to make that are direct reaction to the bottlenecks that we saw in the traditional systems. So in the traditional systems, they're inherently disk oriented. So all that machinery that I talked about before in that pie chart, a lot of that is based, you have to have because everything, you know, a transaction could try to touch data that's not in the buffer pool but that’s on disk. But for these modern workloads that we’re looking at, in many cases the database can fit entirely in main memory. You can buy a few number of machines that have enough RAM that's able to store the entire database entirely in main memory. And so in H-store, we are going to have a main memory storage engine. We’re going to assume that the database is small enough to be able to fit in RAM. So we’re talking about databases that are usually, for these type of applications, are a couple hundred bits of gigabytes. And the largest one that I know of is Zinga, which is roughly about 10 terabytes. So again, it's perfectly feasible to buy enough machines that have enough memory to do this. The second thing is because in a traditional systems, because there's a disk, they have to allow transactions to run currently because anytime one could stall because they tried to touch something that’s not in the buffer pool. And so again, to do this you have to have a concuragable[phonetic] screen that's using locks and latches and UTexts and other synchronization methods to make sure that one running transaction does not violate the system view of another transaction that's running at the same time. But now if everything is in main memory, you're never going to have those kind of disk stalls. So maybe does not make sense to actually have a lock manager and have to concurrent transactions anymore. So in Hstore, we're going to have serial execution of transactions, meaning we’re going to execute transactions one at a time at a single core. And this sort of makes sense because if the cost of going and acquiring a lock in main memory is the same as actually just accessing the data in main memory, you might as well go access the data. And lastly, in a traditional system, they have to use a more heavyweight recovery mechanism to make sure that all changes are persistent and durable after a crash. So they use something where they record the individual changes that were made on each record that was read or written to by a transaction. So in H-store we're going to use a more compact logging scheme that's more lightweight and more efficient where we only need to store what a transaction was rather than what it actually did. And I'll explain a little bit more about what I mean by that in a second. So now, the basic architecture of H-store is that the databases can be split up into disjoint subsets call partitions that are stored entirely in main memory. So this example here, I have, say I have a single node I have two partitions, and so my database is going to be split into disjoint subsets where one half the database will be in one partition, the other half will be in the other partition. And for each of these partitions, it's going to be assigned a single threaded execution engine that has exclusive access to all the data at that partition. And what that means, if any transaction needs to touch data that partition, it has to first get queued up and then wait to be executed by that partition’s engine, and because when I have these [inaudible] locks at the partition level, when a transaction’s running, since the engine is single threaded, it knows that no other transaction is running at the same time, so we don't have to set any finegrained locks or latches down within the underlying data structures within the partition. So when these are run beginning to end without ever stalling. So now to execute a transaction, the application comes along and it's going to pass in the name of the stored procedure that it wants to invoke and then the input parameters for that transaction. So in H-store, the primary execution API is going to be through stored procedures. And stored procedures are essentially in our world, a data class file where you have a bunch of predefined queries that each have a unique name, and then on run method, it takes in the input parameters sent in by the application and invokes program logic that will make invocations of the predefined queries. And so we have a very important constraint in our stored procedures in H-store and that is they have to be deterministic. And what I mean by that is they’re not allowed to make, you know, use a random number generator inside the run method, or go grab the current time, or make invocations using our PC to some outside system. All the information that it needs, and to process that transaction, has to be contained within a past in from the client. And this will be important later on for recovery mechanisms. Yes. >>: Does that mean you can't depend upon the prior state of the database? You can't have a conditional logic which>> Andy Pavlo: That's okay. Yeah. So the determinism really has to be if we re-execute this transaction at a later date in the same order that we process it originally, we need to end up with the same ending state. It's perfectly fine to do a query, read back the state and then if branched, do something different. That's fine. Okay. So now the transactional request will be queued up at the partition that has the data that it needs, and once it reaches the front of that engine's queue, it will have the global lock for that partition, and it's going to be allowed to start running. Now when it finishes, we're going to go commit its changes right away; but before we send the result back to the application, we have to write out the same information the application sent us originally out to a command log on disk. Yes. >>: [inaudible] an individual transaction can't access multiple data from multiple partitions? >> Andy Pavlo: No, it can. And we'll get to that too. That's later. Yes. And so I’ll say this command log, this writing this out is done as separate threads; we are not blocking the main engine. Yes. >>: Does this mean as you get up in cores you have to do a finer-grain partition? >> Andy Pavlo: Yeah. So you could. I mean for one core there's one partition. Finer grain in the sense of like>>: You have more cores, you have more partitions. >> Andy Pavlo: Absolutely. Yes. So there is an upper limit, that's sort of related to his question. There’s an upper limit as you partition more. You could end up with more multipartition transactions. Yeah. And so we’ll do, we're going to batch these entries into command log together and do a group command where it’s just one [inaudible] to write them all at the same time and sort advertising the cost of doing that write across multiple transactions. So now once this node’s safely written endurable out to the command log, it's a safe for us to go ahead and write, send back the result to the application. Now there's also a replication scheme that's going on here where we are doing active, active application where we can just forward these transactional requests from the application to our replica nodes and then process them in parallel, but I'm not going to talk about that today, right now because it sort of complicates everything that we’ll talk about later on. But if you want to know more about it I'll be happy to talk about it afterwards. So now, while this is all going on, the database system in the background is we taking asynchronous snapshots of the partitions in memory and then writing them out the disk as well. So we're going to use a copy on write mechanism so we don't slow down the main execution pipeline of the execution engines. And so now if there's a crash, all we need to do is load in the last checkpoint that we took then we can replay the command log to put us back into the same database tape again. So this is why they have to be deterministic. On recovery, we want to make sure we end up with the same result. >>: You need a transaction consistent checkpoint, right? >> Andy Pavlo: Correct. Yes. But you need to make sure all the nodes are doing the checkpoint at the same time, but then you use a copy on write mechanism to make sure that you don't block everybody else. But yes. Okay. So now if we go back to our Japanese American Idol workload we looked at before, and this time we’re going to run the same workload and the same hardware using H-store. But you see what we get H-store a single CPU core to process transactions, it can do over 20,000 transactions a second. But now the main difference is as we scale up the number of cores that we give the system, we can get better and better performance, up to a factor of about 25X over what we can get in the traditional systems on eight cores. And so now immediately every single one of you here, when you see a performance gain like this, should have a red light or siren going off in the back of your head telling you to be skeptical. And what I'll say is that this is not a parlor trick. All three systems are running the same workload, the same serializable isolation level, and they're running the same durability and persistence guarantee. So if the database crashes, all three systems can recover any transactions that were committed. So we can actually look at some other workloads and see the same kind of performance results. So TBCC, as everyone I'm sure is aware of, is the canonical benchmark that everyone uses to measure the performance of these sort of types of systems. And we see the same kind of performance results as we give more cores to H-store we can get a better performance, whereas the traditional database systems it simply flat lined. Telecom One or TM1 or it’s actually referred to as TDP now, is a workload from Ericsson where that simulates someone driving down the highway with their cell phone and the cell phone has to update the towers and say, if you need to call me, here's where to find me. And we see the same kind of result as well. As we give more cores to H-store, we can do better, whereas the traditional systems, the performance actually gets worse. So now you would look at these results and say, well this is great. H-store does much better than the traditional system; why would I ever want to use a traditional system today when something like the architecture, of a main memory architecture like H-store, can get much better performance? And the answer should be quite obvious everyone here, and that is it's an inherent problem to the main memory database system is that you're limited to databases that can fit in main memory. But that's okay, because out-of-the-box H-store supports multi-node deployment. So here we looked at the same three workloads that we started off with before, and this time we’re going to scale up the number of nodes in our cluster, so we're going to go from one to 2 to 4 nodes, with eight cores per node, and we see the same kind of thing. As we add more hardware to the system, we are able to get better performance, and this checkered line here is sort of marking where we ought to be in terms of achieving linear scalability, which is the Gold mark standard, what you want to have in a distributed database. So as we double the number of nodes, we want to get double the performance. In the case of the voter benchmark, we're pretty close to achieving that. In the case of TPC-C and Telecom One, we're a little bit off because of the nature of the workload. And so again, now you look at us and say well this is great because in H-store I can add more machines, I can just put databases that are larger than memory on a single machine, why would I ever want to use a traditional database system when I can just buy more machines and scale out with H-store? And again, sort of related to the gentleman's here’s question is: in these three workloads, all the transactions were single partitioned, meaning>>: It’s not CPU cores, it’s the number of machines? >> Andy Pavlo: CPU cores, divide by eight. So it's one to two to four. Actually I marked that. Sorry. So for these workloads here, all of the transactions only touch at a single partition. So when they ran, they did not need to coordinate or synchronize with any other node in the cluster. And that's why we're able to get this really good performance. But now if you have a transaction that has to touch multiple partitions, you end up with what is known as a distributed transaction. And this is really the, been the main bottleneck, the main problem of why a lot of the distributed databases from the 1980s did not really come become popular because this is why these systems aren't able to scale up. And although H-store is a NewSQL System, its modern code base and modern hardware assumptions were not immune to this problem either. So now we go back to GPC benchmark again and this time we make 10 percent of the transactions be distributed, meaning 10 percent of the transactions need to touch data at two more partitions. Now we see as we scale up the number of nodes, the performance is terrible. It's completely flat-lined, and we are nowhere near where we want to be in terms of linear scalability. So now you look at this and say this is God-awful, why would I ever want to use H-store because I try to scale up and have multiple nodes in my cluster, I’m paying for more hardware, I’m paying for more energy, I’m paying for more maintenance for those machines, and I'm not getting better performance at all. I might as well go back to traditional database system where at least if I pay more money I can try to scale up and get better hardware that way. And so this is a real problem because not all transaction, not all workloads can be perfectly single partitioned in the way that we assumed before. So in the early days of this project, we actually visited PayPal, and PayPal had this legal requirement where customers from different countries couldn’t be on the same partitions. So if you had an account in Italy, if you had an account in the US and you wanted to send money between the two people, they had some legal requirements that had to be a distribution transaction. So an architecture like H-store simply would not work for them. In many cases, the application scheme itself is not easily partitionable at all either. So you would end up with this bottleneck. So this is the main thing we’re trying to solve here. How can we try to achieve linear scalability? And when we have distributed transactions. So to do this, the first thing we have to do is to figure out well, what's going on? What is it about these distributed transactions in a system like H-store that's causing this bottleneck? So now we want to look at an example where we have a multi node cluster. Say we have an application that comes along and submits a transactional request to this system, and this time this transaction needs to touch data at these four partitions. So the way H-store’s concurrency control protocol works is that we have to acquire the locks for these partitions first before the transaction’s allowed to start running. And the reason why we have to do it first is we don't have to do deadlock detection if we sort of have more fine grain locking now because that would be expensive to do in a distributed environment, especially if you have transactions that are finishing the order of milliseconds. But now the first problem we are going to hit is we don't actually know what partitions this transaction actually needs before it starts running. So we have to lock the entire cluster even though we're never going to need most of those guys. So now once we do this, and the transaction is allowed to start running, it can issue the query requests to these remote partitions to either access or modify data that's located at the other nodes, and we’re going to see a second problem. That is, if we actually knew the number of queries this transaction needed to execute at each node, we would see that it needs to execute more data at this node at the bottom, or touch more data at the node at the bottom. So what we really wanted to do is when the request came in, we wanted to be able to automatically redirect it to this node down here and run the stored procedure there because that would result in a fewer number of network messages because most of the data you need is, would be local, we won’t have to go to the network to send query requests to these remote nodes. But again, we don't know this information because we’re dealing with arbitrary stored procedures. Right? So this is a difficult problem, and you could try to apply things like static code analysis and other things, but that would be too slow to do on every single transaction that comes in. So luckily for us, the solution to this problem can actually be found back at the dog track. So I wouldn't say that I was going all the time, I wouldn’t go every day, but you know, you go a couple times a week, holidays, Fourth of July, Memorial Day, Mother's Day, stuff like that. And it was one of those things where you start going to the same place over and over again you start to notice the same people in the same patterns, right, of people doing the same thing every single time. And this guy, I met this guy named Fat-faced Rick; and he first came to my attention because he was winning every single bet he was making with all the bookies at the track. He wasn't betting a lot amount each time, but every single time he made a bet, he was right almost 100 percent of the time. And it took me a while, but I finally figured out what he was doing. Every single morning before a race he would go down to the parking lot where all those trainers would bring in their dogs and sort of check in for that night's race, and he would pretend to be a vet from the state gaming commission, and he would tell the trainers I need to look at your dogs, make sure that they're up to regulation and they don't have any health code problems, right? But what he was really doing was checking them out to figure out which ones were in the best shape, which ones were the strongest, and which ones didn’t have any injuries. And those are the ones he would go make his bets on. And that's why he was always winning. So this is the same thing we need to do in our database system. We need to know what things are going to do when they come in before they start running. So to do this, we build a machine learning framework called Houdini that we've integrated into H-store that’s going to allow us to predict behavior transactions right when the request comes in without having to run it first. So we have a very important constraint in this work of how we make these predictions, and that is, we can't spend a lot of time figuring these things out. So we can’t spend about 100 milliseconds figuring out what a transaction’s going to do if that transaction’s only going to run for five milliseconds. So the underlying component of how Houdini works is that we’re going to create Markov models, or probabilistic models, of all the stored procedures that the application could execute. And so we’re going to build these based on, from training sets of previously executed transactions. So for each model, we're going to have the starting and terminal states for the transactions, so the beginning, the commit, the abort states. And then we’re going have the various execution states that the transaction could be in at runtime. So theses execution states are represented by the name of the query being executed, how many times you've executed the query in the past, and what partitions this query’s going to touch. Now each of these states are going to be connected together by edges that are weighted by the probability that if a transaction’s at one state, it’ll transition to another state. So now at runtime, when a request comes in, Houdini will grab the right model for that unique request, unique transaction invocation, and it will estimate some path through this model, and based on the states that the transaction will visit when it traverses through the model, that will tell us what are the optimizations we can apply at runtime. Now if we get our predictions wrong, it's okay because that runtime, we are actually going to follow along with what state transitions that transaction actually does make. And so we're maintaining internal counters of how many times it goes across some path. So we start noting that are predictions are deviating from reality or in the actual runtime behavior transactions, we can just re-compute these edge weights really quickly online and with a cheap computation to get us back now in synch where the application, what the application’s actually doing. So how we’re going to generate these models is that again, we are going to have a training set, a previous executed transactions, yes. >>: So I can understand why you predict [inaudible] more frequent than aborts. But it seems to me what you really need to do is be able to predict which partition’s being referenced. >> Andy Pavlo: Okay, so again, that's what we do. So the state has the name of the query being executed, how many times we've executed query in the past, what partition’s this end location that query will go to. And because it’s a Markov model, we have to encode all the history at any state. We also have the history of all the partitions we touched in the past. >>: So you're weightings there suggests an overwhelming [inaudible]. But you could easily end up with a path which there are three possible partitions and they reach around 33 percent. >> Andy Pavlo: Yes. Give me like two minutes and I'll solve your problem. Okay. Great [inaudible]. So we have a training set of previously executed transactions, right? So these are all the queries and input parameters that were invoked in each transaction, and we’re going to feed that first into a feature cluster that's going to split them up based on characteristics or attributes of the transaction’s input parameters that create the most accurate model. So these can be things like the length of an array parameter, the hash value of another parameter. And now with these bucketed training sets, we're going to first feed that into our model generator that will create the Markov models for each bucket, and then we’ll have a decision tree, a classified, create a decision tree that will split them up based on the features that we originally clustered them on. So now, one of the features could be: what's the hash value of some parameter? And that will tell us what partition we’re going to execute this query on, what partition we’ll execute this transaction on, and that will tell us now that it almost becomes like a linear state machine where we don't have that equal probability of what partition we take because is no longer a giant monolithic [inaudible] model; it's more individualized. Yes. >>: There are many models, many classifiers that you could have used. Why did you choose a decision tree? >> Andy Pavlo: We chose a decision tree because we wanted to be able to quickly traverse it and say at runtime>>: Runtime speed? >> Andy Pavlo: Correct. So this whole top part here is actually what I was going to say next. This whole top part here we’re doing off-line, so it could take a while, that's fine. But now at runtime, we can quickly traverse the decision tree and then quickly estimate some path to the model. So we can do this bottom part here in microseconds per transaction. >>: So just to make sure I understood. So the parameter value is being [inaudible]? >> Andy Pavlo: Yes. >>: From that you can exactly know, it’s not a prediction, you know exactly which partition that's connected [inaudible] or you don't know? >> Andy Pavlo: It tells us, it suggests to us what partition to run the stored procedure at. But now within that stored procedure, it could touch any number of partitions. It just so happens in the case>>: So then how does it help you with the lock? [inaudible] some sort of computed example. >> Andy Pavlo: Yes. >>: So let’s say [inaudible]. How do you know which partitions to lock? >> Andy Pavlo: Yes. So when a request comes in, we grab the right model, do a little handwaving magic with the machine learning, so we take the model that is best representative based on the decision tree, and we estimate what the path is through it. And so when we estimate, when we're trying to figure out what the state transitions we’re making, it’s, since we know what the tables are partition on, because we have to be told that ahead of time, now we have, we know what the input parameters are to that transaction and that can tell us what actually partition the query’s going to get to have to touch. Yes. >>: There might be some dependencies here that are actually stored in the database. >> Andy Pavlo: Correct. >>: So if I look up my main customer and the hash of my main customer now determines the partition. >> Andy Pavlo: Yes. >>: What is your boundary? So in many cases, this should work fine>> Andy Pavlo: If you have to read the state of the database and say, based on that, now you hash whatever that, the output of one query’s being as used as input for another query, these models don't capture, capsulate that information. >>: Right. >> Andy Pavlo: So we take care of that in other ways. We take care of that, again, for the partitioning work to be done, the automatic database [inaudible] we've done, we can figure out hey, we see this pattern happened a lot, it's usually read only, read mostly, so we’ll have a, create secondary indexes that are replicated in every single node so that we can do that lookup and then that will direct us to the write locations. So there's other things more than that. Right? >>: But you would know, right? So in these cases, if you have any mode of saying well, for these transactions we can get a very accurate model, for these cases we just don't know. So we'll back off from our prediction and do something more conservative because even though your most likely path may not be very likely. >> Andy Pavlo: Yes. So your question is, is there a way for us to identify whether different types of workloads that we see, that have this dependency where you're reading stuff in the database system and then maybe not applied these optimizations for that? >>: Or just more general, do you know when your model works well and when it doesn't? >> Andy Pavlo: So we have not done anything formal about that, but I can tell you sort of off the cuff, things again, doing that lookup and using applicable queries and then put in another query, that won’t work well with this. But again, we take care of that in other cases. If you have sort of large range queries that have to touch multiple partitions, and it’s arbitrary what partitions you have to touch, that won't work well in this, but I'll say for those types of workloads, for that second type of query, we don't see often in, for the type of applications we’re focusing on. That's more getting into the real-time analytical stuff, which we haven't focused on yet. And I'll talk about that in future work. Any other questions? Yes. >>: So are they able to do optimizations for blocking the partitions as well, based on your predictions on which partitions the transaction is going to access? >> Andy Pavlo: So we can do like, this case here, we have our path, right? So we know what partitions we think it’s going to touch, so now we only lock the ones we need. >>: And what if the partition model is incorrect, and during the runtime you decide [inaudible]? >> Andy Pavlo: So we’d say, yeah. [inaudible]? If we predict, if we fail to predict that we need a partition in the beginning and we don't lock it, when the transaction tries to actually access that partition, we'll abort it, roll back any changes, and restart it, and acquire the locks. >>: That’s to prevent any deadlocks? >> Andy Pavlo: Correct. So then, yeah. You can't touch anything without having a lock beforehand. >>: So [inaudible] the partition is just the transaction that works [inaudible]? >> Andy Pavlo: No. There's other things as well. So like you try to lock something you don't actually end up needing, and now the thing is just sitting idle and can't do anything. I don't have a good sense of what’s worse. >>: Okay. >> Andy Pavlo: I know that, I guess I'll just jump to it now. >>: My other question is how big is the difference, so if I remember correctly from our previous conversation, the difference between a transaction that runs correctly, is predicted correctly, runs either completely low color or in feature transaction, versus a transaction that needs to run locking everything. To get between these is huge, right? The idea of running a transaction and aborting halfway to [inaudible]. The [inaudible] of that abort is actually, if I understand correctly, is not that relevant. Is that a fair statement? >> Andy Pavlo: The overhead of aborting the transaction and restarting it>> Yeah. So you’re trying to run the fast version of it>> Andy Pavlo: Yes. >>: I fail to predict and I run the slow version>> Andy Pavlo: When you say slow version, it's more like you restart it and you lock, you restart it locking the full cluster. If you have to lock the full cluster, so this is the naive prediction scheme. This is when you assume that everything is single partition, if you get it wrong you abort, restart, acquire the locks that you need, right? And the top line is when we actually use Houdini of when we are accurately predicting what we actually need it. So it's about a 2X difference between the two. Right? [inaudible] I don't know whether it's better to lock something you don't end up needing or miss a lock. My sense is it's roughly the same, but if you have to lock the entire cluster every single time, the performance is absolutely terrible. So this is what we can do if we use our model, sorry, yes. >>: So this model, how would it compare to say doing like just a dirty run in the transaction >> Andy Pavlo: To like stimulate it? Yeah. >>: Use that as a model. >> Andy Pavlo: So I mean, there's other techniques that you could use. You could simulate the transaction, you could use static code analysis, you could use [inaudible] checking. For the simulating one, I think the overhead of doing it might be, the problem with that is you miss things like if I read back a value and do an [inaudible] branch. >>: But the same in here is that the fast version is much faster than the slow version, you're doing an even faster fast version which didn't acquire locks just to see where you think it might touch. >> Andy Pavlo: But you still have to acquire the locks. Are you suggesting you just simulate the transaction, not in the engine, in a separate thread>>: Yeah. >> Andy Pavlo: See what data tries to touch>>: Right. >> Andy Pavlo: And that tells us how to schedule. >>: [inaudible] you almost at the same cost [inaudible] transaction? >>: Not acquiring the locks is a big delay though. Like you have to>> Andy Pavlo: So for your case, you would still have to acquire the locks. >>: No. I'm saying like when you do the dirty simulation, you acquire no locks. So you use some IO, but>> Andy Pavlo: So you're doing optimistic concurrency control kind of thing. >>: [inaudible] transaction? On a separate thread that's actually real data, pretending that there's no transaction exists>> Andy Pavlo: Right. And that will tell you what transaction partitions you need to lock. But now when you want it for real, you have to do the concurrency control scheme with, acquire the real locks. >>: That's right. I'm just wondering like, this dirty simulation is also a model of what could be locked. >> Andy Pavlo: Correct. Yes. >>: So you have sort of like a fancy machine learning static analysis. I'm just wondering what's the strongman, I guess is what I'm trying to get at. Like how should I think of this model as being better or worse than other attempts to guess at what to log? >> Andy Pavlo: Right. I got that. I have not done that simulation example. That is something I have to do for my dissertation work. There's other things that we can do these models that I'm not really going to talk about today; so we can do things like we can identify when we’re done with a partition, so we can go ahead and send like the early two phase commit message. We can also, I don't want to bring it up because Phil’s here, but you can do things like, if you know there's never going to be a user abort with absolute certainty, maybe don't need your undo locks, undo logging, and you can get about 10 percent speedup as well. But I don't have a good answer of how much you get just using simulation. There's other things I'll talk about later on, how we can leverage these models into [inaudible] execution, other things like that. We can talk about afterwards whether we can still do simulation for that. But it's more, we’re doing more than just figuring out where should we send it and what should we lock. And that's my hunch on why just doing a quick and dirty simulation might be insufficient. Any other questions? Okay. So this is what we get, about 2X improvement over that naive prediction scheme; but we compare how well we did versus what I'll call the optimal case, so this is you have an Oracle that knew exactly what every single transaction was going to do. This is the best performance that you can get. So we’re about 98, 99 percent accurate. So there's some cases where we need to lock something we didn't end up using or we lock something, or we don’t lock something that we do need later on. And so we're not that far off from where we want to be in that case. And actually, we run this even longer over time, we can learn more, and our models get improved, and we get closer and closer. Now 2X improvement is always a welcome>>: You say optimal, you mean perfect prediction? >> Andy Pavlo: Yes. So I hardcoded something in the system that said, here's a request, what is it going to do? >>: [inaudible] how these models can be so accurate? [inaudible]. Let's go back to the example I had before. Let's say that our transactions [inaudible]. >> Andy Pavlo: Yes. >>: So you have a history of that parameter value showing up; a specific value, right? >> Andy Pavlo: So it's more like>>: But imagine per second. >> Andy Pavlo: Yes. >>: When you see a parameter value you’ve never seen before. >> Andy Pavlo: Yes. >>: How [inaudible]? >> Andy Pavlo: Again, we are doing- >>: [inaudible] I would imagine there would be a very long tail of parameter you’ve never seen before, even though the stored procedure’s exactly the same. >> Andy Pavlo: Yes. Correct. So we are, this is like, let me hold your question, and then I have a slide later on that I'll show you. Because again, we are doing hash partitioning. We only have to encode, here's Andy, here’s Vivek, individual records of how they map in partitions. >>: [inaudible] learning based on the result of the hash function? >> Andy Pavlo: Correct. Yes. But I'll show how we can more deterministic in our selection later on. So 2X improvement is always pretty good. It's always a welcome improvement when you’re doing database research. But the problem is if we change the graph to be reflective where we’re going to be in terms of linear scalability, we are still not in the right direction. So the absolute numbers have improved, but the trend is still not where we want to be. So this doesn't help us. So again, we are adding more machines and we are not getting a better performance that we want. So the question is: what's going on? What’s causing us not to be able to scale up? So it has to do with the inherent nature of our concurrency control model. And that is because we have these granular locks at a partition, when a transaction starts running, when a distribution transaction is running and it holds the lock at remote nodes, these remote nodes, the engine for them, are idle doing nothing because they have to wait before, we have to wait until the distribution transaction sends a message over the network to tell it to execute a query and send back the result or start that two-phase commit process to finish the transaction. So they're essentially doing nothing. So now, when the standard procedure does send a request to these remote nodes, they have something to do. They can process the query. But now we go back, we sort of flip, and now the guy at the bottom, he’s idle because he has to wait for the results to come back before he can make forward progress. And once it does come back, again, the remote guys go back to being idle. Right? And this is because we are optimizing our system for these single partition transactions which are their majority of the workloads that we’re looking for these types of applications. But because we have such granular locks and we have these long wait times, this is why our system is just being completely slowed down. So once again, one last time, the answer to how to solve this problem can be found back at the dog track. So I first met these guys, they were ex-taxi drivers from Argentina, and they were kind of like a CD group. They didn't really talk to anybody else, they were always talking to themselves. And I first noticed them because they were always running around looking very, very busy. So most people go to the dog track to relax, at least that's what I do because there’s actually not a lot of things to do while you're at the track because there's only about 14 to 15 races per night, and each race is about 50 seconds. It’s over pretty quickly. So there's a lot of time you’re just sort of sitting, eating food, waiting for the next thing to happen. But these guys weren't there to relax. They were at the track to make money. And so what they would do is they would go down to the payphones by the bathrooms and whenever we were in between a race at the track we were at, they would go call their bookies at tracks in the next county over so they can make more bets and so they can make more money. So again, they’re something useful when everyone else is sitting around idle. So again, this is the same thing we want to do in our database system. We want to be able to do some kind of useful work whenever we know that the engine is blocked waiting on the network. So to do this, we developed a new protocol that allows us to [inaudible] execute single partition transactions at a execution engine four partition that is blocked because of a distributed transaction. And the key thing about this is that we have to make sure that we maintain that the serializeability of the database system. So my apologies, Phil, that you have to sit through this. So for a serializeable database, what we want them to do is we want to have the end state of the data be the same as if we executed transactions sequentially one after the other. And this is essentially what we're doing now. So we have a distribution transaction and two single partition transactions, and each of those single partition transactions do not start executing until it knows that its previous guy has committed successfully. What we see from the case of the distributed transaction, we have this huge block of time here, where we’re idle, where we’re waiting for somebody to come over the network to tell us to do work, or waiting for the result to come back before we can continue to forward in our transaction. So what we're going to try to do here, we're going to try to find a schedule where we can interleave these single partition transactions during this time when we are blocked, and then we're going to hold on the results to the end, and we’re going to do a verification process to check whether everyone didn't created any conflicts, and that end database is still the same as it was if we executed them sequentially, and so then, you know, we have a system view at all times. So this sort of looks like optimistic concurrency control, right? But in that original paper from Professor Cong in 1981, they assume that conflicts are very rare. So the number of transactions you have to abort because, in the verification step because there was a conflict, is small. But in a lot of the applications that we've looked at, there usually is a lot of skew, either temporal skew or popularity skew. So conflicts are not rare, and you end up having to abort a lot of things over and over. So what we're going to do instead is we’re going to use precomputed rules to allow us to identify whether two transactions conflict, and if they don't, we can, we know that it’s safe to interleave them. And now at the verification step, because we scheduled them based on the predictions that we generate from our Markov models, if we know we selected them that they wouldn’t conflict, as long as they did what we thought they were going to do, we know there aren't any conflicts and we can commit everybody all at the end safely. So now let's look at an example here. So let's say we have a distributed transaction that needs to touch data at these two partitions, so this sort of procedure is going to run on the top, and acquires the lock for the partition at the bottom, so when it starts running, at some point it’s going to issue a query request to this partition at the bottom, and then that means that stored procedure will get blocked and is idle because it has to wait for the result to come back. So now when this occurs, HERMES will kick in and is going to look at this engine’s transaction queue to try to find single partition transactions that it can interleave. So we have two important requirements how we’re going to do this scheduling. First is that we have to make sure that single partition transactions will finish in time before the distributed transaction needs to resume. So we've extended our models that we used in Houdini to now include the estimated runtime in between state transitions. >>: Why does that matter? >> Andy Pavlo: Because cascading [inaudible]. Because you’re holding these locks, you want to finish up and have everything commit as soon as possible. So let's say we have a distributed transaction, we have the prediction that we generated from the model in the beginning, and so it just executed this query here, and we anticipate that it's going to execute this next query here. So now our models include the estimated elapsed time in between these transitions at this partition. So when we go look in our queue to try to figure out what guy we want to execute, we want to make sure that they’ll finish within this time. But then the second requirement is that we need to make sure we don't have any read, write or write, write conflicts. So let's say, for this speculation candidate, if we scheduled it, we have this problem where the distributions transaction just read the value of this record X from the database at this partition. And then it gets blocked, because it has to wait to execute some query at a remote node, but now if we execute this candidate here, it’s going to write some value that's going to change the value that record X, the same one that the distribution transaction read, so normally this would be okay, except that when the distribution transaction resumes, it's going to try to read that value back again and this is going to be a phantom read. It's going to be an inconsistent result. So this is a conflict that we can’t allow to occur. So we're not going to choose to speculatively execute this guy, and we’re going to skip it and go on to the next one. This next guy will finish in time, and it doesn't have any conflicts, so we know it's safe to execute. So we’re going to pull out of the queue and execute it directly on top of the distributions transaction at that partition as if it had gone through the normal locking process. >>: Can you prove that that will never ever going to have any conflict or you're saying it's very unlikely and it will abort if [inaudible]. >> Andy Pavlo: I'm not claiming that we can do this for all stored procedures. For some things it’s too complex to try to figure it out. But some basic things like, I read a table X, table foo[phonetic], this guy reads and writes table bar, now there's no conflict. So it's okay to do that. >>: Okay. >> Andy Pavlo: It's rules like that. Heuristics. >>: But my question is, so imagine there are two versions of these, you can imagine. >> Andy Pavlo: Yes. >>: So one is like, it’s absolutely guaranteed by construction of these two transactions. They will not interfere. >> Andy Pavlo: Yes. >>: So again, to do, I potentially cannot even check afterwards what happened because>> Andy Pavlo: So we still can, we’ll check at the end at the logical level. There is a conflict that occurs we can abort and roll back. So we’re not going to be unrecoverable. >>: And the other cases, I might have a conflict, but there is a zero, zero, zero, one percent probability to do it. >> Andy Pavlo: Yes. >>: You can go ahead anyway and abort it? >> Andy Pavlo: Yes. And so, it’s actually sort of related to his question is, so when it finishes we'll commit it, we won't commit it right away; we’ll put its results on a side buffer because we need to wait to see to learn whether the distribution transaction actually finishes. Yes. >>: So I think I'm missing something. But if you know there’s no conflicts, then you can commit the single partition ones right away, couldn’t you? >> Andy Pavlo: The single partition transaction could have read something written by the distributions transaction and therefore you have to hold it because you can’t, you know, release them by>>: [inaudible]? >> Andy Pavlo: It's not a conflict in the sense of like I read something, and I try to read it twice, and I [inaudible] result. It's more like you don't want to release the state of the database of any changes for the distribution transaction until you now that distribution transaction has committed successfully. For some cases, yes. If you read something that the distribution transaction hasn’t touched, then it’s okay to send back the result. Yeah. >>: So the problem is that if I read something the distributed transaction>>: It depends on how finely you look>> Andy Pavlo: Correct. Yes. It depends on whether the distributed transaction has written anything before it got stalled. They just read something, then it’s okay. I'm just sort of giving you an example of like when you would read, write or write, write conflict. So normally, this is sort of related to Vivek’s question in the beginning, normally this would be an unsafe to thing to do because if you're using these probabilistic models, there's a bit of randomness involved in picking what path you're going to take and trying to identify what data you're going to, you know, what the transaction’s going to end up reading and writing. But we can actually exploit a property of our stored procedures that make these selections, what predictions more deterministic. So let's say we have, our transaction comes, a new transaction comes in the system and starts off the beginning state, and again, we want to figure out what path we are going to take for this model, and that's going to be able to tell us what data we are going to read and write. So normally you would say, alright, well here’s the two transitions I can make from where I'm at now, so roll the dice and based on the edge weights, the distribution of the edge weights you’ll make, go down one path or another. But in this case here, we see that there's only one query that the transaction could ever execute, this GetWarehouse query, and so in the GetWarehouse query, we see that there's an input parameter that's being passed in from the program logic of the stored procedure that's going to be used on the Warehouse ID which is the primary key for this table. And so since we partitioned this table on this Warehouse ID that tells us we know the value of this input parameter that tells us exactly what record we are going to access. And so we can generate a statistical mapping between these input parameters that are passed in from the application into the transaction to the input parameters of the queries. So that means if we know that, since we know the value of the transaction’s input parameters, we know the value for this input parameter for the query. So that tells us exactly what path we’re going to take, what transition we’re going to take for this case here. And we can continue to do this all down the line. So again, I'm not claiming that we can do this for all stored procedures, but for a lot of them in the applications we look at, we see this pattern and we can exploit this. >>: Is this analysis done manually? Or do you have an automated- >> Andy Pavlo: It’s automatic. Yeah. >>: What is the [inaudible] for this? Given [inaudible], how do you go about, so now you have the semantic model the Warehouse ID is>> Andy Pavlo: Yes. >>: You have a semantics associated [inaudible] application, how do you>> Andy Pavlo: So we have these stored procedures and we have these workload traces and we say, there's both different ways you could do this. You would end up with the same result. The static code analysis could tell you to do this, tain[phonetic] checking is another example of how to do this, we do a dynamic approach. We take these previously executed workload traces and we say all right, well this input parameter, we see the value of this query being executed within this transaction, how often does it correspond to the input parameters for the stored procedure? As we, if we see that there's a direct mapping, then we know that we can identify those. Okay? So now, again, now this makes our predictions more accurate, more deterministic. So now at runtime, when we want to commit a bunch of speculative transactions, instead of checking the read, write sets and the dependency graphs in between them, we just need to verify that all our transactions, at the logical level, made these correct path transitions that we predicted they were going to, and because we have our pre-computive rules to schedule them to not conflict, those rules are based on our initial predictions, as long as our predictions are accurate, we know that there's no conflicts and we don't commit everybody all at once and we’re happy. >>: Can you say what’s the size of this set of rules? I mean, for every transaction type you see how many different rules can come out and, you know. Like how big is this space of things you’re pre-computing? >> Andy Pavlo: Yeah. I don't have a good sense, I don't have a number. >>: But in the tens, in the thousands, in the millions? >> Andy Pavlo: In the hundreds. In the case of TPC-C, the low hundreds. Yeah. >>: The other question that I have is in the case of TPC-C as a benchwarmer. There are some assumptions in this work that may make sense in TPC-C, and I just don’t know if they make sense for the types of applications that require this sort of scaled out lightweight transactions. >> Andy Pavlo: I would say that TPC-C is actually very representative. >>: Is it? >> Andy Pavlo: Yes. >>: Do typically have those kinds of conflicts and things like that? >> Andy Pavlo: In TPC-C, the workload, in both terms of the complexity and what the transactions actually do, is actually very representative of what I've seen in the industry. >>: I have to confess I'm a little bit surprised because it seems to me like I’m thinking about Amazon shopping baskets or something like that and it doesn't seem sort of like they're likely to be a lot of conflict because you’d have to have conflicts within the same shopping basket. And that doesn't seem likely to me. There may be shopping baskets that are hot but not necessarily transactions that are going to conflict with each other over those shopping baskets. >> Andy Pavlo: Shopping basket is a bad example because they don't actually use transactions for that. I can't give you a number to say like 80 percent of the workloads we see have these properties, but I would say in the things that I've looked at, it's fairly common. But I don't have a way to quantify that just yet. >>: [inaudible] selections [inaudible] workforce [inaudible] really work well in the system. >> Andy Pavlo: Right. So again, I'm not trying to build a general purpose system. I'm very careful about saying that. Right? So, yes. There could be some things that clearly would not want to use H-store for because they don't have these properties. But I would say that there's, there are a significant number of them. >>: With the direction we want to be, you get the impression that this research will apply to their customers? >> Andy Pavlo: Yes. We’ll get to that. Okay. So let's go back to that TPC-C benchmark a third time. Again, we're going to have the same number of distribution transactions as before, and we’re going to scale up the number of machines, and this time when we use H-store and using HERMES, you can see that we can get almost about 60,000 transactions again across four machines with eight cores apiece, right? So now we see, in terms of linear scalability, we are in the right direction. So we’re not actually going to be, we're never going to be perfectly linear scalable because we’re a bit conservative in our estimates, but could've been cases where we speculatively executed transaction, but we didn't because we didn't want to cause any stalls. Yes. >>: Before, on the previous slide, you showed us TPC-C and you said something like 10 percent distributed transactions. So what's the fraction [inaudible] versus a single partition here, and if I change that, what happens to this [inaudible]? >> Andy Pavlo: You’re absolutely right. So this is, they’re all executing the same number, absolute number of distributed transactions percentage varies because what we're doing is we’re executing more single partition transactions. You absolutely, that’s an astute observation. We’re not actually doing anything to make each individual distribution transaction run faster. We’re just sort of be able to do more work when we otherwise would be idle. It’s future work to make, to speed up each individual distribution transaction, reduce its latency. >>: Right. So specifically, you're able to speculate the, execute the single partition transaction. >> Andy Pavlo: Yes. >>: So what fraction of the transactions here are single partition? I guess I'm just wondering. Because it seems like you need to have basically a nice queue of single transaction or single partition work that's ready to [inaudible] to exploit>> Andy Pavlo: Right. Yes. >>: So is that typical and is that a dial you can turn here to show me what happens to the slope? >> Andy Pavlo: Yes. So again, if you add more distribution transactions, yes. You’re going to be down here again. Yeah. >>: Okay. >> Andy Pavlo: I'm not claiming, you're absolutely right. It’s future work for us to figure out how to slope keep going up and get more distribution transactions. >>: Well, specific on TPC-C, [inaudible] specific transaction makes [inaudible] right? >> Andy Pavlo: It’s 10 percent, yes. >>: Are you maintaining that>> Andy Pavlo: No. Again, it’s the same number distribution transactions for all three points within a single column. >>: And is that, the fixed 10 percent or whatever, TPC-C [inaudible]? >> Andy Pavlo: What's that? >>: The point of 60,000 for thirty cores, is it running that TPC-C with the fixed 10 percent [inaudible] transaction? Or by running>> Andy Pavlo: It's absolutely diverse. So like 10 percent of whatever this is, say this is like 1500, it’s 1500 or whatever it is for all of them because we’re executing more single partition transactions. >>: So you're not running 10 percent of distribution transaction anymore. You’re now down to say>> Andy Pavlo: But the absolute number is the same. >>: I see what you're saying. >>: 10,000. That is 10 percent of that is 1000. You're up to 60,000. So [inaudible] 60. >>: Can you always tell whether you have a distributed transaction versus a single partition transaction? >> Andy Pavlo: In case of TPC-C, in these workloads we’re about, I think we’re almost always accurate, like 98, 99 percent. >>: TPC-C is very [inaudible]. >> Andy Pavlo: Well the other workloads that we look at, you do the same thing as well. TM1 is a little bit different because you do this>>: You can’t always tell. So when you can’t tell, you have to run [inaudible]. >> Andy Pavlo: When we can't tell we, when we can’t tell it's more like, it’s the same thing as getting a prediction wrong. We assumed a single partition. We always execute a single partition. And then when it tries to touch something that we didn't know about, we just have to abort and restart it. So we actually compare ourselves, what we're doing here, versus what I'll call the optimal case or the unsafe case, so this is where if you assumed that there are no conflicts and you blindly pull out whatever in the, first in the transaction queue when you want to spectrally execute something and you don't care about serialize ability, this is the best you can do. So we're not that far off in terms of the optimal case and we are maintaining serialize ability. So that's good. Okay. Yes. >>: So when you say [inaudible] distributed transactions. So do you have an idea whether any of the [inaudible] distributed transactions are the same? >> Andy Pavlo: Yes. >>: But the transactions start touching more partitions. But I think in case of TPC-C, typically to partitions, right? [inaudible]>> Andy Pavlo: Yes. >>: Distributed transactions. So I say, for example, they're starting to use four partition transaction or eight partition transaction, you expect [inaudible] or do you expect it to flatten out? >> Andy Pavlo: I suspect the performance would get worse. >>: Would get worse. Okay. >> Andy Pavlo: Yeah. >>: It’s not only the [inaudible] distributed transactions, but how much>>: Is it? If you're allowed a single, the speculative one to get in, I mean if you’re touching, depending on how much you’re touching, right? >> Andy Pavlo: Yes. If you touch more partitions, for the HERMES case, I don't think it be, anything would be the same trend because we’re just executing more single partitions transactions. So it's okay. >>: So the heated assumption is that the workload is a sort of, there are different guys, some which are trying to go as fast as they can over a single transaction and some other guys trying to go as fast as they can on the distributed ones. >> Andy Pavlo: Yes. >>: That is the case of, this holds. Otherwise, [inaudible] the ratio. >> Andy Pavlo: Correct. Yes. >>: So in this particular mix, you had 10 percent cross partitions and 90 percent within a partition, right? >> Andy Pavlo: Yes. >>: So that means that if you just sort of did a very simple optimistic concurrency control thing here, you would commit 90 percent of your transactions since you have serialized access for those and they remain on the partition. Now, of course, you potentially could have a higher percentage of aborts for the things that span multiple partitions. Do have a sense of how this compares to that solution? >> Andy Pavlo: So your question is>>: [inaudible] baseline support? >> Andy Pavlo: So I think you just execute a single partition guy without any rewrites and just let it fly, right? And then if you have a distribution transaction, it will try to touch something that doesn't have a lock, you have to abort and restart it, either move it to another node because you don't know when it comes in what node you should actually run it on. >>: And the next time around your taking locks, right? >> Andy Pavlo: Yes. >>: Right. So I’m saying, suppose you just never do that. You just keep trying to rerun it and maybe they even start. Right? You haven’t said anything about an SOA on being able to actually commit a transaction that you try to do, right? >> Andy Pavlo: Correct. >>: So it seems to me that 10 percent of the transactions that, I think if you're suggesting if you run optimistically, you could rerun very fast, but you’ll always be>>: The assumption>> Andy Pavlo: Let me>>: Let’s [inaudible] check. So how much minutes you have left? >> Andy Pavlo: I’ve got 10 minutes. >>: You have 10 minutes? >> Andy Pavlo: We can talk about this afterwards. >>: Maybe let's hold the questions till the end. >> Andy Pavlo: [inaudible] H-store has actually been commercialized as VoltDB; it’s actually another resource project that's based on our code, and it’s actually used in a couple hundred installations throughout the world today. So the Japanese American Idol example I talked about in the beginning, that's actually a real-world deployment of this system. And I actually found out a few weeks ago, we apparently power the Canadian version of American Idol, which I think is just called Canadian Idol. There's a lot of people using this system to do so sort of network traffic monitoring, in particular the government. The CIA and the NSA apparently love this system. They don't tell us what for, obviously, but it’s, other than to say it's installed in every single telecommunication hub throughout the country. So that can make you feel warm and fuzzy at night. >>: That’s the only thing that scared me the most in the last 10 years. >> Andy Pavlo: Is what, those guys? >>: You have your hands on>> Andy Pavlo: Apparently they run the same amount of data through VoltDB as a run through accumulo[phonetic], which is their version of big table security stuff. Again, we don't know what they're actually doing with it. A lot of high-frequency trading guys and the high finance guys are using this system because they like the low latency properties and the high throughput. So we power the Malaysian derivatives market runs entirely off of the system. And then a lot of online companies, like AOL’s game.com, use this [inaudible] runtime state of players. And my personal favorite is that there is a shady offshore gambling website that I can’t mention that uses the system for all their bets and wagers. So that's nice. So what did I talk about today? So I initially started off by describing a system that’s really optimized from the ground up to be, support these types of high throughput transactional workloads that we see a lot of. But then I showed how if you have just a small fraction of the workload, how to touch data in multiple partitions, the performance really breaks down. So then I showed you how to use probabilistic models to be able to predict behavior transactions and do the correct optimizations to lock the minimum resources you need in the beginning. And then I showed how to extend these models further to allow you to safely interleave single partition transactions whenever an engine is blocked because of a distribution transaction. So it's no longer a question of can you scale up and get better performance in the system while maintaining transactions, what I showed today is exactly how to do that. So where do we go from here? For future work, I can [inaudible] say that there is probably five or six new products that I've been working on in the next couple years to extend the performance of H-store for a bunch of different things. So I'll loosely categorize these into two groups. So first, I want to look for ways to improve the scalability of the system. So I'm working with, I think Aaron Omar, whose coming to do here an internship with you guys this summer, it’s colleague UCSB, we’re looking at ways to have a lasting deployments of the system. So right now, if you have a partition that the hotspot and all the transactions are getting queued up there, there's nothing we can do at this point to be able to mitigate that. So we want to look at ways to say, to identify that we’re overloaded at a partition or maybe it will split it up into multiple pieces and migrate the data, or maybe coalesce partitions that aren't being actively used. We are also looking at ways to exploit new hardware that’s coming out. And this is part of my involvement of the big data ICC that’s headquarters at MIT that’s sort of sponsored by Intel. So particular, we are very interested in looking at what kind of database system that we want to be able to build. If you have some of the new no-volatile memory devices that are coming out, so whether this is something like H-store or whether you want to try a different concurrency control models; we want to explore that to figure out what's the best kind of system to have. You have storage device that has the speed of reads and writes of DRAM, but the persistency of an SSD. And then we are also looking at the exploit and sort of the new mini core architectures that are coming along. And then the thing that I'm probably the most excited about that’s coming out in the next year, which is sort of similar to your glacier project here on a Hecatin[phonetic] system, is a new system model that we've been working on called Anti-caching. So in Anti-caching, although I’ve just spent the last hour talking about a main memory database system, with Anti-caching we’re going to go ahead and add the disk back in. And so we're going to use the Anti-cache as a place to store cold data, right? For data we're not going to need anymore. So we’re going to monitor how much memory is being used at a partition, and then when we reach above a certain threshold, we’re going to go ahead and evict data that has not been used recently or is unlikely to be used in the future, and we’re going to start that in a block storage hash table out on disk. And then we still have to maintain index information and catalog information in main memory for all the data that we've evicted. So the main difference between a traditional database system and the Anti-caching model is in traditional system, a single record could either exist on disk or in memory at the same time, and so in our model, it's either one or the other. So now when a transactional request comes along that only needs to touch data that's entirely in main memory, then execute it just like before without any problems. But as soon as it tries to touch data that's been evicted, we’re going to go ahead and put into a special monitoring mode where we keep track of all the records you try to touch that out in the Anti-cache, and when it tries to actually do something with that data, either modify it or send it back to the application, we’re going to abort it, roll back any changes that it had in a separate thread, and asynchronously we’re going to go fetch in the blocks that it needs to get the data from, merge that into main memory, all the while we’re still going to execute other transactions, now block them, and then once we know that our data have been merged in, we update our indexes and we can now restart this transaction it runs just like before. So again, the key thing about this is now we are able to support databases that are larger than the amount of memory on a single machine. So we’ve done some initial experiments where we’ve taken TPC-C, this time with 100 percent single petition transactions for 10 gigabyte database, we want to scale up how much memory we’re going to give the system, and we see when we use MySQL, the performance isn’t so great. Using MySQL with Memcache for certain read operations, the performance isn't that much better because TPC-C really can't take advantage of sort of the caching properties, the read, write properties you get from Memcache. But now when we use H-store, you see that performance does not degrade that much for databases that are eight times the amount of memory that's available to the system at a single node. So this is pretty exciting work here. The other stuff we are looking to do is we want to expand the type of workloads we want to run on the H-store system. So we're, with a colleague at Brown, we are looking how to integrate stream processing primitives or continuous queries as a first-class entity directly in the system. So we have a mix of streaming data and transactional data and sort of intertwine them together. And then sort of related to what all the discussion was, for distribution transactions, we want to look at how can we improve the performance system for workloads that are not easily partitionable. So we add more distributed transactions, how can we make the system perform better? Because it is doing things like batching and other techniques to amortize the cost of acquiring these locks at all these different partitions. I'm also looking at getting involved back into, getting back involved in working with scientists to apply database techniques for scientific workloads. So this is possibly building a whole separate system that’s really optimized for their types of workloads because a lot of them are just sort of running these four transcripts or C-scripts, running MPI programs that don't take advantage of the kind of things that we know about in the database world. And then lastly, the thing, the next major trend I see, which I'm sure you guys are aware of here, is adding support for these front end main memory systems to do real-time analytical operations. So now we want to be able to do the longer running queries that have to touch more data without having to run that as a distribution transaction that needs to slow down the entire system. So a lot of the game companies care about this kind of thing. They have, the front end system is maintaining all the runtimes data players, and they want to be able to say, here's the top 10 players for a bunch of different metrics. So instead of shoving that off into your backend system or Hadoop or whatever, running the analytical query there and then shoving it forward, we want to be able to do that directly in the front end. And so to do this, we want to apply techniques like using hybrid storage models, hybrid nodes, and applying techniques like relax consistency guarantees of these analytical queries. These are new techniques, but we want to sort of come up with clever ways to combine them together, and again, other leverage, other machine learning techniques to make the system do this automatically. That's really the main take away for what my research is all about, it’s really having the database system know as much as possible about what the application’s trying to do and what the workload’s trying to do and be able the system, have the system go faster and get better performance because of it. So I want to thank everyone for coming today; I'll be happy to answer any more questions, and there’s also website set up for the H-store system where, you know, for more information about the documentation about all the things that I talked about today. So thank you. >>: Any questions? >>: You measured TPC-C? >> Andy Pavlo: So, yes. So TPC-C won't work well on this. You're absolutely right. I will say though, from the VoltTB guys, what they tell us is a lot of the workload, there’s very, very few workloads that they see that are complex of TPC-E. TPC-C is actually very representative of what's out there in terms of the size and complexity. But you're right. TPC-E won't work well on this system. And that's one of the things we want to look at in the future. Yes. >>: So typically, say it was designed as a benchmark and its got uniformly distributed data>> Andy Pavlo: Yes, >>: Carefully designed, which is not proven is most cases>> Andy Pavlo: Correct. Yes. >>: And so in petitioning, will likely result in hot spots [inaudible]. >> Andy Pavlo: Right. Yes. >>: And the other thing is the design with Oracle in mind, it's also designed [inaudible] using snapshot isolation. So TPC-C is an interesting benchmark, but it's certainly isn’t conclusive in terms of anything. >> Andy Pavlo: So we’ve done, I didn’t talk about that, we've done other experiments with Carlo and I, we have a whole benchmark suite with a bunch of different LOTB[phonetic] workloads and we tried it out. And again, TPC-E is an excellent example of something that won’t work well in this. Anything that looks like a social graph, Twitter, Facebook, that stuff, that won't work well on this. But again, I would say from what I've seen in industry, working with the VoltTB guys, TPC-C is actually very representative for a large number of applications that simply it doesn’t, they're not getting the performance they want using Times 10 or other things, and they're switching to something like VoltTB. All right. Thanks, guys.