>> Phil Long: Welcome everybody. It is my pleasure to introduce Ashaf Aboulnaga who was visiting here from University of Waterloo. I think a lot of people here know Ashaf for a variety of reasons. He is in a pretty broad range of work in the database field and the last few years he has moved into new areas focusing on cloud computing and data integration of web data. In particular, part of the work he is going to describe today on cloud computing was in a PVLDB paper just a couple of months ago that won best paper award there, which by the way, was incorrectly listed as PVLDB 2012 in the advertising that went around. Maybe he'll have one there too; who knows? We'll see. [laughter]. But anyway, without further ado let me turn it over to Ashaf. >> Ashaf Aboulnaga: Thank you Phil. Good morning everybody and thank you for coming to my talk. So the title of my talk is High-Availability for Database Systems in Cloud Computing Environments. More generally it is about providing a database service in the cloud. Before I start the technical material, let me start with some thanks that are due. This is joint work with my colleague Professor Ken Salem at the University of Waterloo. The first part of the work is done by my student Umar Farooq Minhas who was an intern at Microsoft Research a couple of times before, and he is graduating next year and needs a job. So if you guys have jobs to offer [laughter] Umar is your guy. The second part of the talk is the work of Rui Liu, who is a postdoc jointly supervised by Ken and me and he is also finishing in 2012 and also needs a job [laughter]. The first part of the talk is in collaboration with people from the University of British Columbia, Professor Andrew Warfield and his graduate students. So as I said this talk is about, or my interest in this area is about using cloud technologies plus SQL database systems to build scalable and highly available database service in the cloud. Another interest I have in this area of cloud computing is what if I don't care about a scalable highly available database service? What if I have my own instance of me database system and I want to deploy it in the cloud or I want to run it on a virtual machine? How can we improve the way database systems interact with these virtual machines, and also how can we improve the underlying cloud technology? Can we add APIs? Can we add new features to the cloud technologies to better support database systems? So why database systems? I mean why am I focusing on this class of application? Well they are important; databases are important. And looking at the people in the crowd, I guess we all agree that this is not something that I have to spend too much time convincing you guys of. But the question is why do I think that we can solve this problem? Why do I think that folks in the database systems can get us somewhere? One reason is that database systems have a narrow interface with the user. They speak SQL. So when we need to reason with a database system we don't need to reason about general programs in a language like Java or C sharp. We need to reason about SQL. And secondly, database systems have transactions that offer very well defined semantics for consistency and durability. And thirdly, database systems have these very well-defined structures, very well defined operators. So regardless of the database system you can always count on it having a buffer pool. You can always count on it executing queries as trees of operators like hash joined, sort, merge, join and so on. And finally database systems are very introspective. They have accurate performance models of their query execution costs. They have accurate performance models that tell you what the marginal benefit of some extremity will be and if you can expose these models to the underlying cloud technology, we might get some benefit from there. So for all of these reasons I think it is promising to look at database systems as a specific class of applications and tune specifically for them. In this talk you will see that I rely heavily on the semantics of transactions to decide when things have to be consistent and when they don't need to be consistent. And also that I rely on the fact that we have these welldefined structures and present optimizations that are tuned towards things like the buffer pool. So I will talk about two projects. The first is RemusDB which is the PVLDB paper that Phil was talking about. And PVLDB aims to provide database high-availability using virtualization. The second part of the talk I will talk about a new project called DBECS which also aims to provide database high-availability but also scalability and elasticity by relying on eventually consistent cloud storage. So let's start with RemusDB and highavailability using virtualization. So first of all what do I mean by high-availability in this context? What I mean is resilience to hardware failure. I know that often people talk about high-availability talk about scheduled downtime and unscheduled downtime. Here we are focused more on the unscheduled downtime. It works on the resilience to hardware failure. And highavailability is becoming an important requirement for all kinds of applications. It is not just high-end enterprise mission critical applications that have to be highly available these days. Everything has to be highly available. Every Facebook game has to be highly available. Nobody expects anything--nobody is willing to expect anything less than 24-7 uptime. And when we talk about high-availability we have to talk about several issues, so in the context of databases we need to maintain database consistency in the face of failure. We need to minimize the impact on performance of the high-availability solution both during normal operation and also during failover. And finally, we have to reduce the complexity and the administrative overhead of high-availability. Now high-availability has been around for a while. There are many high-availability solutions out there, and one common way to do high-availability is to do active standby replication. So when you activate a standard application you run two copies of your database, one on a primary server and one on a backup server. And the primary is the active server. It accepts user requests and performs queries. The backup is a standby server, and the primary ships changes to the database from, to the backup by propagating the transaction log, and the backup is busy applying these changes, so that when the primary fails, the backup can take over as the primary and it takes over with a consistent database state. If solutions like this exist, why are we still working in this area? We are still working in this area because active standby replication is complex to implement in the DBMS system and also complex to administer. You need to worry about things like how to propagate the transaction log in a transaction consistent where you need to worry about the [inaudible] of handover from primary to backup. You need to worry about redirecting client requests. When the primary fails, the backup takes over as primary. The database is consistent, but you need to tell the clients that now instead of talking to this guy, they need to talk to this guy. And you need to minimize the impact on performance. For example, when the failover happens, you want the failover to happen to a backup with a warmed up buffer pool. So the solutions are complex. And what we are aiming to do in this work is to push this complexity out of the database system and into the virtualization layer. If I want to invent the buzzword for this, I would call it high-availability as a service, in which we can make any database system highly available as we're running it in our soup of virtual machine. We want to do this with as little performance overhead as possible. So the idea is still to use active standby replication. We still have a primary server and a backup server, but now the primary server is running a virtual machine and the database system is inside the virtual machine. The changes, we still propagate changes from the primary to the backup, but we don't just propagate changes in the database state; we propagate changes in the entire virtual machine state, which includes buffer pool, which includes client connection state, so that when the primary fails, the backup can take over as primary. But now it takes over with a warmed buffer pool and clients get filled over automatically to the backup, and this all happens with no code changes through the database system. Yes? >>: [inaudible]? >> Ashaf Aboulnaga: Yes. >>: [inaudible] performance [inaudible]? >> Ashaf Aboulnaga: There is a performance penalty for running the database in the virtual machine but we didn't really study that penalty in this work. We assume that you are willing to run your database system in the virtual machine. There are people working on--this is a continuous process of--there is a continuous process of reducing the penalty of running databases in the virtual machine, but this is part of the equation. It is not part of the equation that we focused on in this work, what we focused on is making this virtual machine highly available. Yes? >>: Question on the [inaudible] how we [inaudible] the plans? >> Ashaf Aboulnaga: We will talk about that next. RemusDB is built on Remus which is a project that was developed at the University of Columbia and now is part of the Xen provider. Remus does this picture. It maintains two copies of virtual machine, one on a primary server friend one on the backup server and it periodically replicates the state changes from the primary virtual machine to the backup virtual machine using whole machine checkpointing. And this whole machine checkpointing extends live virtual machine migration. So things like failing over the clients from the primary to the backup are handled by live VM migration. The transparency of failover is handled by live VM migration. So Remus offers transparent pullover with only seconds of downtime. So did you have a question? >>: Yes. You're talking about a backup to the [inaudible] all of the pages in this cache are also in this cache over there? >> Ashaf Aboulnaga: In our case this is what it will achieve. The two virtual machines will be exact replicas of each other. >>: And that goes for all of the… >> Ashaf Aboulnaga: The whole virtual machine state. So how does Remus do this whole machine checkpointing? How does Remus achieve high availability? So Remus divides time into epochs. The epoch parameters are tunable parameters but think of it as 25 milliseconds. So Remus lets the primary run uninterrupted for 25 milliseconds. There is no lockstep execution between primary and backup here. And at the end of this epoch, Remus performs a checkpoint in which it suspends the primary virtual machine, copies state changes from the primary virtual machine to domain zero, and for those of you who are not familiar with Xen terminology, domain zero is a privileged virtual machine that exists on any physical machine running Xen and it is the administrative domain. It is in the state of the virtual machine. So you copy the state changes to domain zero and after this copy is done, the primary virtual machine can be resumed. And then asynchronously domain zero copies the checkpoint to the backup server where the backup server applies it to its state. Here is an example of Remus checkpointing. This is an example showing three epochs, A, B and C and at the end of every epoch there is a checkpoint that is taken. In this example the primary machine fails during epoch C. So the primary machine fails. The backup takes over and it resumes execution from the latest checkpoint. So work that the primary did in epoch C is lost, and it is okay to lose this work as long as you don't expose output to the user, because we don't want to expose unsafe output. So the way Remus handles that is that it buffers any output that is exposed to the user until the end of an epoch. So let's focus, for example, on network packets. Whenever the machine that is protected by Remus wants to send a network packet, that network packet is buffered until the checkpoint at the end of the epoch happens and then the state becomes safe. At that point the packet is released to its user. >>: [inaudible] every little packet, released it [inaudible] >> Ashaf Aboulnaga: On average 12 1/2 milliseconds, yes. Exactly. And that is actually part of the reason we looked at these, because Remus is there and you can run a database system inside Remus, but then if you look at what Remus does to database workloads. Without Remus protection you have a database server and the client sends a query to the server. The server processes the query and returns a response that is unprotected and you have some response time. Now if you enable Remus protection, making checkpoints adds a certain overhead and network buffering adds even more overhead. So you end up with a protected server but you get a much bigger response time. And the overhead of protection we measured it to be up to 32% in some cases. So our goal in this work is to implement optimizations that are database inspired to reduce this overhead and in the end we were able to achieve less than 3% overhead and recovery within 3 seconds. So what I want to talk about now is the optimizations we do for Remus. Yes? >>: Once you reach a checkpoint, you wait until the backup applies before proceeding? >> Ashaf Aboulnaga: You wait until the backup acknowledges that it received the checkpoint. So as long as the checkpoint is here you can proceed. >>: If you don't wait until the backup is applied to the secondary, then there is a possibility that both machines go down you lose state? >>: Yes. So here we are tolerating one failure. >>: Is it expensive? >> Ashaf Aboulnaga: In theory it is, but this is not something that either we or the original authors of Remus have actually explored yet. So let me talk about the optimizations that we implement. So when we look at Remus and why it is slow for database workloads, we see that it is slow for two reasons. One is that database systems are heavy users of memory, so implemented some optimizations that reduce the overhead of checkpointing virtual machines where the memory is being heavily used. And we, I am going to talk about two optimizations, asynchronous checkpoint compression and disk read tracking. Asynchronous checkpoint compression aims at sending less data during checkpoints and disk read tracking aims at protecting less memory. Another reason why Remus is slow for database workloads is this 12 1/2 milliseconds delay that is added to every network packet. Some database workloads, in particular transactional workloads, where there is a lot of back-and-forth between the client and the server, are very sensitive to this network latency. The last optimization implemented for Remus is to exploit the semantics of database transactions to avoid this overhead when we can. So let me talk about the memory optimizations first. If you look at the way database systems use memory, you see that there are large sets of frequently changing phases of memory. One example of that is the buffer pool. But you can also think of the memory where the connection state of the client is stored. And if you look at the way the database system uses that memory, in many ways it is modifying a small portion of the page. So if you look at the buffer pool in particular, you see that the database system will often modify a few records on the buffer pool, in a buffer pool page. There is a lot of memory that is being checkpointed, which results in a lot of application traffic between the primary and backup. And there is redundancy in this application traffic because you are modifying the same buffer pool pages over and over again and every time there is a modification in a small part of the page. When we looked at this we said well, instead of sending these redundant pages over and over again, send the delta of the pages and send them compressed. And the way we implement that is we maintain a cache in domain zero that contains the most recently seen dirty pages we get from the protected virtual machine. So as part of checkpointing, the protected virtual machine sends the dirty pages to domain zero and domain zero looks in the cache. If these pages are found in the cache, then it does a delta between the original page and this page, compresses the deltas and sends them over to the backup compressed. If the page is not found in the cache, it is sent as a whole, the delta and this cache is maintained as a LRU cache, so the most recent pages are stuck in the cache and the these reproduced pages are kicked out. Robbie? >>: What kind of [inaudible] you are running the [inaudible] milliseconds? >> Ashaf Aboulnaga: Yes. >>: So you are [inaudible] keep the cache for all the pages through the Delta and [inaudible]? >> Ashaf Aboulnaga: So what we do is in our implementation we took 10% of the memory that is available for domain zero and devote it to this cache. >>: [inaudible]. >> Ashaf Aboulnaga: It is. And one important thing to note about this delta, about this compression is that this compression is done asynchronously in domain zero. So it is asynchronous checkpoint compression. So there may be overhead by doing that, but it is overhead that is incurred by domain zero. It is not on the critical path of the protected database system. >>: [inaudible] until it sends out and gets an acknowledgment for the… >> Ashaf Aboulnaga: No. It will--as soon as the pages are here, the virtual machine can continue. >>: I see. >> Ashaf Aboulnaga: Buffered network packets are not released until the checkpoint is sent to the backup, but the protected virtual machine can continue executing. Now computational overhead in our implementation whenever it could offload work from the protected VM to domain zero we consider that as a win. So we were not too worried about domain zero spending time taking compressing these pages and sending them over to the backup because we assumed that there is sufficient CPU capacity for domain zero. Yes? >>: [inaudible] this optimization, does it have any thing to do with [inaudible] per se or is it…? >> Ashaf Aboulnaga: It is inspired by the buffer pool, but it is applicable to any workload where you have a redundancy in the replication stream. >>: So in some respects, Remus running the database, is that unchanged through Remus running arbitrary applications? So all of the changes occur in domain zero, is that what you're saying? >> Ashaf Aboulnaga: In domain zero and there is a little bit of change in the virtual [inaudible], but the database system should not be changed all. >>: And the VMM that is running the [inaudible] is not changed? >> Ashaf Aboulnaga: Is not changed. >>: [inaudible] virtual machine? >> Ashaf Aboulnaga: It can. It is not something that we studied. Yes? >>: So the assumption that you have is that to begin with there is no competition between the two virtual machines, which means that you can count anything that this one does as a win for them? >> Ashaf Aboulnaga: Yes. >>: Does it hold usually in systems? >> Ashaf Aboulnaga: If you have a system with a sufficient CPU--if you have enough CPUs that it can dedicate CPUs to domain zero that assumption would hold. >>: When you [inaudible] you need to get a snap shot of the virtual impression [inaudible]. You need to delay [inaudible]? >> Ashaf Aboulnaga: Yes. So while this stuff is being taken, the virtual machine is suspended. That is the synchronization that we do and when the data is copied, the virtual machine is [inaudible]. And then the rest of the checkpointing can happen asynchronously. >>: So how long does it take to finish this? >> Ashaf Aboulnaga: How long is the duration of the virtual machine being suspended? I don't have that number off the top of my head, but it is a small number of milliseconds. So the second optimization implemented is again an optimization that is inspired by the buffer pool. So if you look at the way that data systems read data from the database, you have an active virtual machine and a standby virtual machine and both of them have their own discs and there is a copy of the database on each disc. When the database system loads a page from disk into its buffer pool, it looks clean to the database system, but it is dirty to Remus. So Remus will synchronize these dirty buffer pool pages to the backup with every checkpoint. And disk read tracking is based on the fact that synchronizing these clean buffer pool pages is not necessary because it can always read them, the backup can always read them from its copy of the database. So what we are doing in optimization is we track--for any disk read--again, this is not specific to the buffer pool. It is inspired by the buffer pool, but what we do is for any disk read we track the memory pages into which they read data is copied, the read data is copied, and we don't mark them as dirty. We don't send them in checkpoints. What we do is we send an annotation in the application screen telling the backup you should read these pages from your copy of the database and put them in your memory to reconstruct these pages. And the backup can do this read lazily, so we only need to do this read from this when a failover happens, but in the database what we do is we read these data periodically so that we can make, we can shorten fill over time. So these are--yes? >>: [inaudible] cannot be transparent, write? That particular optimization needs the protected version for the machine to be able to talk to Remus somehow and tell it to mark these… >> Ashaf Aboulnaga: So the protected virtual machine doesn't read, and Remus figures out, Remus gives the data back to the virtual machine and at the same time Remus sends over to the other Remus on the backup virtual machine an annotation saying you should read these pages and put them in your memory. >>: How does it know the database… >> Ashaf Aboulnaga: It doesn't care. It is a disk read that was read from this and put in some page of memory. >>: [inaudible]. >> Ashaf Aboulnaga: Okay. >>: So the assumption, even in the [inaudible] reaches this one at all, the virtual machine has an exact replica, the same copy of the data on database. >> Ashaf Aboulnaga: Yes. >>: Picks up the data on the database… >> Ashaf Aboulnaga: Yes. Primary and the backup are replicas of each other including the local disks. >>: So are these pages sent to domain zero and then domain zero does the detection or is there something else going on? >> Ashaf Aboulnaga: Yes. That is the way, Xen does reads. Domain zero is involved in reads and it detects, it does the detection. >>: So again, then it is not necessary--so when you are doing checkpointing at the 25 millisecond interval and you are suspending the machine briefly, sifting over the stuff from, how in that process do you distinguish the pages that are changed via a disagree from those that changed for updates, or don't you? >> Ashaf Aboulnaga: You don't. The way, Remus originally worked was that it marked all pages as read only and the first time the protected virtual machine modifies a page, it raises an exception, and Remus would detect and say oh, this is really, this page now needs to be copied over. So what we do with this optimization is that if a page has been modified because data is been read into it from disk, we don't mark it read only. We don't--we basically just send an annotation that here is a new page that has been read and the backup should read it from its copy of the disc. >>: So I think I'll was just misunderstanding, so you have the database in a virtual machine and then you're going to go through checkpoint process. But the checkpoint process isn't part of the virtual machine, it's part of the system under lying the… >> Ashaf Aboulnaga: It is part of this. I keep going back to this drawing. It is part of this here, part of domain zero and the hyper [inaudible]. I promise never to go back to this machine again [laughter]. These two optimizations that I described reduce the overhead of checkpointed memory and they are completely transparent to the database system. Now when you look at network, we saw that there is an opportunity for optimizing the way, Remus deals with network packets and we are left with an optimization that is not transparent to the database system. So if you look at the way, Remus handles network packets, Remus buffers every outgoing network packet, which ensures that clients never see unsafe execution. But it adds up to three orders of magnitude to the latency of every network packet, because for Remus we are assuming that the primary and the backup are the same network, so 12 and a half milliseconds of average latency is really high. And this is the largest source of overhead for [inaudible] and in particular transaction workloads. And our observation is that this is unnecessarily conservative for databases, because database systems have their own transactions with clear consistency and durability semantics, so we don't need these TCP level per checkpoint transactions that Remus has. So what we did is we added to Remus an interface that allows the database system running an application to say that these packets need to be protected or buffered to the next checkpoint and these packets don't need to be protected. And this exposed application is via a Linux setsockopt option. So you have a socket and there is a plug switch that is issued with every socket that says this socket is not protected; it is unprotected. And the way the database system uses this switch is that it only protects transaction control packets, begin transaction, commit, abort. These have to be protected. All other packets are sent unprotected, which means that the client sees unsafe state, so if a client sees unsafe state, what happens when the primary fails? A failover, after failover, they failover handler runs and the backup virtual machine in a failover handler thread and that failover handler, recovery handler, failover handler aborts all in-flight transactions where the connection to the client was not in protected state. So database systems are allowed to abort transactions, so to get significant boost in overhead, a significant boost in performance, during normal operation we pay the small cost of aborting extra transactions on failover. This is not transparent to the database system. We need to toggle this socket between protected and unprotected state and we need to do the aborting in-flight actions after failover, and we actually implemented this in both PostgreSQL and MySQL and ended up having to modify maybe 100 lines of code in each system. So let me show you how this works. So this is our experimental setup. It is exactly the picture I was showing. We have a primary server, a backup server connected via a highspeed network and we are running MySQL and PostgreSQL. Yes? >>: [inaudible] does this modification include the failover handler that [inaudible]? >> Ashaf Aboulnaga: Yes. This one hundred lines of code includes everything. So first of all, can we do failover? Here I am showing you TPC C on MySQL and on the x-axis I am showing time and on the y-axis I am showing throughput, transactions per minute C. and the green line is an unprotected virtual machine. In our setting we can get sustained throughput of around 400 transactions per minute. If we ran unmodified Remus we get this red line. So there is a significant performance overhead. If we run protected Remus, modified RemusDB we get this blue line. We significantly reduce overhead. Now in this experiment, half of the experiment we failed the primary server. We actually pulled the plug on the primary server. And the unprotected virtual machine can't proceed beyond that; throughput drops to zero. It is not available anymore. Both protected virtual machines proceed with very little downtime and they actually achieve peak performance, achieved the same performance as the unprotected virtual machine. Why did the performance jump this way? Because now the backup has taken over as the primary and it is not protected anymore. So we are not advocating that you fail your machine to get better performance, but [laughter] what we are seeing here is that if you are not protected, you don't pay the cost of protection. So this is failover. Now let's look at overhead here at [inaudible] operation. So this is TPC C on PostgreSQL and here if we run on modified Remus we see this 32% overhead so our performance is .68 of an unprotected virtual machine. If we use RemusDB’s transparent memory optimizations, we cover most of that performance. But if we also use the non-transparent commit protection we get back to almost the same performance as an unprotected virtual machine. We get back to 97% of the performance of the unprotected virtual machine. Now the picture is a little bit different with TPC H. TPC H doesn't have this back and forth between client and server, so it doesn't benefit that much from commit protection. It doesn't benefit that much from the non-transparent optimization, but we do get a significant performance boost using the memory optimizations. So that is RemusDB. What we achieved is high availability for [inaudible] system with no code changes or with very little code changes if we use these non-transparent optimizations, and we have now automatically and fully transparent automatic and fully transparent failover to a warmed up system. Now the next steps in this project are we are looking at re-protecting a virtual machine after it fails so we can tolerate one failure at a time but we can, once we are in the state of the virtual machine is unprotected, we can quickly go back to a state where it is protected. And one of the nice things about Remus is that the back up doesn't have to do a lot of work during normal operation. It is just applying checkpoints. So one possibility we could explore is to have one server service the backup for multiple primary clients. And finally there are some administrative questions that arise when you protect a database system with RemusDB. For example, how much network bandwidth do we need between the primary and the backup? And we are looking at answering some of these questions and our current work. So Dave? >>: So if you are say a cloud service provider and you are employing this system in the cloud you got a bunch of servers and you are trying to get as much out of your system as possible. You are going to end up been very concerned about performance of domain zero and so the question you ask is have you measured with the overhead is of domain zero? >> Ashaf Aboulnaga: No. In our experiments we were running on the same, on the physical host we were running one protected virtual machine and one domain zero. So the overhead that one virtual machine places on domain zero is very low. The situation would be different if we had 10 virtual machines on the physical server and that is not something in our experiment, so it wouldn't have been meaningful for us to measure the CPU utilization for example on domain zero because it would have been very low. Phil? >>: Is domain zero single threaded? >> Ashaf Aboulnaga: No it is not. >> Phil Long: So on a multicore system it could potentially be running multiple cores concurrently? >> Ashaf Aboulnaga: Yes. And that is part of the [inaudible]. >> Phil Long: So that would scale up. You would keep it [inaudible] even if he gets [inaudible] over time you would see more cores, but cores are cheap. >> Ashaf Aboulnaga: Yes. This is all predicated on cores are cheap and plentiful. But I think some of the concerns that were raised were if you are running in a cloud environment, you don't want to be wasting cores like this. We can't avoid the fact that we are saving time in the protected virtual machine by doing more work in domain zero. >>: [inaudible] interested in what it is that you are paying. So I can tell if I should be concerned. >> Ashaf Aboulnaga: I see. We didn't measure because we were paying CPU cost in domain zero and cores are cheap and plentiful and so we are willing to pay CPU cost in domain zero. >>: But what about communication cost? >> Ashaf Aboulnaga: So that was very low in our setting. That, we did measure and we reported in the paper and it was quite low in our setting, especially with our optimizations that we reduced communications. >>: I think added to the [inaudible] is basically the low the assumption that the reason for contention and if you have a virtual machine that runs one, that has only, a machine that runs only one virtual machine that is dedicated, I might as well just give it the core and use another system that doesn't waste that core for domain zero and does the individual application. Now usually I do it because I can run multiple systems on it. And the question that I had in my mind was for one virtual machine, yes, you have some extra leverage to give some time to do the work in domain zero. But if you do that for five or six, you might actually have to, you might be paying a penalty for each machine that is in aggregate more than if you just left it alone. >> Ashaf Aboulnaga: Left it alone unprotected, you mean. >>: Yes. Or did it with some other technique which would actually… >> Ashaf Aboulnaga: So this is what we want to achieve, protection with no code changes to the database system and what you are saying is that we pay a cost for that. And my answer is yes, we do. But the cost we pay is what we need to provide the sufficient capacity for domain zero. Now can you do this by doing the log shipping and the data [inaudible], yes you can. One last question before I move on. >>: So what is the behavior when the backup actually fails? Because if the [inaudible] fails, the backup becomes the primary and it is unprotected so of course [inaudible] the same as the unprotected one. But once the backup fails primary still has to send all of those checkpoints? >> Ashaf Aboulnaga: The primary will figure out that there is no one receiving the checkpoints at other sites and so it would stop sending them. So it will behave in the same way as if the primary failed. So now I want to switch gears a little bit and talk about another project which is about building a database service by running database systems and eventually consistent cloud storage. So Phil, do I have to stop at 11:30 or can I go 11:40…? >> Phil Bernstein: No. You can go. >> Ashaf Aboulnaga: So probably I will try to stop by 11:40 or thereabouts. You guys are already using your question time in the middle of my talk, so I will probably stop at 11:40 and not have too much time for questions. So this work is still, hasn't been published and here we are relying on cloud storage. So what is this cloud storage? Many systems these days are developed for the cloud, things like Amazon S3, things like HBase, things like Cassandra. And these systems are all storage systems that are very scalable, distributed and fault-tolerant, but they provide a very simple interface to the user. They are all key value stores, where the basic operations are write, a row with a specific key or read the row with a specific key. And they provide atomicity only for single row operations. They don't provide multirow atomic transactions and they don't provide the richness of SQL, which is why they are called no SQL systems. In this work what we are seeing is if we have one of these scalable cloud storage systems that is running hopefully in multiple data centers to provide disaster tolerance, can we build a multi-tenant database service in this setting by running independent database systems on this cloud storage system? So here we are not interested in scaling an individual database tenant. We assume that one machine has sufficient CPU, has sufficient capacity for one tenant. What we are interested in scaling the storage capacity and bandwidth available for each tenant and in scaling the number of tenants. So we want to build a scalable elastic highly available multitenant database service that supports SQL and ACID transactions. And the idea is that the cloud storage system would provide the scalability, elasticity and availability and these database systems will provide SQL and ACID transactions. So we implemented a prototype of the system like this, and in our prototype what we wanted to look at is if we are building this service on top of and eventually consistent storage system, can we take advantage of the relaxed consistency of the storage system to give us better performance? So our prototype is called DEBECS which stands for databases on eventually consistent stores. And in our prototype we use MySQL and its INNODB storage engine. Actually nothing in what we do is specific to MySQL so we could have replaced my SQL with any other database system, but the storage system that we used, Cassandra and we do rely on Cassandra because we want eventual consistency. So I'll talk more about the system but you had a question? >>: [inaudible] model, go back to that [inaudible] did you say DBECS did you describe that as one database server? >> Ashaf Aboulnaga: This is one instance of MySQL with its own client and its own databases and it is independent of all the other instances of MySQL that are running. >>: And [inaudible] machine. >> Ashaf Aboulnaga: Different machines, different databases. >>: But if they share the storage of the… >> Ashaf Aboulnaga: They share the storage subsystem but they each have their own blocks within the storage system. So basically what we want is to add more and more tenants, not to grow one tenant. Okay? So why Cassandra? As I said, Cassandra uses eventual consistency. So by relaxing consistency, it reduces the latency of writes and it enables, and it is partition tolerant. It can run on multiple data centers. So let me spend a few minutes talking about Cassandra. Cassandra stores data as semi structured rows that are grouped into column families. So think of each column family as a table and within a table or a column family you have semi structured rows. These rows are accessed by a key. So every row has a key and rows are replicated and distributed by hashing these keys. One of the nice things about Cassandra is that it uses multimaster replication for each row. Many of these cloud storage systems try to guarantee consistency and they do that by having a single master. Cassandra doesn't do that. Cassandra has a multimaster for each row, which enables Cassandra to run in multiple data centers gives us partition tolerance. And we rely on that for disaster tolerance in our DBECS system. Another nice thing about Cassandra is that a client controls the consistency of each write. So there is this consistency versus latency trade-off and Cassandra allows the client to manage this with every read or write operation. So in Cassandra a read or write operation specifies a consistency level and the client can say write one or read one, which means get me any copy of the data or write any copy of the data. And that is fast but not necessarily consistent. There is also write call and read all which means read all copies of the data and get me the most recent or write all copies of the data. And that is consistent but may be slow. So Cassandra allows the client to control the latency versus consistency trade-offs and in this work we posed that database systems can control this trade-off very well. >>: [inaudible] illustrates [inaudible] write all read one? >> Ashaf Aboulnaga: Yes, you can. Yes. >>: [inaudible]. >> Ashaf Aboulnaga: Yes. There is actually an explicit read core and in my next slide I will talk more about these consistency levels, but now I have to give a broader overview of Cassandra. Now another nice feature in Cassandra is that Cassandra uses timestamps with each write. And these timestamps are provided by the client. So the client controls the serialization order of updates. And this is important for us in our system. Cassandra also is scalable, elastic and highly available, but so are many other storage systems. So we didn't use Cassandra because it is scalable, elastic or highly available. We chose Cassandra because of this. So let's talk a little bit more about this consistency versus latency trade-off. In Cassandra there is an operation, I mean if you specify a read, you specify a consistency level so the basic operation is to read the value of a specific column in a row with a specific key. So you give the column name; you give this key and then you specify the consistency level. You can say read one, which means that Cassandra will send the read request to all replicas of this row and it finds the replicas by hashing on the key and it returns a value to the client as soon as one replica responds. So it is fast, but may return stale data. Now there is also read all, where Cassandra sends the read request to all replicas of the row and it waits for all of them to respond and then it returns the latest copy. So this is consistent, but it is as slow as the slowest replica. There is also write one which is versus write all, so Cassandra sends the write request to all replicas and either acknowledges as soon as one replica responds or when all replicas respond. And there are also other consistency levels. There is a read column. There are also the data center aware consistency levels; so there is this trade-off. Now let's quantify this trade-off so consistency is possible. It is expensive, so how expensive? So then we ran some experiments and here we were running Cassandra in Amazon EC2 cloud and we have a system with four Cassandra nodes, so small Cassandra cluster and we are running this benchmark called the Yahoo cloud serving benchmark which is a very simple benchmark that does reads and writes. So there is no fancy SQL here. And here I am showing the latency of writes and reads. Blue is write all and read all red is write one and read one and here all of the four Cassandra nodes are in the same EC2 availability zones. Think of it as the same data center. So you see that there is a factor of two penalty between the lead one and lead all, but if we move two of the Cassandra nodes to two EC2 availability zones within the same geographic region the penalty becomes bigger, and if we move the two Cassandra nodes to a different geographic region, in two different data centers, one on the USA East Coast and one on the US West Coast, the penalty becomes huge. The difference in performance between the consistent read and write and the potential inconsistent read and write becomes very big. So the message from these experiments is that there is a significant cost to be paid if we use read all and write all, especially if we want multi-data center operation. >>: [inaudible]. >> Ashaf Aboulnaga: Four, all. >>: So it doesn't do any [inaudible] to say two or three agreed that this is the latest one? All means all of them? >> Ashaf Aboulnaga: All means all, because you have to wait for everybody to respond to decide which is the latest one. >>: But it doesn't manage its own consistencies. It relies on the majority being [inaudible]. >> Ashaf Aboulnaga: You can say read quorum read quorum if you want. But still your quorum here is the difference among the data center so you still have to wait for somebody from the other data center to respond. So this is an overview of Cassandra. So what did we want to do with Cassandra in this project? We wanted Cassandra to look like a disk to the database system, so a scalable elastic highly available transcontinental disk. So what we are going to store on Cassandra is disk blocks. So Cassandra stores rows with keys and values. In our case keys are disk block IDs and because we have different database tenants, we append the data system ID to that. Values are the contents of this block, and we don't do anything fancy with Cassandra's columns or column families. We just have one column family with one column containing all our data. We have this layer, this client library that intercepts read and write requests from the data base system. In our case for my SQL’s INNODB storage engine and it converts them to read and write requests in Cassandra. The question here is what consistency level should the Cassandra I/O layer use? If it uses write one and read one, it is fast but might return stale data and it provides no durability guarantees. Reading stale data makes database systems very unhappy. So this is not something that will work for a database system. Now if we want, one way to get consistency is to use write all read one which is what Dave was mentioning. And that returns no stale data and guarantees durability, but writes are slow. So our goal in this project is to try to achieve the performance of write one and read one while maintaining consistency, durability and availability, and we do that by making two changes to Cassandra and to the way we use Cassandra. The first is something we call optimistic I/O, in which we say well, if write one read one are cheap, let's just use write one read one, and if we happen to get stale data, let's detect that and recover from it. The second change we made is with something we call the client control synchronization. When we use write one and read one we lose durability. Client control synchronization gives us back durability; it makes database updates safe in the face of failures. So let me spend a few minutes talking about each of these optimizations so optimistic I/O. A key observation we had here is that even if we use write one and read one, most of the time reads will not return stale data. Why is that? Are we just looking at the world with pink glasses? No. There are reasons for that. First of all we have a single writer for each database block. We have many different database blocks that belong to many different database tenants, but we have a single writer for each database block. And secondly because these clients are database systems, they have a buffer pool. So when a database system writes a block, it is unlikely to immediately turn around and read the same block. It is, the block is not a buffer pool, so there will likely be a period of time between the write and the next read of this block. And in that time even if Cassandra uses write one, the update will have propagated to all of the replicas because remember, with write one, Cassandra sends the write to all replicas and just responds as soon as one of them acknowledges. And finally there is a factor that is not really important but is also part of the picture, is that because of network topology, the first Cassandra node to acknowledge a write is likely to be the same node to acknowledge a read. So because of all of these reasons, if we use write one and read one most of the time we will not see stale data. But there is a likelihood of still seeing stale data, so what should we do? What we do is detect stale data and recover from it. So how do we detect stale data? There is this Cassandra I/O layer, the client, and the Cassandra I/O layer stores a version number with every block in Cassandra and it remembers the most recent version number of every database block. When we use read one, we check the version number that is returned by the read one, I guess the most recent version number. If it is the most recent, we are fine. Our optimism was warranted. If we detect stale data, then we have to recover and retry the read, and when we retry the read we can use read all so that we can guarantee to have the most recent copy, and an interesting observation is that even if we retry using the read one we are likely to get the most recent copy. Why is that? Because when Cassandra detects staleness, it initiates a repair process whereby it brings all the replicas up-to-date, so if we do try the read one, we are likely to see that Cassandra has already repaired this stale row. Remembering the version number of each block seems expensive. Do we really need the version number of each block? The answer is no. What we do is we remember the version number of the most recent version number of only the most frequently or most recently touched disk blocks. So our Cassandra I/O layer considers a block to be one of three states: either unknown, inconsistent or consistent. Unknown means I don't know the most recent version number and I have no information about this block. This is all the blocks are when the database first starts. So what I do if I am reading a block that is in the unknown state? I use recall. Once I read a block, I get the most recent version number. If once I do a read all I get the most recent version number, I can store that. And if I write a block and I know the most recent version number of it, then I know that this is a block that might be inconsistent. So I can use read one for that block and I have to compare the version number from Cassandra with the most recent version which I know of. Now we have also modified Cassandra so that whenever it responds with a version number it gives us not just the newest version number but also the oldest version number. As we are interacting with Cassandra we are always comparing the newest and oldest version numbers of different disk blocks, and if the newest and oldest are the same, then we know that the block is consistent. What does consistent mean? It means that I can use read one and I don't even need to check version numbers, and I maintain a bounded list of consistent blocks, a bounded size list of inconsistent blocks, and if any of these lists out grows this bounded size, I just return the least recently used blocks with inconsistency to the unknown state. I can always rely on it all if I don't know about the state of a block. Dave? >> Dave: [inaudible] delegate to Cassandra? >> Ashaf Aboulnaga: Remembering the most--how would we delegate the most recent version number to Cassandra? >>: By having it--I mean you would have to change Cassandra to do this, but you could have Cassandra alternatively you could pass the version given to Cassandra and say I will take the block which matches just as soon as you find it. >> Ashaf Aboulnaga: That is an interesting possibility. We haven't looked into that. So basically, one of the things about Cassandra is that the client might connect different Cassandra nodes at different times. So if you pass the version number to Cassandra and tell it, give me the version number that is this or higher, we could do that. But right now what they are doing is keeping this outside of Cassandra. >>: So it depends on which one you want to change first. If you don't want to [inaudible]. >> Ashaf Aboulnaga: Yeah. And also any solution that we have come up with has to work on the assumption that the client will connect different Cassandra nodes to the different [inaudible]. So by maintaining the version in our client library, maybe we make this, maybe we avoid having problems of different connected nodes to the [inaudible]. >>: Cassandra could use the same strategy, right? >> Ashaf Aboulnaga: Yes. I guess there is no reason why this has to be in the client library and not in the [inaudible]. So now we are able to use read one and write one for most of our requests, but we, our data is not safe; writes are not safe. And also, if I use read all this will block if any replica is down. How can we deal with failures? One naïve solution would be to say I am going to use write all in read quorum and if the node is down the write will block, but what we did in our system is, again, observed that because database systems have their well defined transaction semantics, they know precisely when data must be safe and when data can be unsafe. So for example, write ahead logging tells us that you have to write a log record before you write a corresponding data page. You have to flush the log [inaudible] to commit if you are reclaiming records from the log, you have to write the corresponding data pages. There is a well-defined set of points where data has to be made safe. And data base systems are used to dealing with file systems that don't always guarantee safety, so they are used to seeing fsync or something like fsync when they want data to be safe. So if we can add something like fsync for Cassandra, then we can afford to keep data unsafe until fsync is called. What happens when a failure happens and we lose unsafe data? It is exactly what will happen if the database system loses unsafe data. You abort transactions. So what we did is implement a new type of write in Cassandra called write CSYNC. And CSYNC stands for client controlled synchronization. This is a new consistency level for writing Cassandra. And write CSYNC behaves like write one, so it acknowledges the client as soon as one copy of the data is written, but it keeps the key of this page on an in memory list called a sync pending list. So these are blocks that need to be synchronized when the client issues a write sync. And we also added a new call into the Cassandra client called CSYNC or Cassandra sync and whenever the database system says fsync the Cassandra I/O layer, which is the layer between the database system and Cassandra's translates the fsync into a CSYNC. So basically we are making data safe only when the database system needed to be safe. So data that is written remains unsafe until the database system explicitly requests for the data to be safe. And any period of time between the write and the CSYNC is an opportunity for latency hiding. Cassandra can be propagating the data while, and the client doesn't have to be waiting for it. What about read? We use read quorum to do with the possibility of one, of Cassandra rows being down. >>: [inaudible] for all of the pending ones, write them all? >> Ashaf Aboulnaga: CSYNC is something that has to needed extra [inaudible] inside Cassandra, so basically when we write with this new CSYNC synchronization level, there is this, Cassandra accumulates keys on this sync pending list and when we send a CSYNC through the Cassandra node what we're saying is do a write call for every key on this sync pending list. >>: So inside the [inaudible] know…? >> Ashaf Aboulnaga: Inside Cassandra, yes. It becomes a write all. >>: [inaudible] haven't been flushed yet [inaudible] call in then… >> Ashaf Aboulnaga: There is more so basically we can make this CSYNC instead of doing write to multiple data centers. So we don't have to write all… >>: So the read core actually can be--you're trying to say that it will be faster for the next read core and we don't have to wait for all? >> Ashaf Aboulnaga: I am not sure I understand what you are saying. >>: The idea is that instead of saying write all, make sure that you write at least two disaster recovery that so-and-so you are protected at least… >> Ashaf Aboulnaga: Yes. What I am saying is we can do these kinds of things when we implement the CSYNC call. Dave? >> Dave: I thought a write one actually wrote to everything, but acts as soon as a write one came back saying--so if you could track how many came back, you could discharge your CSYNC list so as things go on and not have to do it on the write all, right? >> Ashaf Aboulnaga: Yes. There is an opportunity to clean the CSYNC pending list with output for CSYNC, but I am not sure if we actually do that or not in the [inaudible]. >>: [inaudible]. I am having a Rick Perry moment here [laughter]. >> Ashaf Aboulnaga: So one of the goals we wanted to achieve when we started was to deal with failures. So what are examples of failures that we can deal with? So what happens when we lose a Cassandra node? If we lose a Cassandra node that is completely transparent to the database system. It is handled by Cassandra. Cassandra detects when the node is down and it takes it off the application ring, and when the node comes back up it catches it up with all of the data. So that is one advantage of using a system like Cassandra. What about if we lose the primary data center? Here we don't have a fully good story, but we have just a partial story. So here if we are running Cassandra in multiple data centers and we have the data stored in multiple data centers which we can do with our existing limitations, we can restart the database system in a backup data center, so we don't have an always on database system, but we can restart the database system in the backup data center and apply the standard log-based recovery to bring the database up to date. We can count on a transaction consistent view of the log being there in the backup data center, because of the way we did writes in the system I/O and because of CSYNC. Let's see how this works. I will show you results of--yes? >>: Are you using Cassandra for the log also? >> Ashaf Aboulnaga: Yes. >>: So if you are only doing the write once, you wouldn't necessarily until you did a sync, know that your log was at the disaster center, write? >> Ashaf Aboulnaga: That is true, yes. >>: So you are risking losing some of the end of the log in this. >> Ashaf Aboulnaga: Yes. If the database system doesn't do an fsync. So we are risking losing the end of the log whenever the database system would have been willing to risk losing the end of the log. Whenever the database system says I want to make the data safe by issuing an fsync, we make the database [inaudible] center. So three or four experiments and then I will wrap up. Here we are running TPC C on MySQL and Cassandra in Amazon EC2. We have a small Cassandra cluster with six nodes. And I am showing results for three situations. The first is when I have all six Cassandra nodes in one EC2 availability zone. The yellow bar shows the baseline which is write all read one. The blue bar shows what we can achieve with optimistic I/O. So you can see that you can get a significant performance boost with optimistic I/O. But optimistic I/O is not safe. Now if we bring back safety by using CSYNC, we pay a little bit of performance penalty but this performance penalty is not that high. Now the difference between the yellow bar and red bar becomes bigger if our six Cassandra nodes are divided among two availability zones in the same geographic region, and it becomes even bigger when we are doing, when we are replicating in two geographic regions, U.S. East and US West. So basically one way to look at this work is that we are enabling you to be able to run MySQL on Cassandra and get the red performance instead of the yellow performance if you are running in two regions. Yes, Phil? >> Phil Long: Doesn't the performance of the CSYNC depend on how frequently you do CSYNC? >> Ashaf Aboulnaga: Yes. This is… >> Phil Long: And so how frequently in this graph? >> Ashaf Aboulnaga: That is not something we measure. So we have to TPC C--we do CSYNC whenever MySQL does fsync. How frequently does MySQL do fsync? It does fsync with every commit. It does fsync whenever it needs to make the data safe. >> Phil Long: Well that is normally with every commit, isn't it? >> Ashaf Aboulnaga: Yes. So definitely every commit, but I think there are also other situations. For example, whenever it is reclaiming a page in the buffer pool it gives fsync to lock down before it obtains the page. I mean there are points where data must be made safe and the only way that INNODB knows how to make the database safe is to do fsync. You have this worried look on your face and I am not sure what you are worried about. >> Phil Long: I am just surprised because that is very frequent handshaking across the wire. I mean, the answer to Dave's scenario before you are saying that every transaction commit you are going to make sure that all of, you are going to eagerly push all of the data out to the replicas. >> Ashaf Aboulnaga: Yes. >>: I think what may be useful for, you probably don't have it in this [inaudible]. How does [inaudible] local disk? >> Ashaf Aboulnaga: Actually we measured that and one of the problems we are seeing now is that there is a significant overhead compared to local disk, because every I/O on the local disk is translated into [inaudible]. >>: Some of Phil's concern could be alleviated by noting that if you do group [inaudible] as almost everybody does and it puts an fsync to the log versus an fsync to the database disk then you could separate out those overheads also. >> Ashaf Aboulnaga: Yes. As far as I know, the way INNODB uses that fsync is that whenever there is a commit, you fsync the log. And if the fsync catches a number of commits, you have avoided some fsyncs, and you only fsync the data when you are doing a hard checkpoint. But I would have to go back and look, actually your question does support--we really need to investigate how frequently does INNODB do fsyncs. >>: But it only acknowledges and commits to the client after an fsync? >> Ashaf Aboulnaga: Yes. >>: I think the way to do it right is to have UL propagate all the writes to all of the nodes. That [inaudible] CSYNC concern is that that is been done. >> Ashaf Aboulnaga: The CSYNC propagates all the writes to all the nodes, but that is only done when the database system requests it. >>: Yes. When you [inaudible] so the CSYNC [inaudible]. So that is maybe the reason why you frequently call in a CSYNC still. Because the CSYNC [inaudible]. >>: Well, that is not what he said last time. He said he wasn't sure whether the code was actually checking to see whether the stuff was done and it actually [inaudible]. >> Ashaf Aboulnaga: Yes. [multiple speakers] [inaudible]. >>: Remind me [inaudible] it might be that the [inaudible] flush the sync list but the write all request would from the example there would return, I would assume, much faster if the [inaudible] doesn't really happen. >> Ashaf Aboulnaga: So let me move on. I want to show you--the message of this graph is you get to use MySQL on top of Cassandra with the red performance instead of the yellow performance. Now let's look at the goals that we have set out to achieve which are scalability and availability. So do we get scalability? So here we are adding more and more database tenants. They are all running TPC C. They are independent. They are databases running independent copies of TPC C and we are proportionately increasing the number of Cassandra nodes. So we start with two tenants on one database system and three Cassandra nodes. Then we go to six tenants and nine Cassandra nodes and so on. We proportionately increase the number of tenants and the number of Cassandra nodes. And what we get is a linear scale up in the total TPMC that we see across all of these tenants. So we can scale in the number of tenants. Let's look at some availability results. Here we are running our MySQL in a primary data center and then there are three Cassandra nodes in this primary center and three Cassandra nodes in the secondary center and we fail 1% in the primary center at 300 seconds, and when we do that there is a drop in performance until the other Cassandra nodes realize that this node has failed, and stop sending requests to it and then we recover our performance. Now this Cassandra node comes back up in 500 seconds and it takes a while to catch up this node with the other Cassandra nodes, so performance gradually rises until we get to the original performance. Yes? >>: So you say that for every tenant you add one node? >> Ashaf Aboulnaga: No. No we saw in our setting how much capacity we, how much load can a node sustain and it turns out that for every one of the virtual machines that we are using can sustain two instances of MySQL, and these two instances of MySQL need three Cassandra nodes to sync. >>: So what you are saying is that by adding scalability we are adding more nodes? >> Ashaf Aboulnaga: That is the threshold of scalability. There is no magic. If your system is overloaded--what I can tell you is that every one of these points represents a high load system. So against a high load system with some number of nodes in here, it is a high load system with many more nodes. >>: So the idea is that this is not doing like this. It is basically… >> Ashaf Aboulnaga: The idea is that this is basically linear, linear scale. It is not doing like this. >>: So you could imagine running TPC C a couple of different ways. One is each of your systems runs the TPC C benchmark and so every time you add in there you are adding another one that runs the TPC C benchmark. Is that what you did? >> Ashaf Aboulnaga: Yes. >>: So that , if you will, would be perfectly partitionable because those are all partitions. >> Ashaf Aboulnaga: Yes. >>: So what you are not showing is scalability where you just sort of stretched out the size of the cluster machines that is running a single list of--so you're scalability is like where you added a new client doing independent things and this is what we get. >> Ashaf Aboulnaga: So basically the stored system is able to support more and more independent clients, yes, absolutely. So availability, I showed you what happened when a Cassandra node in the primary center fails. This is what happens when a Cassandra node in the secondary data center fails. You don't see as much of a performance dip because the secondary, the node in the secondary center is not in the critical path most of the time. Here is what happens when we completely lose the primary data center. And here the story is not as nice. It is still okay. So what happens is that you have some performance and then at 300 seconds we completely fail the primary data center. Now what happens is that we start a new database instance in the secondary data center and this instance does traditional log-based recovery. The log is there in Cassandra so it is actually consistent. After it is done with this recovery it comes back up and starts executing queries. And the reason why the performance after recovery is lower than performance before the recovery, is that here is U.S. coast east, and here it is U.S. coast west, and east is closer to Waterloo than west, or actually east is closer to the primary database instance than west. >>: I don't understand why did you need to do database recovery? Database is [inaudible]? >> Ashaf Aboulnaga: Yes. So we lost database data center with Cassandra and with other databases. >>: [inaudible] standby database actually running you can reduce that time? >> Ashaf Aboulnaga: We can, but right now we don't have standby databases. And standby database systems would open up a whole bunch of interesting issues. If you want to be do standby replication completely ignoring the fact that you have a shared storage, than you can do it without any problems, but what would be interesting is to see if we can exploit the fact that we have shared storage to make the standby faster. So what do we have so far? We have scalability, elasticity and storage capacity and storage bandwidth. We have scalability in the number of tenants and we have a highly available and [inaudible] storage tier. We have SQL and ACID transactions for the tenants. So there is this question about whether we can scale consistency. Can we consistency scale? And what we have now is in my view an interesting point in the spectrum of possible answers to this question. One thing that we don't have is what Dave was talking about, scaling an individual database system. That is not something we have looked at yet. And we don't have always on tenants. So when are tenant fails, we have the advantage that when we restart the tenant in another data center, that newly started tenant can find a copy of the database in the log and do a log-based recovery, but we still have to incur downtime. We don't have this automatic and transparent fail [inaudible]. So let me conclude and I can take any other questions off-line. So in this talk, what I argued was that high-availability and also scalability for database systems can be provided by the cloud infrastructure and this is not something that is new. Many people are working on different projects that aim to achieve this goal. But what I tried to look at in this talk is to say that if we take advantage of the well-known characteristics and semantics of database systems, we can greatly improve our solutions. And I presented two examples of this. One is RemusDB, which is high availability in the virtualization layer and the other is DBECS which is scalability and high availability by running on eventually consistent cloud storage. Thank you, and sorry for running overtime. [applause].