>> Phil Long: Welcome everybody. It is my... who was visiting here from University of Waterloo. I...

advertisement
>> Phil Long: Welcome everybody. It is my pleasure to introduce Ashaf Aboulnaga
who was visiting here from University of Waterloo. I think a lot of people here know
Ashaf for a variety of reasons. He is in a pretty broad range of work in the database field
and the last few years he has moved into new areas focusing on cloud computing and data
integration of web data. In particular, part of the work he is going to describe today on
cloud computing was in a PVLDB paper just a couple of months ago that won best paper
award there, which by the way, was incorrectly listed as PVLDB 2012 in the advertising
that went around. Maybe he'll have one there too; who knows? We'll see. [laughter].
But anyway, without further ado let me turn it over to Ashaf.
>> Ashaf Aboulnaga: Thank you Phil. Good morning everybody and thank you for
coming to my talk. So the title of my talk is High-Availability for Database Systems in
Cloud Computing Environments. More generally it is about providing a database service
in the cloud. Before I start the technical material, let me start with some thanks that are
due. This is joint work with my colleague Professor Ken Salem at the University of
Waterloo. The first part of the work is done by my student Umar Farooq Minhas who
was an intern at Microsoft Research a couple of times before, and he is graduating next
year and needs a job. So if you guys have jobs to offer [laughter] Umar is your guy. The
second part of the talk is the work of Rui Liu, who is a postdoc jointly supervised by Ken
and me and he is also finishing in 2012 and also needs a job [laughter]. The first part of
the talk is in collaboration with people from the University of British Columbia,
Professor Andrew Warfield and his graduate students.
So as I said this talk is about, or my interest in this area is about using cloud technologies
plus SQL database systems to build scalable and highly available database service in the
cloud. Another interest I have in this area of cloud computing is what if I don't care
about a scalable highly available database service? What if I have my own instance of
me database system and I want to deploy it in the cloud or I want to run it on a virtual
machine? How can we improve the way database systems interact with these virtual
machines, and also how can we improve the underlying cloud technology? Can we add
APIs? Can we add new features to the cloud technologies to better support database
systems?
So why database systems? I mean why am I focusing on this class of application? Well
they are important; databases are important. And looking at the people in the crowd, I
guess we all agree that this is not something that I have to spend too much time
convincing you guys of. But the question is why do I think that we can solve this
problem? Why do I think that folks in the database systems can get us somewhere? One
reason is that database systems have a narrow interface with the user. They speak SQL.
So when we need to reason with a database system we don't need to reason about general
programs in a language like Java or C sharp. We need to reason about SQL. And
secondly, database systems have transactions that offer very well defined semantics for
consistency and durability. And thirdly, database systems have these very well-defined
structures, very well defined operators. So regardless of the database system you can
always count on it having a buffer pool. You can always count on it executing queries as
trees of operators like hash joined, sort, merge, join and so on.
And finally database systems are very introspective. They have accurate performance
models of their query execution costs. They have accurate performance models that tell
you what the marginal benefit of some extremity will be and if you can expose these
models to the underlying cloud technology, we might get some benefit from there. So for
all of these reasons I think it is promising to look at database systems as a specific class
of applications and tune specifically for them. In this talk you will see that I rely heavily
on the semantics of transactions to decide when things have to be consistent and when
they don't need to be consistent. And also that I rely on the fact that we have these welldefined structures and present optimizations that are tuned towards things like the buffer
pool.
So I will talk about two projects. The first is RemusDB which is the PVLDB paper that
Phil was talking about. And PVLDB aims to provide database high-availability using
virtualization. The second part of the talk I will talk about a new project called DBECS
which also aims to provide database high-availability but also scalability and elasticity by
relying on eventually consistent cloud storage. So let's start with RemusDB and highavailability using virtualization.
So first of all what do I mean by high-availability in this context? What I mean is
resilience to hardware failure. I know that often people talk about high-availability talk
about scheduled downtime and unscheduled downtime. Here we are focused more on the
unscheduled downtime. It works on the resilience to hardware failure. And highavailability is becoming an important requirement for all kinds of applications. It is not
just high-end enterprise mission critical applications that have to be highly available these
days. Everything has to be highly available. Every Facebook game has to be highly
available. Nobody expects anything--nobody is willing to expect anything less than 24-7
uptime.
And when we talk about high-availability we have to talk about several issues, so in the
context of databases we need to maintain database consistency in the face of failure. We
need to minimize the impact on performance of the high-availability solution both during
normal operation and also during failover. And finally, we have to reduce the complexity
and the administrative overhead of high-availability. Now high-availability has been
around for a while. There are many high-availability solutions out there, and one
common way to do high-availability is to do active standby replication. So when you
activate a standard application you run two copies of your database, one on a primary
server and one on a backup server. And the primary is the active server. It accepts user
requests and performs queries. The backup is a standby server, and the primary ships
changes to the database from, to the backup by propagating the transaction log, and the
backup is busy applying these changes, so that when the primary fails, the backup can
take over as the primary and it takes over with a consistent database state.
If solutions like this exist, why are we still working in this area? We are still working in
this area because active standby replication is complex to implement in the DBMS
system and also complex to administer. You need to worry about things like how to
propagate the transaction log in a transaction consistent where you need to worry about
the [inaudible] of handover from primary to backup. You need to worry about
redirecting client requests. When the primary fails, the backup takes over as primary.
The database is consistent, but you need to tell the clients that now instead of talking to
this guy, they need to talk to this guy. And you need to minimize the impact on
performance. For example, when the failover happens, you want the failover to happen
to a backup with a warmed up buffer pool.
So the solutions are complex. And what we are aiming to do in this work is to push this
complexity out of the database system and into the virtualization layer. If I want to
invent the buzzword for this, I would call it high-availability as a service, in which we
can make any database system highly available as we're running it in our soup of virtual
machine. We want to do this with as little performance overhead as possible. So the idea
is still to use active standby replication. We still have a primary server and a backup
server, but now the primary server is running a virtual machine and the database system
is inside the virtual machine. The changes, we still propagate changes from the primary
to the backup, but we don't just propagate changes in the database state; we propagate
changes in the entire virtual machine state, which includes buffer pool, which includes
client connection state, so that when the primary fails, the backup can take over as
primary. But now it takes over with a warmed buffer pool and clients get filled over
automatically to the backup, and this all happens with no code changes through the
database system. Yes?
>>: [inaudible]?
>> Ashaf Aboulnaga: Yes.
>>: [inaudible] performance [inaudible]?
>> Ashaf Aboulnaga: There is a performance penalty for running the database in the
virtual machine but we didn't really study that penalty in this work. We assume that you
are willing to run your database system in the virtual machine. There are people working
on--this is a continuous process of--there is a continuous process of reducing the penalty
of running databases in the virtual machine, but this is part of the equation. It is not part
of the equation that we focused on in this work, what we focused on is making this virtual
machine highly available. Yes?
>>: Question on the [inaudible] how we [inaudible] the plans?
>> Ashaf Aboulnaga: We will talk about that next. RemusDB is built on Remus which
is a project that was developed at the University of Columbia and now is part of the Xen
provider. Remus does this picture. It maintains two copies of virtual machine, one on a
primary server friend one on the backup server and it periodically replicates the state
changes from the primary virtual machine to the backup virtual machine using whole
machine checkpointing. And this whole machine checkpointing extends live virtual
machine migration. So things like failing over the clients from the primary to the backup
are handled by live VM migration. The transparency of failover is handled by live VM
migration. So Remus offers transparent pullover with only seconds of downtime. So did
you have a question?
>>: Yes. You're talking about a backup to the [inaudible] all of the pages in this cache
are also in this cache over there?
>> Ashaf Aboulnaga: In our case this is what it will achieve. The two virtual machines
will be exact replicas of each other.
>>: And that goes for all of the…
>> Ashaf Aboulnaga: The whole virtual machine state. So how does Remus do this
whole machine checkpointing? How does Remus achieve high availability? So Remus
divides time into epochs. The epoch parameters are tunable parameters but think of it as
25 milliseconds. So Remus lets the primary run uninterrupted for 25 milliseconds. There
is no lockstep execution between primary and backup here. And at the end of this epoch,
Remus performs a checkpoint in which it suspends the primary virtual machine, copies
state changes from the primary virtual machine to domain zero, and for those of you who
are not familiar with Xen terminology, domain zero is a privileged virtual machine that
exists on any physical machine running Xen and it is the administrative domain. It is in
the state of the virtual machine.
So you copy the state changes to domain zero and after this copy is done, the primary
virtual machine can be resumed. And then asynchronously domain zero copies the
checkpoint to the backup server where the backup server applies it to its state. Here is an
example of Remus checkpointing. This is an example showing three epochs, A, B and C
and at the end of every epoch there is a checkpoint that is taken. In this example the
primary machine fails during epoch C. So the primary machine fails. The backup takes
over and it resumes execution from the latest checkpoint. So work that the primary did in
epoch C is lost, and it is okay to lose this work as long as you don't expose output to the
user, because we don't want to expose unsafe output. So the way Remus handles that is
that it buffers any output that is exposed to the user until the end of an epoch.
So let's focus, for example, on network packets. Whenever the machine that is protected
by Remus wants to send a network packet, that network packet is buffered until the
checkpoint at the end of the epoch happens and then the state becomes safe. At that point
the packet is released to its user.
>>: [inaudible] every little packet, released it [inaudible]
>> Ashaf Aboulnaga: On average 12 1/2 milliseconds, yes. Exactly. And that is
actually part of the reason we looked at these, because Remus is there and you can run a
database system inside Remus, but then if you look at what Remus does to database
workloads. Without Remus protection you have a database server and the client sends a
query to the server. The server processes the query and returns a response that is
unprotected and you have some response time. Now if you enable Remus protection,
making checkpoints adds a certain overhead and network buffering adds even more
overhead. So you end up with a protected server but you get a much bigger response
time. And the overhead of protection we measured it to be up to 32% in some cases. So
our goal in this work is to implement optimizations that are database inspired to reduce
this overhead and in the end we were able to achieve less than 3% overhead and recovery
within 3 seconds.
So what I want to talk about now is the optimizations we do for Remus. Yes?
>>: Once you reach a checkpoint, you wait until the backup applies before proceeding?
>> Ashaf Aboulnaga: You wait until the backup acknowledges that it received the
checkpoint. So as long as the checkpoint is here you can proceed.
>>: If you don't wait until the backup is applied to the secondary, then there is a
possibility that both machines go down you lose state?
>>: Yes. So here we are tolerating one failure.
>>: Is it expensive?
>> Ashaf Aboulnaga: In theory it is, but this is not something that either we or the
original authors of Remus have actually explored yet. So let me talk about the
optimizations that we implement. So when we look at Remus and why it is slow for
database workloads, we see that it is slow for two reasons. One is that database systems
are heavy users of memory, so implemented some optimizations that reduce the overhead
of checkpointing virtual machines where the memory is being heavily used. And we, I
am going to talk about two optimizations, asynchronous checkpoint compression and disk
read tracking.
Asynchronous checkpoint compression aims at sending less data during checkpoints and
disk read tracking aims at protecting less memory. Another reason why Remus is slow
for database workloads is this 12 1/2 milliseconds delay that is added to every network
packet. Some database workloads, in particular transactional workloads, where there is a
lot of back-and-forth between the client and the server, are very sensitive to this network
latency. The last optimization implemented for Remus is to exploit the semantics of
database transactions to avoid this overhead when we can.
So let me talk about the memory optimizations first. If you look at the way database
systems use memory, you see that there are large sets of frequently changing phases of
memory. One example of that is the buffer pool. But you can also think of the memory
where the connection state of the client is stored. And if you look at the way the database
system uses that memory, in many ways it is modifying a small portion of the page. So if
you look at the buffer pool in particular, you see that the database system will often
modify a few records on the buffer pool, in a buffer pool page. There is a lot of memory
that is being checkpointed, which results in a lot of application traffic between the
primary and backup. And there is redundancy in this application traffic because you are
modifying the same buffer pool pages over and over again and every time there is a
modification in a small part of the page.
When we looked at this we said well, instead of sending these redundant pages over and
over again, send the delta of the pages and send them compressed. And the way we
implement that is we maintain a cache in domain zero that contains the most recently
seen dirty pages we get from the protected virtual machine. So as part of checkpointing,
the protected virtual machine sends the dirty pages to domain zero and domain zero looks
in the cache. If these pages are found in the cache, then it does a delta between the
original page and this page, compresses the deltas and sends them over to the backup
compressed. If the page is not found in the cache, it is sent as a whole, the delta and this
cache is maintained as a LRU cache, so the most recent pages are stuck in the cache and
the these reproduced pages are kicked out. Robbie?
>>: What kind of [inaudible] you are running the [inaudible] milliseconds?
>> Ashaf Aboulnaga: Yes.
>>: So you are [inaudible] keep the cache for all the pages through the Delta and
[inaudible]?
>> Ashaf Aboulnaga: So what we do is in our implementation we took 10% of the
memory that is available for domain zero and devote it to this cache.
>>: [inaudible].
>> Ashaf Aboulnaga: It is. And one important thing to note about this delta, about this
compression is that this compression is done asynchronously in domain zero. So it is
asynchronous checkpoint compression. So there may be overhead by doing that, but it is
overhead that is incurred by domain zero. It is not on the critical path of the protected
database system.
>>: [inaudible] until it sends out and gets an acknowledgment for the…
>> Ashaf Aboulnaga: No. It will--as soon as the pages are here, the virtual machine can
continue.
>>: I see.
>> Ashaf Aboulnaga: Buffered network packets are not released until the checkpoint is
sent to the backup, but the protected virtual machine can continue executing. Now
computational overhead in our implementation whenever it could offload work from the
protected VM to domain zero we consider that as a win. So we were not too worried
about domain zero spending time taking compressing these pages and sending them over
to the backup because we assumed that there is sufficient CPU capacity for domain zero.
Yes?
>>: [inaudible] this optimization, does it have any thing to do with [inaudible] per se or
is it…?
>> Ashaf Aboulnaga: It is inspired by the buffer pool, but it is applicable to any
workload where you have a redundancy in the replication stream.
>>: So in some respects, Remus running the database, is that unchanged through Remus
running arbitrary applications? So all of the changes occur in domain zero, is that what
you're saying?
>> Ashaf Aboulnaga: In domain zero and there is a little bit of change in the virtual
[inaudible], but the database system should not be changed all.
>>: And the VMM that is running the [inaudible] is not changed?
>> Ashaf Aboulnaga: Is not changed.
>>: [inaudible] virtual machine?
>> Ashaf Aboulnaga: It can. It is not something that we studied. Yes?
>>: So the assumption that you have is that to begin with there is no competition
between the two virtual machines, which means that you can count anything that this one
does as a win for them?
>> Ashaf Aboulnaga: Yes.
>>: Does it hold usually in systems?
>> Ashaf Aboulnaga: If you have a system with a sufficient CPU--if you have enough
CPUs that it can dedicate CPUs to domain zero that assumption would hold.
>>: When you [inaudible] you need to get a snap shot of the virtual impression
[inaudible]. You need to delay [inaudible]?
>> Ashaf Aboulnaga: Yes. So while this stuff is being taken, the virtual machine is
suspended. That is the synchronization that we do and when the data is copied, the
virtual machine is [inaudible]. And then the rest of the checkpointing can happen
asynchronously.
>>: So how long does it take to finish this?
>> Ashaf Aboulnaga: How long is the duration of the virtual machine being suspended?
I don't have that number off the top of my head, but it is a small number of milliseconds.
So the second optimization implemented is again an optimization that is inspired by the
buffer pool. So if you look at the way that data systems read data from the database, you
have an active virtual machine and a standby virtual machine and both of them have their
own discs and there is a copy of the database on each disc. When the database system
loads a page from disk into its buffer pool, it looks clean to the database system, but it is
dirty to Remus. So Remus will synchronize these dirty buffer pool pages to the backup
with every checkpoint. And disk read tracking is based on the fact that synchronizing
these clean buffer pool pages is not necessary because it can always read them, the
backup can always read them from its copy of the database.
So what we are doing in optimization is we track--for any disk read--again, this is not
specific to the buffer pool. It is inspired by the buffer pool, but what we do is for any
disk read we track the memory pages into which they read data is copied, the read data is
copied, and we don't mark them as dirty. We don't send them in checkpoints. What we
do is we send an annotation in the application screen telling the backup you should read
these pages from your copy of the database and put them in your memory to reconstruct
these pages. And the backup can do this read lazily, so we only need to do this read from
this when a failover happens, but in the database what we do is we read these data
periodically so that we can make, we can shorten fill over time. So these are--yes?
>>: [inaudible] cannot be transparent, write? That particular optimization needs the
protected version for the machine to be able to talk to Remus somehow and tell it to mark
these…
>> Ashaf Aboulnaga: So the protected virtual machine doesn't read, and Remus figures
out, Remus gives the data back to the virtual machine and at the same time Remus sends
over to the other Remus on the backup virtual machine an annotation saying you should
read these pages and put them in your memory.
>>: How does it know the database…
>> Ashaf Aboulnaga: It doesn't care. It is a disk read that was read from this and put in
some page of memory.
>>: [inaudible].
>> Ashaf Aboulnaga: Okay.
>>: So the assumption, even in the [inaudible] reaches this one at all, the virtual machine
has an exact replica, the same copy of the data on database.
>> Ashaf Aboulnaga: Yes.
>>: Picks up the data on the database…
>> Ashaf Aboulnaga: Yes. Primary and the backup are replicas of each other including
the local disks.
>>: So are these pages sent to domain zero and then domain zero does the detection or is
there something else going on?
>> Ashaf Aboulnaga: Yes. That is the way, Xen does reads. Domain zero is involved in
reads and it detects, it does the detection.
>>: So again, then it is not necessary--so when you are doing checkpointing at the 25
millisecond interval and you are suspending the machine briefly, sifting over the stuff
from, how in that process do you distinguish the pages that are changed via a disagree
from those that changed for updates, or don't you?
>> Ashaf Aboulnaga: You don't. The way, Remus originally worked was that it marked
all pages as read only and the first time the protected virtual machine modifies a page, it
raises an exception, and Remus would detect and say oh, this is really, this page now
needs to be copied over. So what we do with this optimization is that if a page has been
modified because data is been read into it from disk, we don't mark it read only. We
don't--we basically just send an annotation that here is a new page that has been read and
the backup should read it from its copy of the disc.
>>: So I think I'll was just misunderstanding, so you have the database in a virtual
machine and then you're going to go through checkpoint process. But the checkpoint
process isn't part of the virtual machine, it's part of the system under lying the…
>> Ashaf Aboulnaga: It is part of this. I keep going back to this drawing. It is part of
this here, part of domain zero and the hyper [inaudible]. I promise never to go back to
this machine again [laughter].
These two optimizations that I described reduce the overhead of checkpointed memory
and they are completely transparent to the database system. Now when you look at
network, we saw that there is an opportunity for optimizing the way, Remus deals with
network packets and we are left with an optimization that is not transparent to the
database system. So if you look at the way, Remus handles network packets, Remus
buffers every outgoing network packet, which ensures that clients never see unsafe
execution.
But it adds up to three orders of magnitude to the latency of every network packet,
because for Remus we are assuming that the primary and the backup are the same
network, so 12 and a half milliseconds of average latency is really high. And this is the
largest source of overhead for [inaudible] and in particular transaction workloads. And
our observation is that this is unnecessarily conservative for databases, because database
systems have their own transactions with clear consistency and durability semantics, so
we don't need these TCP level per checkpoint transactions that Remus has. So what we
did is we added to Remus an interface that allows the database system running an
application to say that these packets need to be protected or buffered to the next
checkpoint and these packets don't need to be protected. And this exposed application is
via a Linux setsockopt option. So you have a socket and there is a plug switch that is
issued with every socket that says this socket is not protected; it is unprotected. And the
way the database system uses this switch is that it only protects transaction control
packets, begin transaction, commit, abort. These have to be protected. All other packets
are sent unprotected, which means that the client sees unsafe state, so if a client sees
unsafe state, what happens when the primary fails? A failover, after failover, they
failover handler runs and the backup virtual machine in a failover handler thread and that
failover handler, recovery handler, failover handler aborts all in-flight transactions where
the connection to the client was not in protected state.
So database systems are allowed to abort transactions, so to get significant boost in
overhead, a significant boost in performance, during normal operation we pay the small
cost of aborting extra transactions on failover. This is not transparent to the database
system. We need to toggle this socket between protected and unprotected state and we
need to do the aborting in-flight actions after failover, and we actually implemented this
in both PostgreSQL and MySQL and ended up having to modify maybe 100 lines of code
in each system.
So let me show you how this works. So this is our experimental setup. It is exactly the
picture I was showing. We have a primary server, a backup server connected via a highspeed network and we are running MySQL and PostgreSQL. Yes?
>>: [inaudible] does this modification include the failover handler that [inaudible]?
>> Ashaf Aboulnaga: Yes. This one hundred lines of code includes everything. So first
of all, can we do failover? Here I am showing you TPC C on MySQL and on the x-axis I
am showing time and on the y-axis I am showing throughput, transactions per minute C.
and the green line is an unprotected virtual machine. In our setting we can get sustained
throughput of around 400 transactions per minute. If we ran unmodified Remus we get
this red line. So there is a significant performance overhead. If we run protected Remus,
modified RemusDB we get this blue line. We significantly reduce overhead. Now in this
experiment, half of the experiment we failed the primary server. We actually pulled the
plug on the primary server. And the unprotected virtual machine can't proceed beyond
that; throughput drops to zero. It is not available anymore. Both protected virtual
machines proceed with very little downtime and they actually achieve peak performance,
achieved the same performance as the unprotected virtual machine. Why did the
performance jump this way? Because now the backup has taken over as the primary and
it is not protected anymore. So we are not advocating that you fail your machine to get
better performance, but [laughter] what we are seeing here is that if you are not protected,
you don't pay the cost of protection.
So this is failover. Now let's look at overhead here at [inaudible] operation. So this is
TPC C on PostgreSQL and here if we run on modified Remus we see this 32% overhead
so our performance is .68 of an unprotected virtual machine. If we use RemusDB’s
transparent memory optimizations, we cover most of that performance. But if we also
use the non-transparent commit protection we get back to almost the same performance
as an unprotected virtual machine. We get back to 97% of the performance of the
unprotected virtual machine.
Now the picture is a little bit different with TPC H. TPC H doesn't have this back and
forth between client and server, so it doesn't benefit that much from commit protection.
It doesn't benefit that much from the non-transparent optimization, but we do get a
significant performance boost using the memory optimizations. So that is RemusDB.
What we achieved is high availability for [inaudible] system with no code changes or
with very little code changes if we use these non-transparent optimizations, and we have
now automatically and fully transparent automatic and fully transparent failover to a
warmed up system. Now the next steps in this project are we are looking at re-protecting
a virtual machine after it fails so we can tolerate one failure at a time but we can, once we
are in the state of the virtual machine is unprotected, we can quickly go back to a state
where it is protected. And one of the nice things about Remus is that the back up doesn't
have to do a lot of work during normal operation. It is just applying checkpoints. So one
possibility we could explore is to have one server service the backup for multiple primary
clients.
And finally there are some administrative questions that arise when you protect a
database system with RemusDB. For example, how much network bandwidth do we
need between the primary and the backup? And we are looking at answering some of
these questions and our current work. So Dave?
>>: So if you are say a cloud service provider and you are employing this system in the
cloud you got a bunch of servers and you are trying to get as much out of your system as
possible. You are going to end up been very concerned about performance of domain
zero and so the question you ask is have you measured with the overhead is of domain
zero?
>> Ashaf Aboulnaga: No. In our experiments we were running on the same, on the
physical host we were running one protected virtual machine and one domain zero. So
the overhead that one virtual machine places on domain zero is very low. The situation
would be different if we had 10 virtual machines on the physical server and that is not
something in our experiment, so it wouldn't have been meaningful for us to measure the
CPU utilization for example on domain zero because it would have been very low. Phil?
>>: Is domain zero single threaded?
>> Ashaf Aboulnaga: No it is not.
>> Phil Long: So on a multicore system it could potentially be running multiple cores
concurrently?
>> Ashaf Aboulnaga: Yes. And that is part of the [inaudible].
>> Phil Long: So that would scale up. You would keep it [inaudible] even if he gets
[inaudible] over time you would see more cores, but cores are cheap.
>> Ashaf Aboulnaga: Yes. This is all predicated on cores are cheap and plentiful. But I
think some of the concerns that were raised were if you are running in a cloud
environment, you don't want to be wasting cores like this. We can't avoid the fact that we
are saving time in the protected virtual machine by doing more work in domain zero.
>>: [inaudible] interested in what it is that you are paying. So I can tell if I should be
concerned.
>> Ashaf Aboulnaga: I see. We didn't measure because we were paying CPU cost in
domain zero and cores are cheap and plentiful and so we are willing to pay CPU cost in
domain zero.
>>: But what about communication cost?
>> Ashaf Aboulnaga: So that was very low in our setting. That, we did measure and we
reported in the paper and it was quite low in our setting, especially with our optimizations
that we reduced communications.
>>: I think added to the [inaudible] is basically the low the assumption that the reason
for contention and if you have a virtual machine that runs one, that has only, a machine
that runs only one virtual machine that is dedicated, I might as well just give it the core
and use another system that doesn't waste that core for domain zero and does the
individual application. Now usually I do it because I can run multiple systems on it. And
the question that I had in my mind was for one virtual machine, yes, you have some extra
leverage to give some time to do the work in domain zero. But if you do that for five or
six, you might actually have to, you might be paying a penalty for each machine that is in
aggregate more than if you just left it alone.
>> Ashaf Aboulnaga: Left it alone unprotected, you mean.
>>: Yes. Or did it with some other technique which would actually…
>> Ashaf Aboulnaga: So this is what we want to achieve, protection with no code
changes to the database system and what you are saying is that we pay a cost for that.
And my answer is yes, we do. But the cost we pay is what we need to provide the
sufficient capacity for domain zero. Now can you do this by doing the log shipping and
the data [inaudible], yes you can. One last question before I move on.
>>: So what is the behavior when the backup actually fails? Because if the [inaudible]
fails, the backup becomes the primary and it is unprotected so of course [inaudible] the
same as the unprotected one. But once the backup fails primary still has to send all of
those checkpoints?
>> Ashaf Aboulnaga: The primary will figure out that there is no one receiving the
checkpoints at other sites and so it would stop sending them. So it will behave in the
same way as if the primary failed.
So now I want to switch gears a little bit and talk about another project which is about
building a database service by running database systems and eventually consistent cloud
storage. So Phil, do I have to stop at 11:30 or can I go 11:40…?
>> Phil Bernstein: No. You can go.
>> Ashaf Aboulnaga: So probably I will try to stop by 11:40 or thereabouts. You guys
are already using your question time in the middle of my talk, so I will probably stop at
11:40 and not have too much time for questions. So this work is still, hasn't been
published and here we are relying on cloud storage. So what is this cloud storage? Many
systems these days are developed for the cloud, things like Amazon S3, things like
HBase, things like Cassandra. And these systems are all storage systems that are very
scalable, distributed and fault-tolerant, but they provide a very simple interface to the
user. They are all key value stores, where the basic operations are write, a row with a
specific key or read the row with a specific key. And they provide atomicity only for
single row operations. They don't provide multirow atomic transactions and they don't
provide the richness of SQL, which is why they are called no SQL systems.
In this work what we are seeing is if we have one of these scalable cloud storage systems
that is running hopefully in multiple data centers to provide disaster tolerance, can we
build a multi-tenant database service in this setting by running independent database
systems on this cloud storage system? So here we are not interested in scaling an
individual database tenant. We assume that one machine has sufficient CPU, has
sufficient capacity for one tenant. What we are interested in scaling the storage capacity
and bandwidth available for each tenant and in scaling the number of tenants. So we
want to build a scalable elastic highly available multitenant database service that supports
SQL and ACID transactions. And the idea is that the cloud storage system would provide
the scalability, elasticity and availability and these database systems will provide SQL
and ACID transactions.
So we implemented a prototype of the system like this, and in our prototype what we
wanted to look at is if we are building this service on top of and eventually consistent
storage system, can we take advantage of the relaxed consistency of the storage system to
give us better performance? So our prototype is called DEBECS which stands for
databases on eventually consistent stores. And in our prototype we use MySQL and its
INNODB storage engine. Actually nothing in what we do is specific to MySQL so we
could have replaced my SQL with any other database system, but the storage system that
we used, Cassandra and we do rely on Cassandra because we want eventual consistency.
So I'll talk more about the system but you had a question?
>>: [inaudible] model, go back to that [inaudible] did you say DBECS did you describe
that as one database server?
>> Ashaf Aboulnaga: This is one instance of MySQL with its own client and its own
databases and it is independent of all the other instances of MySQL that are running.
>>: And [inaudible] machine.
>> Ashaf Aboulnaga: Different machines, different databases.
>>: But if they share the storage of the…
>> Ashaf Aboulnaga: They share the storage subsystem but they each have their own
blocks within the storage system. So basically what we want is to add more and more
tenants, not to grow one tenant. Okay?
So why Cassandra? As I said, Cassandra uses eventual consistency. So by relaxing
consistency, it reduces the latency of writes and it enables, and it is partition tolerant. It
can run on multiple data centers. So let me spend a few minutes talking about Cassandra.
Cassandra stores data as semi structured rows that are grouped into column families. So
think of each column family as a table and within a table or a column family you have
semi structured rows. These rows are accessed by a key. So every row has a key and
rows are replicated and distributed by hashing these keys. One of the nice things about
Cassandra is that it uses multimaster replication for each row. Many of these cloud
storage systems try to guarantee consistency and they do that by having a single master.
Cassandra doesn't do that. Cassandra has a multimaster for each row, which enables
Cassandra to run in multiple data centers gives us partition tolerance. And we rely on
that for disaster tolerance in our DBECS system.
Another nice thing about Cassandra is that a client controls the consistency of each write.
So there is this consistency versus latency trade-off and Cassandra allows the client to
manage this with every read or write operation. So in Cassandra a read or write operation
specifies a consistency level and the client can say write one or read one, which means
get me any copy of the data or write any copy of the data. And that is fast but not
necessarily consistent. There is also write call and read all which means read all copies
of the data and get me the most recent or write all copies of the data. And that is
consistent but may be slow.
So Cassandra allows the client to control the latency versus consistency trade-offs and in
this work we posed that database systems can control this trade-off very well.
>>: [inaudible] illustrates [inaudible] write all read one?
>> Ashaf Aboulnaga: Yes, you can. Yes.
>>: [inaudible].
>> Ashaf Aboulnaga: Yes. There is actually an explicit read core and in my next slide I
will talk more about these consistency levels, but now I have to give a broader overview
of Cassandra. Now another nice feature in Cassandra is that Cassandra uses timestamps
with each write. And these timestamps are provided by the client. So the client controls
the serialization order of updates. And this is important for us in our system. Cassandra
also is scalable, elastic and highly available, but so are many other storage systems. So
we didn't use Cassandra because it is scalable, elastic or highly available. We chose
Cassandra because of this.
So let's talk a little bit more about this consistency versus latency trade-off. In Cassandra
there is an operation, I mean if you specify a read, you specify a consistency level so the
basic operation is to read the value of a specific column in a row with a specific key. So
you give the column name; you give this key and then you specify the consistency level.
You can say read one, which means that Cassandra will send the read request to all
replicas of this row and it finds the replicas by hashing on the key and it returns a value to
the client as soon as one replica responds. So it is fast, but may return stale data. Now
there is also read all, where Cassandra sends the read request to all replicas of the row
and it waits for all of them to respond and then it returns the latest copy. So this is
consistent, but it is as slow as the slowest replica.
There is also write one which is versus write all, so Cassandra sends the write request to
all replicas and either acknowledges as soon as one replica responds or when all replicas
respond. And there are also other consistency levels. There is a read column. There are
also the data center aware consistency levels; so there is this trade-off. Now let's quantify
this trade-off so consistency is possible. It is expensive, so how expensive? So then we
ran some experiments and here we were running Cassandra in Amazon EC2 cloud and
we have a system with four Cassandra nodes, so small Cassandra cluster and we are
running this benchmark called the Yahoo cloud serving benchmark which is a very
simple benchmark that does reads and writes. So there is no fancy SQL here.
And here I am showing the latency of writes and reads. Blue is write all and read all red
is write one and read one and here all of the four Cassandra nodes are in the same EC2
availability zones. Think of it as the same data center. So you see that there is a factor of
two penalty between the lead one and lead all, but if we move two of the Cassandra nodes
to two EC2 availability zones within the same geographic region the penalty becomes
bigger, and if we move the two Cassandra nodes to a different geographic region, in two
different data centers, one on the USA East Coast and one on the US West Coast, the
penalty becomes huge. The difference in performance between the consistent read and
write and the potential inconsistent read and write becomes very big.
So the message from these experiments is that there is a significant cost to be paid if we
use read all and write all, especially if we want multi-data center operation.
>>: [inaudible].
>> Ashaf Aboulnaga: Four, all.
>>: So it doesn't do any [inaudible] to say two or three agreed that this is the latest one?
All means all of them?
>> Ashaf Aboulnaga: All means all, because you have to wait for everybody to respond
to decide which is the latest one.
>>: But it doesn't manage its own consistencies. It relies on the majority being
[inaudible].
>> Ashaf Aboulnaga: You can say read quorum read quorum if you want. But still your
quorum here is the difference among the data center so you still have to wait for
somebody from the other data center to respond.
So this is an overview of Cassandra. So what did we want to do with Cassandra in this
project? We wanted Cassandra to look like a disk to the database system, so a scalable
elastic highly available transcontinental disk. So what we are going to store on
Cassandra is disk blocks. So Cassandra stores rows with keys and values. In our case
keys are disk block IDs and because we have different database tenants, we append the
data system ID to that. Values are the contents of this block, and we don't do anything
fancy with Cassandra's columns or column families. We just have one column family
with one column containing all our data.
We have this layer, this client library that intercepts read and write requests from the data
base system. In our case for my SQL’s INNODB storage engine and it converts them to
read and write requests in Cassandra. The question here is what consistency level should
the Cassandra I/O layer use? If it uses write one and read one, it is fast but might return
stale data and it provides no durability guarantees. Reading stale data makes database
systems very unhappy. So this is not something that will work for a database system.
Now if we want, one way to get consistency is to use write all read one which is what
Dave was mentioning. And that returns no stale data and guarantees durability, but
writes are slow. So our goal in this project is to try to achieve the performance of write
one and read one while maintaining consistency, durability and availability, and we do
that by making two changes to Cassandra and to the way we use Cassandra.
The first is something we call optimistic I/O, in which we say well, if write one read one
are cheap, let's just use write one read one, and if we happen to get stale data, let's detect
that and recover from it. The second change we made is with something we call the
client control synchronization. When we use write one and read one we lose durability.
Client control synchronization gives us back durability; it makes database updates safe in
the face of failures. So let me spend a few minutes talking about each of these
optimizations so optimistic I/O. A key observation we had here is that even if we use
write one and read one, most of the time reads will not return stale data. Why is that?
Are we just looking at the world with pink glasses? No. There are reasons for that. First
of all we have a single writer for each database block. We have many different database
blocks that belong to many different database tenants, but we have a single writer for
each database block. And secondly because these clients are database systems, they have
a buffer pool. So when a database system writes a block, it is unlikely to immediately
turn around and read the same block. It is, the block is not a buffer pool, so there will
likely be a period of time between the write and the next read of this block. And in that
time even if Cassandra uses write one, the update will have propagated to all of the
replicas because remember, with write one, Cassandra sends the write to all replicas and
just responds as soon as one of them acknowledges.
And finally there is a factor that is not really important but is also part of the picture, is
that because of network topology, the first Cassandra node to acknowledge a write is
likely to be the same node to acknowledge a read. So because of all of these reasons, if
we use write one and read one most of the time we will not see stale data. But there is a
likelihood of still seeing stale data, so what should we do? What we do is detect stale
data and recover from it. So how do we detect stale data? There is this Cassandra I/O
layer, the client, and the Cassandra I/O layer stores a version number with every block in
Cassandra and it remembers the most recent version number of every database block.
When we use read one, we check the version number that is returned by the read one, I
guess the most recent version number. If it is the most recent, we are fine. Our optimism
was warranted. If we detect stale data, then we have to recover and retry the read, and
when we retry the read we can use read all so that we can guarantee to have the most
recent copy, and an interesting observation is that even if we retry using the read one we
are likely to get the most recent copy. Why is that? Because when Cassandra detects
staleness, it initiates a repair process whereby it brings all the replicas up-to-date, so if we
do try the read one, we are likely to see that Cassandra has already repaired this stale row.
Remembering the version number of each block seems expensive. Do we really need the
version number of each block? The answer is no. What we do is we remember the
version number of the most recent version number of only the most frequently or most
recently touched disk blocks. So our Cassandra I/O layer considers a block to be one of
three states: either unknown, inconsistent or consistent. Unknown means I don't know
the most recent version number and I have no information about this block. This is all
the blocks are when the database first starts. So what I do if I am reading a block that is
in the unknown state? I use recall. Once I read a block, I get the most recent version
number. If once I do a read all I get the most recent version number, I can store that.
And if I write a block and I know the most recent version number of it, then I know that
this is a block that might be inconsistent. So I can use read one for that block and I have
to compare the version number from Cassandra with the most recent version which I
know of.
Now we have also modified Cassandra so that whenever it responds with a version
number it gives us not just the newest version number but also the oldest version number.
As we are interacting with Cassandra we are always comparing the newest and oldest
version numbers of different disk blocks, and if the newest and oldest are the same, then
we know that the block is consistent. What does consistent mean? It means that I can
use read one and I don't even need to check version numbers, and I maintain a bounded
list of consistent blocks, a bounded size list of inconsistent blocks, and if any of these
lists out grows this bounded size, I just return the least recently used blocks with
inconsistency to the unknown state. I can always rely on it all if I don't know about the
state of a block. Dave?
>> Dave: [inaudible] delegate to Cassandra?
>> Ashaf Aboulnaga: Remembering the most--how would we delegate the most recent
version number to Cassandra?
>>: By having it--I mean you would have to change Cassandra to do this, but you could
have Cassandra alternatively you could pass the version given to Cassandra and say I will
take the block which matches just as soon as you find it.
>> Ashaf Aboulnaga: That is an interesting possibility. We haven't looked into that. So
basically, one of the things about Cassandra is that the client might connect different
Cassandra nodes at different times. So if you pass the version number to Cassandra and
tell it, give me the version number that is this or higher, we could do that. But right now
what they are doing is keeping this outside of Cassandra.
>>: So it depends on which one you want to change first. If you don't want to
[inaudible].
>> Ashaf Aboulnaga: Yeah. And also any solution that we have come up with has to
work on the assumption that the client will connect different Cassandra nodes to the
different [inaudible]. So by maintaining the version in our client library, maybe we make
this, maybe we avoid having problems of different connected nodes to the [inaudible].
>>: Cassandra could use the same strategy, right?
>> Ashaf Aboulnaga: Yes. I guess there is no reason why this has to be in the client
library and not in the [inaudible].
So now we are able to use read one and write one for most of our requests, but we, our
data is not safe; writes are not safe. And also, if I use read all this will block if any
replica is down. How can we deal with failures? One naïve solution would be to say I
am going to use write all in read quorum and if the node is down the write will block, but
what we did in our system is, again, observed that because database systems have their
well defined transaction semantics, they know precisely when data must be safe and
when data can be unsafe. So for example, write ahead logging tells us that you have to
write a log record before you write a corresponding data page. You have to flush the log
[inaudible] to commit if you are reclaiming records from the log, you have to write the
corresponding data pages. There is a well-defined set of points where data has to be
made safe. And data base systems are used to dealing with file systems that don't always
guarantee safety, so they are used to seeing fsync or something like fsync when they want
data to be safe. So if we can add something like fsync for Cassandra, then we can afford
to keep data unsafe until fsync is called. What happens when a failure happens and we
lose unsafe data? It is exactly what will happen if the database system loses unsafe data.
You abort transactions.
So what we did is implement a new type of write in Cassandra called write CSYNC. And
CSYNC stands for client controlled synchronization. This is a new consistency level for
writing Cassandra. And write CSYNC behaves like write one, so it acknowledges the
client as soon as one copy of the data is written, but it keeps the key of this page on an in
memory list called a sync pending list. So these are blocks that need to be synchronized
when the client issues a write sync. And we also added a new call into the Cassandra
client called CSYNC or Cassandra sync and whenever the database system says fsync the
Cassandra I/O layer, which is the layer between the database system and Cassandra's
translates the fsync into a CSYNC.
So basically we are making data safe only when the database system needed to be safe.
So data that is written remains unsafe until the database system explicitly requests for the
data to be safe. And any period of time between the write and the CSYNC is an
opportunity for latency hiding. Cassandra can be propagating the data while, and the
client doesn't have to be waiting for it. What about read? We use read quorum to do
with the possibility of one, of Cassandra rows being down.
>>: [inaudible] for all of the pending ones, write them all?
>> Ashaf Aboulnaga: CSYNC is something that has to needed extra [inaudible] inside
Cassandra, so basically when we write with this new CSYNC synchronization level, there
is this, Cassandra accumulates keys on this sync pending list and when we send a
CSYNC through the Cassandra node what we're saying is do a write call for every key on
this sync pending list.
>>: So inside the [inaudible] know…?
>> Ashaf Aboulnaga: Inside Cassandra, yes. It becomes a write all.
>>: [inaudible] haven't been flushed yet [inaudible] call in then…
>> Ashaf Aboulnaga: There is more so basically we can make this CSYNC instead of
doing write to multiple data centers. So we don't have to write all…
>>: So the read core actually can be--you're trying to say that it will be faster for the next
read core and we don't have to wait for all?
>> Ashaf Aboulnaga: I am not sure I understand what you are saying.
>>: The idea is that instead of saying write all, make sure that you write at least two
disaster recovery that so-and-so you are protected at least…
>> Ashaf Aboulnaga: Yes. What I am saying is we can do these kinds of things when
we implement the CSYNC call. Dave?
>> Dave: I thought a write one actually wrote to everything, but acts as soon as a write
one came back saying--so if you could track how many came back, you could discharge
your CSYNC list so as things go on and not have to do it on the write all, right?
>> Ashaf Aboulnaga: Yes. There is an opportunity to clean the CSYNC pending list
with output for CSYNC, but I am not sure if we actually do that or not in the [inaudible].
>>: [inaudible]. I am having a Rick Perry moment here [laughter].
>> Ashaf Aboulnaga: So one of the goals we wanted to achieve when we started was to
deal with failures. So what are examples of failures that we can deal with? So what
happens when we lose a Cassandra node? If we lose a Cassandra node that is completely
transparent to the database system. It is handled by Cassandra. Cassandra detects when
the node is down and it takes it off the application ring, and when the node comes back
up it catches it up with all of the data. So that is one advantage of using a system like
Cassandra. What about if we lose the primary data center? Here we don't have a fully
good story, but we have just a partial story. So here if we are running Cassandra in
multiple data centers and we have the data stored in multiple data centers which we can
do with our existing limitations, we can restart the database system in a backup data
center, so we don't have an always on database system, but we can restart the database
system in the backup data center and apply the standard log-based recovery to bring the
database up to date. We can count on a transaction consistent view of the log being there
in the backup data center, because of the way we did writes in the system I/O and because
of CSYNC.
Let's see how this works. I will show you results of--yes?
>>: Are you using Cassandra for the log also?
>> Ashaf Aboulnaga: Yes.
>>: So if you are only doing the write once, you wouldn't necessarily until you did a
sync, know that your log was at the disaster center, write?
>> Ashaf Aboulnaga: That is true, yes.
>>: So you are risking losing some of the end of the log in this.
>> Ashaf Aboulnaga: Yes. If the database system doesn't do an fsync. So we are risking
losing the end of the log whenever the database system would have been willing to risk
losing the end of the log. Whenever the database system says I want to make the data
safe by issuing an fsync, we make the database [inaudible] center. So three or four
experiments and then I will wrap up. Here we are running TPC C on MySQL and
Cassandra in Amazon EC2. We have a small Cassandra cluster with six nodes. And I
am showing results for three situations. The first is when I have all six Cassandra nodes
in one EC2 availability zone. The yellow bar shows the baseline which is write all read
one. The blue bar shows what we can achieve with optimistic I/O. So you can see that
you can get a significant performance boost with optimistic I/O. But optimistic I/O is not
safe. Now if we bring back safety by using CSYNC, we pay a little bit of performance
penalty but this performance penalty is not that high.
Now the difference between the yellow bar and red bar becomes bigger if our six
Cassandra nodes are divided among two availability zones in the same geographic region,
and it becomes even bigger when we are doing, when we are replicating in two
geographic regions, U.S. East and US West. So basically one way to look at this work is
that we are enabling you to be able to run MySQL on Cassandra and get the red
performance instead of the yellow performance if you are running in two regions. Yes,
Phil?
>> Phil Long: Doesn't the performance of the CSYNC depend on how frequently you do
CSYNC?
>> Ashaf Aboulnaga: Yes. This is…
>> Phil Long: And so how frequently in this graph?
>> Ashaf Aboulnaga: That is not something we measure. So we have to TPC C--we do
CSYNC whenever MySQL does fsync. How frequently does MySQL do fsync? It does
fsync with every commit. It does fsync whenever it needs to make the data safe.
>> Phil Long: Well that is normally with every commit, isn't it?
>> Ashaf Aboulnaga: Yes. So definitely every commit, but I think there are also other
situations. For example, whenever it is reclaiming a page in the buffer pool it gives fsync
to lock down before it obtains the page. I mean there are points where data must be made
safe and the only way that INNODB knows how to make the database safe is to do fsync.
You have this worried look on your face and I am not sure what you are worried about.
>> Phil Long: I am just surprised because that is very frequent handshaking across the
wire. I mean, the answer to Dave's scenario before you are saying that every transaction
commit you are going to make sure that all of, you are going to eagerly push all of the
data out to the replicas.
>> Ashaf Aboulnaga: Yes.
>>: I think what may be useful for, you probably don't have it in this [inaudible]. How
does [inaudible] local disk?
>> Ashaf Aboulnaga: Actually we measured that and one of the problems we are seeing
now is that there is a significant overhead compared to local disk, because every I/O on
the local disk is translated into [inaudible].
>>: Some of Phil's concern could be alleviated by noting that if you do group [inaudible]
as almost everybody does and it puts an fsync to the log versus an fsync to the database
disk then you could separate out those overheads also.
>> Ashaf Aboulnaga: Yes. As far as I know, the way INNODB uses that fsync is that
whenever there is a commit, you fsync the log. And if the fsync catches a number of
commits, you have avoided some fsyncs, and you only fsync the data when you are doing
a hard checkpoint. But I would have to go back and look, actually your question does
support--we really need to investigate how frequently does INNODB do fsyncs.
>>: But it only acknowledges and commits to the client after an fsync?
>> Ashaf Aboulnaga: Yes.
>>: I think the way to do it right is to have UL propagate all the writes to all of the
nodes. That [inaudible] CSYNC concern is that that is been done.
>> Ashaf Aboulnaga: The CSYNC propagates all the writes to all the nodes, but that is
only done when the database system requests it.
>>: Yes. When you [inaudible] so the CSYNC [inaudible]. So that is maybe the reason
why you frequently call in a CSYNC still. Because the CSYNC [inaudible].
>>: Well, that is not what he said last time. He said he wasn't sure whether the code was
actually checking to see whether the stuff was done and it actually [inaudible].
>> Ashaf Aboulnaga: Yes.
[multiple speakers] [inaudible].
>>: Remind me [inaudible] it might be that the [inaudible] flush the sync list but the
write all request would from the example there would return, I would assume, much
faster if the [inaudible] doesn't really happen.
>> Ashaf Aboulnaga: So let me move on. I want to show you--the message of this graph
is you get to use MySQL on top of Cassandra with the red performance instead of the
yellow performance. Now let's look at the goals that we have set out to achieve which
are scalability and availability. So do we get scalability? So here we are adding more
and more database tenants. They are all running TPC C. They are independent. They
are databases running independent copies of TPC C and we are proportionately
increasing the number of Cassandra nodes. So we start with two tenants on one database
system and three Cassandra nodes. Then we go to six tenants and nine Cassandra nodes
and so on. We proportionately increase the number of tenants and the number of
Cassandra nodes. And what we get is a linear scale up in the total TPMC that we see
across all of these tenants. So we can scale in the number of tenants.
Let's look at some availability results. Here we are running our MySQL in a primary data
center and then there are three Cassandra nodes in this primary center and three
Cassandra nodes in the secondary center and we fail 1% in the primary center at 300
seconds, and when we do that there is a drop in performance until the other Cassandra
nodes realize that this node has failed, and stop sending requests to it and then we recover
our performance. Now this Cassandra node comes back up in 500 seconds and it takes a
while to catch up this node with the other Cassandra nodes, so performance gradually
rises until we get to the original performance. Yes?
>>: So you say that for every tenant you add one node?
>> Ashaf Aboulnaga: No. No we saw in our setting how much capacity we, how much
load can a node sustain and it turns out that for every one of the virtual machines that we
are using can sustain two instances of MySQL, and these two instances of MySQL need
three Cassandra nodes to sync.
>>: So what you are saying is that by adding scalability we are adding more nodes?
>> Ashaf Aboulnaga: That is the threshold of scalability. There is no magic. If your
system is overloaded--what I can tell you is that every one of these points represents a
high load system. So against a high load system with some number of nodes in here, it is
a high load system with many more nodes.
>>: So the idea is that this is not doing like this. It is basically…
>> Ashaf Aboulnaga: The idea is that this is basically linear, linear scale. It is not doing
like this.
>>: So you could imagine running TPC C a couple of different ways. One is each of
your systems runs the TPC C benchmark and so every time you add in there you are
adding another one that runs the TPC C benchmark. Is that what you did?
>> Ashaf Aboulnaga: Yes.
>>: So that , if you will, would be perfectly partitionable because those are all partitions.
>> Ashaf Aboulnaga: Yes.
>>: So what you are not showing is scalability where you just sort of stretched out the
size of the cluster machines that is running a single list of--so you're scalability is like
where you added a new client doing independent things and this is what we get.
>> Ashaf Aboulnaga: So basically the stored system is able to support more and more
independent clients, yes, absolutely. So availability, I showed you what happened when
a Cassandra node in the primary center fails. This is what happens when a Cassandra
node in the secondary data center fails. You don't see as much of a performance dip
because the secondary, the node in the secondary center is not in the critical path most of
the time. Here is what happens when we completely lose the primary data center. And
here the story is not as nice. It is still okay. So what happens is that you have some
performance and then at 300 seconds we completely fail the primary data center. Now
what happens is that we start a new database instance in the secondary data center and
this instance does traditional log-based recovery. The log is there in Cassandra so it is
actually consistent. After it is done with this recovery it comes back up and starts
executing queries. And the reason why the performance after recovery is lower than
performance before the recovery, is that here is U.S. coast east, and here it is U.S. coast
west, and east is closer to Waterloo than west, or actually east is closer to the primary
database instance than west.
>>: I don't understand why did you need to do database recovery? Database is
[inaudible]?
>> Ashaf Aboulnaga: Yes. So we lost database data center with Cassandra and with
other databases.
>>: [inaudible] standby database actually running you can reduce that time?
>> Ashaf Aboulnaga: We can, but right now we don't have standby databases. And
standby database systems would open up a whole bunch of interesting issues. If you
want to be do standby replication completely ignoring the fact that you have a shared
storage, than you can do it without any problems, but what would be interesting is to see
if we can exploit the fact that we have shared storage to make the standby faster.
So what do we have so far? We have scalability, elasticity and storage capacity and
storage bandwidth. We have scalability in the number of tenants and we have a highly
available and [inaudible] storage tier. We have SQL and ACID transactions for the
tenants. So there is this question about whether we can scale consistency. Can we
consistency scale? And what we have now is in my view an interesting point in the
spectrum of possible answers to this question. One thing that we don't have is what Dave
was talking about, scaling an individual database system. That is not something we have
looked at yet. And we don't have always on tenants. So when are tenant fails, we have
the advantage that when we restart the tenant in another data center, that newly started
tenant can find a copy of the database in the log and do a log-based recovery, but we still
have to incur downtime. We don't have this automatic and transparent fail [inaudible].
So let me conclude and I can take any other questions off-line. So in this talk, what I
argued was that high-availability and also scalability for database systems can be
provided by the cloud infrastructure and this is not something that is new. Many people
are working on different projects that aim to achieve this goal. But what I tried to look at
in this talk is to say that if we take advantage of the well-known characteristics and
semantics of database systems, we can greatly improve our solutions. And I presented
two examples of this. One is RemusDB, which is high availability in the virtualization
layer and the other is DBECS which is scalability and high availability by running on
eventually consistent cloud storage. Thank you, and sorry for running overtime.
[applause].
Download