>> Phil Bernstein: So it's a pleasure to be... was here as an intern working with us a year...

advertisement
>> Phil Bernstein: So it's a pleasure to be welcoming back Sudipto Das, who
was here as an intern working with us a year ago. And since then he's been a
busy guy. He's got papers published in all the top database conferences and
several others. Co-winner of the Best Paper Award at CIDR and best runner-up
at Mobile Data Management conference, and has more work coming out in the
pipeline. He certainly isn't going to have time to talk about all of that today, but
he's going to tell us about some of his work on transactional record management
for database systems.
>> Sudipto Das: Thank you very much, Phil, for the nice introduction. And thank
you all for coming. So as Phil pointed out, this is going to be just a sliver of the
work which I've done in the broad area of scalability to management. So this one
focuses on scalable, consistent, and elastic database systems. And the context
I'm setting it is for the cloud platforms.
So as many of us are aware, like over the past few years, a lot -- more and more
applications are being delivered over the web, over the network. Not only has
this front end changed, this has also resulted a change in the back end
infrastructure, what we often call as cloud computing.
So in simplest form cloud computing is essentially cloud infrastructure or services
provided -- or solutions provided as a service. And it's already become a pretty
big business, and it's growing. And some of the key factors to the success are
the economies of scale the notion elasticity or pay-per-use kind of pricing.
Even though almost every aspect of computing can be provided as a service,
three paradigms have evolved as popular with providing cloud service, namely
infrastructure as a service, platform as a service, to all the way up to software as
a service, where your entire software comes as a package delivered over the
cloud.
Irrespective of which layer you are using or which abstraction you are using for
the cloud, data is a central concept. And a DBMSs form a mission critical part of
-- mission critical component of the cloud software stack. They kind of serve
petabytes of data that -- manage petabytes of data and, more often than not,
they drive the revenue of the company as well. And because of the wide variety
of applications that are deployed in the cloud, they often have to deal with a wide
variety of applications themselves, a term which we often refer to as
multitenancy.
If you consider the data needs of these applications, you can broadly divide into
two different categories. On one -- one hand we have the OLTP systems, which
are there to just serve data, small read-write transactions, and on the other hand
we have the data analysis systems that allow for decision support and
intelligence. This is obviously a very like simplified view of the world.
In this talk, I'll be focusing on the transaction processing systems or the
transaction processing aspect of these databases.
So if you think of the application landscape for this OLTP databases, what does
it look like? It ranges all the way from social gaming to rich content, to managed
applications. And we have the cloud application platforms as well that are
growing in popularity, like Windows Azure or Google App Engine, and they have
an OLTP database that is setting behind the scenes serving all the applications.
So as you can see, it's pretty rich and diverse.
There are a large number of challenges that need to be solved for designing
such OLTP databases. In this talk, I just focus on three specific challenges. As
we all know, the amount of data or the number of applications that are being
served is growing every day. So scale is definitely a big problem. So these
systems must be scalable. But because they have the OLTP, they also must
ensure that they are serving transactions efficiently, that they are executing
transactions efficiently.
Elasticity is a big thing in cloud, where it allows for the infrastructure to be
provisioned on demand. And we want our databases that are deployed in the
cloud to be elastic as well; that is, have the ability to scale on demand in a light
system.
And last, but not the least, when you have a big system, you want it to be
self-manageable, to reduce the number of dollars you spend in getting
administrators for the system. So you want intelligence without a human
controller. I'll get into the details of each of these challenges.
So if you consider the challenge of scalability, there are classically two different
approaches to scalability. One approach is scale-up, where you throw more -more powerful or high capacity hardware. And this is a typically used in the
classical enterprise setting where it was more convenient to scale up the
databases and the relational databases are one of the popular examples that like
scale-up, because the rich functionality that the relational databases support, it's
easier to scale them up rather than the other way around.
And the key idea here is that you still have -- even when you have a bigger
hardware or a more expensive hardware, you still have -- you limit access to a
single node, which is the key for efficiency and a good performance as well.
Now, obviously this is not a viable solution for the cloud, where you want to
leverage from the commodity hardware or the economies of scale that can be
can leveraged from commodity hardware. In that setting you want to use a
cluster of commodity serves to scale out where what we say is scale out. So the
idea here is you somehow partition your database, divide it up into smaller
granules and then set it up into -- on a cluster of servers. So one of the systems
that has taken it to the extreme is what we call the key-value stores. They broke
down the database to the smallest possible granules of single key pairs or rows
and then distributed across cluster of thousands of servers or even geographies
in many cases.
But in order to do that, what they have done is to ensure scalability and to ensure
the property that you want to have your transactions at a single node. They have
limited the functionality and guarantees that are supported. There are a lot of
limitations that are in force. For this talk, I'll just focus on the aspect that these
key-value stores do not provide support for multi-row or multi-step transactions.
Now, why is transactions a big deal? I think many of us in this room already
know that, but just to give -- put you in context, why actually care about
transactions? So if you think of a very simple application like a social network or
social -- any social application, and there is a friend request that is accepted. So
this results in updating two -- the friend lists of the two individual users.
If you were in a world where the database system supported transactions, this is
the code you would write as the application developer. So the key idea here is
the simplicity. And this is one of the main reasons why databases have been so
popular over the last few -- or last two or three decades. On the other hand, if
you were writing it on a key-value store with limited guarantees, this is just a
fragment of the code which you'll end up writing. Don't even bother reading it
because there are a lot of corner cases that have been left out here. And this is
what the application developer reimplements for every application it writes.
So in summary, it gets too complicated to write for the -- it makes life for the
application developer harder and harder to build on this -- these kind of key-value
stores or the reduced consistency guarantees.
So if you view it as two different axis, as scale-out being the vertical axis and
ACID transactions being the horizontal axis, on one hand we have the relational
databases that give good functionality and strong ACID transactions, but are not
very amenable to scale-out. They do provide limited scale-out.
On the other hand we have key-value stores that give you scale-out to problem
thousands of nodes. So the challenge which I want to address is spanning or
bridging the gap between these two systems. Because there is a lot of potential
to be leveraged, a lot of potential to be exploited in the middle space by providing
transactions at probably not the scale of thousands but probably at the scale of
tens of nodes which spans a lot of different types of applications. And it
becomes even more critical for cloud platforms that often cater to a wide variety
of applications.
As I've already mentioned, elasticity is a key thing in cloud. So compared to
classical enterprise setting, where you have a statically allocated capacity, cloud
allows you to provision your systems on demand. So the underlying
infrastructure allows you to scale on demand. So we want the database systems
to have this ability as well. So we want to have the database systems to be
elastic as well as be lightweight in terms of providing the elasticity by not
introducing a lot of overhead.
And last, but not the least, managing these systems is often a pain. Why is it so?
Because as you go to scale, failures become a norm, rather than being an
exception. So recovering -- detecting and recovering from these failures.
Coordination and synchronization between a cluster of these nodes. How do you
provision these systems? How do you do capacity planning? And the laundry
list essentially continues for what you want to do automated.
So there is a quote from a famous open source system called Zookeeper that
says a large distributed system is essentially a zoo. And that's why you need a
Zookeeper for automating a lot of these guarantees.
Now, to add to it, if you consider the cloud platforms inherently multitenant. So
there is a conflict between the goals of the service provider that is trying to
minimize its operating cost, as well as the performance guarantees that are given
to the applications that are designed. And the challenge is how do you design
these self-managing systems by minimizing the need for a human controller in
such systems.
To this end my dissertation makes the following contributions. To provide
transactions at scale, I've designed two different systems that allow you to scale
out on a cluster of commodity nodes while providing transactional access. So
one of the systems called ElasTraS, it uses a static partitioning technique, while
another system called G-Store uses -- allows you to form the partitions
dynamically on demand.
To provide lightweight elasticity, I've proposed two different designs for two
common database architecture. One design called Albatross gives -- provides
you lightweight elasticity in a shared storage or a decoupled storage architecture.
Zephyr, on the other hand, provides you lightweight elasticity in the classical
shared nothing database cluster. And in the self-manageability front, I'm
currently working on a design called Pythia that allows you to do workload
characterization where tenant placement, et cetera, how to automate these kind
of things in large database systems.
For the interest of time in this talk, I'll just delve deeper into two of these systems.
And obviously we can talk offline about the rest of the papers.
But before getting into the depth of the two different papers, I'll like to spend a
couple of minutes to give an overview of the kind of work that I have done. So as
I've already said, this talk on my dissertation focuses on transaction processing.
On the other hand, if you consider analytics, I've also will worked on a number of
projects to support a different kind of analytics or richer analytics for the different
type of data needs. And one of the projects called Ricardo as an intern at IBM, I
worked on a project that allows for statistical models to be built on terabytes or
petabytes of data. Essentially it's an integration between art or a statistic
software in Hadoop as a data management software.
I've also worked on a project for multi-dimensional data analysis to provide
scalable multi-dimensional indexed database system to support location based
services. And essentially this is an architecture that allows you to ingest a lot of
location updates which are coming from the mobile devices, as well as done
analytics online in such a system. So that you can build rich applications like
recommendation systems on top of these systems.
And in a somewhat different project I've also worked on social network
anonymization that you if you want to anonymize the edge weights in a social
graph, how do you anonymize the graph while preserving some of the
properties? On the other hand, I've also been -- I've also worked on some of the
projects that try to leverage novel hardware that is available and see how we can
leverage that infrastructure or new hardware to come up with better and more
efficient database architectures.
So as an intern last year at MSR I was working with Phil on the project Hyder that
gives us scale-out database architecture leveraging large amount of flash -- low
latency and data center networks and large amounts of RAM that is available.
In a different context, in the context of data streaming applications, you have long
running queries, continuous queries. And in this paper, in the course we were
exploiting how we can leverage multicore architectures or the parallelism
inherent to multicore architectures who efficiently parallelize these continuous
queries?
And we also looked at a problem where we were looking at the same problem
but at a different hardware, which is ternary content addressable memory, which
is hardware hash table, equivalent to a hardware hash table.
So this is kind of the work I've done as a PhD student with different
collaborations.
Now, getting back to the main area, the main focus of the talk. How to we
provide transactions while scaling out? So as I've already said, when you want
to scale out to a large number of nodes, you have to somehow partition the
database and then distribute the partitions across a cluster of nodes. I see a
very quiet audience. If there are questions, please interrupt me. So I've -- I want
it to be more like interactive.
Okay. So getting back to partitioning. There has to be a mechanism of
statistically partitioning the databases. And classically what we use is what we
call table level partitioning where you partition every table individually,
independent of each other. Classical term -- like typical techniques are range
based partitioning or hash based partitioning. This makes the system
management pretty easy. But the challenge that arises is that because the data
is not partitioned the way it is accessed, a lot of transactions end up accessing
data from different partitions, often resulting in distributed transactions which we
all know is pretty expensive.
So recently what a trend has come up is to leverage the data access patterns to
partition the database. What we call partition the database schema itself or the
groups of tables, leveraging the access patterns. And the goal here is to
co-locate data items that are frequently accessed together within the same
transaction.
So here I've shown two different approaches. One is where you are limiting the
schema by providing a specific schema structure. Another is where you are
exploiting is workload patterns to kind of partition. So this is work done at MIT.
This is work done by me. And one of our systems and a similar kind of schema
pattern is also supported in Cloud SQL Server as well as Megastore, which are
other commercial systems.
So if you consider this problem of scaling out where you have statically
partitioned the database somehow so that most of your transactions are within a
single partition, this essentially what you have allowed is you have made your
transaction processing easier. But when you are scaling out to a big system, you
have to deal with all the set of distributed systems challenges that were listed in
one of the earlier slides.
And actually I've proposed, designed, and implemented a system called
ElasTraS that that provides one way of solving these challenges. This is a talk in
itself. But, unfortunately, I wouldn't be getting into the details of the system. But
we can definitely talk offline on this.
And there are a number of other systems that were developed currently as well,
as I've said, Cloud SQL Server which was a similar that supports SQL Azure and
Microsoft hosted services, extend hosted archive. And this was done at
Microsoft. There is a project from Google, Google Megastore that powers the
Google app App Engine as well as our relation -- academic paradigm from MIT
called Relational Cloud.
For this talk, I'll focus on a somewhat different where instead of viewing the
partitions to be statically formed, what happens if you form the partitions
dynamically. So let's take a concrete example, that static partition leverages the
idea that somehow the access patterns partition statically. What if access
patterns change, and often rapidly?
For example, there are a bunch of applications where we observe this pattern.
For example, online gaming application or other collaboration based applications
or recently we also came across scientific computing applications where you get
these kind of access patterns. I'll get into details of one of these applications
later in the slides.
As you can see, the access patterns are evolving. Obviously it's not amenable to
static partitioning, as in we are losing out on the benefit of statically putting data
together by trying to limit most transactions to a single node. Because the
access patterns are changing, you end up doing a lot of distributed transaction.
So the question we wanted to answer is how do we get the benefit of partitioning,
when accesses do not statically partition? And we propose a solution which is
one of the four solutions to allow that.
So let us take this example of an online multi-player game. We have a statically
partitioned database somehow. Doesn't matter how. Let us assumes that it
happens to be the -- the data items happens to be one of the rows corresponds
to player profiles. Here we have a player ID. The player's name, some kind of
dollars associated with that and et cetera, et cetera.
Now, we have a bunch of players that are partition -- spread across the static
partitions. And these players together want to come and play a game, which is
online. Now, while the game is in progress, you want to execute transactions on
this -- on the bunch of player profiles who are part of this game. So ideally you
would want to co-locate these data items together at one node so that your
transactions are local. But there is a problem. The thing is that players move
from one game to another. They want to play with some set of friends and then
move to another set of friends. So the -- the data items on which you want
transactions change with time.
Similarly, players can try playing different games. There are a lot of games that
come up in a social platform like Facebook that want to move around between
games. So essentially these groups or partitions are dynamic.
>>: Do players ever want to play two games simultaneously [inaudible].
>> Sudipto Das: I'll get to that. That's a very good question, actually. Dave
knows what -- how to trick me.
But if addition, if your game becomes popular, you also have to deal with the
challenge of dealing with hundreds of house of these concurrent game instances
of groups being formed.
So how do you deal with the scale problem in addition to the dynamicity? So to
re-state the problem statement, what we want is we want to have transactional
access over groups of data items. And we want to avoid doing distributed
transactions in doing that.
This is a pretty hard problem because the application is not trying to help me. So
what I would suggest is I would expose an abstraction to the application to help
me out. What I want is the applications to declare to me what are the data items
on which it wants transactional access? We call it the key group abstractions. I
want the groups to be small. I'll get into why. I want the groups to execute a
non-trivial number of transactions as well. Again, I'll get into why. And as I said,
these groups can be dynamic and formed on demand. So the applications can
form a group as well as delete a group.
And if you want to stretch your imagination to multi-tenant systems you can view
the groups to be dynamically formed tenant databases where your tenant data is
kind of in a shared table kind of distribution.
Now, how are we going to do it? As I've said, because the application comes in
and says what are the arbitrary set of data items, they can be distributed. What
I'm going to do is that as a first step I'm going to select one of these data items
as a leader. The leader selection is -- can be arbitrary or can have strategic
decision as well. And once there is -- a leader is selected, what the followers do
is the rest of the keys in the data in the group can which are called followers
transferred the ownership to the leader node so that all the read-write access of
the data items here are co-located at one node.
The key idea here is that again we are now limiting the rest of the accesses to
the group who are single node. So that transactions can execute efficiently. So,
yes?
>>: The data [inaudible].
>> Sudipto Das: Conceptionally only the ownership is moved, the read-write
access. The data is actually not moved. So I'll get into the details. To answer
Dave's question is as you have said that because we are moving access to one
node, what happens if groups have overlaps? If the overlaps are small, they can
be co-located at one node, but if the overlaps become adversarial, obviously you
end up doing a lot of distributed transactions.
I'm moving some of the ownership to a single node. That's why I want the
groups to be small. So that I can serve them from a single node. And in
addition, because I'm paying a cost up front for doing the movement, that's why I
want a nontrivial number of transactions to execute, so that I can amortize the
cost over the execution of these transactions.
I've again made my life easier by putting things together at one node. The
transaction becomes easier. But what I've done is I've added a dynamics to the
system that this hand shake between the followers and the leaders now have to
be guaranteed to have correctness in the presence of the different types of
failures that can happen.
>>: [inaudible].
>> Sudipto Das: Just one transaction.
>>: Just one?
>> Sudipto Das: Yes.
>>: Okay.
>> Sudipto Das: That's similar to a distributed transaction.
So essentially what we do is we form a -- where we execute what we call a group
formation protocol, which is similar to a distributed transaction to do this in a fault
tolerant way. Now, as I said, what are the challenges? The challenge here is to
guarantee that the contract between the leaders and the followers is met in the
presence of both the leader as well as the followers failing, to the presence of
lost, duplicated or reordered messages within the network as a result of network
failures arising, or in the presence of dynamics of the underlying system.
Because you have a statically partitioned system sitting underneath that can do
its own set of things in a funny ways, you still want to guarantee correctness in
the presence of that.
And now that I have -- I have brought things together at one node, how do I
efficiently execute ACID transactions on these dynamically formed groups?
So let's take one at a time. We'll deal with the first challenge before doing the
grouping protocol. Essentially what I'll give is a very high-level overview of how
we do it. So if you consider the timeline, this is how the leader's time is
progressing, this is how the follower's time is progressing. At some point in time,
the application comes and says, hey, I want to form a group, or I want
transactional access to the group. That is when the lead executes a handshake
between the follower -- with all the followers' node by exchanging these
messages.
Once this handshake completes, the ownership is at one node. So all the group
operations are now local. So they are efficient. At some point in time, the
application says oh, I'm done with it, I don't really care about this group anymore
so that there comes a delete request, at which point there is another handshake
that guarantees that ownership is given back to the followers and the keys are
free from the group.
Now, what I've abstracted here is that all of these messages can fail here.
Messages can be locked. So essentially we use mechanisms for timers,
retransmissions as well to deal with failed messages. What might also happen is
messages can get reordered or duplicated or be delivered after a long period of
time. So we use a concept of power group [inaudible] to detect scale or
reordered messages. I'm not getting into the details of these. The paper has all
the details.
In addition, these nodes can fail as well. So what you have -- might have noticed
here is that we have a bunch of logging operation that is are happening for all the
group -- group operations of the messages being exchanged. So this is the at
both the leader and the follower that persistently stores the group information as
well as allows us to recreate the group information after the failure.
So what I can do is that if somewhere here the group -- node fails, I don't
terminate the group formation, it kind of resumes after failure.
>>: I forgot to tell you [inaudible].
>> Sudipto Das: Yes, a crash failure. Not a malicious one.
>>: No. Well, there's a permanent failure?
>> Sudipto Das: Yes, there is a permanent failure as well. So in that case, the
log has to be persist across the failure. So what I realize is a failure where the
log can -- I still have access to log. So one idea is to replicate the log itself by
putting it in a replicated storage. So that way you can deal with single-node
failures and still have log.
So for folks who are familiar with database architectures, so this is conceptually
similar to LOC. The difference here is that instead of locks being held by a
transaction, now the locks are being held by the group during the lifetime of the
group.
Now, how do I efficiently execute transactions? Now, once everything is at one
node, essentially this boils down to an architecture, something like this, that
every node has a transaction manager that deals -- executes transactions on the
group and there is -- because the leader has unique access to the data items,
you can aggressively cache the data items at the leader. So there is a cache
manager that caches all the data, answering [inaudible] question that just the
cache of the data. The actual data is here with the followers. And all the
transaction updates are local to the cache.
So how do I guarantee persistence of these updates? I use a looking at the
transaction manager that logs all the transactional updates so that I can deal with
failures of the transaction manager as well and recover from the log.
So the cache is asynchronously propagated to the followers so the followers
eventually get all the updates. And there is a guarantee that before a group is
deleted, all updates have propagated to the group. So this way what you have
done is by paying the cost of one distributed transaction at the start of the group,
the rest of your life becomes easy and efficient. So essentially you amortize after
executing a few transactions you're going to break even and start getting the
profits.
In terms of implementation, how can with we implement it? So as a proof of
concept, I implemented it on top of a key-value store. I choice H-Base, which is
and open source variant of Big Table. So here what we have is the key-value
store logic that is executing on a cluster of servers and what I added, I added a
grouping middleware on top of it on top of it that has a grouping layer that
executes the group formation and deletion as well as the transaction manager
that executes transactions on the group.
And to answer your question, you can put the log in the distributed storage so
this allows the log to be persistent across failure of individual nodes.
So how did we do in terms of performance? So our evaluation done again using
this prototype implementation which is about like 10,000 lines of code added in
the middleware layer. And I experimented using Amazon EC2 to do some
scale-out experiments. I did an online benchmark -- a benchmark which is an
online multi-player game, and on a cluster of a modest cluster of 10 nodes, we
were able to serve about a billion rows, which is about a terabyte of data. And
with groups of the size of hundred keys, the group creation latency was
somewhere between 10 to hundred millisecond, depending on how you select
the groups and how you are doing.
And in this cluster of 10 nodes, you are able to serve about 10,000 nodes on -10,000 groups being concurrently served on this cluster. Obviously this is just a
snapshot of the experiment. The paper gets into details of how the numbers vary
on the different parameters. And this is just a few of the same set of experiments
which shows that depending on how you choose -- how you end up implementing
the middleware layer, you have implemented within the key-value store or sitting
outside the key-value store. And depending on two lists -- different distributions
of key selection, how does the group concentration latency and the group
concentration throughput vary?
So as you can see, this is a distribution which allows where keys are contiguous.
So my implementation can efficiently batch the group formation to give you very
local group formation while if you come up with an adversarial distribution it can
obviously get worse.
Now, I've shown you a mechanism of executing transactions, and I've briefly
discussed about the mechanism that allows you to execute transactions in the
scale-out setting.
>>: [inaudible]. So did you come further, just round-robin partitioning or hash
partitioning?
>> Sudipto Das: So this is the range partitioning here.
>>: So in terms of transactions per second, how much -- how much better are
you ->> Sudipto Das: So ->>: [inaudible].
>> Sudipto Das: Yeah. So the thing is that we don't have an experimental
evaluation for that. We are currently working on them because there is no
distributed transaction implementation on H-Base. So I'm currently implementing
on that. But the idea here is that the back of the envelope calculation is that if
you are -- after you have formed the group you have done two or three
transactions, you have broken even. Because now you have already broken
even for the cost of group formation, and now anything you do is profit for you.
That's backup envelope. But actually it might vary.
>>: That's because the group formation is essentially like running a transaction
[inaudible] you're paying for one distributed transaction and then [inaudible].
>> Sudipto Das: Yeah. So I didn't do one for formation and one for deletion.
Yeah?
>>: So deciding when to form a group and the size of the group, you're
expecting the application to handle all this for you?
>> Sudipto Das: Yeah. In the current work, yes. But in the future we would like
to try to automate it based on some workload patterns or using some form of
exposing high level semantics so that the applications can do it declaratively.
But in the current setting, yes. Yes?
>>: So do you relied on shared storage of --
>> Sudipto Das: No. We don't. But this implementation happened to use that.
But we don't have anything in the shared storage. So what -- except for the log
for high availability. But what -- the key idea here is that what we are doing is we
are decoupling the transaction execution from the actual data storage and
allowing you to do a lightweight reorganization. You can view it as being a
shared storage because the underlying storage itself is shared across multiple
groups. But nothing inherent.
>>: If groups overlap, you still see performance variables compared to not using
groups at all?
>> Sudipto Das: If the group overlaps are small, then yes. So you can stick
multiple groups together at one node. But if they overlap like arbitrarily, then
obviously this is not a good solution. Probably you need something different.
Okay.
So now assuming that we can execute transactions somehow, two different
approaches are shown, how do you make the design elastic so that we have the
property which we wanted to have? So what exactly does elasticity mean in the
database tier? I'll give a very high level motivating example. So let's take a
simplified view of the world that is a class of -- that is set of application clients
that are accessing the service through a load balancer. And there is a tier of
application on the web server and the database is sitting at the bottom.
And this I'll consider the care. I'll motivate this from -- more from the
multitenancy aspect, but it can be applied to the database partitioning based
approach as well. I use a color coding where the clients have a color
corresponding to their database partition here called the tenants -- tenants are
color coded.
Now, I put this application -- I designed this application, put it up on Facebook. It
becomes extremely popular. One of my applications become extremely popular,
and there is a surge in load.
So if infrastructure was deployed in a cloud, what I can do is the application
server deal can easily scale out because very little state is being shared across
them. I would want the same property within the database, which is currently not
-- typically not provided, is that you would want to add a new node to the system,
migrate over parts of the database, in this case the tenants' database. So that
you can redistribute the load or balance the load across the different set of
servers.
I would also want to do the reverse process, that when the load decreases I
would want to have the ability to consolidate back as well, since consolidation is
critical to optimize the operating cost in a pay-per-use infrastructure.
So as you have said -- seen, essentially elasticity in the database tier boils down
to migrating a database partition or tenant, if you may, in a live system without
any -- introducing any downtime. This allows you to optimize the operating cost,
as I've said, as well as in a multitenant system, where multiple tenants are
sharing resources between the -- of the system resources. This is an effective
tool to do online resource orchestration on demand as the resource requirements
change.
Obviously, as you can see, migration is a loaded term. Because migration can
also be used for -- as the database software evolves, how do you migrate data
between the different softwares? Or how do you migrate data as the database
schema evolves? So the use of my term of migration in this setting is primarily
for elasticity and is different from this context.
So one of the simple solutions, one of the straightforward solutions which people
can easily come up with is why don't you use real migration for database
elasticity? So how can we do it? One of the approaches is to have every tenant
give its own database, running within a VM. And then there is a hypervisor that
shares these VMs at a different -- at a angle node.
This is a valid design supported by the current state of the art. And you can now
use VM migration to migrate things on the fly. However, as you have seen and
as many of you know, that databases weren't designed for this kind of operation.
And if you are running multiple databases uncoordinated at the same node, God
be with you in terms of performance.
And there is a recent paper that shows you that the performance overhead can
be as much as an order of magnitude, both in terms of performance as well as
the consolidation issue.
So what you would want is you would want multiple different database partitions
to be resident within the same database process. That gives you somewhat
better performance, even -- my life, I wouldn't want it to -- the VM to be sitting
here as well, but let us consider for this scenario. Now, you can use the VM to
migrate the database again. But now what you have lost out is you have lost out
the ability to do fine-grained load balancing. Was that a question?
>>: [inaudible].
>> Sudipto Das: Okay. So what you would ideally want is this world where you
have only the database process running on bear metal, a bunch of tenants or
database partitions sharing the same database process, and about -- a model
which is called shared process multitenancy in the database literature, and you
would want to migrate individual partitions on demand in a live system.
So what I'm essentially saying is that what VMs allow you to do for our operating
systems, I'm going to allow you to do the same functionality in the database tier.
So essentially what you can view this to being virtualization in the database tier
itself.
Again, there is another straightforward solution that can be done. Because
databases were designed to be fault tolerant. So essentially what you can do is
you can stop the database at the source, migrate it over to the destination and
then start serving it as the destination. I call it the stop and copy technique. And,
again, this is -- this can be done.
However, it is expensive. Why is it expensive? Because it results in an
unavailability window. And I want to have minimal unavailability during migration.
If possible, no unavailability. I want to minimize this metric. In addition, I want to
also minimize any impact on the tenant while I'm doing migration. Migration is
done for system management. The tenant should not be aware of it. So I want
to minimize the number of failed requests as well as have minimal impact on the
performance of the transactions that are executing.
And in addition, from the system's perspective, I also want to minimize the
amount of data that is transferred as a result of migration. There is some amount
of the data that needs to be migrated. This is the data on top of that.
So essentially what I -- as I've said, there are two different approaches of -- in
which databases are designed. One approach which we call the decoupled
storage, where the transaction execution logic is decoupled from the storage
logic. There are popular examples like different examples like the system
ElasTraS which I designed. G-Store happens to be a similar design as well.
Project Deuteronomy at Microsoft Research as well as Google Megastore fall
into this category.
Now, because your persistent data is stored in a network that has storage while
you are migrating, you don't need to migrate the data. So essentially now it boils
down to migrating that execution state of the database. Migrate the transactions
as well as migrate the database cache. And I appropriate this technique
ElasTraS as well as implemented it on top of the -- so I propose this technique
Albatross, which is implemented on top of ElasTraS. And this is a paper that
would be presented in the upcoming VLDB in a couple of weeks.
There is also another way of designing databases, which is the standard shared
nothing design where the persistent data is stored in locally attached storage.
Here when you're migrate it's a hardware problem, because now you are to move
large amounts of data as well. So how do you guarantee that you will do -- you
incur minimal cost during such migration?
So the common examples of this architecture are SQL Azure, Relational Cloud,
which is again the prototype from MIT, and MySQL also has a cluster offering,
which is similar shared nothing cluster. And I proposed a technique called
Zephyr, which was presented recently in SIGMOD in June.
In this talk I'll just focus on Zephyr. You are welcome to come to VLDB to get
that the details of ->>: Just a comment. In the shared nothing architectures, take SQL Azure, for
example, they already have certain availability guarantees. And for that, they
already replicate the data, right?
>> Sudipto Das: Yes.
>>: Not necessarily need to [inaudible].
>> Sudipto Das: Yes.
>>: [inaudible].
>> Sudipto Das: That's a very good point. And as I'll show, we leverage from
the layer replication. But when you are doing elasticity, you don't always have -you are trying to add a new node, and you don't have -- always have a replica
running at that node. So I want a technique that allows you to replicate -- migrate
even in that setting. But, yeah, as I -- as you said, replication can be benefited.
So why is this a hard problem? I've already said that multiple times? So let's get
into the details of the actual points. We want to migrate the persistent database
-- or persistent image of the data, which can be of the order of gigabytes. So
how do you guarantee no down time while you are migrating such large amounts
of data? So you have to execute transactions while the data is being migrated.
Now, because it's not an instantaneous process, again, there can be failures.
Nodes can fail curing migration, both the source and the destination. So how do
you guarantee correctness in the presence of failures, especially transaction
atomicity and durability, how do you guarantee that the transactions executing
have the properties as well as you want to recover the state of migration, if there
is a failure in the middle, so that you don't leave the system in an inconsistent
state.
In addition, because you don't want any down time, transactions will be executing
during migration. So how do you guarantee serializability of these transactions
while you're migrating things on the fly so that from the tenants' perspective it is
normal business from -- as if nothing happened?
So our approach is, the way we do it is that instead of viewing as migration as
one chunk being migrated, we break down migration into a collection of phases.
It starts with -- migration starts with transfer of minimal information from the
source to the destination. We call this the wireframe. The minimal information
consists of the database schema, user authentication information and another
thing called the index wireframe which I'll get into.
Again, instead of viewing the entire database to be migrated as a whole, we view
the database as a collection of pages, the database pages, which is often the
case. And we use the concept of unique page developed ownership and
migration of database pages on demand from the source to the destination.
To allow for zero down time there is a phase in migration where we allow for both
the source and the destination to currently execute transaction on it, and we
show that how you can have minimal transaction synchronization and still
guarantee serializability for such transactions. And we use mechanisms for
logging and handshaking to guarantee fault tolerance of the -- fault tolerance
during the migration as well.
So in this talk, I'll make some simplifying assumption to limit the scope. I'll
assume that transactions execute at a single node. I do not leverage from
replication for this technique, but I can definitely do that. The paper shows you
how to do that. And I would assume that there are some indices that are used to
keep track of pages. I don't allow any structural changes to the indices during
migration.
The paper obviously gives up all these assumptions and gives an extended
design that is more flexible compared to what I'll present in this presentation, in
this talk.
>>: Question.
>> Sudipto Das: Yeah.
>>: What [inaudible] update? [inaudible].
>> Sudipto Das: I'll get into that. That's a very good point actually.
>>: There are ways of doing a replication which doesn't take the [inaudible] the
original source offline while you're doing the wrap up that cause you to sort of
scan the replica, copy the replica, copy the data and then -- and then use the log
to bring it up to date.
>> Sudipto Das: Yeah, yeah.
>>: Which greatly shortens or perhaps eliminates the service interruption for
that. You seem to paint -- you painted a picture before of the doing the
replication.
>> Sudipto Das: Yes, yes. [inaudible].
[brief talking over].
>> Sudipto Das: It's not -- yeah, that's a very good point. The reason why we
don't use it here -- there are two reasons. One of it is for setting up a new
replica, you incur a lot of tech pointing overhead, that is a lot of [inaudible] and if
there is -- the source is already overloaded you don't want to add load to the
source.
And the second is that, as you'll see here, the destination starts executing
transactions as soon as you start migration. So you are able to offload some of
the load to the destination immediately. So that allows you to a better
performance. But obviously the exact numbers would vary, depending upon
workloads.
So my view have the database is a collection of pages. There are a set of active
transactions that are executing and there is an index that keeps track of these --
what exactly is in the pages. I'll use this index to tag along additional information
which is called -- which I call as the page ownership information.
If the page is -- and the conventional use in this talk is if a page is white, which
means this is the node that owns the page or has unique read-write access to the
page. And if it's grayed out, the node has information about the page but doesn't
have the data. When migration starts you freeze the index wireframe. I'll get into
what do you mean by freezing and migrate over the index wireframe.
So what will exactly is an index wireframe. Just to take a specific case of a B
plus tree index. Your index wireframe is the internal nodes of the B plus tree
index. And your actual data resides on the database pages. What I money by
freezing the index is I wouldn't allow any structural changes to the metadata. So
I still allow updates to these individual pages. But if there is an update or an
insert that results in a new page to be inserted, which is then a change in this
index wireframe, that is something that is not allowed during migration. Interest
[inaudible].
>> Sudipto Das: It will be aborted in this setting. But there are simple extensions
that can be done again to deal with that problem. So essentially when I move the
-- migrate the index wireframe, this is what the destination has. It has information
that there are some database pages but it doesn't have the database pages.
So this is the state of the destination in what we call the dual mode. In this
mode, the ->>: Let me back up just to make sure I understood that. So you're basically
saying no page splits, is that ->> Sudipto Das: No page splits during migration.
>>: During migration.
>> Sudipto Das: Yes.
>>: So you can do inserts, but as soon as you get page limit you got [inaudible].
>> Sudipto Das: Yeah.
>>: [inaudible] transaction.
>> Sudipto Das: Yes. That's right.
>>: Would using a [inaudible] make you relax that requirement?
>> Sudipto Das: Yes, it does. And you can also use, you know, flow buckets
which are similar to [inaudible].
Okay. So at the destination -- at the start of the dual mode, it just has the meta
information. At this point I allow new transactions to go to the destination. And
while the sources still completing transaction that is are still active or that are
arriving due to still meta at the routing there.
Now, because the transactions start executing as a destination without the data,
now the data is pulled in on demand, [inaudible] the index structure to keep track
of ownership information. So let's take for example that page P3 is being
accessed by a transaction at the destination. At this point the request is sent to
the source, and the sense -- the source does some synchronization at this point
to ensure that there is this page P3 is not currently being accessed by any other
transaction and if that is true, it changes the ownership information and migrates
the page over to the destination.
So this is the only point in time where the two nodes synchronize when executing
transactions when -- and the concept of unique page ownership allows us in
using this mechanism. Now, as soon as the source completes all the
transactions, now it just keeps -- figures out what are the pages that have still not
been migrated and then asynchronously pushes them to the destination and the
destination keeps getting on information. But against these transactions are
access -- executing at the destination pages can still be pulled on demand. And
we show how the indexed metadata can be used to detect duplicates in this
setting and guaranteed that you don't mess up the database state.
And once all the page structures -- all the pages have been migrated, now the
source can get it off all the resources, the paper again shows how to get rid of
the log logs as well, so that the source is completely free now and everything -the destination is the sole owner.
Now, because of the simplification which I read for the sake of the talk, these are
sort falling artifacts of this simplification. That I migrate pages only once. Once a
page is migrated from the source to the destination, it is never pulled back. This
allows for forward progression and quick migration. But what it allows -- what it -the implication is that any transaction that accesses the page that has been
migrated from the source must be aborted. Remember here accesses because I
want to give serializability. If I can give snapshot isolation then I can allow leads.
As I've said, no structural changes to the index. So any transaction at both the
source and the destination, if it results in a structural change to the index, that
would be aborted as well. And because the destination pulls the pages on
demand from the source, there is a higher latency from some of the transactions
that are going to the destination. In most of the times it's pulling from the sources
cache so it's not such a big latency, but there is somewhat higher latency
because of the network traffic.
>>: Why do you need that -- why do you need that structure changes? I mean, if
you're executing transactions at the destination [inaudible] so what?
>> Sudipto Das: I don't need that. I -- this is just for simplicity, to allow for
merging the indices easier. In the paper we actually talk about an extension that
-- where we don't need that. Actually we don't need that. So I think I'm probably
short on time, so I'll just keep on some of the serializability -- it's okay? Okay.
So essentially what I've done is that I have just used a simple synchronization
mechanism. So how am I going to guarantee serializability during transaction
execution? As you can see the dual mode is the only concern because only in
the dual mode, the two nodes are executing transactions concurrently.
What in the paper we show is that you can use local predicate locking at the
index level and exclusive page ownership at the leaf level to ensure that there
are no phantoms during migration. We use strict two-phase locking during
normal transaction execution. So to guarantee that transactions are not local -transactions are locally serializable.
And because we use this only once transaction mark the database page
migration, essentially what you can see shows that any transaction at the
destination is ordered after a conflicting transaction at the source. So there is a
strict ordering that is enforced. So this allows you to prevent loops in your
serialization gaps. Those are providing you guaranteed serializability.
Now, recovery becomes complicated as well. Because now two transactions -two nodes that are executing transactions concurrently. But again, I use had this
causal ordering property between the pages that if there are two transactions that
are conflicting on a page, I just want strict ordering on them. I don't care about
other transactions. So when I'm moving pages over from the source to the
destination, I also carry over the log sequence number so that all transactions at
the destination are ordered after the source, even in the recovery log. And
during recovery we just replay maintaining this order. So you are preserving the
conflict order due to recovery as well.
And how do you recover -- so this was transaction recovery, by the way. How do
you ensure migration recovery? You have to guarantee that because the -- there
are two nodes changing between different states, you have to guarantee that
they are always in the same state. And there is no confusion on that.
In the paper we show how -- how you can atomically transition from one stage of
migration or one phase of migration to another phase of migration. And
essentially we use logging and handshake protocols for doing this atomic
transition. And in addition, every page always has a unique owner in the node.
And you can use bookkeeping in the index level to keep track of this ownership
information. Even after a failure.
So in a simple way, you would always log migration of a page, but that introduces
a lot of IO as well. So in the paper we show how you can rely on the transaction
semantics to capture this migration information and make it persistent. And be
able to recover that as well. So essentially what we show is that in the presence
of arbitrary repeated failures, we can guarantee updates made to database
pages are consistent; failure does not Lee a page without an owner; and both the
source and the destination are in the same migration mode. So you can -essentially this extends to the correctness proof.
And we also show how you can guarantee termination and starvation freedom in
the presence of arbitrary failures as well.
>>: So why isn't this simply a special case of a data sharing system?
>> Sudipto Das: Actually, the -- it is and the extension, which I didn't talk about,
relies on data sharing. And global and local LOC managers to exchange the
pages. But, yeah, so this is -- this becomes a data sharing only during migration.
So in terms of implementation, the design was implemented in and open source
OLTP databases called H2, which provides all the bells and whistles of classical
OLTP database. And to implement this, we went and added support for freezing
the indices as well as keeping track of ownership information. This was about
6,000 lines of code hashed in the database engine.
We use an open source router, SQL router to migrate connections from the
source to the destination as a result of migration.
How did we do in terms of performance? We evaluated it using an open source
microbenchmark. We adopted the Yahoo cloud serving benchmark to add
transactions and vary the different parameters of the workload. Depending on
the database size or the workload which is executing, the [inaudible] technical
which is stop and copy allows you to -- results in 3 to 8 seconds where a
database is unavailable. This is for a very small database, about 200 megabytes
or on something.
As you make the database bigger, this unavailability window becomes longer and
longer. What you have to notice, during this period you can only run the
database in read only mode. So any update transaction has to be -- has to abort
during migration. On the other hand, Zephyr does not result in any downtime,
because at any point in time either the source or the destination is executing
transactions.
In terms of the failed operations, because stop and copy has to fail all updates, it
results in about hundreds to thousands of operations failing during migration.
Again, depending on the workload, these numbers vary. On the other hand,
Zephyr results in -- and the simple prototype of Zephyr results in an orders of
magnitude failed operations. And the paper we show how you can guarantee
zero transaction loss. But we don't have an implementation for that yet.
So even the simple implementation is orders of magnitude better in that setting.
>>: So where does [inaudible] come from?
>> Sudipto Das: So the failures -- because of the inserts -- so we ran an
adversarial workload with a lot of inserts. And whenever there is a change in the
index structure that results in the failure.
>>: Also some transactions made just get sent to the wrong machine.
>> Sudipto Das: At the source, yes. But the source -- as long as the source is
still active, it can still serve those transactions. But if they get ->>: I see. So ->> Sudipto Das: Yeah.
>>: You aborted the transaction and then redirecting to the target is not
considered a failed transaction?
>> Sudipto Das: No. In this number, no.
>>: Okay.
>>: So transaction request two pages, one is at the source, the other has been
migrated. So will that transaction ->> Sudipto Das: That transaction will fail.
>>: It will fail. So you won't do a distributed ->> Sudipto Das: No, I won't do -- so across my work the idea is to wide
distributed transactions where have possible. But there is an extension that does
that actually. I use a shared LOC manager for getting the pages.
>>: Wouldn't the benchmark do that ->> Sudipto Das: So this is a Yahoo cloud serving benchmark with modified
support for transactions, multi-table transactions and client sessions. So Yahoo
cloud serving benchmark was a key-value store, benchmark for key-value store
which obviously doesn't have any transactions.
>>: So what fraction of the transactions had multi-page -- [inaudible] multi-page
access ->> Sudipto Das: Every transaction required a multi-page access. And every
transaction was multi-operation. Before it was 10. You can vary number of
operations within a transaction. We also looked at like 25, 30 operation
transactions as well. So all of these parameters are varied during the
experiments.
>>: So if you do stop and copy, you could do an IO efficient copy, right?
>> Sudipto Das: Yes.
>>: [inaudible].
>> Sudipto Das: Yes.
>>: But you think it's going to pull pages on demand?
>> Sudipto Das: Yeah. Essentially ->>: That could be a huge, huge ->> Sudipto Das: In theory, yes. But in practice, we didn't observe that. Because
what happens is that most of the pages that are accessed at the destination
during the pull phase are often in the cache, just the reason for locality. And then
the final phase is just a copy through the disk. So in theory, it can be bad. But
not in -- not in practice. At least the workload we did.
So in terms of operational overhead, we show that operational overhead resulting
in the -- as a result of this migration is very low, between 10 to 15 percent
increased latency during migration.
So this is a number -- a graph that shows you how the number of failed
operations increased as you increase that load.
>>: [inaudible].
>> Sudipto Das: Yeah?
>>: [inaudible] so if you actually take the load [inaudible] the system is already
loaded because, you know, you actually trying to [inaudible] so I think -- so I think
on that context any increase in the [inaudible], I mean, the angle of the migration
is kind of critical. So have you done any comparative studies ->> Sudipto Das: So the thing is that I agree to it but if you wouldn't have
migrated you would do worse because your source is already overloaded. So
that's the argument against it. But it is one of the reasons that we used this
technique that immediately you offload the load to the destination. And the
destination starts catching up. And whatever load you do is just for fetching the
pages on demand.
>>: But if the source is already overloaded, the source takes care of the
migration because if you have [inaudible] page from the destination, then you are
too overload to fill that string ->> Sudipto Das: Yes, so that's a good point as well. So the idea here is that I
haven't talked about the controller level. So there is a controller that is sitting on
top of these things. So it is the responsibility of the controller to start migration,
at least when there is some room left to migrate.
If it's already too late, you are hundred percent overloaded, tough luck. There is
-- there is -- you have already screwed up your system. Myself wouldn't help
you. So typically like when you're getting close to being, you know, booked,
that's when you initiate migration. That's part of the controller's responsibility.
But as I said, it's a very low, you know, head on the source. It's just fetching
pages and not executing the transactions. If it were executing transactions, it
would have been even more in the [inaudible].
So in terms of failed operations here, as you are increasing the load on the
system, as you can see, the stop and copy technique, the rate of increase is
much higher compared to the Zephyr technique. Not only is this an order of
magnitude better, the rate of increase is also much higher. Just to give you an
example, the slope of the Zephyr curve is .48, but as that first stop and copy is
8.4, which shows that it -- this says technique is more robust to deal with different
variations of the load and allows you to do migration even when the load is
higher on the source.
>>: [inaudible] so that your [inaudible].
>> Sudipto Das: Yes. That is true as well. But then they also have an impact on
the latency of the executing transactions as well. Yeah. Yes. But you still have
to have both transactions that are active at the source.
>>: I don't understand what you just told me ->> Sudipto Das: So this is the slope of a line -- if you put a line here on the
graph as well as on this graph, so this is the slope of that line.
>>: [inaudible] is it the slope of the line across the tops of those yellow bars is
8.4?
>> Sudipto Das: Yes.
>>: Like when you double the number of transactions you less than double the
height?
>> Sudipto Das: Yes.
>>: That seems to me to be a slope less than one.
>> Sudipto Das: So this less than one is -- so this is where basically the slope
measure for -- okay. So this is the slope measure of the angle which is being
projected here in radiance.
>>: [inaudible] I don't understand what ->> Sudipto Das: Okay. So the thing is that what ->>: [inaudible].
>> Sudipto Das: Sorry. No. I'm sorry. It is -- sorry, sorry. I mess it up. It's -- I
think it is the -- I'm sorry. I don't remember the exact measure. But I think it's
either the angle or the tangent of the angle which is reported here. But I can look
back into the paper and --
>>: Ordinarily does it make a difference between the two ends and the rise and
the run?
>> Sudipto Das: Yeah.
>>: Yeah. Okay. I mean, it's really hard to read that in the crowd.
>> Sudipto Das: Yeah. That's why we added this information into this graph
itself.
>>: Yeah. Except I don't understand the ->> Sudipto Das: So I'll have to the get back to the paper to figure out what is -what is the exact measure we did.
>>: I mean you're an order of magnitude better.
>> Sudipto Das: Yeah. So you see that the increase is much lower here
compared to the increase which is here. So this is the angle of the line which is
drawn. Our [inaudible] line. [laughter].
>>: [inaudible]. [laughter].
>>: It's hard to read the Zephyr one, but the stop and copy, it's going from 600 to
a thousand over a spread of 40 transactions per second. So it's going to ->>: You double the number of transactions and you less than double the number
of failed transactions, so that's a slope less than one. Anyway ->> Sudipto Das: Sorry about that. Yeah.
>>: Okay.
>>: I think it's just like 600 over 40 [inaudible] measure.
>> Sudipto Das: So stepping back. In terms of the overall vision, we want -- we
started off with a system where we wanted to have scale out while executing
transactions. We wanted to have elasticity as well while executing transactions.
So my dissertation proposes major enabling technologies to solve these specific
challenges. Specifically I propose a design for a scalable distributed data -database infrastructure. This is one of the first few techniques. There are a
bunch of things that were also designed currently, but this is also one of the first
few techniques in this area.
I've shown how to execute transactions efficiently for partition that is are
dynamically formed. And this is, again, the only technique which I know of that
has been published that allows that.
And I've also shown two different -- I've also shown two different techniques to
allow for live database migration or lightweight elasticity. Again, to the best of my
knowledge, these are the only two published solutions. I know there are some
things that are cooking but they haven't been published yet from the other groups
as well.
And what I would also like to point out is that all of these designs are
implemented in real systems as well as evaluated to show effectiveness.
In terms of related work, I've already covered most of it scattered throughout the
presentation. For transactions and scale-out, I [inaudible] from lot of work over
the last 30 years or so, which I didn't put up here. In terms of the systems that
have been of currently developed is Cloud SQL Server, which was from
Microsoft, Megastore. Deuteronomy. Relational Cloud. Percolator was a
system from Google, which is -- forms the basis for their new index -- indexing
mechanism. And this is a system that does distributed transactions actually. But
it's a different application domain. Not high performance transactions.
And obviously we have Hyder as well from Microsoft Research. And again, the
list is also -- this is an incomplete list, just a snapshot.
In terms of elasticity and migration, we just have VM Migration which is currently
known . Nothing published -- nothing in the published literature.
In terms of future directions, again, the space is very rich. And this is definitely
not the end of the stories. I've tried to list some of the problems which I have the
background and I feel are relevant in the next few -- in the upcoming years is I
just scratched on the surface for a self managing controller for multitenant
system. And I believe it's a very important area to pursue, given that there is
enough scope there where you have a large distributed system, you have no
idea what it's doing, trying to get a good understanding of it and trying to
automate the management of that system such as placement of tenants,
resource orchestration, online profiling and how would you update your models
online as well. This is a very important area of research which I -- I want to
pursue as well.
Another is enforceable data management architectures. On one end we have
the hardware, which is continuously improving. We have face change memories
coming in, we have FPGs as we can have GPUs. How can we build and bring
this hardware into the database architecture and come up with more efficient
implementations or more efficient database architectures.
And another thing which I want -- also want to point out is a need for
convergence of transaction systems and analytics systems, not are warehoused
but just the ability to provide realtime intelligence in a system that is getting all
the updates. This is extremely critical for a lot of the applications that are coming
in, where as soon as you see a change in the behavior or a change in the
update, you want to react fast in realtime. So this is something which I -- what
are the right architectures in this model? And how can you build such
architectures? This is another area of -- area of interest for me.
And another thing which has been getting a lot of popularity is what we call
crowdsourcing or putting human in the loop. I don't want to proceeds new
crowdsourcing solutions but for a lot of these problems I can leverage what a lot
of crowdsourcing solutions as well. For example when we are dealing with
convergence of multiple sources of data entity resolution becomes a big issue.
Data integration also becomes a big issue. How can we leverage cheap human
labor to solve some of these hard problems which we encountered in these
systems to help us out in solving some of these hard problems?
So this brings us to the end of my talk. I'd like to thank you, everyone, for
attending this talk. And I'd also like to thank all my collaborators. I've had the
wonderful fortune to collaborate with a lot -- a large number of great searchers.
And specifically thank my advisors Divy Agrawal and Amr El Abbadi, without
whose contribution I wouldn't be standing here.
And I'd like to open it up for more questions.
[applause].
>>: [inaudible] along the way any lingering questions we want to pursue.
>>: [inaudible] collaboration, essentially it's -- it's -- compared to the transition of
migration services, it's just working a different granularity [inaudible]. It also
means that it requires a [inaudible] destination and a source at the same physical
infrastructure.
>> Sudipto Das: Yes, that is correct.
>>: Also means that it has more complication. Do you have any consider about
the [inaudible] changes [inaudible]. If you consider those factors you will get
much more complicated.
>> Sudipto Das: Yeah, I agree to that. And this is one of the disclaimers I put up
front that I'm not considering it for this paper.
>>: [inaudible] it's working as a [inaudible]. It's already very complicated.
>> Sudipto Das: Yeah. Obviously if you -- the more features you want to
propose that it definitely increases the complexity as well. I don't know if it was a
question or a comment. I agree to your comment. Yes.
>>: [inaudible].
>> Sudipto Das: Yes, I agree to it.
>>: There's a [inaudible].
>> Sudipto Das: Yes. So the thing that we are trying to solve is we are trying to
solve is we are trying to solve more of a research question is how do you
guarantee no downtime? The LOC shipping protocols that actually results in
some period of availability. So actually if I didn't talk about Albatross that uses a
variant of this LOC shipping protocol to migrate your database cache as well as
the state of active transactions as well.
>>: This is just a migration scenario, though, not a replication scenario. If you're
thinking -- when you start talking about data -- you know, about database
mirroring, now you can't just blithely say there's going to be no schema changes
or no other structural changes because you're going to be running this for a long
period. Here he's only doing the migration for a relatively short period to shut
down schema modifications during migration seems like it would be not a big -not a big inconvenience.
>>: In my [inaudible] one page, it may touch many other pages.
>>: Yeah.
>>: [inaudible].
>> Sudipto Das: We still support those kind of transactions. The thing is that
only during -- as Phil pointed out, only during the short duration of migration
some of these operations become expensive. And I would claim here is that you
pay a cost in terms of latency but you don't incur any unavailability. And that is
what it is. Yeah?
>>: Did you check on actually the difference between [inaudible] because
sometimes [inaudible] and I split, they [inaudible] the difference because if
[inaudible] request a page and then you go [inaudible] the page to the source, if
you fetch everything at once, the next square [inaudible] it is already there.
Otherwise all the time I do request that they do a request.
>> Sudipto Das: It's not all the time. It's just for the window. And what I'm trying
to say some of the experiments I've shown it's of the order of few seconds.
That's the window while the source is finishing up the active transactions and the
destination has started executing transactions. I'm not running in this mode
forever for the rest of my life. It's just for the small window.
The benefit here is that obviously I can definitely keep on adding features but that
complicates the design. I wanted to show that using a very simple design you
can still guarantee a bunch of properties. That was the main thing which we
wanted to show here. And the period of restriction which we involve is also very
short of the order of even milliseconds for some of the cases. I just given one
snapshot of the total migration time.
>>: [inaudible] other indexes are transferred [inaudible]. So you're actually trying
to transfer the logical structure ->> Sudipto Das: Yes.
>>: That when you're inserting the page, how are you inserting -- I mean the
[inaudible] index.
>> Sudipto Das: So the index, the way it is, the wireframe copies that you take
an intention LOC on the group of the index and that kind of prevents any update
locks that are being taken, any changes being done in the internal nodes. But
the index wireframe is still there, at both the source and the destination.
Whenever there is an insert at the source or the destination, it goes through all
the index structures and figures out what is the page that actually insert would
be. If it figures out that there is going to be a page split, that is where it's
supported. If the insert can get into the page, it's allowed.
>>: So basically, I mean, if I understand correctly, you were trying to recreate the
index in the order of each page, [inaudible] because at the destination you are
trying to insert one page and you are actually trying to recreate your index
[inaudible].
>> Sudipto Das: I'm not recreating my index. It is just copied once. One set at
the start of migration. And the rest is just the freezing of index is mainly to make
my bookkeeping easier. I can allow changes to the index but then margining with
the index of the source and destination becomes expensive. So this is a
trade-off, more of a design trade-off rather than a performance thing or
something limiting which [inaudible]. Yes?
>>: The [inaudible] transaction of the [inaudible] on the migration. You talk
about [inaudible] queries.
>> Sudipto Das: That's a very good question. I think I have it right. That's the
short answer. But it's harder for that because ->>: It's easy [inaudible] [laughter].
>> Sudipto Das: The thing is that a lot of transaction workloads often fit in
memory. So you -- you are not -- when you are fetching pages from the
destination most of the times it happens to be in memory. But with overlap or
other analytics whereas its often like a lot of scans, so that further increases the
overhead of migration. But definitely that's an area of future work. It's an
important area, as well, because we want to have diversity with it. Okay.
[applause].
Download