>> Phil Bernstein: So it's a pleasure to be welcoming back Sudipto Das, who was here as an intern working with us a year ago. And since then he's been a busy guy. He's got papers published in all the top database conferences and several others. Co-winner of the Best Paper Award at CIDR and best runner-up at Mobile Data Management conference, and has more work coming out in the pipeline. He certainly isn't going to have time to talk about all of that today, but he's going to tell us about some of his work on transactional record management for database systems. >> Sudipto Das: Thank you very much, Phil, for the nice introduction. And thank you all for coming. So as Phil pointed out, this is going to be just a sliver of the work which I've done in the broad area of scalability to management. So this one focuses on scalable, consistent, and elastic database systems. And the context I'm setting it is for the cloud platforms. So as many of us are aware, like over the past few years, a lot -- more and more applications are being delivered over the web, over the network. Not only has this front end changed, this has also resulted a change in the back end infrastructure, what we often call as cloud computing. So in simplest form cloud computing is essentially cloud infrastructure or services provided -- or solutions provided as a service. And it's already become a pretty big business, and it's growing. And some of the key factors to the success are the economies of scale the notion elasticity or pay-per-use kind of pricing. Even though almost every aspect of computing can be provided as a service, three paradigms have evolved as popular with providing cloud service, namely infrastructure as a service, platform as a service, to all the way up to software as a service, where your entire software comes as a package delivered over the cloud. Irrespective of which layer you are using or which abstraction you are using for the cloud, data is a central concept. And a DBMSs form a mission critical part of -- mission critical component of the cloud software stack. They kind of serve petabytes of data that -- manage petabytes of data and, more often than not, they drive the revenue of the company as well. And because of the wide variety of applications that are deployed in the cloud, they often have to deal with a wide variety of applications themselves, a term which we often refer to as multitenancy. If you consider the data needs of these applications, you can broadly divide into two different categories. On one -- one hand we have the OLTP systems, which are there to just serve data, small read-write transactions, and on the other hand we have the data analysis systems that allow for decision support and intelligence. This is obviously a very like simplified view of the world. In this talk, I'll be focusing on the transaction processing systems or the transaction processing aspect of these databases. So if you think of the application landscape for this OLTP databases, what does it look like? It ranges all the way from social gaming to rich content, to managed applications. And we have the cloud application platforms as well that are growing in popularity, like Windows Azure or Google App Engine, and they have an OLTP database that is setting behind the scenes serving all the applications. So as you can see, it's pretty rich and diverse. There are a large number of challenges that need to be solved for designing such OLTP databases. In this talk, I just focus on three specific challenges. As we all know, the amount of data or the number of applications that are being served is growing every day. So scale is definitely a big problem. So these systems must be scalable. But because they have the OLTP, they also must ensure that they are serving transactions efficiently, that they are executing transactions efficiently. Elasticity is a big thing in cloud, where it allows for the infrastructure to be provisioned on demand. And we want our databases that are deployed in the cloud to be elastic as well; that is, have the ability to scale on demand in a light system. And last, but not the least, when you have a big system, you want it to be self-manageable, to reduce the number of dollars you spend in getting administrators for the system. So you want intelligence without a human controller. I'll get into the details of each of these challenges. So if you consider the challenge of scalability, there are classically two different approaches to scalability. One approach is scale-up, where you throw more -more powerful or high capacity hardware. And this is a typically used in the classical enterprise setting where it was more convenient to scale up the databases and the relational databases are one of the popular examples that like scale-up, because the rich functionality that the relational databases support, it's easier to scale them up rather than the other way around. And the key idea here is that you still have -- even when you have a bigger hardware or a more expensive hardware, you still have -- you limit access to a single node, which is the key for efficiency and a good performance as well. Now, obviously this is not a viable solution for the cloud, where you want to leverage from the commodity hardware or the economies of scale that can be can leveraged from commodity hardware. In that setting you want to use a cluster of commodity serves to scale out where what we say is scale out. So the idea here is you somehow partition your database, divide it up into smaller granules and then set it up into -- on a cluster of servers. So one of the systems that has taken it to the extreme is what we call the key-value stores. They broke down the database to the smallest possible granules of single key pairs or rows and then distributed across cluster of thousands of servers or even geographies in many cases. But in order to do that, what they have done is to ensure scalability and to ensure the property that you want to have your transactions at a single node. They have limited the functionality and guarantees that are supported. There are a lot of limitations that are in force. For this talk, I'll just focus on the aspect that these key-value stores do not provide support for multi-row or multi-step transactions. Now, why is transactions a big deal? I think many of us in this room already know that, but just to give -- put you in context, why actually care about transactions? So if you think of a very simple application like a social network or social -- any social application, and there is a friend request that is accepted. So this results in updating two -- the friend lists of the two individual users. If you were in a world where the database system supported transactions, this is the code you would write as the application developer. So the key idea here is the simplicity. And this is one of the main reasons why databases have been so popular over the last few -- or last two or three decades. On the other hand, if you were writing it on a key-value store with limited guarantees, this is just a fragment of the code which you'll end up writing. Don't even bother reading it because there are a lot of corner cases that have been left out here. And this is what the application developer reimplements for every application it writes. So in summary, it gets too complicated to write for the -- it makes life for the application developer harder and harder to build on this -- these kind of key-value stores or the reduced consistency guarantees. So if you view it as two different axis, as scale-out being the vertical axis and ACID transactions being the horizontal axis, on one hand we have the relational databases that give good functionality and strong ACID transactions, but are not very amenable to scale-out. They do provide limited scale-out. On the other hand we have key-value stores that give you scale-out to problem thousands of nodes. So the challenge which I want to address is spanning or bridging the gap between these two systems. Because there is a lot of potential to be leveraged, a lot of potential to be exploited in the middle space by providing transactions at probably not the scale of thousands but probably at the scale of tens of nodes which spans a lot of different types of applications. And it becomes even more critical for cloud platforms that often cater to a wide variety of applications. As I've already mentioned, elasticity is a key thing in cloud. So compared to classical enterprise setting, where you have a statically allocated capacity, cloud allows you to provision your systems on demand. So the underlying infrastructure allows you to scale on demand. So we want the database systems to have this ability as well. So we want to have the database systems to be elastic as well as be lightweight in terms of providing the elasticity by not introducing a lot of overhead. And last, but not the least, managing these systems is often a pain. Why is it so? Because as you go to scale, failures become a norm, rather than being an exception. So recovering -- detecting and recovering from these failures. Coordination and synchronization between a cluster of these nodes. How do you provision these systems? How do you do capacity planning? And the laundry list essentially continues for what you want to do automated. So there is a quote from a famous open source system called Zookeeper that says a large distributed system is essentially a zoo. And that's why you need a Zookeeper for automating a lot of these guarantees. Now, to add to it, if you consider the cloud platforms inherently multitenant. So there is a conflict between the goals of the service provider that is trying to minimize its operating cost, as well as the performance guarantees that are given to the applications that are designed. And the challenge is how do you design these self-managing systems by minimizing the need for a human controller in such systems. To this end my dissertation makes the following contributions. To provide transactions at scale, I've designed two different systems that allow you to scale out on a cluster of commodity nodes while providing transactional access. So one of the systems called ElasTraS, it uses a static partitioning technique, while another system called G-Store uses -- allows you to form the partitions dynamically on demand. To provide lightweight elasticity, I've proposed two different designs for two common database architecture. One design called Albatross gives -- provides you lightweight elasticity in a shared storage or a decoupled storage architecture. Zephyr, on the other hand, provides you lightweight elasticity in the classical shared nothing database cluster. And in the self-manageability front, I'm currently working on a design called Pythia that allows you to do workload characterization where tenant placement, et cetera, how to automate these kind of things in large database systems. For the interest of time in this talk, I'll just delve deeper into two of these systems. And obviously we can talk offline about the rest of the papers. But before getting into the depth of the two different papers, I'll like to spend a couple of minutes to give an overview of the kind of work that I have done. So as I've already said, this talk on my dissertation focuses on transaction processing. On the other hand, if you consider analytics, I've also will worked on a number of projects to support a different kind of analytics or richer analytics for the different type of data needs. And one of the projects called Ricardo as an intern at IBM, I worked on a project that allows for statistical models to be built on terabytes or petabytes of data. Essentially it's an integration between art or a statistic software in Hadoop as a data management software. I've also worked on a project for multi-dimensional data analysis to provide scalable multi-dimensional indexed database system to support location based services. And essentially this is an architecture that allows you to ingest a lot of location updates which are coming from the mobile devices, as well as done analytics online in such a system. So that you can build rich applications like recommendation systems on top of these systems. And in a somewhat different project I've also worked on social network anonymization that you if you want to anonymize the edge weights in a social graph, how do you anonymize the graph while preserving some of the properties? On the other hand, I've also been -- I've also worked on some of the projects that try to leverage novel hardware that is available and see how we can leverage that infrastructure or new hardware to come up with better and more efficient database architectures. So as an intern last year at MSR I was working with Phil on the project Hyder that gives us scale-out database architecture leveraging large amount of flash -- low latency and data center networks and large amounts of RAM that is available. In a different context, in the context of data streaming applications, you have long running queries, continuous queries. And in this paper, in the course we were exploiting how we can leverage multicore architectures or the parallelism inherent to multicore architectures who efficiently parallelize these continuous queries? And we also looked at a problem where we were looking at the same problem but at a different hardware, which is ternary content addressable memory, which is hardware hash table, equivalent to a hardware hash table. So this is kind of the work I've done as a PhD student with different collaborations. Now, getting back to the main area, the main focus of the talk. How to we provide transactions while scaling out? So as I've already said, when you want to scale out to a large number of nodes, you have to somehow partition the database and then distribute the partitions across a cluster of nodes. I see a very quiet audience. If there are questions, please interrupt me. So I've -- I want it to be more like interactive. Okay. So getting back to partitioning. There has to be a mechanism of statistically partitioning the databases. And classically what we use is what we call table level partitioning where you partition every table individually, independent of each other. Classical term -- like typical techniques are range based partitioning or hash based partitioning. This makes the system management pretty easy. But the challenge that arises is that because the data is not partitioned the way it is accessed, a lot of transactions end up accessing data from different partitions, often resulting in distributed transactions which we all know is pretty expensive. So recently what a trend has come up is to leverage the data access patterns to partition the database. What we call partition the database schema itself or the groups of tables, leveraging the access patterns. And the goal here is to co-locate data items that are frequently accessed together within the same transaction. So here I've shown two different approaches. One is where you are limiting the schema by providing a specific schema structure. Another is where you are exploiting is workload patterns to kind of partition. So this is work done at MIT. This is work done by me. And one of our systems and a similar kind of schema pattern is also supported in Cloud SQL Server as well as Megastore, which are other commercial systems. So if you consider this problem of scaling out where you have statically partitioned the database somehow so that most of your transactions are within a single partition, this essentially what you have allowed is you have made your transaction processing easier. But when you are scaling out to a big system, you have to deal with all the set of distributed systems challenges that were listed in one of the earlier slides. And actually I've proposed, designed, and implemented a system called ElasTraS that that provides one way of solving these challenges. This is a talk in itself. But, unfortunately, I wouldn't be getting into the details of the system. But we can definitely talk offline on this. And there are a number of other systems that were developed currently as well, as I've said, Cloud SQL Server which was a similar that supports SQL Azure and Microsoft hosted services, extend hosted archive. And this was done at Microsoft. There is a project from Google, Google Megastore that powers the Google app App Engine as well as our relation -- academic paradigm from MIT called Relational Cloud. For this talk, I'll focus on a somewhat different where instead of viewing the partitions to be statically formed, what happens if you form the partitions dynamically. So let's take a concrete example, that static partition leverages the idea that somehow the access patterns partition statically. What if access patterns change, and often rapidly? For example, there are a bunch of applications where we observe this pattern. For example, online gaming application or other collaboration based applications or recently we also came across scientific computing applications where you get these kind of access patterns. I'll get into details of one of these applications later in the slides. As you can see, the access patterns are evolving. Obviously it's not amenable to static partitioning, as in we are losing out on the benefit of statically putting data together by trying to limit most transactions to a single node. Because the access patterns are changing, you end up doing a lot of distributed transaction. So the question we wanted to answer is how do we get the benefit of partitioning, when accesses do not statically partition? And we propose a solution which is one of the four solutions to allow that. So let us take this example of an online multi-player game. We have a statically partitioned database somehow. Doesn't matter how. Let us assumes that it happens to be the -- the data items happens to be one of the rows corresponds to player profiles. Here we have a player ID. The player's name, some kind of dollars associated with that and et cetera, et cetera. Now, we have a bunch of players that are partition -- spread across the static partitions. And these players together want to come and play a game, which is online. Now, while the game is in progress, you want to execute transactions on this -- on the bunch of player profiles who are part of this game. So ideally you would want to co-locate these data items together at one node so that your transactions are local. But there is a problem. The thing is that players move from one game to another. They want to play with some set of friends and then move to another set of friends. So the -- the data items on which you want transactions change with time. Similarly, players can try playing different games. There are a lot of games that come up in a social platform like Facebook that want to move around between games. So essentially these groups or partitions are dynamic. >>: Do players ever want to play two games simultaneously [inaudible]. >> Sudipto Das: I'll get to that. That's a very good question, actually. Dave knows what -- how to trick me. But if addition, if your game becomes popular, you also have to deal with the challenge of dealing with hundreds of house of these concurrent game instances of groups being formed. So how do you deal with the scale problem in addition to the dynamicity? So to re-state the problem statement, what we want is we want to have transactional access over groups of data items. And we want to avoid doing distributed transactions in doing that. This is a pretty hard problem because the application is not trying to help me. So what I would suggest is I would expose an abstraction to the application to help me out. What I want is the applications to declare to me what are the data items on which it wants transactional access? We call it the key group abstractions. I want the groups to be small. I'll get into why. I want the groups to execute a non-trivial number of transactions as well. Again, I'll get into why. And as I said, these groups can be dynamic and formed on demand. So the applications can form a group as well as delete a group. And if you want to stretch your imagination to multi-tenant systems you can view the groups to be dynamically formed tenant databases where your tenant data is kind of in a shared table kind of distribution. Now, how are we going to do it? As I've said, because the application comes in and says what are the arbitrary set of data items, they can be distributed. What I'm going to do is that as a first step I'm going to select one of these data items as a leader. The leader selection is -- can be arbitrary or can have strategic decision as well. And once there is -- a leader is selected, what the followers do is the rest of the keys in the data in the group can which are called followers transferred the ownership to the leader node so that all the read-write access of the data items here are co-located at one node. The key idea here is that again we are now limiting the rest of the accesses to the group who are single node. So that transactions can execute efficiently. So, yes? >>: The data [inaudible]. >> Sudipto Das: Conceptionally only the ownership is moved, the read-write access. The data is actually not moved. So I'll get into the details. To answer Dave's question is as you have said that because we are moving access to one node, what happens if groups have overlaps? If the overlaps are small, they can be co-located at one node, but if the overlaps become adversarial, obviously you end up doing a lot of distributed transactions. I'm moving some of the ownership to a single node. That's why I want the groups to be small. So that I can serve them from a single node. And in addition, because I'm paying a cost up front for doing the movement, that's why I want a nontrivial number of transactions to execute, so that I can amortize the cost over the execution of these transactions. I've again made my life easier by putting things together at one node. The transaction becomes easier. But what I've done is I've added a dynamics to the system that this hand shake between the followers and the leaders now have to be guaranteed to have correctness in the presence of the different types of failures that can happen. >>: [inaudible]. >> Sudipto Das: Just one transaction. >>: Just one? >> Sudipto Das: Yes. >>: Okay. >> Sudipto Das: That's similar to a distributed transaction. So essentially what we do is we form a -- where we execute what we call a group formation protocol, which is similar to a distributed transaction to do this in a fault tolerant way. Now, as I said, what are the challenges? The challenge here is to guarantee that the contract between the leaders and the followers is met in the presence of both the leader as well as the followers failing, to the presence of lost, duplicated or reordered messages within the network as a result of network failures arising, or in the presence of dynamics of the underlying system. Because you have a statically partitioned system sitting underneath that can do its own set of things in a funny ways, you still want to guarantee correctness in the presence of that. And now that I have -- I have brought things together at one node, how do I efficiently execute ACID transactions on these dynamically formed groups? So let's take one at a time. We'll deal with the first challenge before doing the grouping protocol. Essentially what I'll give is a very high-level overview of how we do it. So if you consider the timeline, this is how the leader's time is progressing, this is how the follower's time is progressing. At some point in time, the application comes and says, hey, I want to form a group, or I want transactional access to the group. That is when the lead executes a handshake between the follower -- with all the followers' node by exchanging these messages. Once this handshake completes, the ownership is at one node. So all the group operations are now local. So they are efficient. At some point in time, the application says oh, I'm done with it, I don't really care about this group anymore so that there comes a delete request, at which point there is another handshake that guarantees that ownership is given back to the followers and the keys are free from the group. Now, what I've abstracted here is that all of these messages can fail here. Messages can be locked. So essentially we use mechanisms for timers, retransmissions as well to deal with failed messages. What might also happen is messages can get reordered or duplicated or be delivered after a long period of time. So we use a concept of power group [inaudible] to detect scale or reordered messages. I'm not getting into the details of these. The paper has all the details. In addition, these nodes can fail as well. So what you have -- might have noticed here is that we have a bunch of logging operation that is are happening for all the group -- group operations of the messages being exchanged. So this is the at both the leader and the follower that persistently stores the group information as well as allows us to recreate the group information after the failure. So what I can do is that if somewhere here the group -- node fails, I don't terminate the group formation, it kind of resumes after failure. >>: I forgot to tell you [inaudible]. >> Sudipto Das: Yes, a crash failure. Not a malicious one. >>: No. Well, there's a permanent failure? >> Sudipto Das: Yes, there is a permanent failure as well. So in that case, the log has to be persist across the failure. So what I realize is a failure where the log can -- I still have access to log. So one idea is to replicate the log itself by putting it in a replicated storage. So that way you can deal with single-node failures and still have log. So for folks who are familiar with database architectures, so this is conceptually similar to LOC. The difference here is that instead of locks being held by a transaction, now the locks are being held by the group during the lifetime of the group. Now, how do I efficiently execute transactions? Now, once everything is at one node, essentially this boils down to an architecture, something like this, that every node has a transaction manager that deals -- executes transactions on the group and there is -- because the leader has unique access to the data items, you can aggressively cache the data items at the leader. So there is a cache manager that caches all the data, answering [inaudible] question that just the cache of the data. The actual data is here with the followers. And all the transaction updates are local to the cache. So how do I guarantee persistence of these updates? I use a looking at the transaction manager that logs all the transactional updates so that I can deal with failures of the transaction manager as well and recover from the log. So the cache is asynchronously propagated to the followers so the followers eventually get all the updates. And there is a guarantee that before a group is deleted, all updates have propagated to the group. So this way what you have done is by paying the cost of one distributed transaction at the start of the group, the rest of your life becomes easy and efficient. So essentially you amortize after executing a few transactions you're going to break even and start getting the profits. In terms of implementation, how can with we implement it? So as a proof of concept, I implemented it on top of a key-value store. I choice H-Base, which is and open source variant of Big Table. So here what we have is the key-value store logic that is executing on a cluster of servers and what I added, I added a grouping middleware on top of it on top of it that has a grouping layer that executes the group formation and deletion as well as the transaction manager that executes transactions on the group. And to answer your question, you can put the log in the distributed storage so this allows the log to be persistent across failure of individual nodes. So how did we do in terms of performance? So our evaluation done again using this prototype implementation which is about like 10,000 lines of code added in the middleware layer. And I experimented using Amazon EC2 to do some scale-out experiments. I did an online benchmark -- a benchmark which is an online multi-player game, and on a cluster of a modest cluster of 10 nodes, we were able to serve about a billion rows, which is about a terabyte of data. And with groups of the size of hundred keys, the group creation latency was somewhere between 10 to hundred millisecond, depending on how you select the groups and how you are doing. And in this cluster of 10 nodes, you are able to serve about 10,000 nodes on -10,000 groups being concurrently served on this cluster. Obviously this is just a snapshot of the experiment. The paper gets into details of how the numbers vary on the different parameters. And this is just a few of the same set of experiments which shows that depending on how you choose -- how you end up implementing the middleware layer, you have implemented within the key-value store or sitting outside the key-value store. And depending on two lists -- different distributions of key selection, how does the group concentration latency and the group concentration throughput vary? So as you can see, this is a distribution which allows where keys are contiguous. So my implementation can efficiently batch the group formation to give you very local group formation while if you come up with an adversarial distribution it can obviously get worse. Now, I've shown you a mechanism of executing transactions, and I've briefly discussed about the mechanism that allows you to execute transactions in the scale-out setting. >>: [inaudible]. So did you come further, just round-robin partitioning or hash partitioning? >> Sudipto Das: So this is the range partitioning here. >>: So in terms of transactions per second, how much -- how much better are you ->> Sudipto Das: So ->>: [inaudible]. >> Sudipto Das: Yeah. So the thing is that we don't have an experimental evaluation for that. We are currently working on them because there is no distributed transaction implementation on H-Base. So I'm currently implementing on that. But the idea here is that the back of the envelope calculation is that if you are -- after you have formed the group you have done two or three transactions, you have broken even. Because now you have already broken even for the cost of group formation, and now anything you do is profit for you. That's backup envelope. But actually it might vary. >>: That's because the group formation is essentially like running a transaction [inaudible] you're paying for one distributed transaction and then [inaudible]. >> Sudipto Das: Yeah. So I didn't do one for formation and one for deletion. Yeah? >>: So deciding when to form a group and the size of the group, you're expecting the application to handle all this for you? >> Sudipto Das: Yeah. In the current work, yes. But in the future we would like to try to automate it based on some workload patterns or using some form of exposing high level semantics so that the applications can do it declaratively. But in the current setting, yes. Yes? >>: So do you relied on shared storage of -- >> Sudipto Das: No. We don't. But this implementation happened to use that. But we don't have anything in the shared storage. So what -- except for the log for high availability. But what -- the key idea here is that what we are doing is we are decoupling the transaction execution from the actual data storage and allowing you to do a lightweight reorganization. You can view it as being a shared storage because the underlying storage itself is shared across multiple groups. But nothing inherent. >>: If groups overlap, you still see performance variables compared to not using groups at all? >> Sudipto Das: If the group overlaps are small, then yes. So you can stick multiple groups together at one node. But if they overlap like arbitrarily, then obviously this is not a good solution. Probably you need something different. Okay. So now assuming that we can execute transactions somehow, two different approaches are shown, how do you make the design elastic so that we have the property which we wanted to have? So what exactly does elasticity mean in the database tier? I'll give a very high level motivating example. So let's take a simplified view of the world that is a class of -- that is set of application clients that are accessing the service through a load balancer. And there is a tier of application on the web server and the database is sitting at the bottom. And this I'll consider the care. I'll motivate this from -- more from the multitenancy aspect, but it can be applied to the database partitioning based approach as well. I use a color coding where the clients have a color corresponding to their database partition here called the tenants -- tenants are color coded. Now, I put this application -- I designed this application, put it up on Facebook. It becomes extremely popular. One of my applications become extremely popular, and there is a surge in load. So if infrastructure was deployed in a cloud, what I can do is the application server deal can easily scale out because very little state is being shared across them. I would want the same property within the database, which is currently not -- typically not provided, is that you would want to add a new node to the system, migrate over parts of the database, in this case the tenants' database. So that you can redistribute the load or balance the load across the different set of servers. I would also want to do the reverse process, that when the load decreases I would want to have the ability to consolidate back as well, since consolidation is critical to optimize the operating cost in a pay-per-use infrastructure. So as you have said -- seen, essentially elasticity in the database tier boils down to migrating a database partition or tenant, if you may, in a live system without any -- introducing any downtime. This allows you to optimize the operating cost, as I've said, as well as in a multitenant system, where multiple tenants are sharing resources between the -- of the system resources. This is an effective tool to do online resource orchestration on demand as the resource requirements change. Obviously, as you can see, migration is a loaded term. Because migration can also be used for -- as the database software evolves, how do you migrate data between the different softwares? Or how do you migrate data as the database schema evolves? So the use of my term of migration in this setting is primarily for elasticity and is different from this context. So one of the simple solutions, one of the straightforward solutions which people can easily come up with is why don't you use real migration for database elasticity? So how can we do it? One of the approaches is to have every tenant give its own database, running within a VM. And then there is a hypervisor that shares these VMs at a different -- at a angle node. This is a valid design supported by the current state of the art. And you can now use VM migration to migrate things on the fly. However, as you have seen and as many of you know, that databases weren't designed for this kind of operation. And if you are running multiple databases uncoordinated at the same node, God be with you in terms of performance. And there is a recent paper that shows you that the performance overhead can be as much as an order of magnitude, both in terms of performance as well as the consolidation issue. So what you would want is you would want multiple different database partitions to be resident within the same database process. That gives you somewhat better performance, even -- my life, I wouldn't want it to -- the VM to be sitting here as well, but let us consider for this scenario. Now, you can use the VM to migrate the database again. But now what you have lost out is you have lost out the ability to do fine-grained load balancing. Was that a question? >>: [inaudible]. >> Sudipto Das: Okay. So what you would ideally want is this world where you have only the database process running on bear metal, a bunch of tenants or database partitions sharing the same database process, and about -- a model which is called shared process multitenancy in the database literature, and you would want to migrate individual partitions on demand in a live system. So what I'm essentially saying is that what VMs allow you to do for our operating systems, I'm going to allow you to do the same functionality in the database tier. So essentially what you can view this to being virtualization in the database tier itself. Again, there is another straightforward solution that can be done. Because databases were designed to be fault tolerant. So essentially what you can do is you can stop the database at the source, migrate it over to the destination and then start serving it as the destination. I call it the stop and copy technique. And, again, this is -- this can be done. However, it is expensive. Why is it expensive? Because it results in an unavailability window. And I want to have minimal unavailability during migration. If possible, no unavailability. I want to minimize this metric. In addition, I want to also minimize any impact on the tenant while I'm doing migration. Migration is done for system management. The tenant should not be aware of it. So I want to minimize the number of failed requests as well as have minimal impact on the performance of the transactions that are executing. And in addition, from the system's perspective, I also want to minimize the amount of data that is transferred as a result of migration. There is some amount of the data that needs to be migrated. This is the data on top of that. So essentially what I -- as I've said, there are two different approaches of -- in which databases are designed. One approach which we call the decoupled storage, where the transaction execution logic is decoupled from the storage logic. There are popular examples like different examples like the system ElasTraS which I designed. G-Store happens to be a similar design as well. Project Deuteronomy at Microsoft Research as well as Google Megastore fall into this category. Now, because your persistent data is stored in a network that has storage while you are migrating, you don't need to migrate the data. So essentially now it boils down to migrating that execution state of the database. Migrate the transactions as well as migrate the database cache. And I appropriate this technique ElasTraS as well as implemented it on top of the -- so I propose this technique Albatross, which is implemented on top of ElasTraS. And this is a paper that would be presented in the upcoming VLDB in a couple of weeks. There is also another way of designing databases, which is the standard shared nothing design where the persistent data is stored in locally attached storage. Here when you're migrate it's a hardware problem, because now you are to move large amounts of data as well. So how do you guarantee that you will do -- you incur minimal cost during such migration? So the common examples of this architecture are SQL Azure, Relational Cloud, which is again the prototype from MIT, and MySQL also has a cluster offering, which is similar shared nothing cluster. And I proposed a technique called Zephyr, which was presented recently in SIGMOD in June. In this talk I'll just focus on Zephyr. You are welcome to come to VLDB to get that the details of ->>: Just a comment. In the shared nothing architectures, take SQL Azure, for example, they already have certain availability guarantees. And for that, they already replicate the data, right? >> Sudipto Das: Yes. >>: Not necessarily need to [inaudible]. >> Sudipto Das: Yes. >>: [inaudible]. >> Sudipto Das: That's a very good point. And as I'll show, we leverage from the layer replication. But when you are doing elasticity, you don't always have -you are trying to add a new node, and you don't have -- always have a replica running at that node. So I want a technique that allows you to replicate -- migrate even in that setting. But, yeah, as I -- as you said, replication can be benefited. So why is this a hard problem? I've already said that multiple times? So let's get into the details of the actual points. We want to migrate the persistent database -- or persistent image of the data, which can be of the order of gigabytes. So how do you guarantee no down time while you are migrating such large amounts of data? So you have to execute transactions while the data is being migrated. Now, because it's not an instantaneous process, again, there can be failures. Nodes can fail curing migration, both the source and the destination. So how do you guarantee correctness in the presence of failures, especially transaction atomicity and durability, how do you guarantee that the transactions executing have the properties as well as you want to recover the state of migration, if there is a failure in the middle, so that you don't leave the system in an inconsistent state. In addition, because you don't want any down time, transactions will be executing during migration. So how do you guarantee serializability of these transactions while you're migrating things on the fly so that from the tenants' perspective it is normal business from -- as if nothing happened? So our approach is, the way we do it is that instead of viewing as migration as one chunk being migrated, we break down migration into a collection of phases. It starts with -- migration starts with transfer of minimal information from the source to the destination. We call this the wireframe. The minimal information consists of the database schema, user authentication information and another thing called the index wireframe which I'll get into. Again, instead of viewing the entire database to be migrated as a whole, we view the database as a collection of pages, the database pages, which is often the case. And we use the concept of unique page developed ownership and migration of database pages on demand from the source to the destination. To allow for zero down time there is a phase in migration where we allow for both the source and the destination to currently execute transaction on it, and we show that how you can have minimal transaction synchronization and still guarantee serializability for such transactions. And we use mechanisms for logging and handshaking to guarantee fault tolerance of the -- fault tolerance during the migration as well. So in this talk, I'll make some simplifying assumption to limit the scope. I'll assume that transactions execute at a single node. I do not leverage from replication for this technique, but I can definitely do that. The paper shows you how to do that. And I would assume that there are some indices that are used to keep track of pages. I don't allow any structural changes to the indices during migration. The paper obviously gives up all these assumptions and gives an extended design that is more flexible compared to what I'll present in this presentation, in this talk. >>: Question. >> Sudipto Das: Yeah. >>: What [inaudible] update? [inaudible]. >> Sudipto Das: I'll get into that. That's a very good point actually. >>: There are ways of doing a replication which doesn't take the [inaudible] the original source offline while you're doing the wrap up that cause you to sort of scan the replica, copy the replica, copy the data and then -- and then use the log to bring it up to date. >> Sudipto Das: Yeah, yeah. >>: Which greatly shortens or perhaps eliminates the service interruption for that. You seem to paint -- you painted a picture before of the doing the replication. >> Sudipto Das: Yes, yes. [inaudible]. [brief talking over]. >> Sudipto Das: It's not -- yeah, that's a very good point. The reason why we don't use it here -- there are two reasons. One of it is for setting up a new replica, you incur a lot of tech pointing overhead, that is a lot of [inaudible] and if there is -- the source is already overloaded you don't want to add load to the source. And the second is that, as you'll see here, the destination starts executing transactions as soon as you start migration. So you are able to offload some of the load to the destination immediately. So that allows you to a better performance. But obviously the exact numbers would vary, depending upon workloads. So my view have the database is a collection of pages. There are a set of active transactions that are executing and there is an index that keeps track of these -- what exactly is in the pages. I'll use this index to tag along additional information which is called -- which I call as the page ownership information. If the page is -- and the conventional use in this talk is if a page is white, which means this is the node that owns the page or has unique read-write access to the page. And if it's grayed out, the node has information about the page but doesn't have the data. When migration starts you freeze the index wireframe. I'll get into what do you mean by freezing and migrate over the index wireframe. So what will exactly is an index wireframe. Just to take a specific case of a B plus tree index. Your index wireframe is the internal nodes of the B plus tree index. And your actual data resides on the database pages. What I money by freezing the index is I wouldn't allow any structural changes to the metadata. So I still allow updates to these individual pages. But if there is an update or an insert that results in a new page to be inserted, which is then a change in this index wireframe, that is something that is not allowed during migration. Interest [inaudible]. >> Sudipto Das: It will be aborted in this setting. But there are simple extensions that can be done again to deal with that problem. So essentially when I move the -- migrate the index wireframe, this is what the destination has. It has information that there are some database pages but it doesn't have the database pages. So this is the state of the destination in what we call the dual mode. In this mode, the ->>: Let me back up just to make sure I understood that. So you're basically saying no page splits, is that ->> Sudipto Das: No page splits during migration. >>: During migration. >> Sudipto Das: Yes. >>: So you can do inserts, but as soon as you get page limit you got [inaudible]. >> Sudipto Das: Yeah. >>: [inaudible] transaction. >> Sudipto Das: Yes. That's right. >>: Would using a [inaudible] make you relax that requirement? >> Sudipto Das: Yes, it does. And you can also use, you know, flow buckets which are similar to [inaudible]. Okay. So at the destination -- at the start of the dual mode, it just has the meta information. At this point I allow new transactions to go to the destination. And while the sources still completing transaction that is are still active or that are arriving due to still meta at the routing there. Now, because the transactions start executing as a destination without the data, now the data is pulled in on demand, [inaudible] the index structure to keep track of ownership information. So let's take for example that page P3 is being accessed by a transaction at the destination. At this point the request is sent to the source, and the sense -- the source does some synchronization at this point to ensure that there is this page P3 is not currently being accessed by any other transaction and if that is true, it changes the ownership information and migrates the page over to the destination. So this is the only point in time where the two nodes synchronize when executing transactions when -- and the concept of unique page ownership allows us in using this mechanism. Now, as soon as the source completes all the transactions, now it just keeps -- figures out what are the pages that have still not been migrated and then asynchronously pushes them to the destination and the destination keeps getting on information. But against these transactions are access -- executing at the destination pages can still be pulled on demand. And we show how the indexed metadata can be used to detect duplicates in this setting and guaranteed that you don't mess up the database state. And once all the page structures -- all the pages have been migrated, now the source can get it off all the resources, the paper again shows how to get rid of the log logs as well, so that the source is completely free now and everything -the destination is the sole owner. Now, because of the simplification which I read for the sake of the talk, these are sort falling artifacts of this simplification. That I migrate pages only once. Once a page is migrated from the source to the destination, it is never pulled back. This allows for forward progression and quick migration. But what it allows -- what it -the implication is that any transaction that accesses the page that has been migrated from the source must be aborted. Remember here accesses because I want to give serializability. If I can give snapshot isolation then I can allow leads. As I've said, no structural changes to the index. So any transaction at both the source and the destination, if it results in a structural change to the index, that would be aborted as well. And because the destination pulls the pages on demand from the source, there is a higher latency from some of the transactions that are going to the destination. In most of the times it's pulling from the sources cache so it's not such a big latency, but there is somewhat higher latency because of the network traffic. >>: Why do you need that -- why do you need that structure changes? I mean, if you're executing transactions at the destination [inaudible] so what? >> Sudipto Das: I don't need that. I -- this is just for simplicity, to allow for merging the indices easier. In the paper we actually talk about an extension that -- where we don't need that. Actually we don't need that. So I think I'm probably short on time, so I'll just keep on some of the serializability -- it's okay? Okay. So essentially what I've done is that I have just used a simple synchronization mechanism. So how am I going to guarantee serializability during transaction execution? As you can see the dual mode is the only concern because only in the dual mode, the two nodes are executing transactions concurrently. What in the paper we show is that you can use local predicate locking at the index level and exclusive page ownership at the leaf level to ensure that there are no phantoms during migration. We use strict two-phase locking during normal transaction execution. So to guarantee that transactions are not local -transactions are locally serializable. And because we use this only once transaction mark the database page migration, essentially what you can see shows that any transaction at the destination is ordered after a conflicting transaction at the source. So there is a strict ordering that is enforced. So this allows you to prevent loops in your serialization gaps. Those are providing you guaranteed serializability. Now, recovery becomes complicated as well. Because now two transactions -two nodes that are executing transactions concurrently. But again, I use had this causal ordering property between the pages that if there are two transactions that are conflicting on a page, I just want strict ordering on them. I don't care about other transactions. So when I'm moving pages over from the source to the destination, I also carry over the log sequence number so that all transactions at the destination are ordered after the source, even in the recovery log. And during recovery we just replay maintaining this order. So you are preserving the conflict order due to recovery as well. And how do you recover -- so this was transaction recovery, by the way. How do you ensure migration recovery? You have to guarantee that because the -- there are two nodes changing between different states, you have to guarantee that they are always in the same state. And there is no confusion on that. In the paper we show how -- how you can atomically transition from one stage of migration or one phase of migration to another phase of migration. And essentially we use logging and handshake protocols for doing this atomic transition. And in addition, every page always has a unique owner in the node. And you can use bookkeeping in the index level to keep track of this ownership information. Even after a failure. So in a simple way, you would always log migration of a page, but that introduces a lot of IO as well. So in the paper we show how you can rely on the transaction semantics to capture this migration information and make it persistent. And be able to recover that as well. So essentially what we show is that in the presence of arbitrary repeated failures, we can guarantee updates made to database pages are consistent; failure does not Lee a page without an owner; and both the source and the destination are in the same migration mode. So you can -essentially this extends to the correctness proof. And we also show how you can guarantee termination and starvation freedom in the presence of arbitrary failures as well. >>: So why isn't this simply a special case of a data sharing system? >> Sudipto Das: Actually, the -- it is and the extension, which I didn't talk about, relies on data sharing. And global and local LOC managers to exchange the pages. But, yeah, so this is -- this becomes a data sharing only during migration. So in terms of implementation, the design was implemented in and open source OLTP databases called H2, which provides all the bells and whistles of classical OLTP database. And to implement this, we went and added support for freezing the indices as well as keeping track of ownership information. This was about 6,000 lines of code hashed in the database engine. We use an open source router, SQL router to migrate connections from the source to the destination as a result of migration. How did we do in terms of performance? We evaluated it using an open source microbenchmark. We adopted the Yahoo cloud serving benchmark to add transactions and vary the different parameters of the workload. Depending on the database size or the workload which is executing, the [inaudible] technical which is stop and copy allows you to -- results in 3 to 8 seconds where a database is unavailable. This is for a very small database, about 200 megabytes or on something. As you make the database bigger, this unavailability window becomes longer and longer. What you have to notice, during this period you can only run the database in read only mode. So any update transaction has to be -- has to abort during migration. On the other hand, Zephyr does not result in any downtime, because at any point in time either the source or the destination is executing transactions. In terms of the failed operations, because stop and copy has to fail all updates, it results in about hundreds to thousands of operations failing during migration. Again, depending on the workload, these numbers vary. On the other hand, Zephyr results in -- and the simple prototype of Zephyr results in an orders of magnitude failed operations. And the paper we show how you can guarantee zero transaction loss. But we don't have an implementation for that yet. So even the simple implementation is orders of magnitude better in that setting. >>: So where does [inaudible] come from? >> Sudipto Das: So the failures -- because of the inserts -- so we ran an adversarial workload with a lot of inserts. And whenever there is a change in the index structure that results in the failure. >>: Also some transactions made just get sent to the wrong machine. >> Sudipto Das: At the source, yes. But the source -- as long as the source is still active, it can still serve those transactions. But if they get ->>: I see. So ->> Sudipto Das: Yeah. >>: You aborted the transaction and then redirecting to the target is not considered a failed transaction? >> Sudipto Das: No. In this number, no. >>: Okay. >>: So transaction request two pages, one is at the source, the other has been migrated. So will that transaction ->> Sudipto Das: That transaction will fail. >>: It will fail. So you won't do a distributed ->> Sudipto Das: No, I won't do -- so across my work the idea is to wide distributed transactions where have possible. But there is an extension that does that actually. I use a shared LOC manager for getting the pages. >>: Wouldn't the benchmark do that ->> Sudipto Das: So this is a Yahoo cloud serving benchmark with modified support for transactions, multi-table transactions and client sessions. So Yahoo cloud serving benchmark was a key-value store, benchmark for key-value store which obviously doesn't have any transactions. >>: So what fraction of the transactions had multi-page -- [inaudible] multi-page access ->> Sudipto Das: Every transaction required a multi-page access. And every transaction was multi-operation. Before it was 10. You can vary number of operations within a transaction. We also looked at like 25, 30 operation transactions as well. So all of these parameters are varied during the experiments. >>: So if you do stop and copy, you could do an IO efficient copy, right? >> Sudipto Das: Yes. >>: [inaudible]. >> Sudipto Das: Yes. >>: But you think it's going to pull pages on demand? >> Sudipto Das: Yeah. Essentially ->>: That could be a huge, huge ->> Sudipto Das: In theory, yes. But in practice, we didn't observe that. Because what happens is that most of the pages that are accessed at the destination during the pull phase are often in the cache, just the reason for locality. And then the final phase is just a copy through the disk. So in theory, it can be bad. But not in -- not in practice. At least the workload we did. So in terms of operational overhead, we show that operational overhead resulting in the -- as a result of this migration is very low, between 10 to 15 percent increased latency during migration. So this is a number -- a graph that shows you how the number of failed operations increased as you increase that load. >>: [inaudible]. >> Sudipto Das: Yeah? >>: [inaudible] so if you actually take the load [inaudible] the system is already loaded because, you know, you actually trying to [inaudible] so I think -- so I think on that context any increase in the [inaudible], I mean, the angle of the migration is kind of critical. So have you done any comparative studies ->> Sudipto Das: So the thing is that I agree to it but if you wouldn't have migrated you would do worse because your source is already overloaded. So that's the argument against it. But it is one of the reasons that we used this technique that immediately you offload the load to the destination. And the destination starts catching up. And whatever load you do is just for fetching the pages on demand. >>: But if the source is already overloaded, the source takes care of the migration because if you have [inaudible] page from the destination, then you are too overload to fill that string ->> Sudipto Das: Yes, so that's a good point as well. So the idea here is that I haven't talked about the controller level. So there is a controller that is sitting on top of these things. So it is the responsibility of the controller to start migration, at least when there is some room left to migrate. If it's already too late, you are hundred percent overloaded, tough luck. There is -- there is -- you have already screwed up your system. Myself wouldn't help you. So typically like when you're getting close to being, you know, booked, that's when you initiate migration. That's part of the controller's responsibility. But as I said, it's a very low, you know, head on the source. It's just fetching pages and not executing the transactions. If it were executing transactions, it would have been even more in the [inaudible]. So in terms of failed operations here, as you are increasing the load on the system, as you can see, the stop and copy technique, the rate of increase is much higher compared to the Zephyr technique. Not only is this an order of magnitude better, the rate of increase is also much higher. Just to give you an example, the slope of the Zephyr curve is .48, but as that first stop and copy is 8.4, which shows that it -- this says technique is more robust to deal with different variations of the load and allows you to do migration even when the load is higher on the source. >>: [inaudible] so that your [inaudible]. >> Sudipto Das: Yes. That is true as well. But then they also have an impact on the latency of the executing transactions as well. Yeah. Yes. But you still have to have both transactions that are active at the source. >>: I don't understand what you just told me ->> Sudipto Das: So this is the slope of a line -- if you put a line here on the graph as well as on this graph, so this is the slope of that line. >>: [inaudible] is it the slope of the line across the tops of those yellow bars is 8.4? >> Sudipto Das: Yes. >>: Like when you double the number of transactions you less than double the height? >> Sudipto Das: Yes. >>: That seems to me to be a slope less than one. >> Sudipto Das: So this less than one is -- so this is where basically the slope measure for -- okay. So this is the slope measure of the angle which is being projected here in radiance. >>: [inaudible] I don't understand what ->> Sudipto Das: Okay. So the thing is that what ->>: [inaudible]. >> Sudipto Das: Sorry. No. I'm sorry. It is -- sorry, sorry. I mess it up. It's -- I think it is the -- I'm sorry. I don't remember the exact measure. But I think it's either the angle or the tangent of the angle which is reported here. But I can look back into the paper and -- >>: Ordinarily does it make a difference between the two ends and the rise and the run? >> Sudipto Das: Yeah. >>: Yeah. Okay. I mean, it's really hard to read that in the crowd. >> Sudipto Das: Yeah. That's why we added this information into this graph itself. >>: Yeah. Except I don't understand the ->> Sudipto Das: So I'll have to the get back to the paper to figure out what is -what is the exact measure we did. >>: I mean you're an order of magnitude better. >> Sudipto Das: Yeah. So you see that the increase is much lower here compared to the increase which is here. So this is the angle of the line which is drawn. Our [inaudible] line. [laughter]. >>: [inaudible]. [laughter]. >>: It's hard to read the Zephyr one, but the stop and copy, it's going from 600 to a thousand over a spread of 40 transactions per second. So it's going to ->>: You double the number of transactions and you less than double the number of failed transactions, so that's a slope less than one. Anyway ->> Sudipto Das: Sorry about that. Yeah. >>: Okay. >>: I think it's just like 600 over 40 [inaudible] measure. >> Sudipto Das: So stepping back. In terms of the overall vision, we want -- we started off with a system where we wanted to have scale out while executing transactions. We wanted to have elasticity as well while executing transactions. So my dissertation proposes major enabling technologies to solve these specific challenges. Specifically I propose a design for a scalable distributed data -database infrastructure. This is one of the first few techniques. There are a bunch of things that were also designed currently, but this is also one of the first few techniques in this area. I've shown how to execute transactions efficiently for partition that is are dynamically formed. And this is, again, the only technique which I know of that has been published that allows that. And I've also shown two different -- I've also shown two different techniques to allow for live database migration or lightweight elasticity. Again, to the best of my knowledge, these are the only two published solutions. I know there are some things that are cooking but they haven't been published yet from the other groups as well. And what I would also like to point out is that all of these designs are implemented in real systems as well as evaluated to show effectiveness. In terms of related work, I've already covered most of it scattered throughout the presentation. For transactions and scale-out, I [inaudible] from lot of work over the last 30 years or so, which I didn't put up here. In terms of the systems that have been of currently developed is Cloud SQL Server, which was from Microsoft, Megastore. Deuteronomy. Relational Cloud. Percolator was a system from Google, which is -- forms the basis for their new index -- indexing mechanism. And this is a system that does distributed transactions actually. But it's a different application domain. Not high performance transactions. And obviously we have Hyder as well from Microsoft Research. And again, the list is also -- this is an incomplete list, just a snapshot. In terms of elasticity and migration, we just have VM Migration which is currently known . Nothing published -- nothing in the published literature. In terms of future directions, again, the space is very rich. And this is definitely not the end of the stories. I've tried to list some of the problems which I have the background and I feel are relevant in the next few -- in the upcoming years is I just scratched on the surface for a self managing controller for multitenant system. And I believe it's a very important area to pursue, given that there is enough scope there where you have a large distributed system, you have no idea what it's doing, trying to get a good understanding of it and trying to automate the management of that system such as placement of tenants, resource orchestration, online profiling and how would you update your models online as well. This is a very important area of research which I -- I want to pursue as well. Another is enforceable data management architectures. On one end we have the hardware, which is continuously improving. We have face change memories coming in, we have FPGs as we can have GPUs. How can we build and bring this hardware into the database architecture and come up with more efficient implementations or more efficient database architectures. And another thing which I want -- also want to point out is a need for convergence of transaction systems and analytics systems, not are warehoused but just the ability to provide realtime intelligence in a system that is getting all the updates. This is extremely critical for a lot of the applications that are coming in, where as soon as you see a change in the behavior or a change in the update, you want to react fast in realtime. So this is something which I -- what are the right architectures in this model? And how can you build such architectures? This is another area of -- area of interest for me. And another thing which has been getting a lot of popularity is what we call crowdsourcing or putting human in the loop. I don't want to proceeds new crowdsourcing solutions but for a lot of these problems I can leverage what a lot of crowdsourcing solutions as well. For example when we are dealing with convergence of multiple sources of data entity resolution becomes a big issue. Data integration also becomes a big issue. How can we leverage cheap human labor to solve some of these hard problems which we encountered in these systems to help us out in solving some of these hard problems? So this brings us to the end of my talk. I'd like to thank you, everyone, for attending this talk. And I'd also like to thank all my collaborators. I've had the wonderful fortune to collaborate with a lot -- a large number of great searchers. And specifically thank my advisors Divy Agrawal and Amr El Abbadi, without whose contribution I wouldn't be standing here. And I'd like to open it up for more questions. [applause]. >>: [inaudible] along the way any lingering questions we want to pursue. >>: [inaudible] collaboration, essentially it's -- it's -- compared to the transition of migration services, it's just working a different granularity [inaudible]. It also means that it requires a [inaudible] destination and a source at the same physical infrastructure. >> Sudipto Das: Yes, that is correct. >>: Also means that it has more complication. Do you have any consider about the [inaudible] changes [inaudible]. If you consider those factors you will get much more complicated. >> Sudipto Das: Yeah, I agree to that. And this is one of the disclaimers I put up front that I'm not considering it for this paper. >>: [inaudible] it's working as a [inaudible]. It's already very complicated. >> Sudipto Das: Yeah. Obviously if you -- the more features you want to propose that it definitely increases the complexity as well. I don't know if it was a question or a comment. I agree to your comment. Yes. >>: [inaudible]. >> Sudipto Das: Yes, I agree to it. >>: There's a [inaudible]. >> Sudipto Das: Yes. So the thing that we are trying to solve is we are trying to solve is we are trying to solve more of a research question is how do you guarantee no downtime? The LOC shipping protocols that actually results in some period of availability. So actually if I didn't talk about Albatross that uses a variant of this LOC shipping protocol to migrate your database cache as well as the state of active transactions as well. >>: This is just a migration scenario, though, not a replication scenario. If you're thinking -- when you start talking about data -- you know, about database mirroring, now you can't just blithely say there's going to be no schema changes or no other structural changes because you're going to be running this for a long period. Here he's only doing the migration for a relatively short period to shut down schema modifications during migration seems like it would be not a big -not a big inconvenience. >>: In my [inaudible] one page, it may touch many other pages. >>: Yeah. >>: [inaudible]. >> Sudipto Das: We still support those kind of transactions. The thing is that only during -- as Phil pointed out, only during the short duration of migration some of these operations become expensive. And I would claim here is that you pay a cost in terms of latency but you don't incur any unavailability. And that is what it is. Yeah? >>: Did you check on actually the difference between [inaudible] because sometimes [inaudible] and I split, they [inaudible] the difference because if [inaudible] request a page and then you go [inaudible] the page to the source, if you fetch everything at once, the next square [inaudible] it is already there. Otherwise all the time I do request that they do a request. >> Sudipto Das: It's not all the time. It's just for the window. And what I'm trying to say some of the experiments I've shown it's of the order of few seconds. That's the window while the source is finishing up the active transactions and the destination has started executing transactions. I'm not running in this mode forever for the rest of my life. It's just for the small window. The benefit here is that obviously I can definitely keep on adding features but that complicates the design. I wanted to show that using a very simple design you can still guarantee a bunch of properties. That was the main thing which we wanted to show here. And the period of restriction which we involve is also very short of the order of even milliseconds for some of the cases. I just given one snapshot of the total migration time. >>: [inaudible] other indexes are transferred [inaudible]. So you're actually trying to transfer the logical structure ->> Sudipto Das: Yes. >>: That when you're inserting the page, how are you inserting -- I mean the [inaudible] index. >> Sudipto Das: So the index, the way it is, the wireframe copies that you take an intention LOC on the group of the index and that kind of prevents any update locks that are being taken, any changes being done in the internal nodes. But the index wireframe is still there, at both the source and the destination. Whenever there is an insert at the source or the destination, it goes through all the index structures and figures out what is the page that actually insert would be. If it figures out that there is going to be a page split, that is where it's supported. If the insert can get into the page, it's allowed. >>: So basically, I mean, if I understand correctly, you were trying to recreate the index in the order of each page, [inaudible] because at the destination you are trying to insert one page and you are actually trying to recreate your index [inaudible]. >> Sudipto Das: I'm not recreating my index. It is just copied once. One set at the start of migration. And the rest is just the freezing of index is mainly to make my bookkeeping easier. I can allow changes to the index but then margining with the index of the source and destination becomes expensive. So this is a trade-off, more of a design trade-off rather than a performance thing or something limiting which [inaudible]. Yes? >>: The [inaudible] transaction of the [inaudible] on the migration. You talk about [inaudible] queries. >> Sudipto Das: That's a very good question. I think I have it right. That's the short answer. But it's harder for that because ->>: It's easy [inaudible] [laughter]. >> Sudipto Das: The thing is that a lot of transaction workloads often fit in memory. So you -- you are not -- when you are fetching pages from the destination most of the times it happens to be in memory. But with overlap or other analytics whereas its often like a lot of scans, so that further increases the overhead of migration. But definitely that's an area of future work. It's an important area, as well, because we want to have diversity with it. Okay. [applause].