>> John Douceur: So good morning, all. It's my pleasure to introduce Diwaker Grupta. He's a student of Amin Vahdat at UCSD. He'll be graduating real soon now and will be looking for a job. He's done work in distributed systems, network emulation, scaling and management of virtual machines and virtualized system infrastructure. He's got publications in NSTI and OSTI including the most recent NSTI and the upcoming OSTI, and he's going to be telling us about some of that work this morning. >> Diwaker Grupta: Thanks a lot, John. Good morning, everyone. Thank you all for coming. It's a pleasure to be here. So I want to start by describing what I think are two compelling problems, and in this talk I'll show you how to address them using virtual machines. So the first one is the problem of protocol evaluation, so we were just talking about TCP, and one of the problems with TCP is that on extremely high bandwidth networks it didn't perform very well. Now, the literature several variants and enhancements to TCP have been proposed to address this problem. So let's say you wanted to test two different variants of TCP on let's say a hundred gigabit per second wire link to see which one does better. How do you go about doing this? Not traditionally you could approach this problem in three different ways. You could try to get access to a real world hundred gigabit per second link if such a link exists, but even if such a link exists it probably is not going to be easily accessible to everyone out there. You could try to do a complete software simulation such as an NS 2, but then you lose realism because you're not using real operating systems, real unmodified applications. And finally you could use an emulation environment such as ModelNet. So a network emulator allows you to experiment with arbitrary network topologies with unmodified applications and standard operating system networking stacks. But even in a network emulation limited by the bandwidth of the I'm going to show you how to do realism right in the comfort of environment you are fundamentally underlying network. So in this talk, such an experiments while preserving your lab. So here's another problem. How do you test large systems? Here's the code by Werner Vogel who is the CTO of Amazon.com and what he's pointing out is that in order to test a large system like Amazon ideally you want to create a replica of the entire system and test on that replica. But what if you don't have the capacity to duplicate all the infrastructure? So typically we resort to doing small scale tests on a few number of machines and extrapolate from there. But the same problem exists in other domains as well. You don't have to be as large as Amazon. Let's say you're a company that makes a high performance file system and you are shipping this file system to clients who will deploy it in various kinds of deployment scenarios, so they might have different hardware, software configurations and so on. So ideally you want to make sure that the system is tested in all these possible deployment scenarios. Unfortunately you have a limited testing infrastructure at your disposal. And furthermore, this testing infrastructure might be shared internally among several different teams who are doing development, so each individual team gets even fewer machines to work with. And particularly it might be impossible to test the system at a scale at which it will be deployed at using the limited number of machines you have. So in this talk I'm going to show you how you can use a small infrastructure to test accurately, so test and replicate a much larger system and order of larger system in fact. So these are sort of the two problems that I'll be talking about in this talk. But where do virtual machines fit into the picture? Before I start talking about that, I just want to clarify the terminology that I'm going to use in this talk because virtual machines mean different things to different people. So on a regular physical machine, we have various resources such as the CPU, the disk, the memory and so on. And we have the operating system which is multiplexing these resources amongst their differently applications. In the virtual machine environment instead of the operating system, we have a thin layer called the virtual machine monitor or the hypervisor which is responsible for managing the system resources among several different virtual machines. A virtual machine is basically a software abstraction of a physical machine and each virtual machine can have its own operating system and application stack. Because the operating system that runs within the virtual machine, the guest operating system and I'm also going to use the term domain denote the virtual machine in this talk. The virtual machine monitor exposes a hardware like interface to the virtual machine so the guest operating system this that it's actually running on real hardware. So that's sort of the model that we're working on. The concept of virtualization itself is quite old, almost 50 years ago now, but in the last couple of years interest in virtualization has picked up significantly. In fact, recent IBC report estimates that virtualization was already a 5.5 billion dollar industry in 2006 and it's projected to grow strongly in the near future. So it looks like this market is growing rapidly. But what are some of the things that are driving the adoption of virtualization? What are people using virtualization for? So here's the recent survey done by cio.com where they ask people what are they using virtualization for in production IT environments? And by far the most popular use is for server consolidation. So by putting services on a fewer number of physical machines you can cut down costs and increase utilizations. So that's one of the biggest driving factors. But the following couple of reasons I clumped them together and called them novel applications. And these are I think the more compelling reasons that will drive virtualization moving forward and we'll talk briefly about both of these in the next couple of slides. So of coarse in the past couple of years server consolidation has become increasingly important over the past decade or so the infrastructure costs of a data center have remained largely stable but the administrative costs and the power and cooling costs have been increasing at an alarming rate and so by consolidating your infrastructure organizations can cut down both these expenses. And in fact [inaudible] must report that typical datacenter utilization servers is in the five to 15 percent range. And using virtualization they can bring it up to 60 or 80 percent which translates to a tremendous dollar value savings for organizations. But as I said earlier, I think virtualization is becoming increasingly more compelling for the new and unique applications that it enables. For example because the virtual machine monitor exports different hardware interfaces you can actually support legacy software, legacy hardware even if the real hardware doesn't exist anymore. Go ahead. >>: Does server consolidation address the biggest costs that you were talking about? >> Diwaker Grupta: A little bit. >>: You still have to, you know, manage each of the virtual machines as much as [inaudible]. >> Diwaker Grupta: Absolutely. So there is some maintenance cost that is, you know, constant overhead per physical machine and there is some maintenance cost that is constant per virtual machine. But the real story is that, you know, with these virtual machines you get management tools to manage this virtualized infrastructure and things like keeping the virtual machines updated, patching them becomes easier if you are paying in this virtual machine framework than if you were working with physical machines themselves. So while there is still some costs machine the overall cost decreases know the per physical machine that infrastructure. We can talk about associated with managing virtual because the dominant cost is you you have to manage in that more later on. There are several other interesting applications in security for example you can do intrusion detection or full system logging and replace. Virtualization extremely useful for development and testing so if you have -- if you're testing a patch for Internet Explorer instead of testing on several differently versions of Windows and IE and different combinations, you can quickly both pre-create a virtual machine image with all these different software combinations and use a single machine for testing. And finally, virtual machines can be created on demand and migrated from one physical machine to another, so this flexibility in how you provision resources allows you to build more Agile infrastructures so companies like Amazon and [inaudible] who are using virtualization to build out their cloud computing infrastructures. The work that I've been doing the past couple of years has been mostly around virtualizations and spans both of these domains. The overarching theme in my research has been to allow you to do more with virtual machines by making it more scaleable. So for example by allowing more aggressive server consolidation or by addressing some of the applications that I mentioned earlier. So in this talk I'll talk about mostly novel applications that virtualization enables, two examples of which I gave earlier. I've also done a bunch of work in server consolidation. So for example by improving the way virtual machines monitors manage memory you can actually create the number -- increase the number of virtual machines that can be supported on a single physical machine, along for more aggressive server consolidation. I have worked in performance isolation where you are trying to make sure that a virtual machine doesn't negatively impact the performance of other virtual machines that are running on the same physical machine. And finally I've worked on infrastructure management and a sensible framework for managing virtual machines and the resources allocated to them. And so as I said in this talk, I'll primarily focus on the novel applications and briefly talk about the memory management work but I'm more than happy to discuss the rest of the project offline. So with this background and motivation here is the outline for the rest of my talk. I'll first present a technique called time dilation that allows us to go beyond the capacity of the underlying hardware and show its application in protocol evaluation. I'll then present DieCast which is a framework that leverages time dilation to do large scale testing using a much smaller infrastructure. Next I'll briefly discuss a system called the Difference Engine that last for more efficient memory management in virtual machines and I'll then touch upon some related work before concluding. Before I dive into details, I want to take a moment to quickly introduce Zen. So while most of the ideas that I'll present here are applicable to virtual machine monitors in general, the implementation is largely based on Zen, which is an OpenSource virtual machine monitor out of the university of Cambridge. When Zen boots, it starts and initial virtual machine called domain zero, domain zero sort of the control entry point for the system where you can create and manage other virtual machines. There are two different kinds of virtual machines that are supported in Zen. Virtual machines require modifications to the guest operating system to make them virtualization aware, allowing for some performance optimizations. But the flip side is that you can't support all operating systems in particular operating system for virtual source code is not available can be supported. Zen also supports fully virtualized VMs which allow running on modified guest operating systems so you could run Windows, Solaris, whatever have you, but they require hardware support such as the Intel VD processor. All right. So let's talk about time dilation. Consider this physical machine here with the given configuration and this physical machine potentially holds several virtual machines. An obvious consequence of doing this kind of multiplexing is that these resources will get split up among these virtual machines so each virtual machine individually has access to only a fraction of the underlying resources. So in particular in this example, let's say you create five identical VMs and these VMs have seek access to all the resources and they have equal work as well, then each virtual machine roughly sees the equivalent of a fifth of the resources. So you'll see a fifth of the CPU, you'll see one fifth network connectivity and it will have one fifth the main memory on the machine. The goal with time dilation is to somehow increase the amount of resources a virtual machine thinks it has. And we'll see how it can be useful as we go along. So here's the idea. What we do is we take time and trade it off for other resources in the system. So consider this example. Over a period of one second, let's say the operating system receives 10 megabits of data. If you then ask the operating system what is the bandwidth that it perceived, it's going to say 10 megabits per second. In time dilation we're going to slow down the passage of time inside the operating system. So let's say if you repeat the same experiment but this time we can list the operating system that instead of one second only hundred milliseconds have passed. But it still sees the same amount of data. So if you now ask the operating system what is the bandwidth that it perceives, it's going to say hundred megabit per second. So it thinks that it actually has a faster network than it actually does. And so by slowing down the passage of time inside the operating system, we can make all these time based resources such as the CPU on the network bandwidth on this bandwidth all appear faster. Of coarse now your experiments will take much longer to complete because for each second of realtime only 100 milliseconds are passing inside the operating system's timeframe. So we call this ratio of the realtime to virtual time, the time dilation factor which in this case is 10. So let's revisit our multiplexing example with time dilation. So earlier we saw that if we have five identical VMs, each of them roughly sees a fifth of the underlying resources. Now if you take each of these virtual machines and run them under a time dilation factor of five, all of these VMs will roughly see five times the resources that they had. But note that time dilation only impacts time based resources, temporal resources. It doesn't impact static resources such as main memory capacity or second restorage capacity so each of the virtual machines still sees 400 megabytes of RAM. And I will come to this point later again. But note that there's really nothing special about the number five here. We could have picked a different time dilation factor and the perceived resource capacity would change accordingly. So if you pick the time dilation factor of four, the VMs would perceive a different resource capacity. Why did we have to create five virtual machines? There's no strong reason to create five virtual machines [inaudible] and have a single virtual machine on this physical machine. And so let's say you have some physical machine with a gigabit per second network and you create a single virtual machine on this physical machine, so originally this virtual machine has access to the entirety of a per second of bandwidth that's available from this machine. Now, if we run this VM under a time dilation factor of 10, we can actually make it believe that it had access to a 10 gigabit network. And this is precisely the kind of thing that we need to do this scaleable network in relation that I give an example of earlier in the talk. Are there any questions at this point? All right. So how do we go about implementing time dilation? The general strategy for implementation is that you have an operating system and that operating system relies on a variety of time sources to infer some value of time. So the most common one is time interrupts that the operating system is receiving but there can be several other time sources as well, right? You could have specialized content such as the TST on exhibit platforms. There would be other hardware timers such as the high performance timer or the programmable intro timer. And there might be some information maintained in the bios as well. And so the challenge here is to basically figure out all the different time sources that an operating system might use. So the operating system takes all these time sources and infers some value of time. So the key point is that the operating system doesn't have any notion of absolute time or realtime. And so to implement time dilation what we need to do is we need to interpose between all these different time sources and scale them appropriately by the given time dilation factors. So it might involve reducing the frequency of the time it interrupts or manipulating the value of the TST that's visible to the operating system. And if we can interpose between all the different time sources, the operating system will automatically perceive a different value of time. Now, the challenge here is that the implementation should be both transparent and pervasive. And by transparent, what I mean is that we shouldn't have to modify the operating system or the applications to support time dilation because we want to support unmodified operating systems. The implementation should be pervasive in the sense that the -- there should be -- the operating system and the application should have no real figuring out that they're running in this dilated timeframe if the realtime somehow leaks into the system then we will see different behavior than expected. Yes? >>: When I'm running in a virtual machine, I don't -- that's getting a 10th of the CPU, I don't see my CPU clock going 10 times slower, I guess sudden bursts of it. So if [inaudible] simple way this could become unless you're scaling based on the number of CPU cycles I'm getting this isn't going to be transparent, right? >> Diwaker Grupta: So you raise a great point. Transparency, absolutely transparency is hard to achieve. What we are trying to say here is that depending on the application that you're running, it might not matter if the exact distribution of CPU cycles is preserved or not, right? And so as we'll see later for most applications it doesn't really make a difference if the number of -- all we are saying is that, you know, in one second -- so let me rephrase it. So if you did some CPU intensive task in a regular operating system and you timed it, you will measure it took some amount of time. All we are saying is that under time dilation if you repeated the same experiment, it will take a 10th and if you were using a time dilation factor of 10, it would take a 10th of the time simply because that's how you're measuring. Does that make sense? faster. Yes? So in that sense a CPU appears >>: I think we're very curious about how you deal with first the effects of the virtual machine and [inaudible]. >> Diwaker Grupta: I'll come to some of it later as I go along. All right. So the implementation that we have for Zen, you know, does roughly the same things that I just described. We scale all of the different time sources, but the given time dilation factor. In fact, we allow each virtual machine to be run with a different time dilation factor, which is extremely useful. Our implementation supports both power virtualized lies and fully virtualized VMs. >>: [Inaudible] how do you get [inaudible]? >> Diwaker Grupta: For power virtualized -- I mean, for power virtualized VMs the argument is because power virtualized operating systems anywhere require modification for our power virtualized implementation we actually just modify the operating system itself. So if you're going through the operating system to access the TST, then we will drop it and return the right value. But in power virtualized operating systems, the applications might soon be able to access the TST directly by exhibiting some instructions and hardware. So you're right, for those applications you have to do some kind of drop and emulate so you have to watch the instruction screen that's emanating from the VMs sort of right there. But right now we don't handle that case. >>: [Inaudible]. >> Diwaker Grupta: No. Any other questions? Okay. I do want to mention here is that you could implement time dilation directly inside the operating system without having to go through this virtual machine interface, but implementing inside the virtual machine monitor makes this kind of scaling very easy because you already have interfaces for exposing the different time sources through the virtual machine monitor. So this makes it much easier. But the really issue as some of you have brought up is that do we believe time dilation, in particular for the bursty kind of allocations and what are the impact of multiplexing? So how do we test if time dilation really works? Our general methodology for validation is that we're going to pick some baseline that we can currently attain and validate, we're going to -- we're then going to scale this baseline down. And this scaledown can be either in the form of picking a less capable hardware or artificially restraining the resources available to a physical system. So once we have the scaledown system, we are going to use dilation to scale it back up to get a perceived configuration. And then the goal is to compare the performance of the baseline and the perceived configuration. And if the amount of resources in the perceived configuration are similar to the amount of resources in the baseline, then the expectation is that they should behave similarly. So in the variance you want to maintain is that of resource equivalents. So basically what I mean by that is that after time dilation the perceived configuration should have the same resource capacity as the baseline system. So how do we get the scale configuration? Let's start with a simple example here. Here I have a single link which is running at 10 megabits per second and 20 millisecond of roundtrip time and these end point might be running under time dilation. And the invariant that we want to preserve in this case is that as perceived by these end points the link should have the same network characteristics all the time. So in the baseline system which is equivalent to having a time dilation factor of one, the real and the perceived configuration are exactly identical. Now, if you were to run the end points at a time dilation factor of 10 without doing anything else the end points are going to perceive a much shorter latency on the link, in particular two milliseconds, and a correspondingly faster bandwidth so the link will appear to be 100 millisecond per second and two millisecond link. So what we need to do is we need to artificially scale down the capacity of the link and make it slower so that after time dilation the perceived configuration matches the real configuration. And in particular we want to manipulate the links so that it has actual bandwidth as one megabit per second and latency of 200 milliseconds. Now, we can easily do this kind of manipulation of the link configuration by using a traffic shaping tool at the end point or using a network emulation environment to manipulate the characteristics. So this also goes back to the pervasiveness argument that I was talking about earlier to preserve this illusion of time dilation we want to make sure that all the components involved in this system on the end points, each of the links, any appliances you might have in between, routers and so on, they should all be scaled uniformly, otherwise we will get unexpected behavior. So next we come to the actual experiments that we do for validation. We have validated time dilation for a wide variety of configurations, but it's really easy to match coarse grain metrics like aggregate throughput over some period of time. And what we really want to see is how does time dilation behave on very, very fine granularities. So let's start with a simple experiment here. We have a single TCP flow, and we inject one percent deterministic losses in this TCP flow. It's going across 100 megabit per second link with a roundtrip time of 20 millisecond and we're looking at the first, second operates on the X axis you have time since the beginning of the experiment on the Y axis you have the sequence numbers. Basically this is showing a packet sequence diagram of the first second of trace. What we want to do is we want to repeat this experiment under different time dilation factors, making sure that the perceived link characteristics remain the same and then we want to see if we can preserve this low level packet behavior. All right. So here is on the baseline configuration again at the very top. We repeat this experiment at a TDF, time dilation factor of 10, again making sure that the perceived configuration is the same at a time dilation factor of 100 and visually you can see that the traces look very, very similar. Of coarse we have also done statistical comparison by using the distribution of the intrapacket arrival times and so on. And even distributions match. And so for this very simple case time dilation is able to preserve low level packet behavior. But this is, again, in a simple case with a single flow, we have also dilated time dilation in more complicated scenarios where we have multiple flows varying bandwidths, varying latencies and so on. And I'm not going to present the details here, but for all of these validation experiments, we show that time dilation can actually preserve the baseline configuration. And coming back to the point you had about burstiness behavior, again I think it goes back to how -- what is the fidelity of your implementation? If the time sources that are being used by the application, by the operating system are all the time sources that you already interposed on, then no matter what granularity the application tries to get the value of time, it will get the scale value of time. So the only case where burstiness actually will show through is when the application is using a time source that we cannot capture, which is what Ed was pointing out. So in the power virtualized case, if you have applications that are directly issuing assembly instructions to read on the TST value, that's something that we do not handle right now. Yes? >>: [Inaudible] the first example if the two slow down machines run differently [inaudible] then if they're on the same virtual machine one of them might get to run for some long quanta, and not see response for the other ones going into that quanta, and that's something [inaudible] they're not slowing down. >> Diwaker Grupta: That's a great point. So I think what you're pointing out is that there is some overhead involved in doing this kind of multiplexing. >>: I'm not talking about overhead, I'm talking about the bursty and the quantization of the CPUs. For example, you do time dilation wherein you gave each VM a 10 second quanta and first VM would spend its 10 seconds perceiving to be 100 seconds and wondering where in the heck that other [inaudible] and then we switch the other guy and [inaudible]. >> Diwaker Grupta: So again, so most of the virtuals that you can see here, I mean for network I/O it's not going to happen, right, because you're going to send some packet, you're going to wait for some response, you're going to block and then the context which will happen. Right? And so you're not ever going to run for 10 seconds straight. >>: My point is if you had enough to do your infrastructure were very foolish in quanta then you absolutely would see and could counter this thing [inaudible] how enough or fidelity -- and your machine worked and allocating these 10 second second [inaudible] so you do we know yours is small >> Diwaker Grupta: Right. So one of the things we the scheduler that the virtual machine monitor uses time quanta. So you can -- you know, you make sure machine is running for too long and trying to break again ->>: do is we modify on to use much smaller that no virtual the solution. But How do you know I guess is the question? >> Diwaker Grupta: So -- that's a great question, and our metric has been to try different kinds of applications that we care about and see if we can preserve their behavior, you know, for all the applications performance applications specific performance metrics that you might want to measure. And in -- for the applications I mean this is just one class of applications that I'm talking about, but later on I'll talk about several other class of applications. And for all those it hasn't mattered. And so you're right that there might be some artifacts that we're not able to capture, but if the application that you're using then, you're using and not concerned about them, then how do they make a difference? >>: So what observation would it [inaudible] fact that you're showing us these graphs that visually match. Did you have -- what's your test for not matching enough? If I built a competing system that would also attempt to achieve its same goal [inaudible] some egregious error what would be a characteristic of my graph? Did you ever see a graph to which you looked at and said oh, that isn't fitful enough? >> Diwaker Grupta: I'll come to that later. So there are artifacts in the system where, you know, you have to do additional work to make sure that the graphs do match. Basically our metric has always been with an application, whatever application. We don't care. And look at the metrics that you normally use to measure the performance of that application. And then run it under time dilation and see if you can match the performance. And if you want -- if there are finer grade metric that you wanted to use, you look at that, and see if you can match it. >>: [Inaudible]. >> Diwaker Grupta: Okay. So as I said earlier in a distributed system you want to make sure that all. >>: [Inaudible] but if you run in bursts, say your clock goes past and then stops and pass and then stops NTP is not going to work ->> Diwaker Grupta: So, so, in the frame -- in this environment we're basically if you use -- if you're using NTP, the NTP server has to be running in the dilated timeframe as well. Because ->>: Of coarse it has to be running in dilated timeframe but if time goes fast and then stops and fast and then stops and they're not in lock step across all the machines then NTP isn't going to converge properly. I know I've actually run it on machines that have clocks that did that as the clock was posted, and it didn't work. >> Diwaker Grupta: So I've never experimented with NTPs, so I really don't know the answer there. Yeah. >>: So [inaudible]? >> Diwaker Grupta: We have. I'll come to that. All right. So let me -- let me move on and I want to talk about one of the examples of time dilation. So again, coming back to the protocol evaluation and example that I had earlier. So in the Linux 2.6 kernel the default flavor of TCPs, CTP and New Reno but Linux also ships with another variant called TCP BIC which stands for binary increase congestion, and this is -- this variant features an enhanced congestion control protocol on designs specifically for high bandwidth networks. And what we want to do is compare New Reno versus BIC on high bandwidth links and see which one does better. So the goal here is to treat this protocol black boxes, we're not trying to reason about the behavior that we see, we're just trying to see can we use time dilation to push beyond the hardware limit and, you know, uncover interesting behavior, in particular does BIC really outperform Reno as you increase the bandwidth. And so here's the experiment setup. We have a single bottleneck link, and we have 50 TCP folds that are going across this bottleneck. We're going to fix the amount of time of the bottleneck link to 80 milliseconds and we're going to vary the bandwidth on this link. And for each value of bandwidth, you want to measure the per flow throughput that we see. So in the first sort of phase of the experiment, on the X axis I had the bandwidth or bottleneck link going from zero to one gig bit per second, on the Y axis I have the per flow throughput in megabits per second. The first thing you notice is that in this range, because we have gigabit hardware in our clusters we can actually simply run the really operating system without using time dilation and measure what the performance looks like. And so the blue dots here are the performance with the regular Linux no time dilation. Then we can repeat the same experiment just to validate if our setup is correct. So we repeat the same experiment using a time dilation factor of 10, again making sure that the perceived bandwidth remains the same in each case. And the green square -- triangle, sorry, green triangles are the performance under time dilation. And in this range you can see that, you know, they match, they're within the variants, the performance is very close to what we've seen in the baseline. The red squares are the performance regret with TCP BIC and in this range there's no clear difference between the performance of these two different TCP variants. Because we are using a time dilation factor of 10 and we have a gig bit hardware at our disposal we can actually push further and test up to 10 gigabits per second. >>: [Inaudible] with the TDF of [inaudible]. >> Diwaker Grupta: >>: What's it look like roughly? >> Diwaker Grupta: >>: We have, but it's not on this graph. Yes. It's the same. [Inaudible]. >> Diwaker Grupta: Yes. So in this range we are pushing up to 10 gigabit per second again on the Y axis we have the virtual throughput and here we see that as we increase the bandwidth of the bottleneck link TCP begins to outperform TCP New Reno. So definitely there seems to be some advantages on that brings to the table. But does it continue to outperform New Reno if you push even further, so we can actually use a higher time dilation factor to test it even higher bandwidth. And so we move to a time dilation factor of hundred that allows us to go up to a hundred gigabit per second on the bottleneck link. And there are two interesting things in this graph. The first is that the performance of both TCP BIC and New Reno sort of flatten out so they don't increase in terms of the bandwidth that they can extract on this link and secondly the performance differential between these two variants seems to shrink in this range. And so the claim here is that TCP BIC is good for high bandwidth networks but only up to a certain range and after that we hit upon the diminution returns where TCP BIC doesn't prove to be as beneficial. >>: [Inaudible]. >> Diwaker Grupta: Yeah? >>: The same series of data I can also draw a completely different conclusion [inaudible]. I can back up two slides and say the -- at the small range we would verify that through the time dilation filter these things look the same, and we can see on the next slide at some point the time dilation filter seems to see something different but then on the third slide it looks like you can't tell the difference anymore. But we can't actually tell whether the phenomenon that we see past the validated area is a function of the variants between -- the difference between [inaudible] or a function or limitation in the fidelity of this. I mean, it seems like they're trying to simultaneously validate the mechanism and draw a conclusion from it. It can't be both at the same time. >> Diwaker Grupta: That's a great point. And I mean the whole point of -- we spent a lot of time in this project just trying to validate, validate, validate, and we can only validate, you know, up to the sort of the hardware capacity that we have. And so in this case we had gigabit hardware so we validate up to a gigabit hardware and beyond that, it is our hope that, you know, once we can actually test on hundred gigabit per second network it will look something like this. But we have no way of finding out right now. >>: [Inaudible] did the results look like yours? >> Diwaker Grupta: The -- so we did look at NS results for some of the things but not this particular experiment because I didn't have an implementation of TCP BIC and so on. So I mean, the problem with, you know, doing apples-to-apples comparisons with an experiment like this in NS2 is that I mean you don't have the real operating system stack. I mean, there are so many other artifacts that are different there and ->>: [Inaudible] results that look like what NS did qualitatively then that would be -- >> Diwaker Grupta: one. So [inaudible] some other experiments, but not this >>: It may be that they're wrong or it may be that you're wrong or it may be that [inaudible] are wrong. >> Diwaker Grupta: Sure. So we haven't looked at NS 2 results for this experiment, but we have for other experiments. Yeah. >>: As you're scaling up the bottleneck bandwidth are you scaling up anything else? I'm wondering if the bottleneck remains the bottleneck [inaudible] the bandwidth. >> Diwaker Grupta: [inaudible]. >>: So I'm not sure what you mean by that. So [Inaudible]. >> Diwaker Grupta: Yes. So then points are running at time dilation factor of 10, so they are -- they do have more CPU in that sense, yeah. >>: So [inaudible] real big here I assume that's because it takes so long to run the experiment, you can only run it one and a half times instead of 10? >> Diwaker Grupta: Well, these are across the 50 flows, they are not across different runs. So plotting the mean and the standard division across those 50 flows. Yeah. >>: Do you suppose those errors and those [inaudible] are getting bigger because they can't get in the protocol or are they getting bigger because burstiness is [inaudible]. >> Diwaker Grupta: Again, I don't know for sure. I mean, in the range that we validated it didn't look like we were adding something to the error bars. But in this range, I don't know for sure. All right. So I'm going to move on at this point. And the take away here is that we think the time dilation can use as an effective tool to push beyond hardware limitations, to do this kind of protocol evaluation. And I think compared to the current state of the art interim, so the techniques and the tools that you have available I think time dilation gives you more realism and accuracy at the same time. So that's just one application. Now, there are several other applications you could use time dilation for. You can use it to predict how performance if your cluster will change if you upgrade some piece of equipment you could also explore how the bottlenecks in your application evolve as you give it more and more resources. But there are certain limitations to be kept in mind when you're using time dilation. First of all, time dilation doesn't scale static memory capacity. As I mentioned earlier and our work on Difference Engine addresses some of the issues here and I will talk about it towards then of the talk if I have time. But remember that for pervasiveness we warn that everything in a disperse system is running under a same time dilation factor and if you have some specialized hardware appliances that either cannot be virtualized for some reason or if the [inaudible] on the appliance is not accessible for direct implementation, then these need to be dealt with differently as a special case. And finally time dilation cannot capture radical hardware changes so if hundred gigabit hardware looks fundamentally different than how we construct a gigabit hardware than the accuracy of the predictions that time dilation makes will degrade. And so we can capture all the technological evolution. Yeah? >>: [Inaudible] sort of higher level [inaudible] one of the motivations is you want to simulate Amazon or something like that, right? Well like in Bill's example NTP there's actually sort of interesting higher level synchronization. Your system doesn't necessarily capture, right. So the evaluation you showed us here was basically bulk TCP throughput, right, which is fairly straightforward. But I mean at what point do you have to say okay, the system we're trying to emulate is so complex we have to somehow intuit about global synchronization behaviors or things like that. >> Diwaker Grupta: Right. So in this work that I'm going to talk about, DieCast, I'll present evaluations for much more complex distributed systems. So fully that will address your point, yes. >>: Does TCP look at the time say the [inaudible] 10 milliseconds [inaudible] differently or [inaudible] millisecond? >> Diwaker Grupta: Excuse me? I didn't get that. >>: Does TCP really take latency into account? >>: Yes. >> Diwaker Grupta: >>: [Inaudible] actual number. >> Diwaker Grupta: >>: Yes: [Inaudible]. Yeah. [Inaudible] different [inaudible]. >> Diwaker Grupta: All right. So next I want to talk about DieCast. And just to refresh the motivation here is to use a small infrastructure to test a much larger system. And we want to do this kind of replication and testing with some goals in mind. The first thing we want is fidelity, right, we want to make sure that we can replicate the original system to as great a fidelity as we can. We also want reproducibility in the sense that we want to be able to do controlled experiments for performance isolation, for performance debugging and so on. And finally we want to be efficient in the sense that we want to use as few resources as possible to do this replication and testing. And as I said before, I'll show that DieCast can scale off given test infrastructure in order of magnitude or more. So let me walk you through the approach that we take with an example. So here we have a typical three tier web service. We have some web servers, we have application servers, we have database servers. They're all connected to -- with some -- through some high speed fabric. And then we have a load balancer in front of the system. And the goal now is to replicate and test the system using fewer number of machines. And note that if you were to make a complete exact copy of the system, you would get fidelity because it's basically the same system all over again, and you might also get the produceability but by our definition of efficiency, because we are using double the resources the system is not efficient. So in order to bring efficiency back, what we do is we're going to encapsulate each of these physical machines into virtual machines and then we're going to consolidate these virtual machines on a fewer number of physical machines. So we do some consolidation that gives us fewer number of physical machines but now we've lost the original topology of the network. So we're also going to put in a network emulation environment that can recreate the original topology. But now we have the problem that each of these virtual machines only has access to a fraction of the resources of the underlying test machine. And furthermore, the machines in the test harness might be completely different than the machines in the -- the physical machines in the original system. So we have lost fidelity. So how do we reconcile fidelity? So let's see what we want to do here. So we have some machine in the test harness and here we have some machine in the -- some physical machine in the original system. Now, so far what we have done is we have created some number of virtual machines on this test machine, and what we want to do is we want to take one of these VMs and make it look like the corresponding physical machine in the original system. So obviously one of the things that we need to do is we're going to use time dilation to scale up the resource capacity that's visible to this VM, but the other thing we need to do is to take into account the heterogeneity between the machines in the test harness and the machines in the physical system and also because each of these virtual machines might not be identical, they might be configured differently we have to be able to precisely control the amount of resources that are available to this virtual machines and particularly the amount of CPU that's available to it, the amount of network and that's available to it, and the amount of disk that's available to it. So for scaling, I mean we use traffic shaping tools in combination with network emulators. As I described earlier for CPU scaling we leverage CPU schedulers inside the virtual machine monitor so we can say things like this virtual machine should get 10 percent of the CPU. There are some subtle advertise involved in the CPU scheduler, but I don't have time to cover them now. But for the most part, we just leverage existing CPU schedulers. But what about disk I/O? Under time dilation the disk will appear to be much faster and in particular it can be -- it can be perceived as a completely different disk than the disk in the original system. And we want to make sure that the VM sees the same disk as the physical machine sees in the system. So how do we deal with disk scaling? So before I describe that, let's just first look at how this happens in them. So again here we have implementations for both power virtualized and fully virtualized VMs but in this talk I'm only going to talk about the fully virtualized implementation. So here we have an unmodified operating system. It thinks it's using a real disk drive. So the device driver is unaware that a real disk didn't exist. Now, in practice the file system of the VM might be back by a disk image or a partition in domain zero. There is a user space process in domain zero called the I/O EMU which is responsible for doing the I/O emulation for this particular VM. So any disk request that originate inside the virtual machine will be intercepted by the I emulator process which will then do the operation on the virtual machine's behalf, so it acts as the hardware do the read drive and so on. So this is the model that we're working on. So what is the goal? Again, so the goal is that we want to preserve the perceived disk characteristic so everything from seek times to this throughput and so on. And because the disk will appear faster under time dilation what we want to do is take each request and slow it down by some amount to make sure that inside the virtual machine the perceived characteristics remain the same. Now, the challenge is that a lot of the low level functionality for disk resides in the formware and so the formware might do batching and reordering of request for efficiency masking some of the delays. But the bigger problem is that the test harness may have a completely different hard drive than a physical machine in the original system. And so how do we reconcile these two completely different pieces of hardware? So our approach to addressing both these issues is to use DiskSim. DiskSim is a high fidelity full DiskSim simulator out of CMU. So what DiskSim does is that you give it a model of a disk to simulate and then for each request it will report how long would the request take to service in that particular disk. So we have DiskSim running as a separate user process, and what we do is for each request that comes to I/O EMU we forward the request to DiskSim so DiskSim knows things like which sector number is being accessed, what is the size of the request, what are the type of the request, read, write, and so on. And after that, it will return disk service time in the simulator disk. Do you have a question? >>: I have a question [inaudible]. My understanding is [inaudible] you can divide the CPU [inaudible] bandwidth even [inaudible] okay but in [inaudible], right, therefore everyone share the same queue. It's hard to make sure everyone get to like the distribution [inaudible] how do you solve that problem? >> Diwaker Grupta: So that's a great point. Which is precisely the reason we want to control the time it takes for each request to service, right? So in particular DiskSim is going to return the service time in the simulator disk. And we know the time dilation factor. We also know how long the request actually took to service inside IMU. Based on these we can figure out how much delay each request by before you return to the virtual machine. And so you can precisely configure, precisely control the time that the virtual machine thinks the request to serve. And so the illusion that we are trying to preserve is that the virtual machine is actually talking with the simulator disk and not the real disk. Yeah? >>: [Inaudible]? >> Diwaker Grupta: >>: [Inaudible]. >> Diwaker Grupta: >>: We are for some of the experiments, yeah. Uh-huh. Because the disks you still [inaudible] disks [inaudible]. >> Diwaker Grupta: So, so -- >>: [Inaudible] confident DiskSim is emulating a single machine to a single disk and how are you [inaudible] the [inaudible] ->> Diwaker Grupta: So, so. >>: Are you trying to emulate everything [inaudible] disk in the individual machine? >> Diwaker Grupta: So part of the reason why it works is that under time dilation each request actually has more realtime to finish. So even if there's some overhead in running multiple, the sim processes and so on because we have more time to finish, we can still do this kind of emulation without breaking fidelity. >>: So one thing that could happen [inaudible] is that you could have all of your multiplier processes are doing sequential I/O and you're starting out [inaudible] you might see the slowdown if the quanta are small, which we [inaudible] we might see what looks like a random I/O to the really disk and that can slow down much more than the order of magnitude. So do you actually validate that -- I mean DiskSim says I want this request to run at this time. Do you validate that you don't [inaudible]? >> Diwaker Grupta: So for DiskSim evaluation that we've done we used some standard file system benchmarks like DBench and Iozone and whatever the access patterns they use are. We didn't specifically try a benchmark that only had random I/Os. But the ->>: [Inaudible] misunderstanding. If you had 10 virtualized machines, 10 virtual machines, all of which were doing sequential I/O, but together [inaudible]. >> Diwaker Grupta: So we haven't done that exact experiment but I will talk about experiments where we have a bunch of virtual machines where we're doing a bunch of disk I/O. We're not sure that's exactly sequential, but for those experiments it hasn't been an issue. But I can imagine that you could ->>: I have another question. You could leave running all the time if I were to have built this I would have said assert that time [inaudible] is after [inaudible]. >> Diwaker Grupta: >>: Oh, oh. [Inaudible]. >> Diwaker Grupta: So you're saying what if the amount of time we actually wanted delay is like negative, right, that SM has already taken long number. You're right. So -- >>: No, the actual disk has taken longer -- >> Diwaker Grupta: Yeah, yeah, yeah. Yes, yes. So in this system so we actually have checks for that. And in experiments we have it doesn't happen. >>: [Inaudible] at all. >> Diwaker Grupta: No, it didn't happen often, so it happens like very, very rarely. I don't have the exact numbers for it, but we do -but we do have checks for that. And we don't explicitly deal with it. And so one of the things we tried doing was accumulate this negative time and incorporate it in a later request where we have enough buffer. So we played around with some of those, but in the experiments so far it hasn't mattered. All right. So the impact of all this is that the virtual machine thinks that it's interacting with the simulator disk which we can choose to our own liking. So by doing this, we can preserve, you know, the disk I/O characteristics as well. So so far the approach has been that we're going to multiplex VMs for efficiency, we're going to use time dilation to scale up the capacity of each VM and we're going to use independent knobs on the CPU, the disk and an effort to make sure that each VM has resources exactly equal to the corresponding physical machine in the original system. And the claim after doing all this is that the scale system almost look like the original system. I say almost because we still don't deal with things like the main memory capacity. And I'll talk about it later. So how do we validate DieCast? Well, the same methodology as before. We set up a baseline system, and the experiments that I'm going to talk about the baseline is 40 physical machines running a single VM each, and for the DieCast scale configurations we have four physical machines running 10 virtual machines each with a time dilation factor of 10, and we want to compare the performance of these two systems. And the questions we want to ask is A, can we match application specific performance metrics, and, B, can we match low levels of some behavior like the CPU utilization profile as a function of time. So we validate DieCast on a variety of systems. In this talk I'm going to talk about one of the systems. This is a service called RUBiS. It's an eCommerce system modelled after eBay. So you have some web servers and database servers. In this experiment we are modelling two geographically distinct sites, so they're connected by a [inaudible] link. And the servers, half the servers in each site talk to the database servers on the other site. The nice thing about RUBiS is it comes with a configurable workload generator, so you tell the workload generator how many user sessions to simulate? Yes? >>: [Inaudible]. >> Diwaker Grupta: I'm just -- just something we came up with. There's no strong reason. >>: [Inaudible]. >> Diwaker Grupta: We have other -- for RUBiS in the paper this is the only one we have. But we have different topologies for BitTorrent, we have different topologies for the third system that we tried out. >>: [Inaudible]. >> Diwaker Grupta: Yeah, no, no. I mean, we just wanted to make sure there is more -- you know, there is reasonable complexity that, you know, communication is happening but within each site across sites we are exercising local area links, wide area links and so on. So it's just something that we came up with. So we have one workload generator for each web server and we increase the workload in the system and see how it behaves. So the workload generated by RUBiS comes so that it reports a bunch of metrics at the end of the run. And these are again application specific metrics that the workload generator reports. We do nothing to them. And so the first thing we want to do is we want to take compare the metrics in the DieCast system versus the baseline. So on the X axis have the total system load in terms of the number of simulated user sessions across all the machines. On the Y axis, I have the aggregate throughput of the system in terms of requests per minute. The solid red line is the performance of the baseline configuration and the dashed [inaudible] line is the performance of DieCast and you can see that at least for this metric they almost perfectly overlap. I'm also showing a card line here which is labeled as no DieCast and this basically is that if you were to just multiplex VMs without doing time dilation or any other kind of scaling this is the performance you would get. And as you can see, you know if you're not using DieCast the performance differential, the system diverges significantly from the baseline and in fact even more prominent if you look at response time. So on the Y axis here, I have the average response time taken across all the requests. And again, we match the baseline configuration pretty closely, but if you are not using DieCast, the response time degrades significantly because now the virtual machines are running out of resources. Yes? >>: The Y axis are being measured in realtime versus DieCast? >> Diwaker Grupta: No. No. So we just take the numbers that the workload generator reports to us. The workload generator is also running in this dilator timeframe. >>: [Inaudible] so this the says that when you use the 10th of the machine with no DieCast you get somewhere near half the performance. >> Diwaker Grupta: Uh-huh. >>: So that suggests that you didn't really push this far enough, right, because I'd expect to see something like a 10th of ->> Diwaker Grupta: It depends on the workload, right. Not always -- >>: It depends on the workload. I'm suggesting that you probably should have pushed the workload by a factor of five farther out because you didn't really stress the difference between ->> Diwaker Grupta: Probably -- I mean for this one, this is as far as we went. But there are other workloads where ->>: Maybe I'm not making my point [inaudible] enough. You're telling me, look these two lines coincide so we've done a good job preserving [inaudible] but that's because the machine -- one possible explanation is we didn't come anywhere close to [inaudible] the machines and the point of having 40 machines is we're going to load them. And the machines are underloaded by a factor of five. >> Diwaker Grupta: So as I said, we have done the experiment where we do actually push the machines, just not in this experiment. So rephrasing your question I think what you are saying is that for this application we're not use time dilation, we're not so terribly off from the [inaudible] line, right ->>: I'm not saying that [inaudible] DieCast is bad, I'm saying the factor, the gentle line, the silent line agree tells us almost nothing because we have not -- you're not measuring the interesting part of the graph. >>: Right. >>: Because there's a lot of [inaudible]. >> Diwaker Grupta: So I think I have the graphs, and so the -- but there are -- so we -- one of the problems we had was that the system was configurable only in certain aspects, so we actually wrote a third service -- let me -- so we wrote a service where you could configure the amount of computation communication and I/O overhead on a per request basis and there we basically try to push the system in each of the three different dimensions and see where we try to break town. those graphs are in the paper. And All right. So this was the application specific performance metrics. But what about the resource utilizations? So in this system we had three different types of machines, the web server, the database server and the workload generator. And so what we did was we randomly picked one machine of each type and looked at the CP utilization, so on the X axis you had the time since the beginning of the experiment, on the Y axis, you had the percentage CPUs and we looked at the -- compared with the CP utilizations and the corresponding virtual machine, and the take away from this graph is that the utilization profiles are similar. I mean, obviously we can't match the instantaneous profiles exactly but they're -- they display similar behavior. We do the same thing for memory. I just want to point out that even though we don't scale memory in this experiment we had set things up such that the baseline similar had the same amount of memory that each of our virtual machines would have so that we can actually look at the memory profiles. And we also looked at the network. So we looked at each hop in the topology and saw measured how much data that was transferred on the hop of coarse this is a fairly coarse grain metric for this experiment. We sorted the hops by the amount of data that was transferred on them and then considered the hops in the same order in the DieCast topology and again the amount of data transferred seems to match. So as I was saying earlier ->>: [Inaudible] so from the fact that each of these graphs [inaudible] fidelity? >> Diwaker Grupta: Yes. For this experiment. >>: Except that memory was the thing where you didn't even try to preserve fidelity, so you've got -- I mean ->> Diwaker Grupta: No, so fidelity here is in the sense that if your baseline had the same amount of memory as your virtual machine does it get utilized in the same way? Right. That's all we are saying. We are not trying to address the fact here that the baseline could have had a lot more memory. That's another problem that I'll talk about later. >>: Because you downgraded the. >> Diwaker Grupta: >>: Yes. Baseline, right. >> Diwaker Grupta: Yes. All right. So we have tested DieCast on, you know, BitTorrent and we discuss some service where we exercise different dimensions of the system and you know, for example, when we have extremely high CPU we try to see some divergence in the baseline. But for the most part we can match both the application specific performance metrics and the resource utilization profiles. But you know, again, these experiments were encouraging, interesting in their own right but we really wanted to see if you could use DieCast on a real world system. Yeah? >>: So how important was it that you used DiskSim, this super accurate thing? Could you have just written underline [inaudible] script that kind of, oh, yeah, [inaudible] how well would that have done? >> Diwaker Grupta: So we do have something like that for a power virtualized case. We don't have, you know, such a high fidelity handle on how each request are going, so there we basically modify the device driver to say, you know, interpose some delay to make sure it works reasonably well, but, you know, we use DiskSim because we want to save -- you know, we want to use the most fidelity model that we have, the highest fidelity model we have. Where again the argument is ->>: I'm wondering where is the [inaudible]. >> Diwaker Grupta: >>: Oh, so [inaudible]. [Inaudible]. >> Diwaker Grupta: So it depends on the workload. You can get workloads where the model tries to show leaks. But we can take about it later. All right. So we were really fortunate to be able to work with this company in Panasas. They build high performance file systems. It's a company based out of Pittsburgh. Garth Gibson from CMU sort of involved in the company. And this is exactly the motivation that I was presenting earlier. They have this problem that they ship their file system to clients which have thousands of machines, but they don't have the infrastructure to test at that scale. And so we wanted to see if you could use DieCast to alleviate some of the testing problems. So their typical testing infrastructure looks something like this. They have a storage cluster which serves the file system. They have some clients that are generating the workload. And then you have, you know connected by some network. So in order to test or run DieCast on the system, the clients are fairly easy to deal with because they just run regular Linux and so we could use our current DieCast implementation to scale the clients. But the storage cluster was more of a problem because they run their own custom operating system on the storage cluster and it's tightly integrated with their hardware and the upshot of all that is that it's not virtualizable. And so we weren't able to use our Zen based implementation for the storage cluster so we had to do a direct implementation inside their operating system. Excuse me. And while it was a great learning experience, it would have been much easier if the system had been virtualized and then we wouldn't have to do much. For the network scaling we use a standard traffic shaper called dummynet that their system ships with. All right. So the first thing we did was again validation was set up a storage cluster with 10 clients generating the workload. To test the system, our DieCast skill system sets up a storage cluster with only 10 percent of the resources and then we have a single physical machines running 10 virtual machines generating the workload. >>: This whole [inaudible]. >> Diwaker Grupta: I don't recall what protocol their system uses. I don't remember. And then we picked two standard benchmarks from their regular test suite, Iozone and MPI I/O, and for each of these benchmarks we ran, you know, their test suite for varying block sizes and we looked at the metrics that they normally look at, which is I think the read and write throughputs. And for each of these benchmarks we were able to match the performance metrics. But the more interesting part from our perspective is that we were able to use, you know, hundred machines in the infrastructure to scale up to 1,000 clients and before DieCast Panasas had no way to test things at that scale because they just wouldn't have 1,000 machines to work with. >>: So [inaudible] usually have only one server at all these sites? >> Diwaker Grupta: Oh, so the storage cluster, I mean the server has many servers internally, so it actually has, you know, I think 10 or 11 blades running inside and then they scale it as the demand increases. >>: And they couldn't scale to a thousand clients by [inaudible] workflow generators and [inaudible]. >> Diwaker Grupta: No, that's what I'm saying. thousand machines to run the workload on. >>: [Inaudible] machines just to generate your cluster network, right? >> Diwaker Grupta: each workload ->>: They didn't have So the way their system is set up, they did require [Inaudible] block requests. >> Diwaker Grupta: >>: No, that's fine. But I'm just saying that -- [Inaudible]. [brief talking over]. >>: [Inaudible]. >> Diwaker Grupta: >>: Excuse me? [Inaudible]. >> Diwaker Grupta: So we will scale the server down before testing it for the validation. And for this one, we do scale the server up if that's your question. >>: All right. [Inaudible]. >> Diwaker Grupta: Okay. I'm going to quickly try to talk about some of the interesting aspects of Difference Engine. So memory was a big issue here in trying to create more and more virtual machines on the same physical machine. And this is important, you know, for server consolidation as well, but even otherwise if you could do with lesser memory the same number of things, then you have incentive to do so because memory's expensive, it's difficult to upgrade and provision a system with more memory. And memory consume a lot of power. And as, you know, move towards multicore systems, the primary bottleneck in creating more and more VMs on the same physical machine is going to be memory, and so we wanted to consider this problem. Go ahead. >>: Well is memory capacity non scaling the same as transistor capacity for queuing multicore? >> Diwaker Grupta: It might be, but the -- for example, if you -- you have limited number of slots in your motherboard and the CPU can be multiplex among several of the ends. Memory can't be multiplexed that way, so it is simply a fact that it's a static resource and not a time based resource. >>: Okay. So [inaudible]. >> Diwaker Grupta: >>: Yes. Yes. Okay. >> Diwaker Grupta: And actually there is, you know, more compelling reasons here because if you can for example have the same number of VMs consuming less memory, you can actually turn off selectively, you know, one particular dim or something and so you can safe more power that way. >>: Okay. >> John Douceur: So there's incentive there as well. The state of the art in virtual machine memory management is what is called content based page sharing. This is done by VM ESX server and potentially other servers and the idea is that you're going to walk through memory across all the virtual machines, identify all the pages on their exactly the same and then for those pages you can do copy and write sharing which gets you some savings. But the key premise here is that there is a significant potential for savings beyond whole pages. So if you look at subpage granularity, you can actually extract a lot more savings which is what we wanted to do in this work. So just to highlight some of the mechanisms we use, here's an example. We have two virtual machines with a total of five pages that are initially backed by five physical pages. Now, two of these pages are exactly identical and two of the pages they're similar but not quite identical so there are small differences. So the first thing the Difference Engine does is it shares the identical pages exactly like, you know, how VM there and other systems do, so that saves us one physical page. The next thing we do is we identify pages that are similar but not quite identical and for these similar pages we can store this page as a disk or a patch against the base page. Now, the interesting thing to note here is that patched pages have to be constructed even on a read access, which is not the case for copy and write pages. The reads there are essentially free. The third thing we do is we identify pages that are not being accessed frequently and so let's say the blue page in this example is not being accessed frequently, we're going to compress this page and store compress in memory. As before, access to compress pages have to be, you know, uncompressed on each axis. And so by using these three mechanisms we can save a lot more memory than the current state of the art. And there are several engineering challenges that are involved in making the system work efficiently. In particular, how do you pick which pages to patch, which pages to compress, which pages to share? How do you identify similar pages efficiently and also accommodating for the fact that page contents might be evolving over time? There is an issue with Zen in particular that the size of the heap that's available to the hypervisor it's fairly small and so we have to be careful in the data searches we do, the bookkeeping that we do. And finally the whole point of doing this memory saving is that you want to create additional VMs and so at some point you might just run out of physical memory, so we need some kind of memory over commitment support for demand paging things out to disk. And so we had to build this as well for Zen. And I'd love to talk about the implementation later on. And for now I just want to summarize the results of our evaluation. We demonstrate significant savings not just for homogenous workloads where every VM is running the same operating system, same applications. We also demonstrate significant savings for heterogenous workloads where you have different operating systems and applications. And we can extract up to twice the savings from VM ESX server and head-to-head comparisons and in all the cases we see less than seven percent degradation and application performance. And we actually show that we can use the additional virtual machines to increase the aggregate system performance. Now, let me quickly wrap up here. With some related work. The literature is rich in all these areas, so I'm just going to highlight a few of the things. As I mentioned before, simulators such as NS 2 are great but the great for extrapolating beyond, you know, some hardware capacity but at the cost of realism. And on the other hand emulators such as ModelNet give you realism but you are so fundamentally limited by the capacity of the underlying network. And time dilation sort of allows you to get the best of both worlds. With respect to testing large scale systems, the shrink work in INFOCOM 2003, I think, is closest in special DieCast and their scaling hypothesis is that under certain assumptions you can take a sample of the info traffic and use it to extrapolate traffic in a bigger network. But shrink only captures the inner network based systems whereas DieCast is able to capture end-to-end characteristics of a distributed system. You could also use test bed such as PlanetLab or Emulab to do large scale testing the but again you're still fundamentally limited by the number of machines that you have in the system. And finally in terms of memory management for virtual machines, ESX server does a great first step by using content based page sharing but we demonstrate that there's actually a lot more potential for savings if you look at subpage granularities. And we leverage work in compression algorithms for main memory. So when we compressing pages we don't use general purpose compression algorithms like the Ziv Lempel compression, but there are specialized compression algorithms that exploit the fact that memory has some structure to it so typically you'd have numbers stored in memory, for example, so you have some four byte granularity. And so it exploits that kind of structure to do better compression and we leverage previous work in this area. So just to briefly mention some future work, our current limitation for DieCast doesn't scale low level substances. So for example, under time dilation, things like the PCI bus or the memory bus will still appear faster. Now, for most of the applications, such low level substances do make a difference but if you have an application that extremely sensitive to memory axis latency and particularly if the axis latency decreases and you see some performance degradation then this will be a problem. Interposing on each memory axis in software is prohibitive but the hope is that, you know, with better hardware support for virtualization this is something that we can deal with moving forward. I've also done work on infrastructure management of virtual machines, and the initial placement and subsequent migration of virtual machines is a topic that I'm interested in. In particular I'm looking at building [inaudible] for placement and migration without the need for expensive instrumentation and sensors in the hardware and also for a system like Difference Engine we want to be able to place virtual machines that are similar in contents of their memory on the same physical machine. And so I'm looking to this work as well. And the challenge here is to come up with compact representations of the contents of virtual machines memory and efficiently compare these representations. And so we're building some algorithms based on minimized hashing to do this. And finally another way that we might address memory capacity under time dilation is to use fast second restorage devices like SSDs to supplement main memory. So under a high enough time dilation factor, the really fast SSDs can actually appear as fast as main memory. And so this is something that I'm just starting to look at right now. So in conclusion, the general theme that of my research has been, as I said before, to make virtualization more scalable both for supporting existing applications like server consolidation but also in terms of enabling new classes of applications by pushing beyond the limits of the hardware. So this kind of scaleable multiplexing poses several challenges and I've addressed some of them in my work. But also opens up several opportunities. And I give you two examples of the kinds of problems we can use virtual machines for. All of these source code for these projects is available online linked through my web page. And thank you for listening. I'm happy to take questions at this point. [applause]. >>: So I was wondering if you have specific examples of distributed systems where you think this isn't necessarily the best approach. Because what if I propose that maybe this system is good for systems in which mainly TCP based, fairly well delineated between uploaders and downloaders, right, and so you've showed that you can preserve certain TCP semantics, for instance. But I mean in what type of scenario would this just blow up? So you know, what BFT, what about like an overlay or something like that? I mean, I'm wondering to what extent, you know, there's a little bit of overhead or fuzz in terms of scheduling things like that. What you've shown is that, okay, for some TCP scenarios that's fine. But under what timing instances or synchronizations do you think that would fall apart? >> Diwaker Grupta: On I think this is certainly one of the biggest concerns that we have. The workloads that we have looked at so far have been okay but we haven't really stressed the system in terms of, for example, every VM is running a database intensive benchmark like TPCC or TPCCW or something. And I think that for those kind of workloads we will start to see some deviation from the baseline. So I think disk is sort of the thing -- I'm not that concerned about the network because I think the network is much easier to scale than the disk is. You know, because again realize that under time dilation you have more time to finish an operation and so with virtual machines in particular and with Zen even more so the overhead more multiplexing things when the virtual machines are doing disk value is much higher than when they're doing network I/O. And so I think that is where the system will start. >>: I'm just saying that just emulating, you know, the network, so it may not be enough if there are complicated timing things going on, so you can accurately model say a bunch of TCP [inaudible] but you know some of these scheduling issues that came up become more important, then that doesn't make as much of a difference, right, so like if it host availability fluctuates a lot or there's some type of complex message patch protocol then it seems like there might be [inaudible] that you haven't been explored yet. >> Diwaker Grupta: So we've looked at, you know, peer-to-peer network like BitTorrent, and we've set up our own topologies. So for those it seemed to have worked. I really don't know what kind of network configuration would lead to a scenario where this wouldn't work. >>: [Inaudible] VM schedule with a [inaudible] VM there's [inaudible], and to make the time dilation accurate you want to make sure [inaudible] doesn't run so long, right? Then that means [inaudible] gets bigger. And then if every VM is running on CPU intensive, really CPU intensive application, then after time dilation you won't get 10 times as single machine would get because there's no [inaudible]. >> Diwaker Grupta: So that's a great point. And one of the nice things about time dilation is that you can tweak it more until you get the right results, right? So for example if you were seeing that the overhead is high enough that you are not able to cope with it, the current time dilation factor, you can increase the time dilation factors so that you have more realtime to accomplish the same amount of work. And so for example if you were running a TDF 10 and you were seeing that you're not getting the same amount of CPU inside the virtual machine, you can increase the time dilation factor [inaudible] for 20 -- restrict the amount of CPUs that the virtual machine gets such that after time dilation it will still get a maximum of, you know, whatever the CPU in the original system was. But now you have more time to actually finish those operations. So we can take -accommodate some of the overheads in that way. >> John Douceur: again. >> Diwaker Grupta: [applause] Anyone else? Thank you. All right. Let's thank the speaker