Document 17863876

advertisement
>> John Douceur: So good morning, all. It's my pleasure to introduce
Diwaker Grupta. He's a student of Amin Vahdat at UCSD. He'll be
graduating real soon now and will be looking for a job. He's done work
in distributed systems, network emulation, scaling and management of
virtual machines and virtualized system infrastructure. He's got
publications in NSTI and OSTI including the most recent NSTI and the
upcoming OSTI, and he's going to be telling us about some of that work
this morning.
>> Diwaker Grupta: Thanks a lot, John. Good morning, everyone. Thank
you all for coming. It's a pleasure to be here. So I want to start by
describing what I think are two compelling problems, and in this talk
I'll show you how to address them using virtual machines.
So the first one is the problem of protocol evaluation, so we were just
talking about TCP, and one of the problems with TCP is that on
extremely high bandwidth networks it didn't perform very well. Now,
the literature several variants and enhancements to
TCP have been proposed to address this problem.
So let's say you wanted to test two different variants of TCP on let's
say a hundred gigabit per second wire link to see which one does
better. How do you go about doing this? Not traditionally you could
approach this problem in three different ways. You could try to get
access to a real world hundred gigabit per second link if such a link
exists, but even if such a link exists it probably is not going to be
easily accessible to everyone out there. You could try to do a
complete software simulation such as an NS 2, but then you lose realism
because you're not using real operating systems, real unmodified
applications.
And finally you could use an emulation environment such as ModelNet.
So a network emulator allows you to experiment with arbitrary network
topologies with unmodified applications and standard operating system
networking stacks.
But even in a network emulation
limited by the bandwidth of the
I'm going to show you how to do
realism right in the comfort of
environment you are fundamentally
underlying network. So in this talk,
such an experiments while preserving
your lab.
So here's another problem. How do you test large systems? Here's the
code by Werner Vogel who is the CTO of Amazon.com and what he's
pointing out is that in order to test a large system like Amazon
ideally you want to create a replica of the entire system and test on
that replica.
But what if you don't have the capacity to duplicate all the
infrastructure? So typically we resort to doing small scale tests on a
few number of machines and extrapolate from there. But the same
problem exists in other domains as well. You don't have to be as large
as Amazon. Let's say you're a company that makes a high performance
file system and you are shipping this file system to clients who will
deploy it in various kinds of deployment scenarios, so they might have
different hardware, software configurations and so on.
So ideally you want to make sure that the system is tested in all these
possible deployment scenarios. Unfortunately you have a limited
testing infrastructure at your disposal. And furthermore, this testing
infrastructure might be shared internally among several different teams
who are doing development, so each individual team gets even fewer
machines to work with.
And particularly it might be impossible to test the system at a scale
at which it will be deployed at using the limited number of machines
you have. So in this talk I'm going to show you how you can use a
small infrastructure to test accurately, so test and replicate a much
larger system and order of larger system in fact.
So these are sort of the two problems that I'll be talking about in
this talk. But where do virtual machines fit into the picture? Before
I start talking about that, I just want to clarify the terminology that
I'm going to use in this talk because virtual machines mean different
things to different people. So on a regular physical machine, we have
various resources such as the CPU, the disk, the memory and so on. And
we have the operating system which is multiplexing these resources
amongst their differently applications.
In the virtual machine environment instead of the operating system, we
have a thin layer called the virtual machine monitor or the hypervisor
which is responsible for managing the system resources among several
different virtual machines. A virtual machine is basically a software
abstraction of a physical machine and each virtual machine can have its
own operating system and application stack. Because the operating
system that runs within the virtual machine, the guest operating system
and I'm also going to use the term domain denote the virtual machine in
this talk.
The virtual machine monitor exposes a hardware like interface to the
virtual machine so the guest operating system this that it's actually
running on real hardware. So that's sort of the model that we're
working on. The concept of virtualization itself is quite old, almost
50 years ago now, but in the last couple of years interest in
virtualization has picked up significantly. In fact, recent IBC report
estimates that virtualization was already a 5.5 billion dollar industry
in 2006 and it's projected to grow strongly in the near future.
So it looks like this market is growing rapidly. But what are some of
the things that are driving the adoption of virtualization? What are
people using virtualization for? So here's the recent survey done by
cio.com where they ask people what are they using virtualization for in
production IT environments? And by far the most popular use is for
server consolidation. So by putting services on a fewer number of
physical machines you can cut down costs and increase utilizations. So
that's one of the biggest driving factors.
But the following couple of reasons I clumped them together and called
them novel applications. And these are I think the more compelling
reasons that will drive virtualization moving forward and we'll talk
briefly about both of these in the next couple of slides.
So of coarse in the past couple of years server consolidation has
become increasingly important over the past decade or so the
infrastructure costs of a data center have remained largely stable but
the administrative costs and the power and cooling costs have been
increasing at an alarming rate and so by consolidating your
infrastructure organizations can cut down both these expenses. And in
fact [inaudible] must report that typical datacenter utilization
servers is in the five to 15 percent range.
And using virtualization they can bring it up to 60 or 80 percent which
translates to a tremendous dollar value savings for organizations.
But as I said earlier, I think virtualization is becoming increasingly
more compelling for the new and unique applications that it enables.
For example because the virtual machine monitor exports different
hardware interfaces you can actually support legacy software, legacy
hardware even if the real hardware doesn't exist anymore. Go ahead.
>>: Does server consolidation address the biggest costs that you were
talking about?
>> Diwaker Grupta:
A little bit.
>>: You still have to, you know, manage each of the virtual machines
as much as [inaudible].
>> Diwaker Grupta: Absolutely. So there is some maintenance cost that
is, you know, constant overhead per physical machine and there is some
maintenance cost that is constant per virtual machine. But the real
story is that, you know, with these virtual machines you get management
tools to manage this virtualized infrastructure and things like keeping
the virtual machines updated, patching them becomes easier if you are
paying in this virtual machine framework than if you were working with
physical machines themselves.
So while there is still some costs
machine the overall cost decreases
know the per physical machine that
infrastructure. We can talk about
associated with managing virtual
because the dominant cost is you
you have to manage in
that more later on.
There are several other interesting applications in security for
example you can do intrusion detection or full system logging and
replace. Virtualization extremely useful for development and testing
so if you have -- if you're testing a patch for Internet Explorer
instead of testing on several differently versions of Windows and IE
and different combinations, you can quickly both pre-create a virtual
machine image with all these different software combinations and use a
single machine for testing.
And finally, virtual machines can be created on demand and migrated
from one physical machine to another, so this flexibility in how you
provision resources allows you to build more Agile infrastructures so
companies like Amazon and [inaudible] who are using virtualization to
build out their cloud computing infrastructures.
The work that I've been doing the past couple of years has been mostly
around virtualizations and spans both of these domains. The
overarching theme in my research has been to allow you to do more with
virtual machines by making it more scaleable. So for example by
allowing more aggressive server consolidation or by addressing some of
the applications that I mentioned earlier.
So in this talk I'll talk about mostly novel applications that
virtualization enables, two examples of which I gave earlier. I've
also done a bunch of work in server consolidation. So for example by
improving the way virtual machines monitors manage memory you can
actually create the number -- increase the number of virtual machines
that can be supported on a single physical machine, along for more
aggressive server consolidation. I have worked in performance
isolation where you are trying to make sure that a virtual machine
doesn't negatively impact the performance of other virtual machines
that are running on the same physical machine. And finally I've worked
on infrastructure management and a sensible framework for managing
virtual machines and the resources allocated to them. And so as I said
in this talk, I'll primarily focus on the novel applications and
briefly talk about the memory management work but I'm more than happy
to discuss the rest of the project offline.
So with this background and motivation here is the outline for the rest
of my talk. I'll first present a technique called time dilation that
allows us to go beyond the capacity of the underlying hardware and show
its application in protocol evaluation. I'll then present DieCast
which is a framework that leverages time dilation to do large scale
testing using a much smaller infrastructure.
Next I'll briefly discuss a system called the Difference Engine that
last for more efficient memory management in virtual machines and I'll
then touch upon some related work before concluding.
Before I dive into details, I want to take a moment to quickly
introduce Zen. So while most of the ideas that I'll present here are
applicable to virtual machine monitors in general, the implementation
is largely based on Zen, which is an OpenSource virtual machine monitor
out of the university of Cambridge. When Zen boots, it starts and
initial virtual machine called domain zero, domain zero sort of the
control entry point for the system where you can create and manage
other virtual machines. There are two different kinds of virtual
machines that are supported in Zen.
Virtual machines require modifications to the guest operating system to
make them virtualization aware, allowing for some performance
optimizations. But the flip side is that you can't support all
operating systems in particular operating system for virtual source
code is not available can be supported.
Zen also supports fully virtualized VMs which allow running on modified
guest operating systems so you could run Windows, Solaris, whatever
have you, but they require hardware support such as the Intel VD
processor.
All right. So let's talk about time dilation. Consider this physical
machine here with the given configuration and this physical machine
potentially holds several virtual machines. An obvious consequence of
doing this kind of multiplexing is that these resources will get split
up among these virtual machines so each virtual machine individually
has access to only a fraction of the underlying resources.
So in particular in this example, let's say you create five identical
VMs and these VMs have seek access to all the resources and they have
equal work as well, then each virtual machine roughly sees the
equivalent of a fifth of the resources. So you'll see a fifth of the
CPU, you'll see one fifth network connectivity and it will have one
fifth the main memory on the machine.
The goal with time dilation is to somehow increase the amount of
resources a virtual machine thinks it has. And we'll see how it can be
useful as we go along. So here's the idea. What we do is we take time
and trade it off for other resources in the system. So consider this
example. Over a period of one second, let's say the operating system
receives 10 megabits of data. If you then ask the operating system
what is the bandwidth that it perceived, it's going to say 10 megabits
per second. In time dilation we're going to slow down the passage of
time inside the operating system.
So let's say if you repeat the same experiment but this time we can
list the operating system that instead of one second only hundred
milliseconds have passed. But it still sees the same amount of data.
So if you now ask the operating system what is the bandwidth that it
perceives, it's going to say hundred megabit per second. So it thinks
that it actually has a faster network than it actually does. And so by
slowing down the passage of time inside the operating system, we can
make all these time based resources such as the CPU on the network
bandwidth on this bandwidth all appear faster.
Of coarse now your experiments will take much longer to complete
because for each second of realtime only 100 milliseconds are passing
inside the operating system's timeframe. So we call this ratio of the
realtime to virtual time, the time dilation factor which in this case
is 10.
So let's revisit our multiplexing example with time dilation. So
earlier we saw that if we have five identical VMs, each of them roughly
sees a fifth of the underlying resources. Now if you take each of
these virtual machines and run them under a time dilation factor of
five, all of these VMs will roughly see five times the resources that
they had. But note that time dilation only impacts time based
resources, temporal resources. It doesn't impact static resources such
as main memory capacity or second restorage capacity so each of the
virtual machines still sees 400 megabytes of RAM. And I will come to
this point later again.
But note that there's really nothing special about the number five
here. We could have picked a different time dilation factor and the
perceived resource capacity would change accordingly. So if you pick
the time dilation factor of four, the VMs would perceive a different
resource capacity.
Why did we have to create five virtual machines? There's no strong
reason to create five virtual machines [inaudible] and have a single
virtual machine on this physical machine. And so let's say you have
some physical machine with a gigabit per second network and you create
a single virtual machine on this physical machine, so originally this
virtual machine has access to the entirety of a per second of bandwidth
that's available from this machine. Now, if we run this VM under a
time dilation factor of 10, we can actually make it believe that it had
access to a 10 gigabit network. And this is precisely the kind of
thing that we need to do this scaleable network in relation that I give
an example of earlier in the talk.
Are there any questions at this point? All right. So how do we go
about implementing time dilation? The general strategy for
implementation is that you have an operating system and that operating
system relies on a variety of time sources to infer some value of time.
So the most common one is time interrupts that the operating system is
receiving but there can be several other time sources as well, right?
You could have specialized content such as the TST on exhibit
platforms. There would be other hardware timers such as the high
performance timer or the programmable intro timer. And there might be
some information maintained in the bios as well.
And so the challenge here is to basically figure out all the different
time sources that an operating system might use. So the operating
system takes all these time sources and infers some value of time. So
the key point is that the operating system doesn't have any notion of
absolute time or realtime.
And so to implement time dilation what we need to do is we need to
interpose between all these different time sources and scale them
appropriately by the given time dilation factors. So it might involve
reducing the frequency of the time it interrupts or manipulating the
value of the TST that's visible to the operating system. And if we can
interpose between all the different time sources, the operating system
will automatically perceive a different value of time.
Now, the challenge here is that the implementation should be both
transparent and pervasive. And by transparent, what I mean is that we
shouldn't have to modify the operating system or the applications to
support time dilation because we want to support unmodified operating
systems. The implementation should be pervasive in the sense that the
-- there should be -- the operating system and the application should
have no real figuring out that they're running in this dilated
timeframe if the realtime somehow leaks into the system then we will
see different behavior than expected. Yes?
>>: When I'm running in a virtual machine, I don't -- that's getting a
10th of the CPU, I don't see my CPU clock going 10 times slower, I
guess sudden bursts of it. So if [inaudible] simple way this could
become unless you're scaling based on the number of CPU cycles I'm
getting this isn't going to be transparent, right?
>> Diwaker Grupta: So you raise a great point. Transparency,
absolutely transparency is hard to achieve. What we are trying to say
here is that depending on the application that you're running, it might
not matter if the exact distribution of CPU cycles is preserved or not,
right? And so as we'll see later for most applications it doesn't
really make a difference if the number of -- all we are saying is that,
you know, in one second -- so let me rephrase it.
So if you did some CPU intensive task in a regular operating system and
you timed it, you will measure it took some amount of time. All we are
saying is that under time dilation if you repeated the same experiment,
it will take a 10th and if you were using a time dilation factor of 10,
it would take a 10th of the time simply because that's how you're
measuring. Does that make sense?
faster. Yes?
So in that sense a CPU appears
>>: I think we're very curious about how you deal with first the
effects of the virtual machine and [inaudible].
>> Diwaker Grupta:
I'll come to some of it later as I go along.
All right. So the implementation that we have for Zen, you know, does
roughly the same things that I just described. We scale all of the
different time sources, but the given time dilation factor. In fact,
we allow each virtual machine to be run with a different time dilation
factor, which is extremely useful. Our implementation supports both
power virtualized lies and fully virtualized VMs.
>>:
[Inaudible] how do you get [inaudible]?
>> Diwaker Grupta: For power virtualized -- I mean, for power
virtualized VMs the argument is because power virtualized operating
systems anywhere require modification for our power virtualized
implementation we actually just modify the operating system itself. So
if you're going through the operating system to access the TST, then
we will drop it and return the right value. But in power virtualized
operating systems, the applications might soon be able to access the
TST directly by exhibiting some instructions and hardware.
So you're right, for those applications you have to do some kind of
drop and emulate so you have to watch the instruction screen that's
emanating from the VMs sort of right there. But right now we don't
handle that case.
>>:
[Inaudible].
>> Diwaker Grupta: No. Any other questions? Okay. I do want to
mention here is that you could implement time dilation directly inside
the operating system without having to go through this virtual machine
interface, but implementing inside the virtual machine monitor makes
this kind of scaling very easy because you already have interfaces for
exposing the different time sources through the virtual machine
monitor. So this makes it much easier.
But the really issue as some of you have brought up is that do we
believe time dilation, in particular for the bursty kind of allocations
and what are the impact of multiplexing? So how do we test if time
dilation really works? Our general methodology for validation is that
we're going to pick some baseline that we can currently attain and
validate, we're going to -- we're then going to scale this baseline
down. And this scaledown can be either in the form of picking a less
capable hardware or artificially restraining the resources available to
a physical system.
So once we have the scaledown system, we are going to use dilation to
scale it back up to get a perceived configuration. And then the goal
is to compare the performance of the baseline and the perceived
configuration. And if the amount of resources in the perceived
configuration are similar to the amount of resources in the baseline,
then the expectation is that they should behave similarly.
So in the variance you want to maintain is that of resource
equivalents. So basically what I mean by that is that after time
dilation the perceived configuration should have the same resource
capacity as the baseline system. So how do we get the scale
configuration? Let's start with a simple example here. Here I have a
single link which is running at 10 megabits per second and 20
millisecond of roundtrip time and these end point might be running
under time dilation. And the invariant that we want to preserve in
this case is that as perceived by these end points the link should have
the same network characteristics all the time.
So in the baseline system which is equivalent to having a time dilation
factor of one, the real and the perceived configuration are exactly
identical. Now, if you were to run the end points at a time dilation
factor of 10 without doing anything else the end points are going to
perceive a much shorter latency on the link, in particular two
milliseconds, and a correspondingly faster bandwidth so the link will
appear to be 100 millisecond per second and two millisecond link. So
what we need to do is we need to artificially scale down the capacity
of the link and make it slower so that after time dilation the
perceived configuration matches the real configuration. And in
particular we want to manipulate the links so that it has actual
bandwidth as one megabit per second and latency of 200 milliseconds.
Now, we can easily do this kind of manipulation of the link
configuration by using a traffic shaping tool at the end point or using
a network emulation environment to manipulate the characteristics. So
this also goes back to the pervasiveness argument that I was talking
about earlier to preserve this illusion of time dilation we want to
make sure that all the components involved in this system on the end
points, each of the links, any appliances you might have in between,
routers and so on, they should all be scaled uniformly, otherwise we
will get unexpected behavior.
So next we come to the actual experiments that we do for validation.
We have validated time dilation for a wide variety of configurations,
but it's really easy to match coarse grain metrics like aggregate
throughput over some period of time. And what we really want to see is
how does time dilation behave on very, very fine granularities.
So let's start with a simple experiment here. We have a single TCP
flow, and we inject one percent deterministic losses in this TCP flow.
It's going across 100 megabit per second link with a roundtrip time of
20 millisecond and we're looking at the first, second operates on the X
axis you have time since the beginning of the experiment on the Y axis
you have the sequence numbers. Basically this is showing a packet
sequence diagram of the first second of trace. What we want to do is
we want to repeat this experiment under different time dilation
factors, making sure that the perceived link characteristics remain the
same and then we want to see if we can preserve this low level packet
behavior. All right. So here is on the baseline configuration again
at the very top. We repeat this experiment at a TDF, time dilation
factor of 10, again making sure that the perceived configuration is the
same at a time dilation factor of 100 and visually you can see that the
traces look very, very similar. Of coarse we have also done
statistical comparison by using the distribution of the intrapacket
arrival times and so on.
And even distributions match. And so for this very simple case time
dilation is able to preserve low level packet behavior. But this is,
again, in a simple case with a single flow, we have also dilated time
dilation in more complicated scenarios where we have multiple flows
varying bandwidths, varying latencies and so on. And I'm not going to
present the details here, but for all of these validation experiments,
we show that time dilation can actually preserve the baseline
configuration.
And coming back to the point you had about burstiness behavior, again I
think it goes back to how -- what is the fidelity of your
implementation? If the time sources that are being used by the
application, by the operating system are all the time sources that you
already interposed on, then no matter what granularity the application
tries to get the value of time, it will get the scale value of time.
So the only case where burstiness actually will show through is when
the application is using a time source that we cannot capture, which is
what Ed was pointing out. So in the power virtualized case, if you
have applications that are directly issuing assembly instructions to
read on the TST value, that's something that we do not handle right
now. Yes?
>>: [Inaudible] the first example if the two slow down machines run
differently [inaudible] then if they're on the same virtual machine one
of them might get to run for some long quanta, and not see response for
the other ones going into that quanta, and that's something [inaudible]
they're not slowing down.
>> Diwaker Grupta: That's a great point. So I think what you're
pointing out is that there is some overhead involved in doing this kind
of multiplexing.
>>: I'm not talking about overhead, I'm talking about the bursty and
the quantization of the CPUs. For example, you do time dilation
wherein you gave each VM a 10 second quanta and first VM would spend
its 10 seconds perceiving to be 100 seconds and wondering where in the
heck that other [inaudible] and then we switch the other guy and
[inaudible].
>> Diwaker Grupta: So again, so most of the virtuals that you can see
here, I mean for network I/O it's not going to happen, right, because
you're going to send some packet, you're going to wait for some
response, you're going to block and then the context which will happen.
Right? And so you're not ever going to run for 10 seconds straight.
>>: My point is if you had enough to do
your infrastructure were very foolish in
quanta then you absolutely would see and
could counter this thing [inaudible] how
enough or fidelity --
and your machine worked and
allocating these 10 second
second [inaudible] so you
do we know yours is small
>> Diwaker Grupta: Right. So one of the things we
the scheduler that the virtual machine monitor uses
time quanta. So you can -- you know, you make sure
machine is running for too long and trying to break
again ->>:
do is we modify on
to use much smaller
that no virtual
the solution. But
How do you know I guess is the question?
>> Diwaker Grupta: So -- that's a great question, and our metric has
been to try different kinds of applications that we care about and see
if we can preserve their behavior, you know, for all the applications
performance applications specific performance metrics that you might
want to measure. And in -- for the applications I mean this is just
one class of applications that I'm talking about, but later on I'll
talk about several other class of applications. And for all those it
hasn't mattered. And so you're right that there might be some
artifacts that we're not able to capture, but if the application that
you're using then, you're using and not concerned about them, then how
do they make a difference?
>>: So what observation would it [inaudible] fact that you're showing
us these graphs that visually match. Did you have -- what's your test
for not matching enough? If I built a competing system that would also
attempt to achieve its same goal [inaudible] some egregious error what
would be a characteristic of my graph? Did you ever see a graph to
which you looked at and said oh, that isn't fitful enough?
>> Diwaker Grupta: I'll come to that later. So there are artifacts in
the system where, you know, you have to do additional work to make sure
that the graphs do match. Basically our metric has always been with an
application, whatever application. We don't care. And look at the
metrics that you normally use to measure the performance of that
application. And then run it under time dilation and see if you can
match the performance. And if you want -- if there are finer grade
metric that you wanted to use, you look at that, and see if you can
match it.
>>:
[Inaudible].
>> Diwaker Grupta: Okay. So as I said earlier in a distributed system
you want to make sure that all.
>>: [Inaudible] but if you run in bursts, say your clock goes past and
then stops and pass and then stops NTP is not going to work ->> Diwaker Grupta: So, so, in the frame -- in this environment we're
basically if you use -- if you're using NTP, the NTP server has to be
running in the dilated timeframe as well. Because ->>: Of coarse it has to be running in dilated timeframe but if time
goes fast and then stops and fast and then stops and they're not in
lock step across all the machines then NTP isn't going to converge
properly. I know I've actually run it on machines that have clocks
that did that as the clock was posted, and it didn't work.
>> Diwaker Grupta: So I've never experimented with NTPs, so I really
don't know the answer there. Yeah.
>>:
So [inaudible]?
>> Diwaker Grupta: We have. I'll come to that. All right. So let me
-- let me move on and I want to talk about one of the examples of time
dilation. So again, coming back to the protocol evaluation and example
that I had earlier.
So in the Linux 2.6 kernel the default flavor of TCPs, CTP and New
Reno but Linux also ships with another variant called TCP BIC which
stands for binary increase congestion, and this is -- this variant
features an enhanced congestion control protocol on designs
specifically for high bandwidth networks.
And what we want to do is compare New Reno versus BIC on high bandwidth
links and see which one does better. So the goal here is to treat this
protocol black boxes, we're not trying to reason about the behavior
that we see, we're just trying to see can we use time dilation to push
beyond the hardware limit and, you know, uncover interesting behavior,
in particular does BIC really outperform Reno as you increase the
bandwidth.
And so here's the experiment setup. We have a single bottleneck link,
and we have 50 TCP folds that are going across this bottleneck. We're
going to fix the amount of time of the bottleneck link to 80
milliseconds and we're going to vary the bandwidth on this link. And
for each value of bandwidth, you want to measure the per flow
throughput that we see. So in the first sort of phase of the
experiment, on the X axis I had the bandwidth or bottleneck link going
from zero to one gig bit per second, on the Y axis I have the per flow
throughput in megabits per second.
The first thing you notice is that in this range, because we have
gigabit hardware in our clusters we can actually simply run the really
operating system without using time dilation and measure what the
performance looks like. And so the blue dots here are the performance
with the regular Linux no time dilation. Then we can repeat the same
experiment just to validate if our setup is correct.
So we repeat the same experiment using a time dilation factor of 10,
again making sure that the perceived bandwidth remains the same in each
case. And the green square -- triangle, sorry, green triangles are the
performance under time dilation. And in this range you can see that,
you know, they match, they're within the variants, the performance is
very close to what we've seen in the baseline.
The red squares are the performance regret with TCP BIC and in this
range there's no clear difference between the performance of these two
different TCP variants. Because we are using a time dilation factor of
10 and we have a gig bit hardware at our disposal we can actually push
further and test up to 10 gigabits per second.
>>:
[Inaudible] with the TDF of [inaudible].
>> Diwaker Grupta:
>>:
What's it look like roughly?
>> Diwaker Grupta:
>>:
We have, but it's not on this graph.
Yes.
It's the same.
[Inaudible].
>> Diwaker Grupta: Yes. So in this range we are pushing up to 10
gigabit per second again on the Y axis we have the virtual throughput
and here we see that as we increase the bandwidth of the bottleneck
link TCP begins to outperform TCP New Reno. So definitely there seems
to be some advantages on that brings to the table.
But does it continue to outperform New Reno if you push even further,
so we can actually use a higher time dilation factor to test it even
higher bandwidth. And so we move to a time dilation factor of hundred
that allows us to go up to a hundred gigabit per second on the
bottleneck link.
And there are two interesting things in this graph. The first is that
the performance of both TCP BIC and New Reno sort of flatten out so
they don't increase in terms of the bandwidth that they can extract on
this link and secondly the performance differential between these two
variants seems to shrink in this range. And so the claim here is that
TCP BIC is good for high bandwidth networks but only up to a certain
range and after that we hit upon the diminution returns where TCP BIC
doesn't prove to be as beneficial.
>>:
[Inaudible].
>> Diwaker Grupta:
Yeah?
>>: The same series of data I can also draw a completely different
conclusion [inaudible]. I can back up two slides and say the -- at the
small range we would verify that through the time dilation filter these
things look the same, and we can see on the next slide at some point
the time dilation filter seems to see something different but then on
the third slide it looks like you can't tell the difference anymore.
But we can't actually tell whether the phenomenon that we see past the
validated area is a function of the variants between -- the difference
between [inaudible] or a function or limitation in the fidelity of
this. I mean, it seems like they're trying to simultaneously validate
the mechanism and draw a conclusion from it. It can't be both at the
same time.
>> Diwaker Grupta: That's a great point. And I mean the whole point
of -- we spent a lot of time in this project just trying to validate,
validate, validate, and we can only validate, you know, up to the sort
of the hardware capacity that we have. And so in this case we had
gigabit hardware so we validate up to a gigabit hardware and beyond
that, it is our hope that, you know, once we can actually test on
hundred gigabit per second network it will look something like this.
But we have no way of finding out right now.
>>:
[Inaudible] did the results look like yours?
>> Diwaker Grupta: The -- so we did look at NS results for some of the
things but not this particular experiment because I didn't have an
implementation of TCP BIC and so on. So I mean, the problem with, you
know, doing apples-to-apples comparisons with an experiment like this
in NS2 is that I mean you don't have the real operating system stack.
I mean, there are so many other artifacts that are different there and
->>: [Inaudible] results that look like what NS did qualitatively then
that would be --
>> Diwaker Grupta:
one.
So [inaudible] some other experiments, but not this
>>: It may be that they're wrong or it may be that you're wrong or it
may be that [inaudible] are wrong.
>> Diwaker Grupta: Sure. So we haven't looked at NS 2 results for
this experiment, but we have for other experiments. Yeah.
>>: As you're scaling up the bottleneck bandwidth are you scaling up
anything else? I'm wondering if the bottleneck remains the bottleneck
[inaudible] the bandwidth.
>> Diwaker Grupta:
[inaudible].
>>:
So I'm not sure what you mean by that.
So
[Inaudible].
>> Diwaker Grupta: Yes. So then points are running at time dilation
factor of 10, so they are -- they do have more CPU in that sense, yeah.
>>: So [inaudible] real big here I assume that's because it takes so
long to run the experiment, you can only run it one and a half times
instead of 10?
>> Diwaker Grupta: Well, these are across the 50 flows, they are not
across different runs. So plotting the mean and the standard division
across those 50 flows. Yeah.
>>: Do you suppose those errors and those [inaudible] are getting
bigger because they can't get in the protocol or are they getting
bigger because burstiness is [inaudible].
>> Diwaker Grupta: Again, I don't know for sure. I mean, in the range
that we validated it didn't look like we were adding something to the
error bars. But in this range, I don't know for sure.
All right. So I'm going to move on at this point. And the take away
here is that we think the time dilation can use as an effective tool to
push beyond hardware limitations, to do this kind of protocol
evaluation. And I think compared to the current state of the art
interim, so the techniques and the tools that you have available I
think time dilation gives you more realism and accuracy at the same
time.
So that's just one application. Now, there are several other
applications you could use time dilation for. You can use it to
predict how performance if your cluster will change if you upgrade some
piece of equipment you could also explore how the bottlenecks in your
application evolve as you give it more and more resources. But there
are certain limitations to be kept in mind when you're using time
dilation.
First of all, time dilation doesn't scale static memory capacity. As I
mentioned earlier and our work on Difference Engine addresses some of
the issues here and I will talk about it towards then of the talk if I
have time. But remember that for pervasiveness we warn that everything
in a disperse system is running under a same time dilation factor and
if you have some specialized hardware appliances that either cannot be
virtualized for some reason or if the [inaudible] on the appliance is
not accessible for direct implementation, then these need to be dealt
with differently as a special case.
And finally time dilation cannot capture radical hardware changes so if
hundred gigabit hardware looks fundamentally different than how we
construct a gigabit hardware than the accuracy of the predictions that
time dilation makes will degrade. And so we can capture all the
technological evolution. Yeah?
>>: [Inaudible] sort of higher level [inaudible] one of the
motivations is you want to simulate Amazon or something like that,
right? Well like in Bill's example NTP there's actually sort of
interesting higher level synchronization. Your system doesn't
necessarily capture, right. So the evaluation you showed us here was
basically bulk TCP throughput, right, which is fairly straightforward.
But I mean at what point do you have to say okay, the system we're
trying to emulate is so complex we have to somehow intuit about global
synchronization behaviors or things like that.
>> Diwaker Grupta: Right. So in this work that I'm going to talk
about, DieCast, I'll present evaluations for much more complex
distributed systems. So fully that will address your point, yes.
>>: Does TCP look at the time say the [inaudible] 10 milliseconds
[inaudible] differently or [inaudible] millisecond?
>> Diwaker Grupta:
Excuse me?
I didn't get that.
>>:
Does TCP really take latency into account?
>>:
Yes.
>> Diwaker Grupta:
>>:
[Inaudible] actual number.
>> Diwaker Grupta:
>>:
Yes:
[Inaudible].
Yeah.
[Inaudible] different [inaudible].
>> Diwaker Grupta: All right. So next I want to talk about DieCast.
And just to refresh the motivation here is to use a small
infrastructure to test a much larger system. And we want to do this
kind of replication and testing with some goals in mind. The first
thing we want is fidelity, right, we want to make sure that we can
replicate the original system to as great a fidelity as we can.
We also want reproducibility in the sense that we want to be able to do
controlled experiments for performance isolation, for performance
debugging and so on. And finally we want to be efficient in the sense
that we want to use as few resources as possible to do this replication
and testing. And as I said before, I'll show that DieCast can scale
off given test infrastructure in order of magnitude or more.
So let me walk you through the approach that we take with an example.
So here we have a typical three tier web service. We have some web
servers, we have application servers, we have database servers.
They're all connected to -- with some -- through some high speed
fabric. And then we have a load balancer in front of the system. And
the goal now is to replicate and test the system using fewer number of
machines.
And note that if you were to make a complete exact copy of the system,
you would get fidelity because it's basically the same system all over
again, and you might also get the produceability but by our definition
of efficiency, because we are using double the resources the system is
not efficient. So in order to bring efficiency back, what we do is
we're going to encapsulate each of these physical machines into virtual
machines and then we're going to consolidate these virtual machines on
a fewer number of physical machines.
So we do some consolidation that gives us fewer number of physical
machines but now we've lost the original topology of the network. So
we're also going to put in a network emulation environment that can
recreate the original topology. But now we have the problem that each
of these virtual machines only has access to a fraction of the
resources of the underlying test machine. And furthermore, the
machines in the test harness might be completely different than the
machines in the -- the physical machines in the original system.
So we have lost fidelity. So how do we reconcile fidelity? So let's
see what we want to do here. So we have some machine in the test
harness and here we have some machine in the -- some physical machine
in the original system. Now, so far what we have done is we have
created some number of virtual machines on this test machine, and what
we want to do is we want to take one of these VMs and make it look like
the corresponding physical machine in the original system.
So obviously one of the things that we need to do is we're going to use
time dilation to scale up the resource capacity that's visible to this
VM, but the other thing we need to do is to take into account the
heterogeneity between the machines in the test harness and the machines
in the physical system and also because each of these virtual machines
might not be identical, they might be configured differently we have to
be able to precisely control the amount of resources that are available
to this virtual machines and particularly the amount of CPU that's
available to it, the amount of network and that's available to it, and
the amount of disk that's available to it.
So for scaling, I mean we use traffic shaping tools in combination with
network emulators. As I described earlier for CPU scaling we leverage
CPU schedulers inside the virtual machine monitor so we can say things
like this virtual machine should get 10 percent of the CPU. There are
some subtle advertise involved in the CPU scheduler, but I don't have
time to cover them now. But for the most part, we just leverage
existing CPU schedulers.
But what about disk I/O? Under time dilation the disk will appear to
be much faster and in particular it can be -- it can be perceived as a
completely different disk than the disk in the original system. And we
want to make sure that the VM sees the same disk as the physical
machine sees in the system.
So how do we deal with disk scaling? So before I describe that, let's
just first look at how this happens in them. So again here we have
implementations for both power virtualized and fully virtualized VMs
but in this talk I'm only going to talk about the fully virtualized
implementation.
So here we have an unmodified operating system. It thinks it's using a
real disk drive. So the device driver is unaware that a real disk
didn't exist. Now, in practice the file system of the VM might be back
by a disk image or a partition in domain zero. There is a user space
process in domain zero called the I/O EMU which is responsible for
doing the I/O emulation for this particular VM. So any disk request
that originate inside the virtual machine will be intercepted by the I
emulator process which will then do the operation on the virtual
machine's behalf, so it acts as the hardware do the read drive and so
on.
So this is the model that we're working on. So what is the goal?
Again, so the goal is that we want to preserve the perceived disk
characteristic so everything from seek times to this throughput and so
on. And because the disk will appear faster under time dilation what
we want to do is take each request and slow it down by some amount to
make sure that inside the virtual machine the perceived characteristics
remain the same. Now, the challenge is that a lot of the low level
functionality for disk resides in the formware and so the formware
might do batching and reordering of request for efficiency masking some
of the delays.
But the bigger problem is that the test harness may have a completely
different hard drive than a physical machine in the original system.
And so how do we reconcile these two completely different pieces of
hardware? So our approach to addressing both these issues is to use
DiskSim. DiskSim is a high fidelity full DiskSim simulator out of CMU.
So what DiskSim does is that you give it a model of a disk to simulate
and then for each request it will report how long would the request
take to service in that particular disk.
So we have DiskSim running as a separate user process, and what we do
is for each request that comes to I/O EMU we forward the request to
DiskSim so DiskSim knows things like which sector number is being
accessed, what is the size of the request, what are the type of the
request, read, write, and so on. And after that, it will return disk
service time in the simulator disk. Do you have a question?
>>: I have a question [inaudible]. My understanding is [inaudible]
you can divide the CPU [inaudible] bandwidth even [inaudible] okay but
in [inaudible], right, therefore everyone share the same queue. It's
hard to make sure everyone get to like the distribution [inaudible] how
do you solve that problem?
>> Diwaker Grupta: So that's a great point. Which is precisely the
reason we want to control the time it takes for each request to
service, right? So in particular DiskSim is going to return the
service time in the simulator disk. And we know the time dilation
factor. We also know how long the request actually took to service
inside IMU. Based on these we can figure out how much delay each
request by before you return to the virtual machine. And so you can
precisely configure, precisely control the time that the virtual
machine thinks the request to serve.
And so the illusion that we are trying to preserve is that the virtual
machine is actually talking with the simulator disk and not the real
disk. Yeah?
>>:
[Inaudible]?
>> Diwaker Grupta:
>>:
[Inaudible].
>> Diwaker Grupta:
>>:
We are for some of the experiments, yeah.
Uh-huh.
Because the disks you still [inaudible] disks [inaudible].
>> Diwaker Grupta:
So, so --
>>: [Inaudible] confident DiskSim is emulating a single machine to a
single disk and how are you [inaudible] the [inaudible] ->> Diwaker Grupta:
So, so.
>>: Are you trying to emulate everything [inaudible] disk in the
individual machine?
>> Diwaker Grupta: So part of the reason why it works is that under
time dilation each request actually has more realtime to finish. So
even if there's some overhead in running multiple, the sim processes
and so on because we have more time to finish, we can still do this
kind of emulation without breaking fidelity.
>>: So one thing that could happen [inaudible] is that you could have
all of your multiplier processes are doing sequential I/O and you're
starting out [inaudible] you might see the slowdown if the quanta are
small, which we [inaudible] we might see what looks like a random I/O
to the really disk and that can slow down much more than the order of
magnitude. So do you actually validate that -- I mean DiskSim says I
want this request to run at this time. Do you validate that you don't
[inaudible]?
>> Diwaker Grupta: So for DiskSim evaluation that we've done we used
some standard file system benchmarks like DBench and Iozone and
whatever the access patterns they use are. We didn't specifically try
a benchmark that only had random I/Os. But
the ->>: [Inaudible] misunderstanding. If you had 10 virtualized machines,
10 virtual machines, all of which were doing sequential I/O, but
together [inaudible].
>> Diwaker Grupta: So we haven't done that exact experiment but I will
talk about experiments where we have a bunch of virtual machines where
we're doing a bunch of disk I/O. We're not sure that's exactly
sequential, but for those experiments it hasn't been an issue. But I
can imagine that you could ->>: I have another question. You could leave running all the time if
I were to have built this I would have said assert that time
[inaudible] is after [inaudible].
>> Diwaker Grupta:
>>:
Oh, oh.
[Inaudible].
>> Diwaker Grupta: So you're saying what if the amount of time we
actually wanted delay is like negative, right, that SM has already
taken long number. You're right. So --
>>:
No, the actual disk has taken longer --
>> Diwaker Grupta: Yeah, yeah, yeah. Yes, yes. So in this system so
we actually have checks for that. And in experiments we have it
doesn't happen.
>>:
[Inaudible] at all.
>> Diwaker Grupta: No, it didn't happen often, so it happens like
very, very rarely. I don't have the exact numbers for it, but we do -but we do have checks for that. And we don't explicitly deal with it.
And so one of the things we tried doing was accumulate this negative
time and incorporate it in a later request where we have enough buffer.
So we played around with some of those, but in the experiments so far
it hasn't mattered.
All right. So the impact of all this is that the virtual machine
thinks that it's interacting with the simulator disk which we can
choose to our own liking. So by doing this, we can preserve, you know,
the disk I/O characteristics as well. So so far the approach has been
that we're going to multiplex VMs for efficiency, we're going to use
time dilation to scale up the capacity of each VM and we're going to
use independent knobs on the CPU, the disk and an effort to make sure
that each VM has resources exactly equal to the corresponding physical
machine in the original system.
And the claim after doing all this is that the scale system almost look
like the original system. I say almost because we still don't deal
with things like the main memory capacity. And I'll talk about it
later. So how do we validate DieCast? Well, the same methodology as
before. We set up a baseline system, and the experiments that I'm
going to talk about the baseline is 40 physical machines running a
single VM each, and for the DieCast scale configurations we have four
physical machines running 10 virtual machines each with a time dilation
factor of 10, and we want to compare the performance of these two
systems.
And the questions we want to ask is A, can we match application
specific performance metrics, and, B, can we match low levels of some
behavior like the CPU utilization profile as a function of time. So we
validate DieCast on a variety of systems. In this talk I'm going to
talk about one of the systems. This is a service called RUBiS. It's
an eCommerce system modelled after eBay. So you have some web servers
and database servers. In this experiment we are modelling two
geographically distinct sites, so they're connected by a [inaudible]
link. And the servers, half the servers in each site talk to the
database servers on the other site.
The nice thing about RUBiS is it comes with a configurable workload
generator, so you tell the workload generator how many user sessions to
simulate? Yes?
>>:
[Inaudible].
>> Diwaker Grupta: I'm just -- just something we came up with.
There's no strong reason.
>>:
[Inaudible].
>> Diwaker Grupta: We have other -- for RUBiS in the paper this is the
only one we have. But we have different topologies for BitTorrent, we
have different topologies for the third system that we tried out.
>>:
[Inaudible].
>> Diwaker Grupta: Yeah, no, no. I mean, we just wanted to make sure
there is more -- you know, there is reasonable complexity that, you
know, communication is happening but within each site across sites we
are exercising local area links, wide area links and so on. So it's
just something that we came up with.
So we have one workload generator for each web server and we increase
the workload in the system and see how it behaves. So the workload
generated by RUBiS comes so that it reports a bunch of metrics at the
end of the run. And these are again application specific metrics that
the workload generator reports. We do nothing to them. And so the
first thing we want to do is we want to take compare the metrics in the
DieCast system versus the baseline. So on the X axis have the total
system load in terms of the number of simulated user sessions across
all the machines. On the Y axis, I have the aggregate throughput of
the system in terms of requests per minute.
The solid red line is the performance of the baseline configuration and
the dashed [inaudible] line is the performance of DieCast and you can
see that at least for this metric they almost perfectly overlap. I'm
also showing a card line here which is labeled as no DieCast and this
basically is that if you were to just multiplex VMs without doing time
dilation or any other kind of scaling this is the performance you would
get. And as you can see, you know if you're not using DieCast the
performance differential, the system diverges significantly from the
baseline and in fact even more prominent if you look at response time.
So on the Y axis here, I have the average response time taken across
all the requests. And again, we match the baseline configuration
pretty closely, but if you are not using DieCast, the response time
degrades significantly because now the virtual machines are running out
of resources. Yes?
>>:
The Y axis are being measured in realtime versus DieCast?
>> Diwaker Grupta: No. No. So we just take the numbers that the
workload generator reports to us. The workload generator is also
running in this dilator timeframe.
>>: [Inaudible] so this the says that when you use the 10th of the
machine with no DieCast you get somewhere near half the performance.
>> Diwaker Grupta:
Uh-huh.
>>: So that suggests that you didn't really push this far enough,
right, because I'd expect to see something like a 10th of ->> Diwaker Grupta:
It depends on the workload, right.
Not always --
>>: It depends on the workload. I'm suggesting that you probably
should have pushed the workload by a factor of five farther out because
you didn't really stress the difference between ->> Diwaker Grupta: Probably -- I mean for this one, this is as far as
we went. But there are other workloads where ->>: Maybe I'm not making my point [inaudible] enough. You're telling
me, look these two lines coincide so we've done a good job preserving
[inaudible] but that's because the machine -- one possible explanation
is we didn't come anywhere close to [inaudible] the machines and the
point of having 40 machines is we're going to load them. And the
machines are underloaded by a factor of five.
>> Diwaker Grupta: So as I said, we have done the experiment where we
do actually push the machines, just not in this experiment. So
rephrasing your question I think what you are saying is that for this
application we're not use time dilation, we're not so terribly off from
the [inaudible] line, right ->>: I'm not saying that [inaudible] DieCast is bad, I'm saying the
factor, the gentle line, the silent line agree tells us almost nothing
because we have not -- you're not measuring the interesting part of the
graph.
>>:
Right.
>>:
Because there's a lot of [inaudible].
>> Diwaker Grupta: So I think I have the graphs, and so the -- but
there are -- so we -- one of the problems we had was that the system
was configurable only in certain aspects, so we actually wrote a third
service -- let me -- so we wrote a service where you could configure
the amount of computation communication and I/O overhead on a per
request basis and there we basically try to push the system in each of
the three different dimensions and see where we try to break town.
those graphs are in the paper.
And
All right. So this was the application specific performance metrics.
But what about the resource utilizations? So in this system we had
three different types of machines, the web server, the database server
and the workload generator. And so what we did was we randomly picked
one machine of each type and looked at the CP utilization, so on the X
axis you had the time since the beginning of the experiment, on the Y
axis, you had the percentage CPUs and we looked at the -- compared with
the CP utilizations and the corresponding virtual machine, and the take
away from this graph is that the utilization profiles are similar. I
mean, obviously we can't match the instantaneous profiles exactly but
they're -- they display similar behavior.
We do the same thing for memory. I just want to point out that even
though we don't scale memory in this experiment we had set things up
such that the baseline similar had the same amount of memory that each
of our virtual machines would have so that we can actually look at the
memory profiles. And we also looked at the network. So we looked at
each hop in the topology and saw measured how much data that was
transferred on the hop of coarse this is a fairly coarse grain metric
for this experiment. We sorted the hops by the amount of data that was
transferred on them and then considered the hops in the same order in
the DieCast topology and again the amount of data transferred seems to
match.
So as I was saying earlier ->>: [Inaudible] so from the fact that each of these graphs [inaudible]
fidelity?
>> Diwaker Grupta:
Yes.
For this experiment.
>>: Except that memory was the thing where you didn't even try to
preserve fidelity, so you've got -- I mean ->> Diwaker Grupta: No, so fidelity here is in the sense that if your
baseline had the same amount of memory as your virtual machine does it
get utilized in the same way? Right. That's all we are saying. We
are not trying to address the fact here that the baseline could have
had a lot more memory. That's another problem that I'll talk about
later.
>>:
Because you downgraded the.
>> Diwaker Grupta:
>>:
Yes.
Baseline, right.
>> Diwaker Grupta: Yes. All right. So we have tested DieCast on, you
know, BitTorrent and we discuss some service where we exercise
different dimensions of the system and you know, for example, when we
have extremely high CPU we try to see some divergence in the baseline.
But for the most part we can match both the application specific
performance metrics and the resource utilization profiles.
But you know, again, these experiments were encouraging, interesting in
their own right but we really wanted to see if you could use DieCast on
a real world system. Yeah?
>>: So how important was it that you used DiskSim, this super accurate
thing? Could you have just written underline [inaudible] script that
kind of, oh, yeah, [inaudible] how well would that have done?
>> Diwaker Grupta: So we do have something like that for a power
virtualized case. We don't have, you know, such a high fidelity handle
on how each request are going, so there we basically modify the device
driver to say, you know, interpose some delay to make sure it works
reasonably well, but, you know, we use DiskSim because we want to save
-- you know, we want to use the most fidelity model that we have, the
highest fidelity model we have. Where again the argument is ->>:
I'm wondering where is the [inaudible].
>> Diwaker Grupta:
>>:
Oh, so [inaudible].
[Inaudible].
>> Diwaker Grupta: So it depends on the workload. You can get
workloads where the model tries to show leaks. But we can take about
it later.
All right. So we were really fortunate to be able to work with this
company in Panasas. They build high performance file systems. It's a
company based out of Pittsburgh. Garth Gibson from CMU sort of
involved in the company. And this is exactly the motivation that I was
presenting earlier. They have this problem that they ship their file
system to clients which have thousands of machines, but they don't have
the infrastructure to test at that scale. And so we wanted to see if
you could use DieCast to alleviate some of the testing problems.
So their typical testing infrastructure looks something like this.
They have a storage cluster which serves the file system. They have
some clients that are generating the workload. And then you have, you
know connected by some network. So in order to test or run DieCast on
the system, the clients are fairly easy to deal with because they just
run regular Linux and so we could use our current DieCast
implementation to scale the clients.
But the storage cluster was more of a problem because they run their
own custom operating system on the storage cluster and it's tightly
integrated with their hardware and the upshot of all that is that it's
not virtualizable. And so we weren't able to use our Zen based
implementation for the storage cluster so we had to do a direct
implementation inside their operating system. Excuse me. And while it
was a great learning experience, it would have been much easier if the
system had been virtualized and then we wouldn't have to do much.
For the network scaling we use a standard traffic shaper called
dummynet that their system ships with. All right. So the first thing
we did was again validation was set up a storage cluster with 10
clients generating the workload. To test the system, our DieCast skill
system sets up a storage cluster with only 10 percent of the resources
and then we have a single physical machines running 10 virtual machines
generating the workload.
>>:
This whole [inaudible].
>> Diwaker Grupta: I don't recall what protocol their system uses. I
don't remember. And then we picked two standard benchmarks from their
regular test suite, Iozone and MPI I/O, and for each of these
benchmarks we ran, you know, their test suite for varying block sizes
and we looked at the metrics that they normally look at, which is I
think the read and write throughputs. And for each of these benchmarks
we were able to match the performance metrics.
But the more interesting part from our perspective is that we were able
to use, you know, hundred machines in the infrastructure to scale up to
1,000 clients and before DieCast Panasas had no way to test things at
that scale because they just wouldn't have 1,000 machines to work with.
>>:
So [inaudible] usually have only one server at all these sites?
>> Diwaker Grupta: Oh, so the storage cluster, I mean the server has
many servers internally, so it actually has, you know, I think 10 or 11
blades running inside and then they scale it as the demand increases.
>>: And they couldn't scale to a thousand clients by [inaudible]
workflow generators and [inaudible].
>> Diwaker Grupta: No, that's what I'm saying.
thousand machines to run the workload on.
>>:
[Inaudible] machines just to generate your cluster network, right?
>> Diwaker Grupta:
each workload ->>:
They didn't have
So the way their system is set up, they did require
[Inaudible] block requests.
>> Diwaker Grupta:
>>:
No, that's fine.
But I'm just saying that --
[Inaudible].
[brief talking over].
>>:
[Inaudible].
>> Diwaker Grupta:
>>:
Excuse me?
[Inaudible].
>> Diwaker Grupta: So we will scale the server down before testing it
for the validation. And for this one, we do scale the server up if
that's your question.
>>:
All right.
[Inaudible].
>> Diwaker Grupta: Okay. I'm going to quickly try to talk about some
of the interesting aspects of Difference Engine. So memory was a big
issue here in trying to create more and more virtual machines on the
same physical machine. And this is important, you know, for server
consolidation as well, but even otherwise if you could do with lesser
memory the same number of things, then you have incentive to do so
because memory's expensive, it's difficult to upgrade and provision a
system with more memory. And memory consume a lot of power. And as,
you know, move towards multicore systems, the primary bottleneck in
creating more and more VMs on the same physical machine is going to be
memory, and so we wanted to consider this problem. Go ahead.
>>: Well is memory capacity non scaling the same as transistor
capacity for queuing multicore?
>> Diwaker Grupta: It might be, but the -- for example, if you -- you
have limited number of slots in your motherboard and the CPU can be
multiplex among several of the ends. Memory can't be multiplexed that
way, so it is simply a fact that it's a static resource and not a time
based resource.
>>:
Okay.
So [inaudible].
>> Diwaker Grupta:
>>:
Yes.
Yes.
Okay.
>> Diwaker Grupta: And actually there is, you know, more compelling
reasons here because if you can for example have the same number of VMs
consuming less memory, you can actually turn off selectively, you know,
one particular dim or something and so you can safe more power that
way.
>>:
Okay.
>> John Douceur: So there's incentive there as well. The state of the
art in virtual machine memory management is what is called content
based page sharing. This is done by VM ESX server and potentially
other servers and the idea is that you're going to walk through memory
across all the virtual machines, identify all the pages on their
exactly the same and then for those pages you can do copy and write
sharing which gets you some savings. But the key premise here is that
there is a significant potential for savings beyond whole pages. So if
you look at subpage granularity, you can actually extract a lot more
savings which is what we wanted to do in this work.
So just to highlight some of the mechanisms we use, here's an example.
We have two virtual machines with a total of five pages that are
initially backed by five physical pages. Now, two of these pages are
exactly identical and two of the pages they're similar but not quite
identical so there are small differences. So the first thing the
Difference Engine does is it shares the identical pages exactly like,
you know, how VM there and other systems do, so that saves us one
physical page.
The next thing we do is we identify pages that are similar but not
quite identical and for these similar pages we can store this page as a
disk or a patch against the base page. Now, the interesting thing to
note here is that patched pages have to be constructed even on a read
access, which is not the case for copy and write pages. The reads
there are essentially free.
The third thing we do is we identify pages that are not being accessed
frequently and so let's say the blue page in this example is not being
accessed frequently, we're going to compress this page and store
compress in memory. As before, access to compress pages have to be,
you know, uncompressed on each axis.
And so by using these three mechanisms we can save a lot more memory
than the current state of the art. And there are several engineering
challenges that are involved in making the system work efficiently. In
particular, how do you pick which pages to patch, which pages to
compress, which pages to share? How do you identify similar pages
efficiently and also accommodating for the fact that page contents
might be evolving over time?
There is an issue with Zen in particular that the size of the heap
that's available to the hypervisor it's fairly small and so we have to
be careful in the data searches we do, the bookkeeping that we do. And
finally the whole point of doing this memory saving is that you want to
create additional VMs and so at some point you might just run out of
physical memory, so we need some kind of memory over commitment support
for demand paging things out to disk. And so we had to build this as
well for Zen.
And I'd love to talk about the implementation later on. And for now I
just want to summarize the results of our evaluation. We demonstrate
significant savings not just for homogenous workloads where every VM is
running the same operating system, same applications. We also
demonstrate significant savings for heterogenous workloads where you
have different operating systems and applications. And we can extract
up to twice the savings from VM ESX server and head-to-head comparisons
and in all the cases we see less than seven percent degradation and
application performance.
And we actually show that we can use the additional virtual machines to
increase the aggregate system performance.
Now, let me quickly wrap up here. With some related work. The
literature is rich in all these areas, so I'm just going to highlight a
few of the things. As I mentioned before, simulators such as NS 2 are
great but the great for extrapolating beyond, you know, some hardware
capacity but at the cost of realism. And on the other hand emulators
such as ModelNet give you realism but you are so fundamentally limited
by the capacity of the underlying network. And time dilation sort of
allows you to get the best of both worlds.
With respect to testing large scale systems, the shrink work in INFOCOM
2003, I think, is closest in special DieCast and their scaling
hypothesis is that under certain assumptions you can take a sample of
the info traffic and use it to extrapolate traffic in a bigger network.
But shrink only captures the inner network based systems whereas
DieCast is able to capture end-to-end characteristics of a distributed
system.
You could also use test bed such as PlanetLab or Emulab to do large
scale testing the but again you're still fundamentally limited by the
number of machines that you have in the system.
And finally in terms of memory management for virtual machines, ESX
server does a great first step by using content based page sharing but
we demonstrate that there's actually a lot more potential for savings
if you look at subpage granularities. And we leverage work in
compression algorithms for main memory. So when we compressing pages
we don't use general purpose compression algorithms like the Ziv Lempel
compression, but there are specialized compression algorithms that
exploit the fact that memory has some structure to it so typically
you'd have numbers stored in memory, for example, so you have some four
byte granularity. And so it exploits that kind of structure to do
better compression and we leverage previous work in this area.
So just to briefly mention some future work, our current limitation for
DieCast doesn't scale low level substances. So for example, under time
dilation, things like the PCI bus or the memory bus will still appear
faster. Now, for most of the applications, such low level substances
do make a difference but if you have an application that extremely
sensitive to memory axis latency and particularly if the axis latency
decreases and you see some performance degradation then this will be a
problem.
Interposing on each memory axis in software is prohibitive but the hope
is that, you know, with better hardware support for virtualization this
is something that we can deal with moving forward. I've also done work
on infrastructure management of virtual machines, and the initial
placement and subsequent migration of virtual machines is a topic that
I'm interested in. In particular I'm looking at building [inaudible]
for placement and migration without the need for expensive
instrumentation and sensors in the hardware and also for a system like
Difference Engine we want to be able to place virtual machines that are
similar in contents of their memory on the same physical machine.
And so I'm looking to this work as well. And the challenge here is to
come up with compact representations of the contents of virtual
machines memory and efficiently compare these representations. And so
we're building some algorithms based on minimized hashing to do this.
And finally another way that we might address memory capacity under
time dilation is to use fast second restorage devices like SSDs to
supplement main memory. So under a high enough time dilation factor,
the really fast SSDs can actually appear as fast as main memory. And
so this is something that I'm just starting to look at right now.
So in conclusion, the general theme that of my research has been, as I
said before, to make virtualization more scalable both for supporting
existing applications like server consolidation but also in terms of
enabling new classes of applications by pushing beyond the limits of
the hardware. So this kind of scaleable multiplexing poses several
challenges and I've addressed some of them in my work. But also opens
up several opportunities. And I give you two examples of the kinds of
problems we can use virtual machines for.
All of these source code for these projects is available online linked
through my web page. And thank you for listening. I'm happy to take
questions at this point.
[applause].
>>: So I was wondering if you have specific examples of distributed
systems where you think this isn't necessarily the best approach.
Because what if I propose that maybe this system is good for systems in
which mainly TCP based, fairly well delineated between uploaders and
downloaders, right, and so you've showed that you can preserve certain
TCP semantics, for instance. But I mean in what type of scenario would
this just blow up? So you know, what BFT, what about like an overlay
or something like that?
I mean, I'm wondering to what extent, you know, there's a little bit of
overhead or fuzz in terms of scheduling things like that. What you've
shown is that, okay, for some TCP scenarios that's fine. But under
what timing instances or synchronizations do you think that would fall
apart?
>> Diwaker Grupta: On I think this is certainly one of the biggest
concerns that we have. The workloads that we have looked at so far
have been okay but we haven't really stressed the system in terms of,
for example, every VM is running a database intensive benchmark like
TPCC or TPCCW or something. And I think that for those kind of
workloads we will start to see some deviation from the baseline. So I
think disk is sort of the thing -- I'm not that concerned about the
network because I think the network is much easier to scale than the
disk is. You know, because again realize that under time dilation you
have more time to finish an operation and so with virtual machines in
particular and with Zen even more so the overhead more multiplexing
things when the virtual machines are doing disk value is much higher
than when they're doing network I/O. And so I think that is where the
system will start.
>>: I'm just saying that just emulating, you know, the network, so it
may not be enough if there are complicated timing things going on, so
you can accurately model say a bunch of TCP [inaudible] but you know
some of these scheduling issues that came up become more important,
then that doesn't make as much of a difference, right, so like if it
host availability fluctuates a lot or there's some type of complex
message patch protocol then it seems like there might be [inaudible]
that you haven't been explored yet.
>> Diwaker Grupta: So we've looked at, you know, peer-to-peer network
like BitTorrent, and we've set up our own topologies. So for those it
seemed to have worked. I really don't know what kind of network
configuration would lead to a scenario where this wouldn't work.
>>: [Inaudible] VM schedule with a [inaudible] VM there's [inaudible],
and to make the time dilation accurate you want to make sure
[inaudible] doesn't run so long, right? Then that means [inaudible]
gets bigger. And then if every VM is running on CPU intensive, really
CPU intensive application, then after time dilation you won't get 10
times as single machine would get because there's no [inaudible].
>> Diwaker Grupta: So that's a great point. And one of the nice
things about time dilation is that you can tweak it more until you get
the right results, right? So for example if you were seeing that the
overhead is high enough that you are not able to cope with it, the
current time dilation factor, you can increase the time dilation
factors so that you have more realtime to accomplish the same amount of
work. And so for example if you were running a TDF 10 and you were
seeing that you're not getting the same amount of CPU inside the
virtual machine, you can increase the time dilation factor [inaudible]
for 20 -- restrict the amount of CPUs that the virtual machine gets
such that after time dilation it will still get a maximum of, you know,
whatever the CPU in the original system was. But now you have more
time to actually finish those operations. So we can take -accommodate some of the overheads in that way.
>> John Douceur:
again.
>> Diwaker Grupta:
[applause]
Anyone else?
Thank you.
All right.
Let's thank the speaker
Download