Document 17954101

advertisement
>> Shobana Balakrishnan: Welcome to the Microsoft MSR lecture series, I guess. I’d like to introduce
Glauber Costa. He’s the lead engineer for Cloudius, but has been a open source Linux hacker-developer
for the last ten plus years, so…
>>: Ten years.
>> Shobana Balakrishnan: Ten years, okay, so he’s—you know—been in the Linux world for a while. He
worked at IBM, Red Hat, and then Parallels, and now he’s onto this new gig—I’ll call it Cloudius— and
we’ll hear about… I guess based on the KVM kernel? KVM or…?
>> Glauber Costa: I’m gonna tell more about that later, so…
>> Shobana Balakrishnan: Yeah, you’ll learn more. So with that, I’ll hand it off to you, Glauber, and feel
free to also interrupt with questions, but he’ll be there after the talk, so you can ask more questions too.
Thank you.
>> Glauber Costa: Thank you. So I can put this down, right?
>> Shobana Balakrishnan: Yes.
>> Glauber Costa: Again, just feel free to interrupt me at any time. I would like to start by saying that,
as Shobana was saying; I’ve been working with Linux for the past ten years—in the Linux kernel. This is
the last place on earth I would ever imagine I would be, so… [laughter]. I am very happy to be here. We
were discussing this over lunch; I mean, cloud computing is something that changed the game—in my
opinion—completely. Nobody can afford to be too much centered in any kind of thing these days. So
yeah, I do am quite happy to be here. But most of my friends over Facebook are quite—when I checked
in—they were extremely surprised. [laughter] So I’m working for Cloudius Systems…
>>: How do you keynote an open computer? [laughter]
>>: Three weeks ago, everybody was surprised we were there too…
>> Glauber Costa: Yeah, but it’s becoming normal. I mean I got the invitation to come here from Kewai,
and I met him in a Linux conference, so now I’m here. We’re all friends now and it’s a nice world to be
in.
So yes… who are we? We’re having this operating system that is—according to our presentation
materials—is the first operating system designed for the cloud. Every time I say that, people start saying
this is not true, because well—you’re from Microsoft, for example, I know you have Drawbridge, which
is kind of a virtualized operating system, something that it’s fit together to run on cloud infrastructures
or hypervising infrastructures. The people from Xen Source, they have a lot of library OS’s that they
work with as well, but we consider ourselves to be, like, the first operating system that is completely
written from scratch to run in the cloud. So that’s the gig here. I mean, we not… so that’s how I usually
open. We not a fork of Linux. So we are… most of us worked on Linux before. The startup that I work
with, Cloudius, was founded by Avi Kivity, which was the creator of the KVM hypervisor from—the
technical creator, not the CEO of the company or anything. And most of us… like we have with us one
person that maintain, for a very long while, the memory management subsystem in Linux. I work with
Xen, KVM, the Linux kernel, and all—so we do come from this background, but despite that we don’t
actually share any code with the Linux kernel at all, and we don’t intend to. Our design decisions are
very different. I’m gonna talk about some of them today, and our license doesn’t allow us to share any
code with Linux, which is to an extent by design. So we’re not a clone of Linux; we’re not a fork of Linux;
we are a completely new operating system.
We are drawing a lot from the open source code to ring the census. You can see that we’re quite small
startup. What I’m talking here is, again, it’s new thing. Jeffrey heard about it a while ago, when we—
probably when we first announced it—but we don’t even have an Alpha yet. We’re supposed to have
an Alpha in around three weeks, so I mean, we’re very close now to the point that we’re going to have a
product, but we still… we’re not still in the stage of having a product with a customer service and
support sales and all that. But it’s all engineers at this moment. We do have like fifteen people in—
pretty much—ten countries, so we distributed… most of people distributed… Yeah?
>>: I see a lot of tech companies that run in Israel, but their headquarters sitting in Silicon Valley so
people can phone without phoning a foreign area code
>> Glauber Costa: Yeah.
>>: So how did it work for you to be officially…
>> Glauber Costa: We do… we have… two months ago or one month ago, actually, we finished the
paperwork to have a presence in the Silicon Valley, so we will have—we don’t have anybody in there
yet, but we will have an office in Silicon Valley eventually. But we are like… the chairman is one of those
Israeli investors that… they seem to do it for fun—like, earn money, something they do very well. So…
but they… we have an office in Israel, but people—as you see here—are really, really distributed. We
have people—I’m in Russia… I’m not Russian, but I live in Russia. We have people in Finland, Sweden,
and Brazil—I’m Brazilian, but there’s another guy in Brazil… really all over. Yeah?
>>: You’re really in Russia? [laughter]
>> Glauber Costa: Yeah, I’m going to tell the story later [laughter] is how I ended up in Russia. So what
we want to do is—that’s our mission statement—we want to build the default operating system for
public and private clouds running Unix-compatible applications. This is not set in stone, so I mean the
mission is something around that. If you go to the website, you’re not going to find precisely these
words—I mean, we’re still working on that, but that’s about what we want to do. This means that we’re
not in the application business at all, so we’re not doing any kind of applications. I don’t expect anybody
to say, “Yeah, I have this application here that runs in Windows, and I’m going to switch to OSv because
OSv is better.” I mean, I don’t even think about it. I don’t even consider this or any other operating
system, for that matter. We do implement, though, the most—not all, but most—of the Linux API, and
what we want to do is instead: if you would be running Linux in the cloud and you get to run OSv
instead, your application is likely to run—like, very likely… I’m gonna detail what kind of stuff it does that
prevents it from running on OSv—and it’s going to run faster, better, and using less resources.
So here’s how it comes to be: if you look at the typical cloud stack these days—I mean, it probably not
even seeing a diagram like that, you probably made diagrams like that when you’re presenting ‘cause
it’s all very, very traditional. You have the hardware, you have the hypervisor, and then you have the
operating system—Windows in… most of the time for you guys and Linux for other people. Most of the
time, you do have a runtime running—say the JVM, but you could be the Dot-Net platform, you could be
Ruby, could be any kind of thing. At least in the JVM world, it’s very common to have an application
server like JBoss, Tomcat, Forever, and you put your applications on top of that. Now when you look at
this picture, the thing that became clear for us and the dream that has driven this initiative is that a lot
of those layers, and in particular those three layers here—like the hypervisor, the operating system, and
the runtime, they’re really doing… the intersection between what they do is very big—I mean, they’re
not doing exactly the same thing, but they’re doing a lot of repeated tasks, and by doing those repeated
tasks you’re wasting a lot of resources. So… first of all, you’re wasting a lot of resources, but also you’re
binding yourself to some design decisions, because you need to be—in the case of Linux, for example—
you need to be able to run in this layer and in this layer, and the layer above cannot assume certain
things because it could run everywhere.
So what we want to do is we want to cut down that, we want to merge whatever runs the runtime,
whatever runs in user space—in our case—with the kernel. So there will be no more… and again, this is
something that Drawbridge does in the same way, as far as I understand. Whenever I talk about
Drawbridge, please correct me if I’m wrong. I read the paper, but that’s about it. We run our kernel in
user space, and we do that because we believe that the whole protection thing—which is something
that the operating system spends a lot of time and resources managing and doing—it’s gonna be
handled by the hypervisor. So we believe—and that’s the core of what we’re doing—most people—not
every deployment, but most people in the Cloud—they will be running one application per virtual
machine, and then you have a lot of virtual machines running a lot of applications. The reasons for that
are twofold. So first of all, it’s just simpler to do that. Most of the time in the cloud—we believe—for
example, if you want to reboot your machine, if you want to do an upgrade, it’s a lot easier to just shut
down your VM, come up with a new VM, and if you have two applications running the same VM, they’re
not going to scale independently—it gets your management more complicated. So it’s not something
that we are proposing people should do—I mean, I’m not coming here to say, “yeah, we believe that
people should run one application per VM.” That’s something that we look at the market, we look at
the figures and how things are deployed, and we notice already that most of the time—again, there are
always the special case that can be handled separately—but most of the time, people are running one
application, one VM. And if you’re doing that, most of the separation, most of the user administration
privileges, all sorts of ring zeroes, ring tree for Intel processer, et cetera—this is completely useless. In
particular, if you’re running only one application, and your application goes to the kernel—and you
manage to escalate to the kernel—he now has control of the machine. If your application is the only
application, the question is: so what? I mean, if you crash the machine, you crash the machine. You
would have crashed your application—doesn’t matter. So we’re cutting down those layers. And in the
case of Linux and I—from the best of my knowledge… again, I can be wrong about those things; HyperV’s, the same thing—it’s not an external hypervisor like Xen; it’s integrated through the kernel. So it can
run as the hypervisor, it can run as the guest operating system. And when it runs, it’s basically wasting
resources, because the same operating system that is designed to be the hypervisor has all those
protection facilities, and it’s all being replicated in the layer above.
So what we’re targeting right now is this thing here—merge, the runtime, or the actual application. For
a lot of applications, that’s where your stack ends—that’s your application and that’s the end of it. For
the case of a runtime application server, et cetera, we’re just sitting crushing the stack making it a little
bit smaller, more efficient. And if we’re talking about big deployments, any ten percent that we gain is
like—over one million machines—is a lot.
Alright. The way we do that—so how do we propose to make this integration, to make those things
faster, more integrated, using less resources—is by drawing from this concept of a library OS—library
operating system. Again, I believe you’re all familiar with the concept itself, Microsoft Drawbridge is a
library OS as well. It’s something that you… the way I view OSv sometimes is: if you don’t want to talk
about an operating system—that’s why I like the name library OS, because it conveys both messages—if
you don’t want to talk about an operating system, you can talk about a library that you link your
application against, and that library allows you to boot your application directly on a hypervisor. So it
provides all the abstraction layer that you need, all the hardware—the very basic hardware
communication, boot-up sequence, all of that—that allow your C application—your application that you
just run, your Dot-Net application, your Java application, whatever—to instead of be run in an operating
system, it boots directly into the hypervisor.
As I said, the Xen project has two operating systems that follow that logic: one of them is Mirage, and
the other one is Erlang-on-Xen. So Mirage—they are very different in, because they’re very Xencentric—they only run on Xen on paravirtual mode. So paravirtual mode—you probably know—is when
you need to change the operating system, in that case, you just write it all again. But they’re not even
operating systems. So Mirage, for example, is almost an OCaml compiler, so it only runs OCaml. What it
does is that you write your application in OCaml, and then you compile your application using Mirage,
and it’s going to put the… all of… if you happen to write to a file, it’s going to link a file system block into
your application—it’s going to link all the things you need to boot on Xen—and then it boots your Erlang
application directly on Xen—sorry, your OCaml application. Erlang-on-Xen does the same thing but for
Erlang. So what we believe we’re doing here is pretty much something between this and Drawbridge. I
mean, we do want to run all kind… all sorts of applications, but on the other hand, I mean—and we
don’t want to be like the Mirage and Erlang-on-Xen; they’re tied to a hypervisor—we want to run
everywhere. I mean, we want to run… if there is a hypervisor, we want to be there. But—and again,
that’s from my readings, correct me if I’m wrong—Drawbridge is actually Microsoft Windows modified
to cut a lot of pieces, right? We are—in that aspect—we are a lot more like Mirage and Erlang-on-Xen:
we just rewrote everything, but not to a specific language. We rewrote everything to try to run as many
applications and on as many hypervisors as we can.
The reason we can do that is… we believe, again, that rise of the hypervisor is a game changer in this
aspect. Writing an operating system—so to begin with, is just fun, right? It’s nice; it’s nice work; it’s
probably one of the funniest things I’ve ever done in my life. Just let’s write everything, but the
[indiscernible] to scan the PCI bus—I mean, this is something that you don’t do because it’s already
done in every operating system.
>>: [indiscernible]
>> Glauber Costa: You do—cool. Isn’t it fun? It is fun, so [laughs]. But the thing is that these days, most
people used to believe—I mean—there is no point in writing a new operating system. One of the
reasons for that, is that—you are all familiar with this—if you go look at any operating system, eighty
percent of the code is just drivers—it’s just boring drivers for all kinds of network cards, for all kinds of
video adapters. So if you write a new operating system, or a new hypervisor, or a new anything that
talks directly to the machine, you need to either rewrite all this stuff that—I mean, it’s not humanly
possible to do that these days. I mean, let’s support all… or you run only in a single platform, or you try
to somehow to reuse the drivers of a new operating system. That’s what Xen did, for example, with the
Linux drivers. Xen has the hypervisor, and then has a special Linux guest that has special access so you
can reuse the Linux drivers from—they call it Dom zero—the privileged domain.
For us, if you restrict yourself to run on a hypervisor, this is a game changer, because you just can write a
new operating system. I mean, there is no… right now we’re running on KVM, Xen, Virtual Box, and
VMware. And even the last two are, like, almost there. So… but in the beginning, we were just running
on top of KVM. So the hardware is always the same: you have to support one network card, one block
driver, and that’s it. You can focus most of your effort in just writing the actual core operating system,
which is what we’re doing. Even if you want—as we want—to run in every possible hypervisor in the
medium, short, long term—I mean, whatever happens, but definitely in the long term, we want to be
running everywhere—there are like five hypervisors. Even if people keep coming up with new
hypervisors, there not ever going to be one thousand hypervisors, we expect. So the amount of work
we need to do for drivers for compatibility and for things like that just makes the work manageable. So
this is, again, one of the main reasons why we felt that: yes, it is possible to just rewrite the operating
system, to just start using new concepts, to just start writing out… forget about compatibility, forget
about using whatever hardware is there, ‘cause we’re going to run only on top of hypervisors. So no, we
do not run in any kind of physical machine, and we may if there is a very, very specialized-use case that
requires… but this is not anything we want for the short run.
So OSv itself is written from scratch in C++11. So we’re, like, really trying to be in the edge of everything
here. Most of the time, we still spend some time fighting the compiler, ‘cause there is some weird bugs
that appear here and there. I mean, everything is very new… I mean, especially the compiler we use on
Linux DCC—I mean, it doesn’t support C++11 in its fullest. I mean, I don’t think every… anything does
these days, but we using it. So we wrote everything, but we took some parts from FreeBSD. And those
are parts, again, in which we didn’t believe there was immediate value in rewriting. The most, the
biggest example is the file system—like, we really didn’t want to write a new file system for two reasons.
First of all, it’s not that important. Again—we believe—not all, but most applications in the cloud, they
won’t be talking to the local disk, they’ll be talking to some kind of shared storage, because if you don’t
do that—I mean, you’re not… it’s impossible to be stateless, you’re not going to do it. However, some
applications do require a file system. One big example is something that we’re still implementing some
of the functionality it needs, is the Cassandra database. So Cassandra database does not write directly
to the disk, it writes to a file system, so it expects a file system. It takes what, like five to six years to
stabilize a file system, to have a reliable file system? I mean, we’re out of this business—it’s not
something that we want to do. So we just took the ZFS file system from Open Solaris. The code actually
came from FreeBSD, so yeah … there was a port from—it’s the Open Solaris code that was moved to
FreeBSD—and we took it from FreeBSD—just like a layer of indirection. The network stack also we took
from FreeBSD—I think we were not the only ones doing that, right? So… but we heavily modified that.
So we… just we didn’t saw a value in rewriting the TCP protocol and all that. We took it from BSD, but
it’s already heavily modified—the file system, we would rather leave it alone.
>>: So… before you… I’m not quite getting yet whether you were motivated mostly by license issues or
technical issues that you thought [indiscernible] involved.
>> Glauber Costa: A little bit of both, a little bit… so again, I’d have to admit again that I am biased in
that aspect, because I’ve been working for Linux for almost a decade, but I do believe that the network
stack of Linux is way superior than the network stack of FreeBSD, that’s my… I mean, everybody always
disagrees about that, but then I have a lot of people on my side. It’s a eternal fight. From a technical
point of view, if I would make this change… this choice, I mean, I would just get the Linux network stack.
So ZFS is a little bit a technical choice, a little bit a license choice—technical in the sense that most of the
Linux file systems, they’re open source, but they’re GPL license. They are very tied to the Linux VFS, and
ZFS is a project that is more independent. It already runs on BSD, it already runs on Solaris, and they’re
actually trying to come up with a project called open ZFS, which is the core of ZFS turned into a library—
sort of thing like that—so it can run with the same code all those different operating systems, even on
Linux if you want to. The license, actually, is not compatible with the GPL, but you can run on user-space
Linux, using file system user-space. But we felt that for us, from a technical point of view—but not in
terms of speed, performance, et cetera, but manageability—it would be a better choice to use ZFS,
because it’s more self-contained—it’s less dependent on an infrastructure. Once we made this choice,
though—I mean, the ZFS license is not compatible with the GPL—so we don’t… we just can’t use the
GPL, we just can’t use code from Linux. But it is a little bit of both, because if you really wanted, then we
could make the effort and say let’s use the Linux network stack, and let’s just then get one of the file
system from Linux. So it’s a little bit here a little bit there. But as you pointed out—I mean—it does…
we are—I mean—all doing this open source, but we did have a slight preference towards the BSD
license, a little bit from the business point of view as well. So that… we can talk more about that later,
but for example, with the GPL we were always required to give the sources for everything. So if you
look at what Red Hat does, for example, with their business, you have the binary. Once you shift the
binary you’re mandated by the license to shift the source for debt binary verses. They cannot have, for
example, a stable version that is kept in house and is shipped just to binary. With the BSD license we
can. We’re not sure yet that we want to do that, but we want to leave the doors open. So it was just a
friendlier license from a business perspective as well.
So when I say that we want to run Windows applications, what I mean by that is that we implement the
POSIX API the Linux uses. So, Linux doesn’t implement POSIX exactly. It already implements a different
version, a completely Linux-centric with a lot of extensions. And as Linux became more important in the
Unix world it became like the de facto POSIX specification, and that’s what we implement. So every
system call that is available on Linux we implement, except for creation of new processes and I’m going
to detail why and how in a while. For a variety of reasons, it’s better if you have a runtime because if
you have a runtime your API now is the API of the runtime. For example, if you’re running JAVA, you’re
not talking to the operating system and that also gives us more flexibility because if you want to
change… so, as I said, there are some system calls we don’t support, there are some system calls that
we’d like to introduce, and still we would like to have the application developers not to care about that.
So we can make the changes and we do make the changes to the runtime, to the JVM. So we run a
modified version of the open JDK JVM and all the modifications we have goes to the runtime. The
application stays unchanged. So the runtime is yet another layer that can serve as middle-field here to
allow us to, at the same time, maintain compatibility and go crazy in the operating system world. Now,
that’s the, that’s the most important thing of our design. Because we believe that most people are
going to be running one application per virtual machine, we don’t support running more than one
application per virtual machine. And once we do that, we can actually simplify the design of the
operating system by quite a lot. So, that’s actually the main technical point of the OSv architecture and
that’s, that’s one of the things that… Please?
>>: Do you support multiple threads?
>> Glauber Costa: Yes, threads, yes. So, good question about threads. One of the things that we gain a
lot of performance by that, is that we never, ever, ever switch the control register. So we don’t, only if
you, if you invalidate one mapping, yes, then, then we do. So we do support, for example, the Linux
mmaps system called that creates a random mapping. But if we switch from one thread to another, we
never, ever, ever flush the TOB. And because we don’t…
>>: One address space
>> Glauber Costa: Huh? It’s one address space. Exactly. So, we can… we have multiple entities running
parallel, but all of those entities share the same address space—all of them, which makes the
implementation of shared memory extremely easy as well. [laughter] But if you look at some of the
micro benchmarks for doing them, our contact switch time is really fast. So, because there… it’s just
really switching registers, I mean, there’s nothing else to do. You don’t have cache in misses post
contact switch, you don’t have anything. We have no users, we have no admin and user this, user that
and rolls and privileges. So again, that allows us to completely simplify the architecture of the operating
system. The way this is important, in my view is two-fold. So, first of all is a direct performance benefit.
So, we have this… we not doing… we’re just faster because we’re not switching the TOB at any time—
only when you specifically want to invalidate a map. So, it just makes it faster. It also… but most of the
games, the gains doesn’t come from the fact that you’re doing that operation faster. It comes from
once you do that, you’re just simpler. If you’re simpler, I mean, you’re more free to make some kinds of
design decisions and both of them we achieve better performance. So one of the goals of OSv is to have
better manageability, as I was talking earlier with Jeffrey, but also better performance. But in both
terms of less resource usage and more true… less latency and all kinds of things that the applications
may want. We do run user space as well, I mean, as I said, it’s kind of obvious. We don’t have any kind
of user space, kernel space separation. So those system calls that we have that we implement from
Linux, they’re not really system calls, they’re just library calls. I just called for function and come back
and have no mode switch and no kind of anything fancy going on. It also allows—and that’s something
that Drawbridge could do as well—it also allows the application to use the hardware directly. So, if the
application, for example— and we have some example of applications that has that—if the application
wants to track dirty memory, how it usually does is by scanning. Like the JVM does that for garbage
collection. It needs to scan or in the case of the JVM it inserts in the JVM hotspot, in the jit compiler.
Marks to note is that I’m changing this memory access and they need to mark and sweep. If you have
access to the hardware directly, the JVM is actually… it’s not that we’re running the kernel in user space,
we’re actually running the application in kernel space. It’s the other way around. So you have direct
access to the hardware. This is one of the projects we’re working on. It’s not complete. It is not even
close to completed. But the JVM can actually just… instead of going and sweeping the memory, you just
let the hardware do that for you. I mean, just have the page tables for your, I mean, at your disposal.
Whenever somebody writes to that area, the processor is going to flip a bit for you, you win. Automatic.
>>: So you’re redoing Brian’s work … [indiscernible]
>> Glauber Costa: Uh… yeah. [laughter] In a sense, in a sense. But again, the main difference is that
now we don’t have any user space. Like if you crash, that’s the whole idea. If you crash, you crash. You
would have crashed in user space, now you crash in kernel space. Doesn’t matter. You’re not crashing
anybody else. The whole user space, kernel space separation—the problem with that is that you crash
yourself, you crash somebody else. And here, you’re free.
>>: Until you read fine print have you just reinvented VM—CMS?
>> Glauber Costa: We… we have reinvented so many things, like… [laughter]. But it’s like that, I mean
we reinvent stuff but then…
>>: [indiscernible]
>> Glauber Costa: Yeah. [laughter] I do read Wikipedia, so…[laughter] But for example, in the very, in
the very first day that we announced OSv, in one of the Linux journals that I always read, I mean, they
published an article about us and one of the guys commenting they say, “yeah, so you just reinventing
MS DOS.” Yes and no, I mean, it’s some of the new concepts… but if… one of the things that I love about
computing is exactly that. I mean we always bring this stuff back from the past to the future again and
rereading this with the lenses we have now. It happens. But, yes, I mean we… there are a lot of
concepts that we… they’re not exactly new they’re just rebranded for a new kind of world.
We are a very new start up. We are a very new project. We have about one year, so, it’s impressive
that we could just write a new operating system in less than one year. So, in less than one year we were
already running a large collection of applications. One of the reasons we can do that, of course, is just
because the whole design is so much simpler. I mean they’re… right now we’re actually moving a lot
slower than in the beginning, just because, I mean, we came, we flushed all the low hanging fruits—that
was enough to run the full JVM, by the way. Now we’re moving forward. I mean, we don’t want to run
just the JVM, we want to run the whole POSIX API. So, we’re moving a little bit slower now just because
the problems are more complicated, but they are still a lot simpler than they are in a normal operating
system. We are completely, again, it should be obvious by now, but we are completely virtualization
oriented. We don’t intend to run on hardware. We may in the future if the need arises and for
something really, really, really specific. But one of the things that we take into account— I’m not sure if
you’re familiar with the lockholder preemption problem. It’s a problem very typical from virtualization.
So, if you have a spinlock and you’re running on hardware, you’re running a laptop, that means that very
soon you’re gonna to release that spinlock. So, that’s why you busy wait. You busy wait because it’s
gonna be… it’s not gonna to be for a very long time. However, in virtual machines there is this
problem… again, lockholder preemption… that the hypervisor—imagine you’re running a spinlock—the
hypervisor may take the CPU out of the physical CPU so that virtual CPU is no longer running. But then it
schedules to run another virtual CPU that is now waiting for this CPU that is holding the spinlock. And all
the time this virtual CPU is running is just wasted because it never… it’s not going to release the spinlock
because it’s not even running in the CPU. It’s outside the CPU. And the guest operating system has no
control of that whatsoever.
>>: I mean I understand that you have the opportunity to avoid spinlocks here, but you don’t find
paravirtualization handles that correctively?
>> Glauber Costa: Oh, yeah. So, no, paravirtualization is one option to do that. So how Linux tries to
solve that problem, for example, is by doing paravirtualized spinlocks. It tries to communicate with the
hypervisor. The hypervisor tries to tell you, “yeah, your virtual CPU is not running so you don’t, you
shouldn’t, you should go to sleep,” for example, “you should do something.” Because… we could have
done that, but this is a lot more like a Legacy style. We already have the spinlocks. It’s very complicated
to run everything without a spinlock. We had the opportunity. We’re designing everything from
scratch. We just decided we’re not going to use any spinlocks, at all. So, there is no easy way to an OSv,
at all. And this is, this is… I’m a very lazy person, that’s also something about me. I mean, I hate
working. Like, if it were up to me I would be drinking whiskey on the beach every day. So…
>>: In Russia?
>> Glauber Costa: Yeah. [laughter] They have the Black Sea and now they have a larger piece of the
Black Sea so… [laughter] So, it’s cool. But, one of my preferred ways of solving problems is getting rid of
the problem. So we could have, like Linux does, for example, paravirtualization—use paravirtualized
spinlock. But again, paravirtualization is completely dependent on the hypervisor. Not all hypervisors
support it. Not all hypervisors support it in the same way, and if you want to support it across the board
we need to write one paravirtualization scheme for KVM, for Xen, Hyper-V—I don’t even know if he has.
Like we know nearly nothing about Hyper-V.
>>: That part’s published—some of it.
>> Glauber Costa: No—yeah. So, but it’s… but again, it’s not… it’s not about being published or not. I
mean, we need to go there and take a look, but whatever scheme you have, right? It needs to be done.
It needs to be coded. If I just write the operating system without using any spinlocks, I mean, I don’t
have the problem anymore is the perfect solution. Get rid of the problem, right? So we really have no
spinlocks. Usually in an operating system, even if you write… so the easiest way to get rid of spinlocks
completely is by, of course, by using mutexes, right? ‘Cause then you just go to sleep all the time—every
time you count down on the lock you go to sleep. But the mutex itself… most of the mutexes
implementation uses spinlock to hold the queue. So you have a queue of waiters and then you have a
spinlock to control and to protect the queue of waiters. So, we are using… I wasn’t the one that
implemented that. It was a guy, he’s a mathematician so he knows all sorts of crazy stuff, like he
implemented most of our schedulers… Well, there’s a very recent work from 2008, or something like
that, some group implemented a lockless mutex and it’s a mutex that doesn’t use any kind of lock to
protect its internal queue, and we’re using that one. It’s something, again, maybe too complicated, too
cumbersome to adapt to your operating system, but we are doing this from day one. So, that’s just the
kind of decision we have been making. I mean, if we want to run virtualized, we’re not going to do
spinlocks—and we don’t.
We have no complicated hardware model. So, for example, our PCI discovery… this is a… again, this is…
it’s more like an anecdotal example because you don’t do PCI discover all the time…
>>: Why do you ever do it?
>> Glauber Costa: Huh? When you boot.
>>: But you’re running in the DL.
>> Glauber Costa: Yeah, but, again, I don’t know about Hyper-V but then this KVM or Xen Virtual Box…
we’re not running, for example, PV Xen. We run HVM Xen. So it provides you with the view of a
machine. And that’s the way you can run like everywhere. So, you have a PCI device—it’s there. And
most of… and each of the virtio devices, for example, the KVM uses for networking block, they’re all
exposed as PCI devices.
>>: Yeah, we can do that.
>> Glauber Costa: Yeah, so, but again, I’m interested in knowing more about how Hyper-V exposes the
devices, but even Xen can have an MPV mode, can expose your PCI bus… just to hook some devices over
there. So, again, we do PCI discovery, but we don’t do like the most complicated PCI discovery, handle
any case possible on earth. That’s the… how does KVM implement it? That’s how it does. So, our
hardware model is a lot simpler. We don’t support like multiple types of busses, multiple types of
interrupts, it’s like… it’s what we have in the virtual world and that’s it. So, it’s a lot simpler in this way.
We also try to take into account things that are traditionally more expensive in hypervisors. So, setting
timers is a lot more expensive in hypervisors than it is in virtual machine… in physical machines. IPI’s
because you have a lot of exits when you try to communicate with other CPUs. So, we have a pooling
scheme, for example, before we go wide-O, we always keep pooling other CPUs for more work to avoid
IPI communication. We get some performance benefits from that as well. And they’re all things that, I
mean if you… if you design it… if you have something that needs to run everywhere, it might not be the
best thing, or you might… Linux, for example were trying to implement the same very pooling
mechanism that we are, but they need to have that together with all the other mechanisms. And for us
it’s just a lot simpler. We do things thinking about virtual machines, mostly HVM machines. So, as I said,
we don’t even support Xen paravirtualize. It’s a… we’re targeting the kind of devices … the kind of
clouds that are gonna to be available in two years, three years, and that’s what we want to do—as
simple as we can. Simplicity is a feature in our… [pause] We came up… when I met Kewai in Edinburgh,
in Linux conference in Edinburgh, I mean we… instead of giving T-shirts away we were giving boxer
shorts in this motto that “less is more.” [laughter] So simplicity is a feature, I mean, we’re trying to be as
simple as we can and we believe that we’re gonna harvest the good benefits from that. Again, we… this
is not like VM specific but we also have a complete fair scheduler. It’s anything, so again, we have
anything that an operating system would have. That’s why library OS, as I said, is a good term—it’s both
library and an operating system. I don’t know about Windows but on Linux, for example, each page you
have, you have metadata about that page—so it’s a 64 byte structure that describes each page in the
system. We don’t have that. So, we’re running only on 64 bit. We don’t run on 32 bit machines at all.
So we make it get… so, we made this decision. You can see it’s a pattern. We don’t want to support
Legacy. We don’t want to do Legacy. We’re writing something from scratch for the machines, the
virtual machines that are going to be there in two, three years. So, we have…
>>: 64 byte structure, not 64 bit?
>> Glauber Costa: No, the structure for our page is for a byte, or 32 byte depending on… but it’s
between 32 bytes…
>>: Sorry. I’m learning more about Linux in your talk than I am about your OS. This is really interesting.
>> Glauber Costa: Yeah. [laughter]
>>: The problems you’re working around are the most interesting part. [laughter]
>> Glauber Costa: Yeah, yeah, of course.
>>: [indiscernible]
>> Glauber Costa: I don’t think… I don’t think any of those solutions are like incredibly complicated and
ingenious. It’s just that once you decide not to handle all sorts of stuff and you say I’m only going to do
this, you can simplify. Again, I have no idea how Windows handles pages or metadata, bubble memory,
or anything like that, but in Linux you have this structure. So it’s even a line in the special way in groups
of two words. So you have in the variant up to 64 bytes, so it depends on the configuration options that
you have, but at the very minimum 32 bytes. Then you have flags, you have the virtual drives, you have
all sorts of information about the page in there and that really describes each page in the system. And if
you have quite a large machine, that’s quite a bunch of memory that goes only in metadata for pages.
Again, we’re not doing that. If you talk about page tables, the amount of memory that we use for page
tables itself is a lot smaller because we only have one set of page tables. We don’t have like many sets
of page tables. So, all of… all of those decisions allow us to be more resource efficient because… and
again, we try not to compare ourselves to any other operating system other than Linux. So, the way I
view it, for example— this was something more to the end of the talk but I say just now is— we don’t
really want and there is no… there is not even any sense in doing this—I mean, it’s not something… it’s
not the operating system that is gonna be the new operating system for everyone, especially because if
you wrote for application for Windows, again, I find it highly unlikely that it just gonna rewrite it. If you
haven’t rewrite it by now to run on Linux, you’re not gonna rewrite it by now to run on OSv. And even if
you try to harvest the benefits of running a specialized, virtualized operating system, you have it as well.
Right? Right.
So, this is for whoever is running Linux to run a better Linux. So we’re trying to be a better Linux than
Linux by focus on a niche, by focus on the cloud niche that we believe is gonna be quite important in the
upcoming years, and I’m fairly sure you agree.
Security: So there is the…there is the obvious thing that, of course, that you have nobody to attack
because if you’re running alone on the machine and if you crash, you crash. We trust the hypervisor to
isolate the two virtual machines. That’s one of the questions I get asked the most—“But what if I can’t
breach into the hypervisor?” Then it’s not my problem. I mean, it’s… call the support for the hypervisor
vendor—call Red Hat, call Microsoft, call the amrico, whoever—because they’re doing a clumsy job. But
as long as the hypervisor maintain isolation there is nobody you can attack. I mean, you can attack the
application –yes. So you can DOS an application, you can have an application, you shut it down, but you
don’t have isolation problems. Not at all because, again, being lazy and just getting rid of the problem.
You have less components than a traditional operating system, so OSv is just a monolithic library that
you lean against and everything is inside that library. You have less moving pieces, less moving parts, so
again, it’s just a simplicity. And a very, a lot smaller attack surface, of course. We don’t have a very,
very complicated kernel deck. It’s… so, I think that the summary for this is we don’t have any special
security model, but we do believe there are some small… nope, it’s not our focus, so that’s why I don’t
talk too much… don’t spend too much time on this slide. But, even though it’s not our focus, I mean, it’s
still some benefits to be gained by the fact that our simplicity results in a lesser attack surface, et cetera.
As I said, most of the… you have no users, not at all, so there’s no privilege escalation ‘cause there’s
nowhere to scale to, no untrusted processes. Everything there is running inside the VM. We believe
that it’s going to be a trusted entity. If you’re doing something that is untrusted, it’s going to be running
in a different VM. And we are writing C++11 and as long as the compiler does a good job in translating
that into actual code it’s actually… I had no C++ experience before that at all, I mean, Linux is written in
C and, so I hadn’t, pretty much… Linus if you know that he hates C++, so we were all, kind of, go along…
Please?
>>: Granted the compiler does a good job, but is there any difference in terms of security… between
writing C++ 11 or writing formal 61. You write a language, the machine does what you tell it to.
>> Glauber Costa: Yes, yes, but this comparison… so keep in mind, for example, that this is also the first
time I’m talking to you, a non-Linux crowd, but when we talk about Linux, Linux is completely written in
C. And C is just the language in which you need to write everything you want to do, like uh… and C++
did a lot of that automatically for you. And especially C++11, it does that in a very efficient way. So, the
reason Linus never liked C++ to begin with is that it has so many abstractions that you lose control of
what is going on, you lose perf… it’s not… it’s just not the –according to Linus, I mean—it’s just not the
language to write an operating system because you lose a lot of control. C++11 is allowing us to have a
very decent level of control. The primitives are really efficient. You do understand what’s going on.
But, for example, a common problem in Linux: you take a mutex—you forget to release the mutex.
That’s something that never happens for us because we using that resource acquisition idiom that, I
mean, you… whenever you have a block, you code your mutex and you code whatever you want to code
inside that block, and it doesn’t matter from where you leave that block, the mutex is going to be
automatically released. So we have like, we were using a lot of share pointers so resource counting is
done automatically. So whenever you—we want to do resource counting, we never get resource
counting wrong because we’re having the compiler to do that for us. And again, no idea what kind of
tools and what kind of processes you have in Windows and even the language you code in, but
compared to Linux, I mean, there’s a lot less opportunity. I mean we spend a lot less time thinking
about buffer overflows and dangling pointers and memory leaks, because they simply don’t happen.
The compiler would just put all the… without sacrificing performance, because we understand exactly
what’s happening. It’s gonna introduce a lot of new, interesting stuff in there to prevent those stuff
from going on.
>>: Do you use exceptions?
>> Glauber Costa: What?
>>: Do you use exceptions?
>> Glauber Costa: We support them, of course, because at some point you’re going to run an
application, but we don’t rely in for kernel code for the actual oper… To tell you that we never use
would be a lie. I mean in very, very special code, I mean, that we took it from somewhere, it was like,
for example, the parson code for the command line control exceptions. But most of the actual stuff, I
mean, we don’t use exceptions at all. If the exception happens…
>>: I was just curious because that’s a common problem with C versus C++… is another…
>> Glauber Costa: Yes. No, so, no. Every time we code, I mean, we code to be exception safe, but we
don’t throw we don’t expect, none of the classes we write throw in exception, none of that. Alright.
So, coverage: Where do we run? And again, keep in mind that this is not final, we’re here for around
one year. It’s obvious that we run very well and most of our good results, most of our benchmarks, are
produced on KVM. This is because, again, Avi started this—Avi created KVM so there’s possibly nobody
in this world that understand KVMs better than Avi. He knows how to get all the performance from it,
all the good things. I was a KVM developer for five years. The other guy that’s working with us from
early times as well, was the person who implemented inside KVM nested virtualization. We have
another guy that worked on KVM as well. So like half of our crew worked on KVM. So, it’s just a thing
that we are more used to that… We also run on Xen. The reason we targeted Xen so early is that
because we wanted to be present on Amazon, on AWS. And right now we run in Virtual Box and
VMware. The reason we picked Virtual Box and VMware is, VMware is just because a guy— we have a
partnership with VMware as well— but we found this guy to say “yeah, we’re gonna help you with some
code, we’re gonna help you with questions,” et cetera… Virtual Box was just a thing that would allow us
to run, at least for testing purposes, on Windows, Mac, Linux, whatever, in the same hypervisor. So, the
question I was asked, for example, by Kewai when we met is that, “Do you plan to run on Hyper-V?”
Yes, we do. We would love to run on Azure. So we’re running right now in Amazon and Google
Compute Engine. So Google Compute Engine is not on yet, is not on-line. But when it is, we already
have a partnership with Google. We’re gonna have an OSv image from day one. They could pick and
choose from the list just to run a OSv image. So we’re partnering on them even before the platform
goes live to have an OSv image from day one. Why Amazon? It’s because…
>>: When you say partnering them, it just means the image is available or…?
>> Glauber Costa: It depends. With Google it means one thing, with VM Room it means another thing,
with Amazon it means another thing. It depends on the deals we manage to cut. For Google, we
actually have… for Amazon, we basically have an engineer there that help us—I mean, answer some
questions about the future their actions, help us plan a little bit. They don’t actually contribute
anything. There is a guy from Google that contributes us some codes, so we have a more strong
partnership with Google in that aspect. And the reason… so the reason we’re running on Amazon is
because, of course I mean, Amazon today—I know you’re trying to change that and Google’s trying to
change that… I’m gonna love to see the fight—but Amazon is still the biggest public cloud. So, that’s
something we just… we have to run. And Google Compute Engine is running on KVM. So, for us it’s just
like a no-brainer. It’s running on KVM, we run on KVM, let’s run it. And we also have a very good
relationship with Google in terms of… because of just people that we know and work with us and KVM
as well. So I would say that with all those companies the stronger relationship we have is, so far, with
Google. But we would love to have at some point, depending on our band width, again, we are around
fifteen people at this time, expanding but not in the… I would like to double everyday but it’s a…
[laughs]
>>: So [indiscernible] would mean flagging drivers…
>>: Yeah, I mean it’s just a question of whether they can pull those out of Linux or not. We… we gave
them two of Linux kernel. Do you think you can pull the concepts out or do you take the code?
>> Glauber Costa: So, we definitely can’t take code from Linux. That’s a license issue.
>>: So we haven’t documented our protocols. What we have done is give code to the Linux program.
>> Glauber Costa: If you’re willing to do that for us as well, I mean, one of the things I can guarantee is
that it’s going to be simpler. Yeah?
>>: Right. So, I mean, I can’t sign people up to do work there, that’s not my job. But, [pause] what you
said earlier is actually kind of interesting. You said that you don’t want to use paravirtualization at least
through the hoop process. I would have expected you to say the exact opposite.
>> Glauber Costa: So, let’s just… that’s also a personal pet peeve of mine. I mean, I use, sometimes I
use the term paravirtualization a little bit different than… that’s because I come from the Xen era. So I
contributed to Xen very early. And when Xen came with this paravirtualization thing, what
paravirtualization meant at the time was rewriting the whole operating system to run on top of a
hypervisor. Which meant I’m not gonna write to the CRT register, I’m gonna bypass that. I’m not gonna
do… I’m gonna do memory management my own way. So in my mind this is paravirtualization. What
you call paravirtualization is probably what I call paravirtualized drivers. Which is just… Yeah?
>>: [indiscernible]…that’s somewhere in between the two…
>> Glauber Costa: And then of course there’s the gray area. So, for KVM we do run in that definition.
We do run paravirtualized because we run virtio drivers. We don’t support… we do now because of
Virtual Box, but we don’t… in the beginning, for example, we didn’t support IDE drivers ‘cause… we just
had virtio network, virtio block, and virtio PCI device, whatever it is. But those are paravirtualized
drivers. What we don’t do…huh?
>>: [indiscernible]… block the network.
>> Glauber Costa: It’s basically block the network and whatever else you have that you want to expose.
So, KVM, we don’t implement it, but KVM, for example, have a balloon driver that allows you to share
memory between the operating… we might implement it. We haven’t because it’s not like a top
priority. When I say we don’t support paravirtualization, it’s on Xen, specifically, which is pretty much
the only hypervisor that relies heavily on modifying the whole operating system. When we run on Xen,
we only run on HVM mode. So we don’t do the Xen hyper-calls and the way we… the way we set up
page tables on Xen is, write to CR 3, which basically means Xen HVM mode. So we…
>>: [indiscernible]… bringing a lot of other stuff with that.
>> Glauber Costa: Yeah, I mean, we personally consider paravirtualization to be his life and legacy in the
sense that paravirtualizing everything. We live in a post VMX era so the hardware is gonna do
everything for you and whatever and wherever the hardware does, we tag along. Front drivers, then we
do paravirtualization. Yes. And that’s what we’re basically interested, so when we talking about
supporting multiple hypervisors what we have in mind is really getting the drivers, which is a block
driver, network driver for Xen, KVM, Virtual Box, VMware. So the work on VMware that we’re doing
now is basically the VMware network driver and that’s it. And if your hypervisor has some facility that
you…
>>: [indiscernible]…follow up with KY cause…
>> Glauber Costa: Yeah.
>>: [indiscernible]…shouldn’t be an issue because we made the drivers available for PSD as well.
They’re our drivers so… When you put them in Linux you have to get them under [indiscernible] license
but we can get them under a different one.
>> Glauber Costa: Yeah.
>>: But let me follow up with KY.
>> Glauber Costa: Sure.
>>: Thank you.
>> Glauber Costa: But, this is like, this is as far as the hypervisor goes, as far as the cloud computing
platform goes. We’re also interested in drawing partnerships on that for… One of the, I mean we’re not
fully set. Shobana was asking me about that earlier. We’re not completely fully set on the business
model. We believe it’s premature to have a full plan and that’s how it’s gonna be. But most of the,
most of our ideas revolve around, basically per hour paid support on top of, say you charge thirteen
cents per hour on the CPU. I charge extra three cents for each run a tag version of OSv that comes from
a support plan, whatever. So, the way we’re going to present it in each of the clouds is going to be
different from… depending on how we can… For Amazon right now we’ll just drop an image in there.
With Google we have more strong relationships and so, we would like to be… long term we would like to
be the full, go-to image for people running Linux applications. But in each of the platforms it will
depend a little bit on…
>>: Are you gonna present numbers in terms of size and friction… manage to…
>> Glauber Costa: Yeah, next slide.
>>: Okay.
>> Glauber Costa: Yeah. [laughter]
>>: Sorry.
>> Glauber Costa: So, yeah. Resource usage: For a base image that we have, so the base image is like a,
pretty much just a kernel and a small, small, hello world application that you’re gonna run and forget
about it. That’s about seventeen megabytes of disc image. And we can do better but just not
prematurely optimizing anything. It’s around seventeen megabytes. That gives you the full leap C
compatible POSIX that we have so far. So if you have this image and if you want to tag along in a Linux
compatible application that doesn’t come with a bunch of libraries. So again, if you have a lot of
libraries that you depend on, you need to move them on to this image as well. But if you were just
have… so this is the slowest we be goes, at this point so far, seventeen megabytes. In terms of memory
usage it’s as lively bigger than this but it’s not gonna be twice as much. It’s basically just loading that to
memory and initializing a couple of data structures. If it’s…if it’s twice as that it’s [indiscernible].
Obviously that image doesn’t do anything a lot interesting. What we try to do is that we bundle stuff
together to... If we have a implement… mruby is a… it doesn’t implement the full set of Ruby, but it’s
still like a mini Ruby that implements most of the things that Ruby does and we do that in twenty-five
megabytes. So if twenty-five megabytes that have a Ruby shell in which you can run Ruby applications,
then you would run with mruby in whatever, Linux, for example. We haven’t written this mruby thing.
We just poured it from Linux to OSv. And so some people already using it. If you want to run, for
example, embedded Ruby, don’t ask me why but some people do it and … that’s actually came from an
external contributor so we just… hey, cool.
The main image we’re targeting is like, of course, the Java image that’s, at least so far. And we… it fits
in, more-or-less, 250 megs and this is just because we want to get the full JVM so you don’t run into…
We see a lot of value not only in resource usage… the most… it is resource usage, but the most
important and most expensive resource is usually human beings. So it’s like, we would like in the future
to try to pick point more like which kind of libraries in the runtime environment you need, but right now
we just copy the entire runtime environment to the image. Two hundred megs is…the open JDK is
around two hundred megs itself, plus some dependencies—two hundred fifty. So we do run in Amazon
and also that’s it as well. I mean, because we’re targeting the clouds I’m not… you can tell me about
how we [indiscernible] those things, but for Amazon the smallest image they have is six hundred
megabytes in memory size. So, what’s the smallest you have…is it?
>>: I think there’s a… five twelve, five twelve to your mini…
>> Glauber Costa: So five twelve is quite close to six hundred…
>>: Yeah.
>> Glauber Costa: …in terms of its more-or-less. So we also don’t see a lot of value in having specialized
images to run in ten megabytes, ‘cause that’s not what, that’s not what you’re going to find in the
clouds in there. So it’s just like we’re not, we’re not wasting expensive human resources to save ten
cheap megabytes. But those are pretty much the base numbers.
>>: Seven sixty-eight, seven sixty-eight.
>> Glauber Costa: Seven sixty-eight?
>>: Close.
>> Glauber Costa: Yeah.
>>: So, how many of those pages are even…?
>> Glauber Costa: Coming back. Pretty much just the current… so again, everything you map here is
read only because it’s just the disc… this is the disc image, right? So you’re gonna pretty much copy this
image into memory. So this, this memory is gonna be mostly read only. And then have your data. As I
said, the figures for… for a full booted OSv image with all data structures initialized, it’s not gonna get
you twice as much the size. So we can fully run, for example, in less than six hundred megabytes for
Amazon with the JVM. So the actual image that we run in terms of disc size is even bigger because we
have the open JDK and a shell and an example shell so you can just get acquainted. Any runs… that
image it’s around five hundred, four hundred and something megabytes. And that fits in the six
hundred megabyte image. So, it’s around that.
Performance: So, all the numbers I have here is compared to Fedora Linux. Again, for me, it’s one
hundred percent pointless to compare it to anything else, so comparing to Windows, et cetera. So we’re
not… we may do it for fun in the future so let’s see how it compares—even to see like possibilities
for…for improvement and trying to get ideas. But since we are implementing the Linux CPI, the Posix
CPI, I mean, Linux is just really the thing we’re comparing to. We’re also now not going to benchmark
against every single Linux version out there because otherwise we wouldn’t do anything else in our lives.
But this is Fedora. Some of…some of those images are running Fedora nineteen, some of them Fedora
twenty. So we’re talking about a year old Linux kernel at the most. This is between six and eight
months, something about that. So system calls, as it should be obvious from the design explanation in
the beginning, they’re just pretty much free. It’s the cost of coming up with a bunch of registers, saving
the user space registers and restoring them again. There is the FPU as well that you need to…oh,
another thing, but we use the FPU and the scheduler as well. ‘Cause we have to support the FPU
anyway because the applications are going to use it. The applications expect an FPU. I don’t know any
single processor with coming from the last ten years that doesn’t have an FPU, so, I mean, we just use it.
I mean, our scheduler does floating point calculations all the time. Since we’re running, I mean, for us
it’s the same thing, user space, kernel space. But saving the FPU is the most expensive thing that we do
in the context switch. We don’t do it all the time as well. I mean, we have some… that’s pretty
common, I mean, even Linux does that, we have a … we try to guess with informed guesses, of course,
and conservative guesses—When do you need to save the staff you staved and when do you don’t need.
We don’t switch to FPU all the time, but when we do that’s the most expensive operation in the context
switch path. We are around four times faster in context switch than Linux, the Linux kernel running
Fedora—Fedora running, of course, the Linux kernel. [laugh] That’s a micro-benchmark. You’re not
going to be running context switches all the time, I mean, but from the micro-benchmark numbers you
try to get a grasp on the actual final numbers.
For Network we’ve been running constant tests on netperf. Again, comparing again with Fedora Linux,
we’re getting around twenty percent over TCP workloads. So, just networking. Part of that is that
because we copy less buffers. It’s like we don’t have any kernel space, space separation, so we have no
need to copy the buffers. We do copy… we not zero copy because the API doesn’t let you be zero copy.
Like the Linux API for network transmission is pretty much like the file API, it’s just a socket to need to
pass a buffer and that buffer, I mean, while you don’t get the acknowledgment that TCP transmission
finished, you need to keep the buffer around. It’s really hard to do to full zero copy but we don’t need
to have like the kernel spaces, space copies of any kind.
For UDP we can be a lot more aggressive so our numbers on UDP are a lot higher. We can go up to fifty
percent better in terms of transmission. And it is also… we don’t do it yet, but it also a lot more
manageable to just do zero copy on UDP. I mean, because you copy the buffer you don’t have to get the
buffer around, so you can transmit the buffer directly in the… The receive sight is trickier but it’s uh…
we can do it as well.
For Spec—SpecJVM, we’re not actually expecting a lot of performance improvements because it’s
mostly a benchmark designed to test the JVM itself. So most of the time we expect that to be doing
primitive… primitive operations. They have a bunch of crypto programs—they’re…they’re like
twenty…twenty, thirty different programs. They do call eventually into the operating system and those
are the opportunities we have to… to improve. And we’re seeing in… so we lose in some benchmarks,
we win in others and the aggregate value is around three percent to five percent in our side.
Memcached - we have a… it’s one of the things we’re doing better to this point. We’re around forty
percent faster. Most of those gains they come first from networking because we just have a faster
networking to begin with. Also, the memory allocation is… is more efficient. This is a UP benchmark so
there’s not an S and P, but when we do S… I mean, S and P we still have some problems. Most of the
problems just come because our low balancer is not mature enough. So, we don’t have the level of CP
utilization that we would expect. For networking, we were suffering the same thing. I mean those
results are all UP. When we go S and P, sometimes we start to lose a little bit of… so we don’t scale that
well. We don’t believe that this is something from the architecture. We believe this is just something
that we need to put effort and… Another reason we’re targeting UP is that, I mean most of… in the
beginning I don’t expect for next year, for example, anybody to be seriously deploying OSv in all my
servers. So, most of the people are going to be trying it. On the small images they’re likely to be single
CPU. So, that’s where we’re focusing. And of course, we boot the whole machine in less than one
second, which…
>>: When you say single CPU you mean…
>> Glauber Costa: Virtual CPU
>>: Single-core logical CPU
>> Glauber Costa: It’s a single virtual CPU. So, whatever, one run queue, one execution entity,
whatever is the... The hypervisor at the end of the line, the hypervisor can be as big as you want, but it’s
how many virtual CPUs you assign to the operating system. So, we’re designing OSv to go up to sixtyfour CPUs. We didn’t see a lot of value in extending, I mean, there… maybe there are some people in
the cloud running really big up, but most of them are… You buy a big machine to run a smaller virtual
machines. That’s the mode of the cloud. If you want to run one hundred and twenty-eight CPUs you’re
probably going to buy a physical machine anyway. So we want to do S and P. Two CPUs is really normal.
I mean, four CPUs is manageable. More than that, we support just because the work of supporting four
CPUs and sixty CPUs is the same, but after that you need more… So, sixty-four CPU is basically because
we have a bid map [laughter] and we want to use only one word for that. We boot in less than one
second and two hundred milliseconds of that comes from the actual file system mount. So numbers
about this boot: we spend around one hundred and fifty milliseconds reading the imaging real mode
from disc to memory. We’re slightly better now that we compressed the image, so, but still, it’s not
known half of that. But it’s around one hundred milliseconds something to read the image, two
hundred milliseconds to mount the image. So it’s all disc operation, right? And aside from that it’s all
really, really fast.
We don’t expect to be a killer feature because most of the time in the cloud applications you really takes
a lot of time just to boot the whole management around that. For Azure I’m not really sure but Amazon
takes a lot of time to just came up with all the networking, all this stuff, so I mean… But it can be… it can
be an interesting thing if you’re booting… The guys from… from Xen Mirage, they brag a lot about that.
They’re way faster than we are in that aspect. I just clap and say, “congratulations.” But, because
they’re really like almost like an Ocaml compiler they just want to run the application. They don’t do a
lot of boot-ups stuff. For us we do have this one second cap. Whenever we see that the boot time is
starting to get bigger than one second, we let’s stop and try to see how we can optimize it. But as long
as it is below one second we are fine. There are some applications probably that are just gonna want to
boot like spin virtual machines really fast, it’s a possibility. But for now it’s what we have.
Alright. So, Future Performances: This was like what we have today, so, of course, I don’t have numbers
for future performance. [laughter] But, one of the, and again, most of this is likely research oriented, is
future, as I said. The applications can use specialized APIs to access the virtual hardware. One example
that I gave is the MMU. You can just use the MMU because you are in kernel space. You’re not in the
real kernel space because that’s where the hypervisor are, but you have the full view of the kernel
space, you can do whatever you want and the processor is going to do its magic -I don’t even care, I
don’t even want to know. I think I sleep better at night by not knowing how the processor… I mean, I
know the instruction set, but not too much about the internals. The MMU is one example, I mean the…
we see some opportunities in the JVM but there are also other kinds of opportunities to use the MMU
directly. Even more interesting is access to the network buffers. So if you want to do raw… raw socket
networking with your own specialized protocols and even to some extent using TCP or EDP as well, you
can use specialized APIs that we plan to provide, we don’t provide yet. So we can run… you can read
directly from the network buffers. You don’t need to go through all the socket API. If you access the
hardware buffer directly you can achieve full zero copy networking then can go, I mean, really get really
good impressive numbers. We recently merged but it performance still lacking a little bit, so it needs
more work. Something called Van Jacobson network channels, which is basically a way… Van Jacobson,
he was the guy who wrote the congestion control algorithm for TCP and he had this proposal that
instead of using sockets let’s use this channel thing. He had a proof of concept for Linux. This was in
2006. And he got like really beautiful numbers. But, implementing something into Linux is one thing,
merging that into Linux is a different matter. And this is one of the things that we intend to excel as
well. I mean, we just designed the whole thing for the call. We don’t support Legacy. We’re not going
to support any kind of the weird network protocols. It’s just the basics, so we can be more free in that
aspect. The basic idea of Van Jacobson is that you don’t do anything in the natural driver, you do very
few things in the natural driver. So, you don’t really have any lock-on tension. You just have a small
classifier, a very fast classifier that tells you where the specka belong, and just that. And then you move
it up the layers. You don’t even allocate memory. You only allocate memory by the time you’re going to
use it. And if you do that, you have better cache line utilization, you have less lock contention, you don’t
have system time processing, all your time’s spent in user space. And you just can paralyze better. So,
his numbers for Linux are really good. For OSv we got from thirty-six gigabits, this is in memory, right?
So, it’s not over the wire, but it hosts guests, it’s passing through memory. It’s really just so we don’t…
we see how fast we can go. We’ve been to around thirty-six giga, four, two, forty-seven. Alright, so, so,
just using the… so, we did that but we lost performance in a lot of other places, so it’s still experimental.
But it basically allows you to transmit more. The reason it’s so difficult to implement Van Jacobson in
Linux is that you need your network stack to be exposed to user space. And to do that on Linux is very
complicated. It’s hard for us to expose stuff to user space but it’s really trivial to get your user space
into kernel space. So, that’s what we do. Alright, so, that’s it…
If we wanna do like really, really fast boots, I mean… so, as I said, we’re booting less than one second.
We don’t care too much about going to do one hundred millisecond realm. But if you really want to do
that, we can by just creating a new mode of operation in which you run without a file system, like at all.
Nothing from this kind. I mean, no disk reads, I mean we can just move the image directly to memory.
If we don’t do that, I mean, you can pretty much come up with an OSv instance in two hundred
milliseconds, three hundred milliseconds. It might be interesting for some cloud workloads, probably
not too much. And that’s the part Jeffrey likes, I guess. I think the real savings in... I talk a lot about
performance because, I mean, I’m an engineer so… from the performance realm. I always like to do that
kind of stuff. But if you really look at that, at how things are done, the real savings they don’t come
really from machines. I mean, machines are cheap. I mean, human brains are a lot more expensive.
[indiscernible] are expensive. So we’re trying to go on a different route here as well. We have no
command line—that’s a lie. We have a command line for compatibility. Again, so you… we expose a
command line just because most of our user comes from Linux, they expect it to be there, so. But the
most of our management, we want all of our management should be basically a fully automatable rest
API. So we’re using rest because it’s more normal for in the Linux world. It’s just [indiscernible]
requests. We do consider, but it’s not final, to use other kinds of management on top of that . One
matter for us is that I want a single admin to manage one thousand OSv machines, not twenty as people
usually do these days. Again, no graphical interface either, no command line, everything that you do on
OSv should be rest or other automatable management. So we also want to obsolete all the kinds of
tools that people are using on Linux, like Puppet, Chef, and… Most, as we were talking earlier, most of
the time those tools they’re just really parsing output. I mean, you create an output and then somebody
goes and try and use a different locale and it breaks you completely, or change a comma over there and
breaks your parser, et cetera. We wanna be like fully automatable. We also want to be like almost zero
configuration. It’s… I would like to be like one hundred percent no-conf, it may not be possible, but
even when we do configuration and we’re trying to do that in a way you can automate, in a way you can
replicate, in a way you can be stateless and don’t depend on things like on your local disk. For Amazon,
for example, you’re using cloud in it. You just probe, probe the cloud provider, it said this is how you
should run. You run it and if you want to modify the configuration, reboot your machine and probe a
new configuration. That’s it. And we, as much as in the rest of the system, we don’t want to run… we
don’t want to have any significant portion of Legacy here. So, I don’t care how things were done in
Linux, but we gonna do it differently, I mean, we gonna do, at least for that part, fully automatable.
So, there is a project as well called Coral West that some people are using as well in hypervisors. When
we talk… when I talk to a Linux crowd a lot of people ask me, “so how do you compare to Coral West?”
because Coral West is basically just Linux and you strip down a lot of things that you don’t need, you
don’t want. They don’t have to modify the kernel because this is really insane, I mean modifying Linux
without merging to Linux is a suicide note, pretty much. I mean it just moves too… way too fast. But if
you just boot Linux, I mean a normal Linux image, you’re gonna have around two hundred configuration
files, most of them text based. You’re not going to mess with them all, but they’re there. And most of
the time you’re gonna use tools to do that. The problem is that those tools are always error prone. I
mean, things happen. And in here, pretty much like in the spinlock thing, I mean, we’re trying to handle
the problem by not having the problem in the first place. We’re gonna be like fully automatable. Not…
not even something like the Windows register, it’s like all rest. Maybe we’re gonna do some other kind
of RPC mechanism, but so far for our Alpha it’s going to be the rest API.
Another thing for the future as well, some of it is more-or-less already in the present, but another area
in which we believe that the model of the library OS can excel is integrating with the runtime. So if
you… if you remember that figure you want to merge the operating system with the application. In the
case that you have a runtime, that really means merging the guest operating system with the runtime.
So one of, and again, savings comes here not only for performance but also for—for people,
[indiscernible]. A lot of people running JAVA, they’re spend a lot of time benchmarking, I mean what is
the best heap size to use for my application? And what we want to do is automatic heap size
termination. That’s actually the project I’m working on personally. It’s a lot more complicated than it
seems in these lights. But the basic idea is just that you give all the memory to JAVA, and when the
operating system needs memory, I allocate the JAVA object inside JAVA and use that memory for the
operating system.
>>: Don’t all these runtimes also have their own configurations and their own…
>> Glauber Costa: Yes.
>>: So, do you have to then go tweak all them to get them on your program?
>> Glauber Costa: Yes and no. So there are two kinds of configuration you need to do: operating system
configuration and application configuration and the runtime is in the middle, right? But I’m calling this
application. Operating system configuration is the thing we want to get rid of for one reason. If you are
an application developer, I am fairly sure you know about your application. You know how it works.
You better know, right? You know how it works. You know… you know how it behaves. You know how
you should treat things. You’re likely, not necessarily, but you’re likely to know as well how the runtime
behaves because this is the thing you interact with. So you know a lot about the application, a little bit
about the runtime, and almost nothing about the operating system. You need a specialized
[indiscernible] about that usually. So we were trying to move bottom up in this aspect and we want to
get rid of the operating system configuration and manageability and all that. It’s got to be fully
automated, fully rest, et cetera. Our goal is that you shouldn’t even know about that. The runtime is
the next victim. Right now, I mean you should know how to tweak your JVM, which parameters to use
and all that. In the future, so this is still in the future, we would like to get rid of some of that as well.
This is why, this is one of the reasons why we’re focusing on JAVA. Because if we, it’s just, it’s just a
runtime that is widely available…
>>: If, say, a runtime has a set of commands that need to be run in processes, you said it only runs one
process though
>> Glauber Costa: We run, we run one address space but you can run a process provided it doesn’t rely
on memory isolation. So far, again, it might be that there is one runtime over there that needs
something we can’t support and… I’m gonna leave it at that. So far for JAVA we just run everything.
Most of the configuration I’m talking about here is parameters. How big should be the stack size? How
big should be the heap size? How big should be the minimal… which garbage collector should you use
and what’s the size of the young generation? What’s the size of the old generation, and all that? And
we intend to make through changes both in the operating system and in the runtime. So this… this
technique, is only change… I’m only changing the operating system. So OSv has some specialized kind of
facilities just for JAVA and it allows you to forget about determining the heap size which is one of the
things most of the JAVA people just say, ”yeah, I mean it’s too boring, we never get it right and we… it’s
very complicated.” So what we do, we give all the memory in the machine to the JAVA virt… JAVA
virtual machine. So you not gonna be… the reason you need to determine heap size is that there is a
balance between JAVA memory and operating system memory. So you have JAVA memory, but as long
as you open a file that file’s gonna use file system caches. And if you give all the memory in the box to
JAVA, there is no memory for file system caches. So what we’re doing is we have this specialized
communication with JAVA. We, through J and I we allocate a special object in the JAVA heap. That
object has a bunch of memory. We release that memory to the operating system and we allocate our
file system caches, for example, in that memory. So you don’t need to go playing with what’s the heap
size should be. You just, whatever you have.
>>: I understand why you’re—you’ve spent a lot of focus on the JVM and that it’s not your only focus
ultimately, but if you’re going to carry these ideas to their logical conclusion why do I need OSv at all?
Why don’t I just change the JVM so it would survive…[indiscernible]
>> Glauber Costa: You could. How can you do that? You can do that by linking it against the library that
allows you to… that’s library’s called OSv. As I said, I mean I… the reason I like the library…
>>: That’s the answer I expected. I just would have assumed that somebody had already short-cutted
that to some degree. No?
>> Glauber Costa: So, there’s a… I’ll come back to that but let me hear what you…
>>: This is a Rip Van Winkle question: A Unix ship to outside Ma Bell for sixty-four K machines.
>> Glauber Costa: Uh huh.
>>: What happened to get this thousand fold increase in size over these forty years?
>> Glauber Costa: I…[laughter].
>>: Especially if they don’t have Legacy code.
>> Glauber Costa: No, yeah, I mean, so the thing with configuration files in Linux is just that you have…
If you have a Linux box these days, I mean it’s not just the kernel, you have a lot of services that are
running there as well and each of the services need to be configured. And it just explodes, I mean it’s
a… it’s pretty much the same with the Windows Registry, I believe. I mean in the beginning you need to
tweak this and that. Ten years later, you need to tweak the work. But about your question, if you
change the JVM to boot directly into a … you can, I mean it’s perfectly possible. But then you need… if
you want to run the same thing for Doc-Net, for example, or Mono, the latest version of Doc-Net which
would be more natural for us, or Go, or Ruby, et cetera, we need to go and modify all of them to boot
directly on the hypervisor. The usual way to do this is by having a library that has an API and that’s OSv.
So as I said, the reason I love the term library OS is that it conveys the message that it is both library and
an OS. It has everything an OS has. It has scheduler memory, file system, et cetera, and it’s also a
library.
>>: …representing the same interface more-or-less as Linux, you’ve just rolled all those up.
>> Glauber Costa: Yeah. It’s all, it’s all a question of where do I put my abstraction layer? And that’s it,
like I said.
So hardware assisted garbage collection is like for the future but it’s also something that we want to
integrate. We have the runtime.
If you want to know more you have like our website code and github, it’s all public. We can follow us on
Twitter, write to our mailing list and I did put my personal e-mail address in the beginning. So that’s
glommer@cloudius-systems.com.
>>: You mention compact? Or you mentioned, actually, you didn’t mention compact. You mentioned
memcached…
>> Glauber Costa: D. Yes
>>: … and you mentioned Cassandra at one point. What are the sort of things you run it with and do
you have to go in and modify those systems and tweak ‘em in order to get ‘em to work?
>> Glauber Costa: So, right now, in theory you could run any kind of job application, right? So because
we’re fully, right now we fully support the JAVA runtime environment. So if you run… if you have your
own jar, which like your application developer you have it in your jar, you can deploy it directly to OSv.
In practice, theory’s always different, right? So, there is the J and I thing that the JAVA allows you to
basically call directly into the operating system. We aim to support the whole POSIX, but we don’t. We
don’t just because, I mean we didn’t sit down and say let’s implement everything. We’re doing this
more on an on-demand basis. There is this guy that needs this system call, we go and implement. So
right now the two applications that we are running and like certifying and testing are Cassandra and
Tomcat. But one of the reasons for Tomcat as well is that Tomcat is itself an application server, so
people write stuff to deploy on top of Tomcat. So if you run a Tomcat, we get more coverage. But the
way, if you…if you have a well behaved JVM application that doesn’t do anything fancy in terms of
calling back to the operating system, it’s just a matter of deploying it and running.
>>: I’m gonna leave in a minute. You were gonna tell us how you come from Brazil to Russia.
>> Glauber Costa: Oh yeah. [laughter]
>>: Can’t leave without hearing that.
>> Glauber Costa: So I don’t like my country very much. The way I explain that is that, I mean I feel
better in Russia. That’s enough information for almost everybody. But I worked from home almost all
my life and at some point I got this proposal from Parallels to go work on their containers model, which
is more-or-less the same thing we’re trying to do here but with a different… just the same kernel mode
[indiscernible] space. I like this idea a lot better than containers, personally. And then I moved to
Parallels but my wife moved for me, with me from Brazil and she got a job there in Russia as well. And
now she doesn’t want to leave. She says, “I like my job, let’s stay in Russia.” It’s also convenient for me
because it’s close to Israel so, I mean, from Brazil it’s like crossing the world to get to Israel. From Russia
it’s like four hours. So, frequently asked the question—yes I speak Russian; [laugh] but not too much. I
speak like a retarded. It’s enough for getting food.
>>: [indiscernible]
>> Glauber Costa: Yeah. [laugh] It’s a… it’s a to be honest I’m considering leaving now after recent
developments. Like it’s getting a little bit scary for me to be there. But so far it’s just there, I mean. It’s
Brazil with snow for everything. Sergei Bring from Google, he said Russia’s Nigeria with snow [laughter]
And when my friends back in Brazil ask me, ”How’s Russia?” I just say, “Well, it’s Brazil with snow.” I
didn’t know about Sergei’s phrase, I mean I developed that independently, but it really is what it is. I
mean it’s an interesting experience.
>>: Well, thanks for coming
>> Glauber Costa: I appreciated it. [applause] Thank you.
Download