Document 17954101

>> Shobana Balakrishnan: Welcome to the Microsoft MSR lecture series, I guess. I’d like to introduce Glauber Costa. He’s the lead engineer for Cloudius, but has been a open source Linux hacker-developer for the last ten plus years, so… >>: Ten years. >> Shobana Balakrishnan: Ten years, okay, so he’s—you know—been in the Linux world for a while. He worked at IBM, Red Hat, and then Parallels, and now he’s onto this new gig—I’ll call it Cloudius— and we’ll hear about… I guess based on the KVM kernel? KVM or…? >> Glauber Costa: I’m gonna tell more about that later, so… >> Shobana Balakrishnan: Yeah, you’ll learn more. So with that, I’ll hand it off to you, Glauber, and feel free to also interrupt with questions, but he’ll be there after the talk, so you can ask more questions too. Thank you. >> Glauber Costa: Thank you. So I can put this down, right? >> Shobana Balakrishnan: Yes. >> Glauber Costa: Again, just feel free to interrupt me at any time. I would like to start by saying that, as Shobana was saying; I’ve been working with Linux for the past ten years—in the Linux kernel. This is the last place on earth I would ever imagine I would be, so… [laughter]. I am very happy to be here. We were discussing this over lunch; I mean, cloud computing is something that changed the game—in my opinion—completely. Nobody can afford to be too much centered in any kind of thing these days. So yeah, I do am quite happy to be here. But most of my friends over Facebook are quite—when I checked in—they were extremely surprised. [laughter] So I’m working for Cloudius Systems… >>: How do you keynote an open computer? [laughter] >>: Three weeks ago, everybody was surprised we were there too… >> Glauber Costa: Yeah, but it’s becoming normal. I mean I got the invitation to come here from Kewai, and I met him in a Linux conference, so now I’m here. We’re all friends now and it’s a nice world to be in. So yes… who are we? We’re having this operating system that is—according to our presentation materials—is the first operating system designed for the cloud. Every time I say that, people start saying this is not true, because well—you’re from Microsoft, for example, I know you have Drawbridge, which is kind of a virtualized operating system, something that it’s fit together to run on cloud infrastructures or hypervising infrastructures. The people from Xen Source, they have a lot of library OS’s that they work with as well, but we consider ourselves to be, like, the first operating system that is completely written from scratch to run in the cloud. So that’s the gig here. I mean, we not… so that’s how I usually open. We not a fork of Linux. So we are… most of us worked on Linux before. The startup that I work with, Cloudius, was founded by Avi Kivity, which was the creator of the KVM hypervisor from—the technical creator, not the CEO of the company or anything. And most of us… like we have with us one person that maintain, for a very long while, the memory management subsystem in Linux. I work with Xen, KVM, the Linux kernel, and all—so we do come from this background, but despite that we don’t actually share any code with the Linux kernel at all, and we don’t intend to. Our design decisions are very different. I’m gonna talk about some of them today, and our license doesn’t allow us to share any code with Linux, which is to an extent by design. So we’re not a clone of Linux; we’re not a fork of Linux; we are a completely new operating system. We are drawing a lot from the open source code to ring the census. You can see that we’re quite small startup. What I’m talking here is, again, it’s new thing. Jeffrey heard about it a while ago, when we— probably when we first announced it—but we don’t even have an Alpha yet. We’re supposed to have an Alpha in around three weeks, so I mean, we’re very close now to the point that we’re going to have a product, but we still… we’re not still in the stage of having a product with a customer service and support sales and all that. But it’s all engineers at this moment. We do have like fifteen people in— pretty much—ten countries, so we distributed… most of people distributed… Yeah? >>: I see a lot of tech companies that run in Israel, but their headquarters sitting in Silicon Valley so people can phone without phoning a foreign area code >> Glauber Costa: Yeah. >>: So how did it work for you to be officially… >> Glauber Costa: We do… we have… two months ago or one month ago, actually, we finished the paperwork to have a presence in the Silicon Valley, so we will have—we don’t have anybody in there yet, but we will have an office in Silicon Valley eventually. But we are like… the chairman is one of those Israeli investors that… they seem to do it for fun—like, earn money, something they do very well. So… but they… we have an office in Israel, but people—as you see here—are really, really distributed. We have people—I’m in Russia… I’m not Russian, but I live in Russia. We have people in Finland, Sweden, and Brazil—I’m Brazilian, but there’s another guy in Brazil… really all over. Yeah? >>: You’re really in Russia? [laughter] >> Glauber Costa: Yeah, I’m going to tell the story later [laughter] is how I ended up in Russia. So what we want to do is—that’s our mission statement—we want to build the default operating system for public and private clouds running Unix-compatible applications. This is not set in stone, so I mean the mission is something around that. If you go to the website, you’re not going to find precisely these words—I mean, we’re still working on that, but that’s about what we want to do. This means that we’re not in the application business at all, so we’re not doing any kind of applications. I don’t expect anybody to say, “Yeah, I have this application here that runs in Windows, and I’m going to switch to OSv because OSv is better.” I mean, I don’t even think about it. I don’t even consider this or any other operating system, for that matter. We do implement, though, the most—not all, but most—of the Linux API, and what we want to do is instead: if you would be running Linux in the cloud and you get to run OSv instead, your application is likely to run—like, very likely… I’m gonna detail what kind of stuff it does that prevents it from running on OSv—and it’s going to run faster, better, and using less resources. So here’s how it comes to be: if you look at the typical cloud stack these days—I mean, it probably not even seeing a diagram like that, you probably made diagrams like that when you’re presenting ‘cause it’s all very, very traditional. You have the hardware, you have the hypervisor, and then you have the operating system—Windows in… most of the time for you guys and Linux for other people. Most of the time, you do have a runtime running—say the JVM, but you could be the Dot-Net platform, you could be Ruby, could be any kind of thing. At least in the JVM world, it’s very common to have an application server like JBoss, Tomcat, Forever, and you put your applications on top of that. Now when you look at this picture, the thing that became clear for us and the dream that has driven this initiative is that a lot of those layers, and in particular those three layers here—like the hypervisor, the operating system, and the runtime, they’re really doing… the intersection between what they do is very big—I mean, they’re not doing exactly the same thing, but they’re doing a lot of repeated tasks, and by doing those repeated tasks you’re wasting a lot of resources. So… first of all, you’re wasting a lot of resources, but also you’re binding yourself to some design decisions, because you need to be—in the case of Linux, for example— you need to be able to run in this layer and in this layer, and the layer above cannot assume certain things because it could run everywhere. So what we want to do is we want to cut down that, we want to merge whatever runs the runtime, whatever runs in user space—in our case—with the kernel. So there will be no more… and again, this is something that Drawbridge does in the same way, as far as I understand. Whenever I talk about Drawbridge, please correct me if I’m wrong. I read the paper, but that’s about it. We run our kernel in user space, and we do that because we believe that the whole protection thing—which is something that the operating system spends a lot of time and resources managing and doing—it’s gonna be handled by the hypervisor. So we believe—and that’s the core of what we’re doing—most people—not every deployment, but most people in the Cloud—they will be running one application per virtual machine, and then you have a lot of virtual machines running a lot of applications. The reasons for that are twofold. So first of all, it’s just simpler to do that. Most of the time in the cloud—we believe—for example, if you want to reboot your machine, if you want to do an upgrade, it’s a lot easier to just shut down your VM, come up with a new VM, and if you have two applications running the same VM, they’re not going to scale independently—it gets your management more complicated. So it’s not something that we are proposing people should do—I mean, I’m not coming here to say, “yeah, we believe that people should run one application per VM.” That’s something that we look at the market, we look at the figures and how things are deployed, and we notice already that most of the time—again, there are always the special case that can be handled separately—but most of the time, people are running one application, one VM. And if you’re doing that, most of the separation, most of the user administration privileges, all sorts of ring zeroes, ring tree for Intel processer, et cetera—this is completely useless. In particular, if you’re running only one application, and your application goes to the kernel—and you manage to escalate to the kernel—he now has control of the machine. If your application is the only application, the question is: so what? I mean, if you crash the machine, you crash the machine. You would have crashed your application—doesn’t matter. So we’re cutting down those layers. And in the case of Linux and I—from the best of my knowledge… again, I can be wrong about those things; HyperV’s, the same thing—it’s not an external hypervisor like Xen; it’s integrated through the kernel. So it can run as the hypervisor, it can run as the guest operating system. And when it runs, it’s basically wasting resources, because the same operating system that is designed to be the hypervisor has all those protection facilities, and it’s all being replicated in the layer above. So what we’re targeting right now is this thing here—merge, the runtime, or the actual application. For a lot of applications, that’s where your stack ends—that’s your application and that’s the end of it. For the case of a runtime application server, et cetera, we’re just sitting crushing the stack making it a little bit smaller, more efficient. And if we’re talking about big deployments, any ten percent that we gain is like—over one million machines—is a lot. Alright. The way we do that—so how do we propose to make this integration, to make those things faster, more integrated, using less resources—is by drawing from this concept of a library OS—library operating system. Again, I believe you’re all familiar with the concept itself, Microsoft Drawbridge is a library OS as well. It’s something that you… the way I view OSv sometimes is: if you don’t want to talk about an operating system—that’s why I like the name library OS, because it conveys both messages—if you don’t want to talk about an operating system, you can talk about a library that you link your application against, and that library allows you to boot your application directly on a hypervisor. So it provides all the abstraction layer that you need, all the hardware—the very basic hardware communication, boot-up sequence, all of that—that allow your C application—your application that you just run, your Dot-Net application, your Java application, whatever—to instead of be run in an operating system, it boots directly into the hypervisor. As I said, the Xen project has two operating systems that follow that logic: one of them is Mirage, and the other one is Erlang-on-Xen. So Mirage—they are very different in, because they’re very Xencentric—they only run on Xen on paravirtual mode. So paravirtual mode—you probably know—is when you need to change the operating system, in that case, you just write it all again. But they’re not even operating systems. So Mirage, for example, is almost an OCaml compiler, so it only runs OCaml. What it does is that you write your application in OCaml, and then you compile your application using Mirage, and it’s going to put the… all of… if you happen to write to a file, it’s going to link a file system block into your application—it’s going to link all the things you need to boot on Xen—and then it boots your Erlang application directly on Xen—sorry, your OCaml application. Erlang-on-Xen does the same thing but for Erlang. So what we believe we’re doing here is pretty much something between this and Drawbridge. I mean, we do want to run all kind… all sorts of applications, but on the other hand, I mean—and we don’t want to be like the Mirage and Erlang-on-Xen; they’re tied to a hypervisor—we want to run everywhere. I mean, we want to run… if there is a hypervisor, we want to be there. But—and again, that’s from my readings, correct me if I’m wrong—Drawbridge is actually Microsoft Windows modified to cut a lot of pieces, right? We are—in that aspect—we are a lot more like Mirage and Erlang-on-Xen: we just rewrote everything, but not to a specific language. We rewrote everything to try to run as many applications and on as many hypervisors as we can. The reason we can do that is… we believe, again, that rise of the hypervisor is a game changer in this aspect. Writing an operating system—so to begin with, is just fun, right? It’s nice; it’s nice work; it’s probably one of the funniest things I’ve ever done in my life. Just let’s write everything, but the [indiscernible] to scan the PCI bus—I mean, this is something that you don’t do because it’s already done in every operating system. >>: [indiscernible] >> Glauber Costa: You do—cool. Isn’t it fun? It is fun, so [laughs]. But the thing is that these days, most people used to believe—I mean—there is no point in writing a new operating system. One of the reasons for that, is that—you are all familiar with this—if you go look at any operating system, eighty percent of the code is just drivers—it’s just boring drivers for all kinds of network cards, for all kinds of video adapters. So if you write a new operating system, or a new hypervisor, or a new anything that talks directly to the machine, you need to either rewrite all this stuff that—I mean, it’s not humanly possible to do that these days. I mean, let’s support all… or you run only in a single platform, or you try to somehow to reuse the drivers of a new operating system. That’s what Xen did, for example, with the Linux drivers. Xen has the hypervisor, and then has a special Linux guest that has special access so you can reuse the Linux drivers from—they call it Dom zero—the privileged domain. For us, if you restrict yourself to run on a hypervisor, this is a game changer, because you just can write a new operating system. I mean, there is no… right now we’re running on KVM, Xen, Virtual Box, and VMware. And even the last two are, like, almost there. So… but in the beginning, we were just running on top of KVM. So the hardware is always the same: you have to support one network card, one block driver, and that’s it. You can focus most of your effort in just writing the actual core operating system, which is what we’re doing. Even if you want—as we want—to run in every possible hypervisor in the medium, short, long term—I mean, whatever happens, but definitely in the long term, we want to be running everywhere—there are like five hypervisors. Even if people keep coming up with new hypervisors, there not ever going to be one thousand hypervisors, we expect. So the amount of work we need to do for drivers for compatibility and for things like that just makes the work manageable. So this is, again, one of the main reasons why we felt that: yes, it is possible to just rewrite the operating system, to just start using new concepts, to just start writing out… forget about compatibility, forget about using whatever hardware is there, ‘cause we’re going to run only on top of hypervisors. So no, we do not run in any kind of physical machine, and we may if there is a very, very specialized-use case that requires… but this is not anything we want for the short run. So OSv itself is written from scratch in C++11. So we’re, like, really trying to be in the edge of everything here. Most of the time, we still spend some time fighting the compiler, ‘cause there is some weird bugs that appear here and there. I mean, everything is very new… I mean, especially the compiler we use on Linux DCC—I mean, it doesn’t support C++11 in its fullest. I mean, I don’t think every… anything does these days, but we using it. So we wrote everything, but we took some parts from FreeBSD. And those are parts, again, in which we didn’t believe there was immediate value in rewriting. The most, the biggest example is the file system—like, we really didn’t want to write a new file system for two reasons. First of all, it’s not that important. Again—we believe—not all, but most applications in the cloud, they won’t be talking to the local disk, they’ll be talking to some kind of shared storage, because if you don’t do that—I mean, you’re not… it’s impossible to be stateless, you’re not going to do it. However, some applications do require a file system. One big example is something that we’re still implementing some of the functionality it needs, is the Cassandra database. So Cassandra database does not write directly to the disk, it writes to a file system, so it expects a file system. It takes what, like five to six years to stabilize a file system, to have a reliable file system? I mean, we’re out of this business—it’s not something that we want to do. So we just took the ZFS file system from Open Solaris. The code actually came from FreeBSD, so yeah … there was a port from—it’s the Open Solaris code that was moved to FreeBSD—and we took it from FreeBSD—just like a layer of indirection. The network stack also we took from FreeBSD—I think we were not the only ones doing that, right? So… but we heavily modified that. So we… just we didn’t saw a value in rewriting the TCP protocol and all that. We took it from BSD, but it’s already heavily modified—the file system, we would rather leave it alone. >>: So… before you… I’m not quite getting yet whether you were motivated mostly by license issues or technical issues that you thought [indiscernible] involved. >> Glauber Costa: A little bit of both, a little bit… so again, I’d have to admit again that I am biased in that aspect, because I’ve been working for Linux for almost a decade, but I do believe that the network stack of Linux is way superior than the network stack of FreeBSD, that’s my… I mean, everybody always disagrees about that, but then I have a lot of people on my side. It’s a eternal fight. From a technical point of view, if I would make this change… this choice, I mean, I would just get the Linux network stack. So ZFS is a little bit a technical choice, a little bit a license choice—technical in the sense that most of the Linux file systems, they’re open source, but they’re GPL license. They are very tied to the Linux VFS, and ZFS is a project that is more independent. It already runs on BSD, it already runs on Solaris, and they’re actually trying to come up with a project called open ZFS, which is the core of ZFS turned into a library— sort of thing like that—so it can run with the same code all those different operating systems, even on Linux if you want to. The license, actually, is not compatible with the GPL, but you can run on user-space Linux, using file system user-space. But we felt that for us, from a technical point of view—but not in terms of speed, performance, et cetera, but manageability—it would be a better choice to use ZFS, because it’s more self-contained—it’s less dependent on an infrastructure. Once we made this choice, though—I mean, the ZFS license is not compatible with the GPL—so we don’t… we just can’t use the GPL, we just can’t use code from Linux. But it is a little bit of both, because if you really wanted, then we could make the effort and say let’s use the Linux network stack, and let’s just then get one of the file system from Linux. So it’s a little bit here a little bit there. But as you pointed out—I mean—it does… we are—I mean—all doing this open source, but we did have a slight preference towards the BSD license, a little bit from the business point of view as well. So that… we can talk more about that later, but for example, with the GPL we were always required to give the sources for everything. So if you look at what Red Hat does, for example, with their business, you have the binary. Once you shift the binary you’re mandated by the license to shift the source for debt binary verses. They cannot have, for example, a stable version that is kept in house and is shipped just to binary. With the BSD license we can. We’re not sure yet that we want to do that, but we want to leave the doors open. So it was just a friendlier license from a business perspective as well. So when I say that we want to run Windows applications, what I mean by that is that we implement the POSIX API the Linux uses. So, Linux doesn’t implement POSIX exactly. It already implements a different version, a completely Linux-centric with a lot of extensions. And as Linux became more important in the Unix world it became like the de facto POSIX specification, and that’s what we implement. So every system call that is available on Linux we implement, except for creation of new processes and I’m going to detail why and how in a while. For a variety of reasons, it’s better if you have a runtime because if you have a runtime your API now is the API of the runtime. For example, if you’re running JAVA, you’re not talking to the operating system and that also gives us more flexibility because if you want to change… so, as I said, there are some system calls we don’t support, there are some system calls that we’d like to introduce, and still we would like to have the application developers not to care about that. So we can make the changes and we do make the changes to the runtime, to the JVM. So we run a modified version of the open JDK JVM and all the modifications we have goes to the runtime. The application stays unchanged. So the runtime is yet another layer that can serve as middle-field here to allow us to, at the same time, maintain compatibility and go crazy in the operating system world. Now, that’s the, that’s the most important thing of our design. Because we believe that most people are going to be running one application per virtual machine, we don’t support running more than one application per virtual machine. And once we do that, we can actually simplify the design of the operating system by quite a lot. So, that’s actually the main technical point of the OSv architecture and that’s, that’s one of the things that… Please? >>: Do you support multiple threads? >> Glauber Costa: Yes, threads, yes. So, good question about threads. One of the things that we gain a lot of performance by that, is that we never, ever, ever switch the control register. So we don’t, only if you, if you invalidate one mapping, yes, then, then we do. So we do support, for example, the Linux mmaps system called that creates a random mapping. But if we switch from one thread to another, we never, ever, ever flush the TOB. And because we don’t… >>: One address space >> Glauber Costa: Huh? It’s one address space. Exactly. So, we can… we have multiple entities running parallel, but all of those entities share the same address space—all of them, which makes the implementation of shared memory extremely easy as well. [laughter] But if you look at some of the micro benchmarks for doing them, our contact switch time is really fast. So, because there… it’s just really switching registers, I mean, there’s nothing else to do. You don’t have cache in misses post contact switch, you don’t have anything. We have no users, we have no admin and user this, user that and rolls and privileges. So again, that allows us to completely simplify the architecture of the operating system. The way this is important, in my view is two-fold. So, first of all is a direct performance benefit. So, we have this… we not doing… we’re just faster because we’re not switching the TOB at any time— only when you specifically want to invalidate a map. So, it just makes it faster. It also… but most of the games, the gains doesn’t come from the fact that you’re doing that operation faster. It comes from once you do that, you’re just simpler. If you’re simpler, I mean, you’re more free to make some kinds of design decisions and both of them we achieve better performance. So one of the goals of OSv is to have better manageability, as I was talking earlier with Jeffrey, but also better performance. But in both terms of less resource usage and more true… less latency and all kinds of things that the applications may want. We do run user space as well, I mean, as I said, it’s kind of obvious. We don’t have any kind of user space, kernel space separation. So those system calls that we have that we implement from Linux, they’re not really system calls, they’re just library calls. I just called for function and come back and have no mode switch and no kind of anything fancy going on. It also allows—and that’s something that Drawbridge could do as well—it also allows the application to use the hardware directly. So, if the application, for example— and we have some example of applications that has that—if the application wants to track dirty memory, how it usually does is by scanning. Like the JVM does that for garbage collection. It needs to scan or in the case of the JVM it inserts in the JVM hotspot, in the jit compiler. Marks to note is that I’m changing this memory access and they need to mark and sweep. If you have access to the hardware directly, the JVM is actually… it’s not that we’re running the kernel in user space, we’re actually running the application in kernel space. It’s the other way around. So you have direct access to the hardware. This is one of the projects we’re working on. It’s not complete. It is not even close to completed. But the JVM can actually just… instead of going and sweeping the memory, you just let the hardware do that for you. I mean, just have the page tables for your, I mean, at your disposal. Whenever somebody writes to that area, the processor is going to flip a bit for you, you win. Automatic. >>: So you’re redoing Brian’s work … [indiscernible] >> Glauber Costa: Uh… yeah. [laughter] In a sense, in a sense. But again, the main difference is that now we don’t have any user space. Like if you crash, that’s the whole idea. If you crash, you crash. You would have crashed in user space, now you crash in kernel space. Doesn’t matter. You’re not crashing anybody else. The whole user space, kernel space separation—the problem with that is that you crash yourself, you crash somebody else. And here, you’re free. >>: Until you read fine print have you just reinvented VM—CMS? >> Glauber Costa: We… we have reinvented so many things, like… [laughter]. But it’s like that, I mean we reinvent stuff but then… >>: [indiscernible] >> Glauber Costa: Yeah. [laughter] I do read Wikipedia, so…[laughter] But for example, in the very, in the very first day that we announced OSv, in one of the Linux journals that I always read, I mean, they published an article about us and one of the guys commenting they say, “yeah, so you just reinventing MS DOS.” Yes and no, I mean, it’s some of the new concepts… but if… one of the things that I love about computing is exactly that. I mean we always bring this stuff back from the past to the future again and rereading this with the lenses we have now. It happens. But, yes, I mean we… there are a lot of concepts that we… they’re not exactly new they’re just rebranded for a new kind of world. We are a very new start up. We are a very new project. We have about one year, so, it’s impressive that we could just write a new operating system in less than one year. So, in less than one year we were already running a large collection of applications. One of the reasons we can do that, of course, is just because the whole design is so much simpler. I mean they’re… right now we’re actually moving a lot slower than in the beginning, just because, I mean, we came, we flushed all the low hanging fruits—that was enough to run the full JVM, by the way. Now we’re moving forward. I mean, we don’t want to run just the JVM, we want to run the whole POSIX API. So, we’re moving a little bit slower now just because the problems are more complicated, but they are still a lot simpler than they are in a normal operating system. We are completely, again, it should be obvious by now, but we are completely virtualization oriented. We don’t intend to run on hardware. We may in the future if the need arises and for something really, really, really specific. But one of the things that we take into account— I’m not sure if you’re familiar with the lockholder preemption problem. It’s a problem very typical from virtualization. So, if you have a spinlock and you’re running on hardware, you’re running a laptop, that means that very soon you’re gonna to release that spinlock. So, that’s why you busy wait. You busy wait because it’s gonna be… it’s not gonna to be for a very long time. However, in virtual machines there is this problem… again, lockholder preemption… that the hypervisor—imagine you’re running a spinlock—the hypervisor may take the CPU out of the physical CPU so that virtual CPU is no longer running. But then it schedules to run another virtual CPU that is now waiting for this CPU that is holding the spinlock. And all the time this virtual CPU is running is just wasted because it never… it’s not going to release the spinlock because it’s not even running in the CPU. It’s outside the CPU. And the guest operating system has no control of that whatsoever. >>: I mean I understand that you have the opportunity to avoid spinlocks here, but you don’t find paravirtualization handles that correctively? >> Glauber Costa: Oh, yeah. So, no, paravirtualization is one option to do that. So how Linux tries to solve that problem, for example, is by doing paravirtualized spinlocks. It tries to communicate with the hypervisor. The hypervisor tries to tell you, “yeah, your virtual CPU is not running so you don’t, you shouldn’t, you should go to sleep,” for example, “you should do something.” Because… we could have done that, but this is a lot more like a Legacy style. We already have the spinlocks. It’s very complicated to run everything without a spinlock. We had the opportunity. We’re designing everything from scratch. We just decided we’re not going to use any spinlocks, at all. So, there is no easy way to an OSv, at all. And this is, this is… I’m a very lazy person, that’s also something about me. I mean, I hate working. Like, if it were up to me I would be drinking whiskey on the beach every day. So… >>: In Russia? >> Glauber Costa: Yeah. [laughter] They have the Black Sea and now they have a larger piece of the Black Sea so… [laughter] So, it’s cool. But, one of my preferred ways of solving problems is getting rid of the problem. So we could have, like Linux does, for example, paravirtualization—use paravirtualized spinlock. But again, paravirtualization is completely dependent on the hypervisor. Not all hypervisors support it. Not all hypervisors support it in the same way, and if you want to support it across the board we need to write one paravirtualization scheme for KVM, for Xen, Hyper-V—I don’t even know if he has. Like we know nearly nothing about Hyper-V. >>: That part’s published—some of it. >> Glauber Costa: No—yeah. So, but it’s… but again, it’s not… it’s not about being published or not. I mean, we need to go there and take a look, but whatever scheme you have, right? It needs to be done. It needs to be coded. If I just write the operating system without using any spinlocks, I mean, I don’t have the problem anymore is the perfect solution. Get rid of the problem, right? So we really have no spinlocks. Usually in an operating system, even if you write… so the easiest way to get rid of spinlocks completely is by, of course, by using mutexes, right? ‘Cause then you just go to sleep all the time—every time you count down on the lock you go to sleep. But the mutex itself… most of the mutexes implementation uses spinlock to hold the queue. So you have a queue of waiters and then you have a spinlock to control and to protect the queue of waiters. So, we are using… I wasn’t the one that implemented that. It was a guy, he’s a mathematician so he knows all sorts of crazy stuff, like he implemented most of our schedulers… Well, there’s a very recent work from 2008, or something like that, some group implemented a lockless mutex and it’s a mutex that doesn’t use any kind of lock to protect its internal queue, and we’re using that one. It’s something, again, maybe too complicated, too cumbersome to adapt to your operating system, but we are doing this from day one. So, that’s just the kind of decision we have been making. I mean, if we want to run virtualized, we’re not going to do spinlocks—and we don’t. We have no complicated hardware model. So, for example, our PCI discovery… this is a… again, this is… it’s more like an anecdotal example because you don’t do PCI discover all the time… >>: Why do you ever do it? >> Glauber Costa: Huh? When you boot. >>: But you’re running in the DL. >> Glauber Costa: Yeah, but, again, I don’t know about Hyper-V but then this KVM or Xen Virtual Box… we’re not running, for example, PV Xen. We run HVM Xen. So it provides you with the view of a machine. And that’s the way you can run like everywhere. So, you have a PCI device—it’s there. And most of… and each of the virtio devices, for example, the KVM uses for networking block, they’re all exposed as PCI devices. >>: Yeah, we can do that. >> Glauber Costa: Yeah, so, but again, I’m interested in knowing more about how Hyper-V exposes the devices, but even Xen can have an MPV mode, can expose your PCI bus… just to hook some devices over there. So, again, we do PCI discovery, but we don’t do like the most complicated PCI discovery, handle any case possible on earth. That’s the… how does KVM implement it? That’s how it does. So, our hardware model is a lot simpler. We don’t support like multiple types of busses, multiple types of interrupts, it’s like… it’s what we have in the virtual world and that’s it. So, it’s a lot simpler in this way. We also try to take into account things that are traditionally more expensive in hypervisors. So, setting timers is a lot more expensive in hypervisors than it is in virtual machine… in physical machines. IPI’s because you have a lot of exits when you try to communicate with other CPUs. So, we have a pooling scheme, for example, before we go wide-O, we always keep pooling other CPUs for more work to avoid IPI communication. We get some performance benefits from that as well. And they’re all things that, I mean if you… if you design it… if you have something that needs to run everywhere, it might not be the best thing, or you might… Linux, for example were trying to implement the same very pooling mechanism that we are, but they need to have that together with all the other mechanisms. And for us it’s just a lot simpler. We do things thinking about virtual machines, mostly HVM machines. So, as I said, we don’t even support Xen paravirtualize. It’s a… we’re targeting the kind of devices … the kind of clouds that are gonna to be available in two years, three years, and that’s what we want to do—as simple as we can. Simplicity is a feature in our… [pause] We came up… when I met Kewai in Edinburgh, in Linux conference in Edinburgh, I mean we… instead of giving T-shirts away we were giving boxer shorts in this motto that “less is more.” [laughter] So simplicity is a feature, I mean, we’re trying to be as simple as we can and we believe that we’re gonna harvest the good benefits from that. Again, we… this is not like VM specific but we also have a complete fair scheduler. It’s anything, so again, we have anything that an operating system would have. That’s why library OS, as I said, is a good term—it’s both library and an operating system. I don’t know about Windows but on Linux, for example, each page you have, you have metadata about that page—so it’s a 64 byte structure that describes each page in the system. We don’t have that. So, we’re running only on 64 bit. We don’t run on 32 bit machines at all. So we make it get… so, we made this decision. You can see it’s a pattern. We don’t want to support Legacy. We don’t want to do Legacy. We’re writing something from scratch for the machines, the virtual machines that are going to be there in two, three years. So, we have… >>: 64 byte structure, not 64 bit? >> Glauber Costa: No, the structure for our page is for a byte, or 32 byte depending on… but it’s between 32 bytes… >>: Sorry. I’m learning more about Linux in your talk than I am about your OS. This is really interesting. >> Glauber Costa: Yeah. [laughter] >>: The problems you’re working around are the most interesting part. [laughter] >> Glauber Costa: Yeah, yeah, of course. >>: [indiscernible] >> Glauber Costa: I don’t think… I don’t think any of those solutions are like incredibly complicated and ingenious. It’s just that once you decide not to handle all sorts of stuff and you say I’m only going to do this, you can simplify. Again, I have no idea how Windows handles pages or metadata, bubble memory, or anything like that, but in Linux you have this structure. So it’s even a line in the special way in groups of two words. So you have in the variant up to 64 bytes, so it depends on the configuration options that you have, but at the very minimum 32 bytes. Then you have flags, you have the virtual drives, you have all sorts of information about the page in there and that really describes each page in the system. And if you have quite a large machine, that’s quite a bunch of memory that goes only in metadata for pages. Again, we’re not doing that. If you talk about page tables, the amount of memory that we use for page tables itself is a lot smaller because we only have one set of page tables. We don’t have like many sets of page tables. So, all of… all of those decisions allow us to be more resource efficient because… and again, we try not to compare ourselves to any other operating system other than Linux. So, the way I view it, for example— this was something more to the end of the talk but I say just now is— we don’t really want and there is no… there is not even any sense in doing this—I mean, it’s not something… it’s not the operating system that is gonna be the new operating system for everyone, especially because if you wrote for application for Windows, again, I find it highly unlikely that it just gonna rewrite it. If you haven’t rewrite it by now to run on Linux, you’re not gonna rewrite it by now to run on OSv. And even if you try to harvest the benefits of running a specialized, virtualized operating system, you have it as well. Right? Right. So, this is for whoever is running Linux to run a better Linux. So we’re trying to be a better Linux than Linux by focus on a niche, by focus on the cloud niche that we believe is gonna be quite important in the upcoming years, and I’m fairly sure you agree. Security: So there is the…there is the obvious thing that, of course, that you have nobody to attack because if you’re running alone on the machine and if you crash, you crash. We trust the hypervisor to isolate the two virtual machines. That’s one of the questions I get asked the most—“But what if I can’t breach into the hypervisor?” Then it’s not my problem. I mean, it’s… call the support for the hypervisor vendor—call Red Hat, call Microsoft, call the amrico, whoever—because they’re doing a clumsy job. But as long as the hypervisor maintain isolation there is nobody you can attack. I mean, you can attack the application –yes. So you can DOS an application, you can have an application, you shut it down, but you don’t have isolation problems. Not at all because, again, being lazy and just getting rid of the problem. You have less components than a traditional operating system, so OSv is just a monolithic library that you lean against and everything is inside that library. You have less moving pieces, less moving parts, so again, it’s just a simplicity. And a very, a lot smaller attack surface, of course. We don’t have a very, very complicated kernel deck. It’s… so, I think that the summary for this is we don’t have any special security model, but we do believe there are some small… nope, it’s not our focus, so that’s why I don’t talk too much… don’t spend too much time on this slide. But, even though it’s not our focus, I mean, it’s still some benefits to be gained by the fact that our simplicity results in a lesser attack surface, et cetera. As I said, most of the… you have no users, not at all, so there’s no privilege escalation ‘cause there’s nowhere to scale to, no untrusted processes. Everything there is running inside the VM. We believe that it’s going to be a trusted entity. If you’re doing something that is untrusted, it’s going to be running in a different VM. And we are writing C++11 and as long as the compiler does a good job in translating that into actual code it’s actually… I had no C++ experience before that at all, I mean, Linux is written in C and, so I hadn’t, pretty much… Linus if you know that he hates C++, so we were all, kind of, go along… Please? >>: Granted the compiler does a good job, but is there any difference in terms of security… between writing C++ 11 or writing formal 61. You write a language, the machine does what you tell it to. >> Glauber Costa: Yes, yes, but this comparison… so keep in mind, for example, that this is also the first time I’m talking to you, a non-Linux crowd, but when we talk about Linux, Linux is completely written in C. And C is just the language in which you need to write everything you want to do, like uh… and C++ did a lot of that automatically for you. And especially C++11, it does that in a very efficient way. So, the reason Linus never liked C++ to begin with is that it has so many abstractions that you lose control of what is going on, you lose perf… it’s not… it’s just not the –according to Linus, I mean—it’s just not the language to write an operating system because you lose a lot of control. C++11 is allowing us to have a very decent level of control. The primitives are really efficient. You do understand what’s going on. But, for example, a common problem in Linux: you take a mutex—you forget to release the mutex. That’s something that never happens for us because we using that resource acquisition idiom that, I mean, you… whenever you have a block, you code your mutex and you code whatever you want to code inside that block, and it doesn’t matter from where you leave that block, the mutex is going to be automatically released. So we have like, we were using a lot of share pointers so resource counting is done automatically. So whenever you—we want to do resource counting, we never get resource counting wrong because we’re having the compiler to do that for us. And again, no idea what kind of tools and what kind of processes you have in Windows and even the language you code in, but compared to Linux, I mean, there’s a lot less opportunity. I mean we spend a lot less time thinking about buffer overflows and dangling pointers and memory leaks, because they simply don’t happen. The compiler would just put all the… without sacrificing performance, because we understand exactly what’s happening. It’s gonna introduce a lot of new, interesting stuff in there to prevent those stuff from going on. >>: Do you use exceptions? >> Glauber Costa: What? >>: Do you use exceptions? >> Glauber Costa: We support them, of course, because at some point you’re going to run an application, but we don’t rely in for kernel code for the actual oper… To tell you that we never use would be a lie. I mean in very, very special code, I mean, that we took it from somewhere, it was like, for example, the parson code for the command line control exceptions. But most of the actual stuff, I mean, we don’t use exceptions at all. If the exception happens… >>: I was just curious because that’s a common problem with C versus C++… is another… >> Glauber Costa: Yes. No, so, no. Every time we code, I mean, we code to be exception safe, but we don’t throw we don’t expect, none of the classes we write throw in exception, none of that. Alright. So, coverage: Where do we run? And again, keep in mind that this is not final, we’re here for around one year. It’s obvious that we run very well and most of our good results, most of our benchmarks, are produced on KVM. This is because, again, Avi started this—Avi created KVM so there’s possibly nobody in this world that understand KVMs better than Avi. He knows how to get all the performance from it, all the good things. I was a KVM developer for five years. The other guy that’s working with us from early times as well, was the person who implemented inside KVM nested virtualization. We have another guy that worked on KVM as well. So like half of our crew worked on KVM. So, it’s just a thing that we are more used to that… We also run on Xen. The reason we targeted Xen so early is that because we wanted to be present on Amazon, on AWS. And right now we run in Virtual Box and VMware. The reason we picked Virtual Box and VMware is, VMware is just because a guy— we have a partnership with VMware as well— but we found this guy to say “yeah, we’re gonna help you with some code, we’re gonna help you with questions,” et cetera… Virtual Box was just a thing that would allow us to run, at least for testing purposes, on Windows, Mac, Linux, whatever, in the same hypervisor. So, the question I was asked, for example, by Kewai when we met is that, “Do you plan to run on Hyper-V?” Yes, we do. We would love to run on Azure. So we’re running right now in Amazon and Google Compute Engine. So Google Compute Engine is not on yet, is not on-line. But when it is, we already have a partnership with Google. We’re gonna have an OSv image from day one. They could pick and choose from the list just to run a OSv image. So we’re partnering on them even before the platform goes live to have an OSv image from day one. Why Amazon? It’s because… >>: When you say partnering them, it just means the image is available or…? >> Glauber Costa: It depends. With Google it means one thing, with VM Room it means another thing, with Amazon it means another thing. It depends on the deals we manage to cut. For Google, we actually have… for Amazon, we basically have an engineer there that help us—I mean, answer some questions about the future their actions, help us plan a little bit. They don’t actually contribute anything. There is a guy from Google that contributes us some codes, so we have a more strong partnership with Google in that aspect. And the reason… so the reason we’re running on Amazon is because, of course I mean, Amazon today—I know you’re trying to change that and Google’s trying to change that… I’m gonna love to see the fight—but Amazon is still the biggest public cloud. So, that’s something we just… we have to run. And Google Compute Engine is running on KVM. So, for us it’s just like a no-brainer. It’s running on KVM, we run on KVM, let’s run it. And we also have a very good relationship with Google in terms of… because of just people that we know and work with us and KVM as well. So I would say that with all those companies the stronger relationship we have is, so far, with Google. But we would love to have at some point, depending on our band width, again, we are around fifteen people at this time, expanding but not in the… I would like to double everyday but it’s a… [laughs] >>: So [indiscernible] would mean flagging drivers… >>: Yeah, I mean it’s just a question of whether they can pull those out of Linux or not. We… we gave them two of Linux kernel. Do you think you can pull the concepts out or do you take the code? >> Glauber Costa: So, we definitely can’t take code from Linux. That’s a license issue. >>: So we haven’t documented our protocols. What we have done is give code to the Linux program. >> Glauber Costa: If you’re willing to do that for us as well, I mean, one of the things I can guarantee is that it’s going to be simpler. Yeah? >>: Right. So, I mean, I can’t sign people up to do work there, that’s not my job. But, [pause] what you said earlier is actually kind of interesting. You said that you don’t want to use paravirtualization at least through the hoop process. I would have expected you to say the exact opposite. >> Glauber Costa: So, let’s just… that’s also a personal pet peeve of mine. I mean, I use, sometimes I use the term paravirtualization a little bit different than… that’s because I come from the Xen era. So I contributed to Xen very early. And when Xen came with this paravirtualization thing, what paravirtualization meant at the time was rewriting the whole operating system to run on top of a hypervisor. Which meant I’m not gonna write to the CRT register, I’m gonna bypass that. I’m not gonna do… I’m gonna do memory management my own way. So in my mind this is paravirtualization. What you call paravirtualization is probably what I call paravirtualized drivers. Which is just… Yeah? >>: [indiscernible]…that’s somewhere in between the two… >> Glauber Costa: And then of course there’s the gray area. So, for KVM we do run in that definition. We do run paravirtualized because we run virtio drivers. We don’t support… we do now because of Virtual Box, but we don’t… in the beginning, for example, we didn’t support IDE drivers ‘cause… we just had virtio network, virtio block, and virtio PCI device, whatever it is. But those are paravirtualized drivers. What we don’t do…huh? >>: [indiscernible]… block the network. >> Glauber Costa: It’s basically block the network and whatever else you have that you want to expose. So, KVM, we don’t implement it, but KVM, for example, have a balloon driver that allows you to share memory between the operating… we might implement it. We haven’t because it’s not like a top priority. When I say we don’t support paravirtualization, it’s on Xen, specifically, which is pretty much the only hypervisor that relies heavily on modifying the whole operating system. When we run on Xen, we only run on HVM mode. So we don’t do the Xen hyper-calls and the way we… the way we set up page tables on Xen is, write to CR 3, which basically means Xen HVM mode. So we… >>: [indiscernible]… bringing a lot of other stuff with that. >> Glauber Costa: Yeah, I mean, we personally consider paravirtualization to be his life and legacy in the sense that paravirtualizing everything. We live in a post VMX era so the hardware is gonna do everything for you and whatever and wherever the hardware does, we tag along. Front drivers, then we do paravirtualization. Yes. And that’s what we’re basically interested, so when we talking about supporting multiple hypervisors what we have in mind is really getting the drivers, which is a block driver, network driver for Xen, KVM, Virtual Box, VMware. So the work on VMware that we’re doing now is basically the VMware network driver and that’s it. And if your hypervisor has some facility that you… >>: [indiscernible]…follow up with KY cause… >> Glauber Costa: Yeah. >>: [indiscernible]…shouldn’t be an issue because we made the drivers available for PSD as well. They’re our drivers so… When you put them in Linux you have to get them under [indiscernible] license but we can get them under a different one. >> Glauber Costa: Yeah. >>: But let me follow up with KY. >> Glauber Costa: Sure. >>: Thank you. >> Glauber Costa: But, this is like, this is as far as the hypervisor goes, as far as the cloud computing platform goes. We’re also interested in drawing partnerships on that for… One of the, I mean we’re not fully set. Shobana was asking me about that earlier. We’re not completely fully set on the business model. We believe it’s premature to have a full plan and that’s how it’s gonna be. But most of the, most of our ideas revolve around, basically per hour paid support on top of, say you charge thirteen cents per hour on the CPU. I charge extra three cents for each run a tag version of OSv that comes from a support plan, whatever. So, the way we’re going to present it in each of the clouds is going to be different from… depending on how we can… For Amazon right now we’ll just drop an image in there. With Google we have more strong relationships and so, we would like to be… long term we would like to be the full, go-to image for people running Linux applications. But in each of the platforms it will depend a little bit on… >>: Are you gonna present numbers in terms of size and friction… manage to… >> Glauber Costa: Yeah, next slide. >>: Okay. >> Glauber Costa: Yeah. [laughter] >>: Sorry. >> Glauber Costa: So, yeah. Resource usage: For a base image that we have, so the base image is like a, pretty much just a kernel and a small, small, hello world application that you’re gonna run and forget about it. That’s about seventeen megabytes of disc image. And we can do better but just not prematurely optimizing anything. It’s around seventeen megabytes. That gives you the full leap C compatible POSIX that we have so far. So if you have this image and if you want to tag along in a Linux compatible application that doesn’t come with a bunch of libraries. So again, if you have a lot of libraries that you depend on, you need to move them on to this image as well. But if you were just have… so this is the slowest we be goes, at this point so far, seventeen megabytes. In terms of memory usage it’s as lively bigger than this but it’s not gonna be twice as much. It’s basically just loading that to memory and initializing a couple of data structures. If it’s…if it’s twice as that it’s [indiscernible]. Obviously that image doesn’t do anything a lot interesting. What we try to do is that we bundle stuff together to... If we have a implement… mruby is a… it doesn’t implement the full set of Ruby, but it’s still like a mini Ruby that implements most of the things that Ruby does and we do that in twenty-five megabytes. So if twenty-five megabytes that have a Ruby shell in which you can run Ruby applications, then you would run with mruby in whatever, Linux, for example. We haven’t written this mruby thing. We just poured it from Linux to OSv. And so some people already using it. If you want to run, for example, embedded Ruby, don’t ask me why but some people do it and … that’s actually came from an external contributor so we just… hey, cool. The main image we’re targeting is like, of course, the Java image that’s, at least so far. And we… it fits in, more-or-less, 250 megs and this is just because we want to get the full JVM so you don’t run into… We see a lot of value not only in resource usage… the most… it is resource usage, but the most important and most expensive resource is usually human beings. So it’s like, we would like in the future to try to pick point more like which kind of libraries in the runtime environment you need, but right now we just copy the entire runtime environment to the image. Two hundred megs is…the open JDK is around two hundred megs itself, plus some dependencies—two hundred fifty. So we do run in Amazon and also that’s it as well. I mean, because we’re targeting the clouds I’m not… you can tell me about how we [indiscernible] those things, but for Amazon the smallest image they have is six hundred megabytes in memory size. So, what’s the smallest you have…is it? >>: I think there’s a… five twelve, five twelve to your mini… >> Glauber Costa: So five twelve is quite close to six hundred… >>: Yeah. >> Glauber Costa: …in terms of its more-or-less. So we also don’t see a lot of value in having specialized images to run in ten megabytes, ‘cause that’s not what, that’s not what you’re going to find in the clouds in there. So it’s just like we’re not, we’re not wasting expensive human resources to save ten cheap megabytes. But those are pretty much the base numbers. >>: Seven sixty-eight, seven sixty-eight. >> Glauber Costa: Seven sixty-eight? >>: Close. >> Glauber Costa: Yeah. >>: So, how many of those pages are even…? >> Glauber Costa: Coming back. Pretty much just the current… so again, everything you map here is read only because it’s just the disc… this is the disc image, right? So you’re gonna pretty much copy this image into memory. So this, this memory is gonna be mostly read only. And then have your data. As I said, the figures for… for a full booted OSv image with all data structures initialized, it’s not gonna get you twice as much the size. So we can fully run, for example, in less than six hundred megabytes for Amazon with the JVM. So the actual image that we run in terms of disc size is even bigger because we have the open JDK and a shell and an example shell so you can just get acquainted. Any runs… that image it’s around five hundred, four hundred and something megabytes. And that fits in the six hundred megabyte image. So, it’s around that. Performance: So, all the numbers I have here is compared to Fedora Linux. Again, for me, it’s one hundred percent pointless to compare it to anything else, so comparing to Windows, et cetera. So we’re not… we may do it for fun in the future so let’s see how it compares—even to see like possibilities for…for improvement and trying to get ideas. But since we are implementing the Linux CPI, the Posix CPI, I mean, Linux is just really the thing we’re comparing to. We’re also now not going to benchmark against every single Linux version out there because otherwise we wouldn’t do anything else in our lives. But this is Fedora. Some of…some of those images are running Fedora nineteen, some of them Fedora twenty. So we’re talking about a year old Linux kernel at the most. This is between six and eight months, something about that. So system calls, as it should be obvious from the design explanation in the beginning, they’re just pretty much free. It’s the cost of coming up with a bunch of registers, saving the user space registers and restoring them again. There is the FPU as well that you need to…oh, another thing, but we use the FPU and the scheduler as well. ‘Cause we have to support the FPU anyway because the applications are going to use it. The applications expect an FPU. I don’t know any single processor with coming from the last ten years that doesn’t have an FPU, so, I mean, we just use it. I mean, our scheduler does floating point calculations all the time. Since we’re running, I mean, for us it’s the same thing, user space, kernel space. But saving the FPU is the most expensive thing that we do in the context switch. We don’t do it all the time as well. I mean, we have some… that’s pretty common, I mean, even Linux does that, we have a … we try to guess with informed guesses, of course, and conservative guesses—When do you need to save the staff you staved and when do you don’t need. We don’t switch to FPU all the time, but when we do that’s the most expensive operation in the context switch path. We are around four times faster in context switch than Linux, the Linux kernel running Fedora—Fedora running, of course, the Linux kernel. [laugh] That’s a micro-benchmark. You’re not going to be running context switches all the time, I mean, but from the micro-benchmark numbers you try to get a grasp on the actual final numbers. For Network we’ve been running constant tests on netperf. Again, comparing again with Fedora Linux, we’re getting around twenty percent over TCP workloads. So, just networking. Part of that is that because we copy less buffers. It’s like we don’t have any kernel space, space separation, so we have no need to copy the buffers. We do copy… we not zero copy because the API doesn’t let you be zero copy. Like the Linux API for network transmission is pretty much like the file API, it’s just a socket to need to pass a buffer and that buffer, I mean, while you don’t get the acknowledgment that TCP transmission finished, you need to keep the buffer around. It’s really hard to do to full zero copy but we don’t need to have like the kernel spaces, space copies of any kind. For UDP we can be a lot more aggressive so our numbers on UDP are a lot higher. We can go up to fifty percent better in terms of transmission. And it is also… we don’t do it yet, but it also a lot more manageable to just do zero copy on UDP. I mean, because you copy the buffer you don’t have to get the buffer around, so you can transmit the buffer directly in the… The receive sight is trickier but it’s uh… we can do it as well. For Spec—SpecJVM, we’re not actually expecting a lot of performance improvements because it’s mostly a benchmark designed to test the JVM itself. So most of the time we expect that to be doing primitive… primitive operations. They have a bunch of crypto programs—they’re…they’re like twenty…twenty, thirty different programs. They do call eventually into the operating system and those are the opportunities we have to… to improve. And we’re seeing in… so we lose in some benchmarks, we win in others and the aggregate value is around three percent to five percent in our side. Memcached - we have a… it’s one of the things we’re doing better to this point. We’re around forty percent faster. Most of those gains they come first from networking because we just have a faster networking to begin with. Also, the memory allocation is… is more efficient. This is a UP benchmark so there’s not an S and P, but when we do S… I mean, S and P we still have some problems. Most of the problems just come because our low balancer is not mature enough. So, we don’t have the level of CP utilization that we would expect. For networking, we were suffering the same thing. I mean those results are all UP. When we go S and P, sometimes we start to lose a little bit of… so we don’t scale that well. We don’t believe that this is something from the architecture. We believe this is just something that we need to put effort and… Another reason we’re targeting UP is that, I mean most of… in the beginning I don’t expect for next year, for example, anybody to be seriously deploying OSv in all my servers. So, most of the people are going to be trying it. On the small images they’re likely to be single CPU. So, that’s where we’re focusing. And of course, we boot the whole machine in less than one second, which… >>: When you say single CPU you mean… >> Glauber Costa: Virtual CPU >>: Single-core logical CPU >> Glauber Costa: It’s a single virtual CPU. So, whatever, one run queue, one execution entity, whatever is the... The hypervisor at the end of the line, the hypervisor can be as big as you want, but it’s how many virtual CPUs you assign to the operating system. So, we’re designing OSv to go up to sixtyfour CPUs. We didn’t see a lot of value in extending, I mean, there… maybe there are some people in the cloud running really big up, but most of them are… You buy a big machine to run a smaller virtual machines. That’s the mode of the cloud. If you want to run one hundred and twenty-eight CPUs you’re probably going to buy a physical machine anyway. So we want to do S and P. Two CPUs is really normal. I mean, four CPUs is manageable. More than that, we support just because the work of supporting four CPUs and sixty CPUs is the same, but after that you need more… So, sixty-four CPU is basically because we have a bid map [laughter] and we want to use only one word for that. We boot in less than one second and two hundred milliseconds of that comes from the actual file system mount. So numbers about this boot: we spend around one hundred and fifty milliseconds reading the imaging real mode from disc to memory. We’re slightly better now that we compressed the image, so, but still, it’s not known half of that. But it’s around one hundred milliseconds something to read the image, two hundred milliseconds to mount the image. So it’s all disc operation, right? And aside from that it’s all really, really fast. We don’t expect to be a killer feature because most of the time in the cloud applications you really takes a lot of time just to boot the whole management around that. For Azure I’m not really sure but Amazon takes a lot of time to just came up with all the networking, all this stuff, so I mean… But it can be… it can be an interesting thing if you’re booting… The guys from… from Xen Mirage, they brag a lot about that. They’re way faster than we are in that aspect. I just clap and say, “congratulations.” But, because they’re really like almost like an Ocaml compiler they just want to run the application. They don’t do a lot of boot-ups stuff. For us we do have this one second cap. Whenever we see that the boot time is starting to get bigger than one second, we let’s stop and try to see how we can optimize it. But as long as it is below one second we are fine. There are some applications probably that are just gonna want to boot like spin virtual machines really fast, it’s a possibility. But for now it’s what we have. Alright. So, Future Performances: This was like what we have today, so, of course, I don’t have numbers for future performance. [laughter] But, one of the, and again, most of this is likely research oriented, is future, as I said. The applications can use specialized APIs to access the virtual hardware. One example that I gave is the MMU. You can just use the MMU because you are in kernel space. You’re not in the real kernel space because that’s where the hypervisor are, but you have the full view of the kernel space, you can do whatever you want and the processor is going to do its magic -I don’t even care, I don’t even want to know. I think I sleep better at night by not knowing how the processor… I mean, I know the instruction set, but not too much about the internals. The MMU is one example, I mean the… we see some opportunities in the JVM but there are also other kinds of opportunities to use the MMU directly. Even more interesting is access to the network buffers. So if you want to do raw… raw socket networking with your own specialized protocols and even to some extent using TCP or EDP as well, you can use specialized APIs that we plan to provide, we don’t provide yet. So we can run… you can read directly from the network buffers. You don’t need to go through all the socket API. If you access the hardware buffer directly you can achieve full zero copy networking then can go, I mean, really get really good impressive numbers. We recently merged but it performance still lacking a little bit, so it needs more work. Something called Van Jacobson network channels, which is basically a way… Van Jacobson, he was the guy who wrote the congestion control algorithm for TCP and he had this proposal that instead of using sockets let’s use this channel thing. He had a proof of concept for Linux. This was in 2006. And he got like really beautiful numbers. But, implementing something into Linux is one thing, merging that into Linux is a different matter. And this is one of the things that we intend to excel as well. I mean, we just designed the whole thing for the call. We don’t support Legacy. We’re not going to support any kind of the weird network protocols. It’s just the basics, so we can be more free in that aspect. The basic idea of Van Jacobson is that you don’t do anything in the natural driver, you do very few things in the natural driver. So, you don’t really have any lock-on tension. You just have a small classifier, a very fast classifier that tells you where the specka belong, and just that. And then you move it up the layers. You don’t even allocate memory. You only allocate memory by the time you’re going to use it. And if you do that, you have better cache line utilization, you have less lock contention, you don’t have system time processing, all your time’s spent in user space. And you just can paralyze better. So, his numbers for Linux are really good. For OSv we got from thirty-six gigabits, this is in memory, right? So, it’s not over the wire, but it hosts guests, it’s passing through memory. It’s really just so we don’t… we see how fast we can go. We’ve been to around thirty-six giga, four, two, forty-seven. Alright, so, so, just using the… so, we did that but we lost performance in a lot of other places, so it’s still experimental. But it basically allows you to transmit more. The reason it’s so difficult to implement Van Jacobson in Linux is that you need your network stack to be exposed to user space. And to do that on Linux is very complicated. It’s hard for us to expose stuff to user space but it’s really trivial to get your user space into kernel space. So, that’s what we do. Alright, so, that’s it… If we wanna do like really, really fast boots, I mean… so, as I said, we’re booting less than one second. We don’t care too much about going to do one hundred millisecond realm. But if you really want to do that, we can by just creating a new mode of operation in which you run without a file system, like at all. Nothing from this kind. I mean, no disk reads, I mean we can just move the image directly to memory. If we don’t do that, I mean, you can pretty much come up with an OSv instance in two hundred milliseconds, three hundred milliseconds. It might be interesting for some cloud workloads, probably not too much. And that’s the part Jeffrey likes, I guess. I think the real savings in... I talk a lot about performance because, I mean, I’m an engineer so… from the performance realm. I always like to do that kind of stuff. But if you really look at that, at how things are done, the real savings they don’t come really from machines. I mean, machines are cheap. I mean, human brains are a lot more expensive. [indiscernible] are expensive. So we’re trying to go on a different route here as well. We have no command line—that’s a lie. We have a command line for compatibility. Again, so you… we expose a command line just because most of our user comes from Linux, they expect it to be there, so. But the most of our management, we want all of our management should be basically a fully automatable rest API. So we’re using rest because it’s more normal for in the Linux world. It’s just [indiscernible] requests. We do consider, but it’s not final, to use other kinds of management on top of that . One matter for us is that I want a single admin to manage one thousand OSv machines, not twenty as people usually do these days. Again, no graphical interface either, no command line, everything that you do on OSv should be rest or other automatable management. So we also want to obsolete all the kinds of tools that people are using on Linux, like Puppet, Chef, and… Most, as we were talking earlier, most of the time those tools they’re just really parsing output. I mean, you create an output and then somebody goes and try and use a different locale and it breaks you completely, or change a comma over there and breaks your parser, et cetera. We wanna be like fully automatable. We also want to be like almost zero configuration. It’s… I would like to be like one hundred percent no-conf, it may not be possible, but even when we do configuration and we’re trying to do that in a way you can automate, in a way you can replicate, in a way you can be stateless and don’t depend on things like on your local disk. For Amazon, for example, you’re using cloud in it. You just probe, probe the cloud provider, it said this is how you should run. You run it and if you want to modify the configuration, reboot your machine and probe a new configuration. That’s it. And we, as much as in the rest of the system, we don’t want to run… we don’t want to have any significant portion of Legacy here. So, I don’t care how things were done in Linux, but we gonna do it differently, I mean, we gonna do, at least for that part, fully automatable. So, there is a project as well called Coral West that some people are using as well in hypervisors. When we talk… when I talk to a Linux crowd a lot of people ask me, “so how do you compare to Coral West?” because Coral West is basically just Linux and you strip down a lot of things that you don’t need, you don’t want. They don’t have to modify the kernel because this is really insane, I mean modifying Linux without merging to Linux is a suicide note, pretty much. I mean it just moves too… way too fast. But if you just boot Linux, I mean a normal Linux image, you’re gonna have around two hundred configuration files, most of them text based. You’re not going to mess with them all, but they’re there. And most of the time you’re gonna use tools to do that. The problem is that those tools are always error prone. I mean, things happen. And in here, pretty much like in the spinlock thing, I mean, we’re trying to handle the problem by not having the problem in the first place. We’re gonna be like fully automatable. Not… not even something like the Windows register, it’s like all rest. Maybe we’re gonna do some other kind of RPC mechanism, but so far for our Alpha it’s going to be the rest API. Another thing for the future as well, some of it is more-or-less already in the present, but another area in which we believe that the model of the library OS can excel is integrating with the runtime. So if you… if you remember that figure you want to merge the operating system with the application. In the case that you have a runtime, that really means merging the guest operating system with the runtime. So one of, and again, savings comes here not only for performance but also for—for people, [indiscernible]. A lot of people running JAVA, they’re spend a lot of time benchmarking, I mean what is the best heap size to use for my application? And what we want to do is automatic heap size termination. That’s actually the project I’m working on personally. It’s a lot more complicated than it seems in these lights. But the basic idea is just that you give all the memory to JAVA, and when the operating system needs memory, I allocate the JAVA object inside JAVA and use that memory for the operating system. >>: Don’t all these runtimes also have their own configurations and their own… >> Glauber Costa: Yes. >>: So, do you have to then go tweak all them to get them on your program? >> Glauber Costa: Yes and no. So there are two kinds of configuration you need to do: operating system configuration and application configuration and the runtime is in the middle, right? But I’m calling this application. Operating system configuration is the thing we want to get rid of for one reason. If you are an application developer, I am fairly sure you know about your application. You know how it works. You better know, right? You know how it works. You know… you know how it behaves. You know how you should treat things. You’re likely, not necessarily, but you’re likely to know as well how the runtime behaves because this is the thing you interact with. So you know a lot about the application, a little bit about the runtime, and almost nothing about the operating system. You need a specialized [indiscernible] about that usually. So we were trying to move bottom up in this aspect and we want to get rid of the operating system configuration and manageability and all that. It’s got to be fully automated, fully rest, et cetera. Our goal is that you shouldn’t even know about that. The runtime is the next victim. Right now, I mean you should know how to tweak your JVM, which parameters to use and all that. In the future, so this is still in the future, we would like to get rid of some of that as well. This is why, this is one of the reasons why we’re focusing on JAVA. Because if we, it’s just, it’s just a runtime that is widely available… >>: If, say, a runtime has a set of commands that need to be run in processes, you said it only runs one process though >> Glauber Costa: We run, we run one address space but you can run a process provided it doesn’t rely on memory isolation. So far, again, it might be that there is one runtime over there that needs something we can’t support and… I’m gonna leave it at that. So far for JAVA we just run everything. Most of the configuration I’m talking about here is parameters. How big should be the stack size? How big should be the heap size? How big should be the minimal… which garbage collector should you use and what’s the size of the young generation? What’s the size of the old generation, and all that? And we intend to make through changes both in the operating system and in the runtime. So this… this technique, is only change… I’m only changing the operating system. So OSv has some specialized kind of facilities just for JAVA and it allows you to forget about determining the heap size which is one of the things most of the JAVA people just say, ”yeah, I mean it’s too boring, we never get it right and we… it’s very complicated.” So what we do, we give all the memory in the machine to the JAVA virt… JAVA virtual machine. So you not gonna be… the reason you need to determine heap size is that there is a balance between JAVA memory and operating system memory. So you have JAVA memory, but as long as you open a file that file’s gonna use file system caches. And if you give all the memory in the box to JAVA, there is no memory for file system caches. So what we’re doing is we have this specialized communication with JAVA. We, through J and I we allocate a special object in the JAVA heap. That object has a bunch of memory. We release that memory to the operating system and we allocate our file system caches, for example, in that memory. So you don’t need to go playing with what’s the heap size should be. You just, whatever you have. >>: I understand why you’re—you’ve spent a lot of focus on the JVM and that it’s not your only focus ultimately, but if you’re going to carry these ideas to their logical conclusion why do I need OSv at all? Why don’t I just change the JVM so it would survive…[indiscernible] >> Glauber Costa: You could. How can you do that? You can do that by linking it against the library that allows you to… that’s library’s called OSv. As I said, I mean I… the reason I like the library… >>: That’s the answer I expected. I just would have assumed that somebody had already short-cutted that to some degree. No? >> Glauber Costa: So, there’s a… I’ll come back to that but let me hear what you… >>: This is a Rip Van Winkle question: A Unix ship to outside Ma Bell for sixty-four K machines. >> Glauber Costa: Uh huh. >>: What happened to get this thousand fold increase in size over these forty years? >> Glauber Costa: I…[laughter]. >>: Especially if they don’t have Legacy code. >> Glauber Costa: No, yeah, I mean, so the thing with configuration files in Linux is just that you have… If you have a Linux box these days, I mean it’s not just the kernel, you have a lot of services that are running there as well and each of the services need to be configured. And it just explodes, I mean it’s a… it’s pretty much the same with the Windows Registry, I believe. I mean in the beginning you need to tweak this and that. Ten years later, you need to tweak the work. But about your question, if you change the JVM to boot directly into a … you can, I mean it’s perfectly possible. But then you need… if you want to run the same thing for Doc-Net, for example, or Mono, the latest version of Doc-Net which would be more natural for us, or Go, or Ruby, et cetera, we need to go and modify all of them to boot directly on the hypervisor. The usual way to do this is by having a library that has an API and that’s OSv. So as I said, the reason I love the term library OS is that it conveys the message that it is both library and an OS. It has everything an OS has. It has scheduler memory, file system, et cetera, and it’s also a library. >>: …representing the same interface more-or-less as Linux, you’ve just rolled all those up. >> Glauber Costa: Yeah. It’s all, it’s all a question of where do I put my abstraction layer? And that’s it, like I said. So hardware assisted garbage collection is like for the future but it’s also something that we want to integrate. We have the runtime. If you want to know more you have like our website code and github, it’s all public. We can follow us on Twitter, write to our mailing list and I did put my personal e-mail address in the beginning. So that’s glommer@cloudius-systems.com. >>: You mention compact? Or you mentioned, actually, you didn’t mention compact. You mentioned memcached… >> Glauber Costa: D. Yes >>: … and you mentioned Cassandra at one point. What are the sort of things you run it with and do you have to go in and modify those systems and tweak ‘em in order to get ‘em to work? >> Glauber Costa: So, right now, in theory you could run any kind of job application, right? So because we’re fully, right now we fully support the JAVA runtime environment. So if you run… if you have your own jar, which like your application developer you have it in your jar, you can deploy it directly to OSv. In practice, theory’s always different, right? So, there is the J and I thing that the JAVA allows you to basically call directly into the operating system. We aim to support the whole POSIX, but we don’t. We don’t just because, I mean we didn’t sit down and say let’s implement everything. We’re doing this more on an on-demand basis. There is this guy that needs this system call, we go and implement. So right now the two applications that we are running and like certifying and testing are Cassandra and Tomcat. But one of the reasons for Tomcat as well is that Tomcat is itself an application server, so people write stuff to deploy on top of Tomcat. So if you run a Tomcat, we get more coverage. But the way, if you…if you have a well behaved JVM application that doesn’t do anything fancy in terms of calling back to the operating system, it’s just a matter of deploying it and running. >>: I’m gonna leave in a minute. You were gonna tell us how you come from Brazil to Russia. >> Glauber Costa: Oh yeah. [laughter] >>: Can’t leave without hearing that. >> Glauber Costa: So I don’t like my country very much. The way I explain that is that, I mean I feel better in Russia. That’s enough information for almost everybody. But I worked from home almost all my life and at some point I got this proposal from Parallels to go work on their containers model, which is more-or-less the same thing we’re trying to do here but with a different… just the same kernel mode [indiscernible] space. I like this idea a lot better than containers, personally. And then I moved to Parallels but my wife moved for me, with me from Brazil and she got a job there in Russia as well. And now she doesn’t want to leave. She says, “I like my job, let’s stay in Russia.” It’s also convenient for me because it’s close to Israel so, I mean, from Brazil it’s like crossing the world to get to Israel. From Russia it’s like four hours. So, frequently asked the question—yes I speak Russian; [laugh] but not too much. I speak like a retarded. It’s enough for getting food. >>: [indiscernible] >> Glauber Costa: Yeah. [laugh] It’s a… it’s a to be honest I’m considering leaving now after recent developments. Like it’s getting a little bit scary for me to be there. But so far it’s just there, I mean. It’s Brazil with snow for everything. Sergei Bring from Google, he said Russia’s Nigeria with snow [laughter] And when my friends back in Brazil ask me, ”How’s Russia?” I just say, “Well, it’s Brazil with snow.” I didn’t know about Sergei’s phrase, I mean I developed that independently, but it really is what it is. I mean it’s an interesting experience. >>: Well, thanks for coming >> Glauber Costa: I appreciated it. [applause] Thank you.

Document 17954101

Related documents

Products

Support

Document 17954101

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib