>>: So I just got the signal we can go ahead and get started. It's my pleasure to introduce virtually Ken Birman from Cornell University who will be talking about consistency options for replicated storage in the cloud. Ken, it's all yours. Ken Birman: Thank you very much, Roger. And I want to apologize to the audience for not being there personally. I would've really enjoyed the conference, but this is the best I could do, as it turned out, for the U.S. Air and the weather yesterday. So I want to talk about the challenge of replicating data in cloud settings, and in particular to a premise that's been pretty widely accepted, articulated by Art Brewer, who argued that when we build systems that need to be highly available, there's a tradeoff between consistency guarantees and partitionability, which has been taken much more broadly as performance. And so the cap theorem, as he named it, is basically the claim that we must abandon consistency in order to build large-scale cloud computing systems. I want to question that and argue that that might not really be the case, and maybe we've been a little too quick to abandon consistency. Before I do that, I'll say, though, that it's a widely accepted assumption about cloud computing. So, for example, if you attended lattice, the cloud computing workshop that region two years ago, the guy who built the architecture of the Ebay system, his five key arguments for how to guarantee scalability, the fifth one is to embrace inconsistency. And that's actually the one he talked about most. Werner Vogels used to work with me here at Cornell, and when he went to Amazon, the first job he had was to clean up a huge scalability problem that the company was having of fluctuations in load within their cloud computing system. He tracked it down to a reliability mechanism in their publish and scribe architecture that was being used to replicate data, and he basically stamped out reliability. So you have to switch that to something slower that's not necessarily going to make such strong guarantees, and he solved the problem and later was quite proud. He said, look, the kind of reliability that was being guaranteed there -- and it wasn't a strong kind -- was nonetheless was in the way of scalability, very much in the same sense as Eric's point. And James Hamilton, who was one of the architects of Azure, now at Amazon, but at the time was at Microsoft, he gave a great talk on this. And he basically said that consistency is the biggest [inaudible] to scalability in cloud computing systems. He was talking mostly about database consistency, but he said as far as he was concerned, right way to handle this is to create a committee and tell people who want to build a mechanism that involves consistency to first get approval from the committee. And here you've got a picture of the meeting the last time they met. And his sense was that as long as the next time they meet is about as far out as the last time, that people would get the point and that they'll find some other way to build their apps. And people are doing that. So what I want to do is spend a minute now talking about what I mean about consistency, what is this term, and why it matters, and then ask whether we can actually have consistency and scalability too. So I'm going to use this term to refer to situations in which something is replicated but behaves as if it is not replicate. So a consistent system for me looks like a single fast server running at Amazon or at Microsoft or whatever, but in reality, it's made of very large numbers of moving parts. And so we could draw pictures of that. And I'm going to do that in a second. Some examples of things that have consistency guarantees are transactions on replicated data so when we shard data in contemporary cloud computing databases, we're basically abandoning transactional properties at the same time. That's how these sharded systems are built. So transactions are an example of a consistency mechanism. Cloud versions of large-scale replicated data, by and large, avoid the use of transactions. Not always, of course. And don't take what I'm saying as an always story, but I mean most of the time. Atomic broadcast locking and, of course, locking through Paxos [inaudible] is a major mechanism in cloud computing systems. Nonetheless, a huge effort is made to avoid using locking because it's a consistency mechanism of the type perceived as destabilizing. Here's the replicated data picture I was going to make before. You can think of it as locking if you prefer. So an example of a consistency property would be the following. Suppose that I told you that I built a data center in which it looks as if the patient's medical records are updated on a single server. And here's a picture of that happening. So the timeline is going from left to right and the little blue stars are places where updates occur. And then I told you actually I built it as a cloud computing system and I spread that service out on a couple of nodes, but although they're executing separately and in parallel, the execution really behaves just as if it had been the original reference execution. We call that a synchronous execution. You can see that I've come up now with timelines, multiple processes, five in this example, two of them fail long the way, but if you look at the actual events, everybody sees the same event as in the top picture, the state is the same, it's indistinguishable. Okay? Paxos works that way. And here's virtual synchrony. This is the model that I happen to be fond of because I invented it and it's fast and I've always been sort of a speed demon. Virtual synchrony gives executions which, if you look at them in a semantic sense, are indistinguishable from the synchronous ones, which in turn are indistinguishable from the reference runs. So virtual synchrony weakens ordering and weakens certain other properties, preserves, though, the guarantee that what happens in the distributed cloud system is indistinguishable from the reference execution. So my goal in the rest of this talk is to make it possible to use executions like this and to respond to the concern that this can't scale. Now, why do people fear consistency? Why do they think that consistency is such a dangerous thing to have? The main concern is that consistency is perceive as a root cause of the problems that places like Amazon had. And actually it goes way, way back. If you had been in the field for as long as I have, almost 30 years now, banks even 15 or 20 years ago were afraid of this phenomenon where they had trading floors that would be destabilized. There's a picture of the type of problem that Amazon was suffering from at the time that Werner went there. Not anymore. And what you can see is that when they measured message rates in the background on their cloud platforms, they were oscillating between saturating the network -- that's at the top -- and dropping to zero over a significant period of time. And if you looked at exactly what was happening, they had decided to use a published subscribed message bus very aggressively, and this particular bus guaranteed delivery, and when the system loaded up enough, it started dropping packets, but because it had to guarantee delivery, it would retransmit those packets which created additional load and essentially is putting reliability ahead of scalability. The extra load caused a complete collapse, and that's why it went down to zero. After about 90 seconds, that particular system would give up, which meant all the load drained away and the cycle would repeat. So you can understand why people would be afraid of this. If you imagine your, you know, 12 times a soccer field data center completely destabilized in this way, it's a pretty frightening prospect. Now, on the other hand, there are dangers of inconsistency. So if we start to move mission-critical applications to the cloud, that's going to include banking applications, medical care applications. Microsoft has a big commitment to moving that direction. Google does as well. And if you take medical care records, those aren't going to be just your doctor's records, they're going to include real-time data coming from blood sugar measurements that are going to be turned around and used to adjust insulin pumps. That will happen for people who are at home and who aren't capable of doing it themselves. And because of the efficiencies of cloud computing, it'll be on a scale where people couldn't step in if it broke down, but obviously inconsistency is dangerous in such settings. So if we can't figure out how to reintroduce consistency in cloud environments, then what we're doing is we're saying that we can achieve scalability, but we can't run those types of apps. And I don't think we should accept this. And that's actually why I don't think caps should be viewed as a theorem. It's more of a rule of thumb that's worked pretty well and gotten us pretty far. Now, to reintroduce consistency, what we're going to need are a few things. First of all, a scaleable model. Many people would say, for example, that we should use what's called state mission replication, Paxos. Isis, the virtual synchrony model, would be another option. And we need to convince ourselves that that model itself is compatible with scaling, and then we have to have an implementation of a platform that can scale massively in all sorts of dimensions. And so the rest of the talk, what I want to do is explain why I think this is a solvable problem. So I'm known in my career for building the Isis system originally. It was used in things like -- the New York Stock Exchange ran on Isis for about a decade. During that whole period you never read about a trading disruption due to technology in the stock change. Yet that was a system with hundreds of machines, even in the early days. So why did it never fail? The answer is it experienced failures, but it was self-healing. It was using the software I'm going to reinvent, in some sense, for cloud computing now. French air traffic control system continues to use that platform, the U.S. Navy Aegis, and there were a lot of other apps that used it as well. It didn't make a lot of money other than for me, but it did make some money, and it certain made some people happy, and in particular, stock exchange traders, for example. Now, the key to Isis was to support group communication like in the picture I showed you earlier, and what I'm going to do now is suggest that the way to think about this today is that these groups are really objects and that the programs I was showing, the little processes with the timelines, had imported the objects much as if you had opened a file. So if you have a group of five processes like I showed you earlier, what that really is is an object shared, in some sense, shared memory among five processes. They float in the network, they hold data when they're idle, and then you open them when you want to use them. And I'm going to reincarnate this now. We'll call it Isis two, okay? So how would this look? Well, to the user, it will just be a library -- I'm just going to show you very, very quickly the kind of thing I have in mind. So you basically create the group, you give it a name, it looks like a file name, the file name is actually a real file name, and it's where the state of the group is kept when nobody's using the group. While it's active, the state of the group is in the applications using it. You can register handlers, and then you can send operations like an update to those handlers. It's polymorphic. This is all done in Csharp. The corresponding handler is called. I do type checking. You can't actually join a group if you don't have the right interfaces, for example. And what will happen now at run time is that you can use this kind of mechanism to program in what you'd call a state machine style with strong guarantees that even though you're not thinking much about at fault tolerance and consistency are guaranteed by the platform. Security too. We'll talk about that in a second. So here, for example, is somebody who queries a group and he wants the group to do some work in parallel for him. And as you can see, he's asking for replies from all members. Isis knows how to handle that. The operation is to look up somebody's name. It's a pattern in this case. And when the replies come back, they're in an internal form, and what's done is we turn that into a call-back here to a routine called lookup. And what happens on the server side is that the lookup routine is invoked, it does some calculation, and then each of the members reply. Here's a picture of how that might look. So here you've to the a querying process -- I turn the timelines top to bottom now. You've got a querying process on the left. It's talking to this group of five processes which have imported that object. They may have imported tens of thousands of other objects. It's not the only things those programs are doing. But, in any case, this particular set of threads receives the request, four parallel executions occur, each of them computes maybe a fourth of the result, and then they send the replies and the replies are processed by calls. That's the idea. It's very easy to program this way. The popularity of Isis in the old days was really because it was so easy to use. You hardly needed any training to use this model. So a group is an object. The user doesn't experience all the mess, and it's a lot of mess, I can assure you, as you try to build these things, and the groups have replicas. Now, what model are we going to use? In what I'm going to describe, I'm actually merging the virtual synchrony model with the Paxos model. We did some work with Microsoft Research, Dahlia [inaudible], and found that in fact we can build a super model that subsumes the two and actually is faster than either of them and also cleans up some problems that both models had. I can say more about that in questions if people want to come back to it. But Paxos had some issues, especially when it's dynamically reconfigurable. We were able to fix those. Isis had some issues in the old days of virtual synchrony. We fixed those as well. We have a submission to PODC in a paper that I could share with people if they're interested. So here's the way the platform's going to look. It's going to have a basic layer that supports large numbers of these process groups, these objects. Applications will join them. The applications talk to a library. The library has various presentations, virtual synchrony, multi-cast. You can actually ask it to be Paxos if you want it to. It will support Gossip as well, so you can use pure Gossip mechanisms if you want and, on top of that, various high-level packages. Very fast pub/sub, a very, very fast data replication package, and other things could be put in there too. For example, database transactions, [inaudible] fault tolerance, overlays. I have quite a range that I'd like to put in. We'll see how far I get. Now, for security, I've actually decided that since people worry about that and I'm aiming at mission-critical apps, I should make that transparent. So simply by requesting that a group be secured, I will secure the group using keys that are generated dynamically, and only group members can make sense of the data that's transmitted. We do compression if messages get very large so that we minimize the load on the network. So, now, what's the core of my challenge? It comes back to the problems I talked about at the outset, James Hamilton and Jim Gray and the Ebay people being afraid of instability. So why can I build a stable system if previous systems weren't stable? Now, the core of my challenge turns out to be this. I need to do better research -resource management than has traditionally been done. And I want to talk about just one example of a problem in this space. There are a couple, but we've solved many of them over the last few years in research. And that's the use of IP multicast as the fastest possible way to get replicas updated. I think everyone would agree that IP multicast, one UDP packet that's received by several receivers is obviously the fast -- the speed of light for replication in a data center. But we can't use it because it doesn't work well, and in particular, it's associated with the constant instabilities we saw earlier. But we do some studies. In fact, we worked with an IBM research group on that, and if you look at the top right here, you see a graph that's typical of what we came up with. We found that if you use IP multicast and send a constant data rate, constant stream of messages, the rate is fixed, but you vary the number of multicast groups you're using to send it -- so nothing changes here except the number of IP multicast groups -- in fact, the hardware breaks. So you can see that happening. This is a loss rate graph, and you can see that when on average nodes in my data center are joining about 100 IP multicast groups, this is a perfectly plausible number, suddenly loss rates spike and they go through the ceiling. They go up to 25 percent. This, by the way, is what happened with Amazon in the instability problem that they had a few years ago. The pub/sub product that they were using accidentally wandered into this space, and with these huge loss rates, no surprise, that as people use pub/sub heavily, it melts down. And you can think how insidious this is. Synchronous because you like your pub/sub product, you roll it out on a larger scale, and then it melts down one day all by itself. So, now, how do we handle that? Well, what we're doing is we're managing the IP multicast abstraction. You see the new blue box below my other boxes. And here's how the management scheme works. It's an optimization scheme based on the kinds of results people are getting from social networking. And there's an optimization formula. If I had a little more time, I'd go through it carefully. But basically what we want to do is decide who really gets to use IP multicast addresses, the hardware ones that are seen by the data center, and we're going to do that in such a way that we never overload the hardware limits. Against that background, we're going to try to minimize the amount of extra work. In our case that involves not using IP multicast forces you to send point to point. So there's a cost for sending if I tell somebody that their particular multicast group has to send point to point. There's also a receiver cost if I use an IP multicast address for several groups, and some of the groups include people who didn't want some of the traffic. They're going to have to filter. And to understand the way this works is actually kind of easy. What I've done here is I've imagined groups as red dots in a kind of high-dimensional space. That's what's on the top. And subscribers and publishers are the people at the bottom, maybe grad students in the CS department. And they join some of the groups. So, for example, there is genuinely at Cornell a thank goodness it's Friday beer group. They all go out together and drink. Some people drink, some don't. The ones are the people who are members of the group that drinks beer. And if you think about the crowd that drinks beer and the crowd that wants free food, they may be a very similar crowd. And that's going to correspond to proximity of the corresponding membership groups, IP multicast groups, in the high-dimensional space. And the idea now is just going to be to do clustering and assign one IP multicast group to some set of similar-looking groups. So the little X's represent the IP multicast addresses. So I'm going to do that transparently to you or to my apps. And so here we've mapped some large number of IP multicast groups down to three addresses, but it's at a cost, right? Because some of these groups didn't have exactly the right memberships, and beer people are getting messages about food, and maybe they're not hungry, and some food people don't drink beer and they're getting beer messages. The sending cost is minimal, though. So now what I can do is I can start to say, well, find somebody who is kind of an outlier, turn him into using UDP for his group, not IP multicast. He thinks he's using IP multicast. He's actually forced to use UDP. He'll actually have a higher sending cost, but my filtering cost has gone down because his application isn't receiving undesired messages. And then I can repeat that until I no longer exceed my maximum for filtering costs, my sending cost hopefully is as low as I can keep it, and I'm definitely not exceeding the hardware limits for the data center routers and so forth, and so I'm definitely not provoking that massive loss phenomenon. So this is an example of an optimization result that we're going to use in Isis 2 to map what people think of as IP multicast groups or what my library thinks of as IP multicast groups down to a small number of physical groups. I use ideas like this quite heavily. I'll just comment that we did a study looking at lots and lots of situations where people have subscription patterns, pub/sub kinds of patterns, and we found actually that most traffic and most groups have a very heavy tail distribution. You can see for the curves here on this graph on the right where relatively few groups account for most of the popularity and actually an even smaller number of groups if you consider traffic. So a small percentage of the IP multicast use in a real data center covers most of the benefit. And so it's completely plausible that we can get what turned out to be 100 to 1 reductions sometimes in the number of IP multicast addresses required to fully satisfy our objectives in terms of speed. Now, we're doing more. I don't have time today to tell you all about the other things we're doing. I'll just say a few words about them. A second problem is reliability. I do have a reliability application and -- I'm not sure what that is. You can ignore the background noise. And for this reliability goal, you have to get acknowledgments. Well, if you send acknowledgments directly to the sender, you get a implosion problems where. We're using trees of token rings. You get a picture of that here. They have about 25 nodes each. And for very large groups, it turns out this works out beautifully. We have a paper on it. We have to do flow control. The background beeping was actually that I was running out of time so I'm going to go through this real fast. Basically another optimization similar in spirit to the first one lets me keep the data rates for large numbers of groups below target thresholds. This particular picture is an early version. We actually tended to overshoot. Red is the traffic we generated, green is the traffic with our agile flow control scheme. It's a very similar idea to an optimization that says who can send at what rate. And you can see yellow was our target. Nowadays we are below our target. So to summarize, we can build a system that's drastically more scalable and in which most updates occur with a single IP multicast to the set of replicas. And we do this using various tricks, my last slide. And what it gives you, then, is multicast literally at the speed of light with consistency guarantees. And although I didn't have time to talk about it, with guarantees of stability as well, and theoretically rigorous guarantees of stability, theoretically rigorous guarantees of security and performance that actually, if you think about it, drastically exceeds what we get today in cloud computing systems. And, in fact, I'll end on the following point. The reason that cloud computing systems I think endorse inconsistency, embrace inconsistency, is that they don't know how to do replication at a high data rate. Since we replicate in slow motion, you'd better get used to stale data. And you replicate in slow motion because you're afraid of reliability mechanisms. If I can fix that, and the arrangement then is that I think we can, we toss out lots of pages of code on my side, it turns out, to make things nice and simple, and as a matter of fact, you end up with the benefit that data is consistent and is being replicated at speeds maybe hundreds or thousands of times faster than what you're seeing in modern data centers, which are using things like TCP point to point to move data from a source to a set of replicas. For me it's going to be direct IP multicast. One replicated update becomes one multicast. And you can imagine the speed-up, and it's a speed-up that I'm already beginning to measure. I've got Isis 2 starting to run here at Cornell. Not quite demonstrable yet, but it will be soon. And with that, I'll stop and take questions. And please ask questions at the microphone because I'm not as close as some of your speakers are going to be today. >>: Thank you, Ken. Questions in the audience? We have a timid audience. Any questions? Yes, we do have one. Great. >>: Hi. So for doing this clustering to decide which multicast groups you will have, you have to do that in real-time, right? Ken Birman: That's right. Yes. We have a paper on this that we're presenting in EuroSys, and the really quick answer is that it's quite efficiently paralyzable. The paper is called. Dr. Multicast. It's going to be presented next week by one of the main authors. And what we're able to do is to break up the very large structure. Obviously you get a very large structure. We break it into small pieces and we have an efficient greedy algorithm, and this can be done at very, very high data rates. And then you essentially elect a set of leaders which handle sort of subgroups. You can think of it as kind of a hierarchical version of the protocol. Large clusters within which we allocate some of the resource to each cluster and then within the cluster a suballocation to the actual groups that fall into that area. >>: Okay. So you don't -Ken Birman: By the way, I'll mention that IBM and quite possibly Cisco are already adopting this scheme. There's no IP in the sense of patents involved, so anybody who wants to read the paper and steal the idea is welcome to. >>: Okay. Thanks. >>: Okay. In the interest of time, we should move on to our next speaker, and let's thank Ken once again for his talk. [applause] Ken Birman: Thank you everybody out there. Again, I'm really sorry I couldn't join you. >>: So our next speaker is Armando Fox from U.C. Berkeley RAD lab. Armando Fox: What an ego trip. I get to talk after Dave Patterson and Ken Birman and I filled the room, even though it's a very small room, but still. So I'm Armando Fox. I do work in the RAD lab, but I also spend some time in the parallel computing lab, and I wasn't sure which template to use for this set of slides because the ideas kind of came out of the par lab, but they've really crossed over quite a bit, and hopefully I'll be able to persuade you of that by the end of the talk. This is some ideas that we've had on how to make parallel programming more productive for people who don't think of themselves primarily as programmers. And although the ideas came out of parallel and multicore, we think there's important applicability to cloud computing as well. So I'm going to try to give you a little blend of both of those things. And as with all good systems work, this is a collaboration with many people, some of whom are listed here and some of whom I've now doubt forgotten. So our goal in the par lab is high productivity parallel programming that actually gets good performance and is sort of accessible to mere mortals, and we kind of begin with the observation that everybody knows that these very high level languages like Python, Ruby, I dare say MATLAB, although it's not my favorite language, people like to use them. Scientists, for example, like to use them because the abstractions that you can get in those languages are a good match for the kind of code that you're trying to write, and various studies, none of them done by us, have then to that you can get up to 5x faster development time and express the same ideas in 3 to 10x fewer lines of code. For most of us that probably means 3 to 10x fewer likelihood of bugs. And we're going to stipulate that more than 90 percent of programmers fall into this category. And that's probably a conservative estimate. In practice, I think it's probably even a greater fraction than that. At the other end of the language spectrum you have efficiency languages or efficiency-level languages -- we'll call them ELLs -- C and C++ which when I learned C, which doesn't feel like that long ago, C was a high-level language. Today C is a systems language. CUDA, which is the language that's used to program in video GPUs, or OpenCL which is a similar open language that's coming out now. These languages tend to take a lot longer time to develop code, but the payback is if you're a really good programmer, if you understand the hardware architecture and if you're willing to kind of work around the languages quirks, you potentially could get 2, 3, maybe more orders of magnitude in performance because you're using the language's ability to take advantage of the hardware model. So far fewer programmers, we're going to say far less than 10 percent, fit into this category. And the irony is that even though -- in some sense these guys are the scarce resource for getting the benefits of these languages, and yet their work tends to be poorly reused. They will come and rewrite an app that someone did in MATLAB, they'll put a lot of work into it, they'll speed it up by three orders of magnitude, and then that code perhaps never gets used again. So we think it's possible to do better than that. Especially because just because you're spending the 5x more development time down here doesn't mean you will necessarily get the improvement. It just means you could get it. And, in fact, a lot of people don't. So can we raise the level of abstraction for everyone and still not have to sacrifice at least part of that performance gain. Traditionally the way that people have tried to do this is you code your algorithm, you do your prototype in something like MATLAB or Python, and then you find an efficiency program, or somebody who's an expert at saying I know how to take this problem structure, sparse matrix-vector multiply, logistic regression, and make it run really well on this exotic type of parallel hardware. So a few examples. Stencil/SIMD codes. Great match for GPUs because of their natural multidimensional parallelism. Sparse matrix. There's been a lot of work on communication avoiding algorithms, and that's a great fit for multicore where communication is expensive. At the cloud level, people who do things like big finance simulations that are Monte Carlo-like, some of those are a great fit for expressing with abstractions like MapReduce. Couldn't you use libraries to do this? Sure, you could. But libraries matched to a particular language don't tend to raise the level of abstraction. They just save you from writing some lines of code at that language. C libraries don't work well, for example, with something like Python just because the abstractions that you can get in Python aren't expressible very well in C. So it's not really possible to create a library that expresses a higher level of abstraction than what the library itself is implemented in. So given that these efficiency programmers where the experts at doing this are the scarce resource, can we make their efforts more accessible and sort of reusable by a productivity programmers. Traditionally the way that people have approached this problem is, you know, like everything in computer science, we do layers, we do indirection, so we have a bunch of application domains in red, virtual worlds, robotics. We identify that there's also these domains of types of computations. So rendering, probabilistic algorithms. And what we'd like to do is have all these different types of hardware, which today includes the cloud platform, able to take advantage of these different types of hardware from those different domains. So the traditional solution is, I know, we'll find a run time in LS and some kind of intermediate language that is flexible enough to express everything up there, right? These are quite different abstractions from each other. So this language has to be general enough to satisfy all those communities. And it has to be general enough to map to these quite different architectures down here, all while getting good performance. Not surprisingly, this goal has been elusive. So here's a proposed new idea that violates layering. It's always fun to give a talk when you say we're going to violate some sacred cow of computer science because, if nothing else, you get questions about it. So here's another way to do it. A couple of years ago the group that started the par lab, which I am not part of, although at the time it didn't include me, identified that there's a handful, let's call them order of a dozen, a couple of tens of computation patterns that recur cross many different domains. So our idea is instead of using strict layering to map code that expresses these patterns down to the hardware, let's punch through the layers selectively. So if we have a programmer who knows how to take a stencil code and make it run really well on an FPGA-type research platform, let's let them at it and they create this thing. They punch through all the layers. There's no assumption of a common intermediate language. [inaudible] came up with the name stovepipes, although we though that that term has negative connotations in the IT industry, but whatever. Anything for a little bit of controversy. So this is our idea. We're selectively violating -- it's not all to all, right? We're not saying we can do all of these. We're not saying we can target every kind of hardware. But we're stipulating that when we can do it, we'd like to do it in a way that makes this work reusable. Oh, there we go. Trial balloon. That was weird. This is a point where people typically ask why is this any different from the arbitrarily intelligent compiler problem. This is the simple answer. We assume human beings do these. Each one of those is probably done by one or a small set of human beings. There's no implied commonality across here. These two might be completely different individuals who develop them, so looking forward a little bit, we want to try to crowd source the creation of these things. So we know these people are out there. There's not in the majority, but there's still a lot of them. So here's how we propose to do it. The name of our technique, which is just rolls off the tongue, is selective embedded just-in-time specialization, SEJITS, and the idea is that we allow the productivity programmers who write in these high-level languages, but we have an infrastructure that will selectively specialize some of the computation patterns at run time -- and I'll get to selectively in a minute. Everything in italics means I'll explain it shortly. Specialization takes advantage of information available at run time to actually generate new source code and JIT-compile it to do just that computation targeted to that hardware. That's the selective part. And the embedded part, as we'll see, is you could have done this trick using previous generation scripting languages, but it would have been a pain because to do it you would have had to extend or modify the interpreter associated with the PLL. This is not true anymore. Because languages are now both widely used and tastefully designed, like Python and Ruby, we can actually do all of this stuff without leaving the PLL. And that turns out to be a boon in persuading people to contribute specializers. So let me give one more level of detail how it works and then I'll show you a couple of examples, because it's actually much easier to see by example. Here's what happens. When my program is running, written in my high-level language, and a specializable or potentially specializable function is called, first I need to determine if there's a specializer that exists for whatever the current platform that I'm running on that takes care of that computation pattern in that function. If the answer is no, that's fine. The program is in a high-level language. We can just continue executing in that high-level language. It will just be slow, but it will run. That's really important. And I'll come back to why. But if we do have a specializer, because these languages have nice features like full introspection, spec actually hand the specializer the AST of the function and it can use that to do some code generation. And, remember, the specializers are written by human experts, so they can create snippets of source code templates that embody their human-level intelligence about specializing a pattern. What we're doing is source code generation -- syntax-directed source code generation based on templates provided by human experts, and then we dynamically link that code, after compiling it, to the PLL interpreter and we can hand the result back to the PLL. And all of this work can actually be done in the PLL itself. And as I said before, a reason that I think we wouldn't have thought of this maybe five, six, seven years ago is modern productivity languages actually have all the machinery that you need to do this inside the language. If you wanted to do this in a language like Pearl -- well, there's many reasons not to use Pearl, but if you wanted to, the amount of hacking that you have to do to the innards of the interpreter to get this to really work is daunting. With Python and Ruby, that's not the case. And we're betting on Python in particular for practical reasons that I can tell you about later. So here's an example of how it would work in real life. We have a productivity app written in, let's call it Python. Here's the Python interpreter down here in yellow, and it's running on top of some OS and hardware, which we're not going to say what it is. Sometimes we'll call a function that doesn't actually do any computation for which specialization makes any sense, something where there's really no performance gain to be gotten by calling the function. That's fine. We let the interpreter run the function as it normally would. We might also call a function for which specialization is possible. I've used the at sign notation. For those of you who know Python, it's suggestive of what Python calls a decorator. It's a way of signaling that this function is special and I'd like to be notified before the function's called. So in this case the SEJITS infrastructure will intercept the function call, but it turns out in this case we don't have a specializer for the function. Okay, we lose. That's fine. Keep executing in the PLL. But if we call a function for which there is a specializer, then the specializer can actually generate new code. For the sake of the example we'll assume that the specializer knows how to create a C version of whatever this pattern is. That will get run through the standard C compiler tool chain, the code gets cached so that in the future when you call the function again, you don't have to redo this step. The .so can be linked to Python on the fly. Again, you couldn't do this with scripting languages one generation ago, right? The fact that we can compile, create the .so on the fly and pull the symbols in, that's relatively new. But it means that we can actually now call the specialized version of the function and completely bypass the original PLL version. Come on. So this is why it's selective, right? Not all functions are necessarily specializable. It's embedded because we can use machinery always existing in languages like Python to actually do everything that I said. It's just in time because we're generating new source code. By the way, why not just generate a binary? That's stupid. We have a good compiler, right? In fact, I've used .c with CC as an example, but I'm going to show you some real examples that are working today where this tool chain is a lot more sophisticated. If you've done any GPU program with something like CUDA, the CUDA compiler is highly non-trivial, right? It uses these ungainly C++ templates to convert them into multithreaded CUDA code. That's a lot of post processing. So this is a simple example. But in fact, a lot of work has also gone into this tool chain, and we're able to leverage that directly. And, of course, specialization means that instead of executing the function as originally written in the PLL, we're going to execute the specialized version. So since I said this was easier to see by example, here's a couple of examples. Don't worry if you don't know Ruby. These are examples that are working now. Here's an example that takes a stencil computation and attempts to run it on something like a GPU. This is straight Ruby code. If you run this, it will compute this two-dimensional stencil over the one neighbors of some grid. This is a really simple function. We're just multiplying each point by a constant. No big deal. But when the function is called, this actually subclasses from a specializable class that one of our grad students created, and what will happen is the function will be handed the entire AST of this computation from which it can pull out things like, one, the radius of neighbors you're doing the stencil over, it can pull out the AST corresponding to this intermediate computation. Whatever functions you want it to do in here -- this is a really simple one, but you could have arbitrary code in there defining what stencil you want to run, and using that AST exactly the same way as a compiler would, you can actually emit code. In this case the code is emitted for Open MP, and this is what it looks like. I've stylized it a little bit to make it easier to read, but it's basically cut and pasted. So you'll have to take my word for it that semantically this computes the same result as that. The only question for you is which one would you rather be writing. So we're able to get -- and, by the way, because we know things like the dimension of the stencil, this is information you don't know at compile time, right? This could be an arbitrary expression. The fact that we can pull that out at run time means that if we have different -- entirely different source code templates, for example, for compiling the GPU different based on that constant, we could pick the right variant at run time too. That's something you can't do at compile time. So the specializer emits Open MP in this case. Not surprisingly, the compiled Open MP code is about three orders of magnitude faster than doing it in Ruby, which is the entire point. And, again, remember, if any step along this way fails, if we have problems compiling, if there's something in this function that renders it possibly unsafe to specialize, for example, maybe I put something in here that can't be proven not to have a cross loop dependency of some kind, fine. We throw up a flag, you run it in Ruby. Here's another example. This one was just -- we got this working, like, two days ago. So this is sparse matrix-vector multiply in Python. And the idea is I'm going to define a sparse matrix-vector multiply function where I have -- Ax is just an array of the non-zero elements of the matrix -- of the indices of non-zero elements. Ax and Aj are what I'm multiplying. And the logic is really simple, right? I have a sub function which takes a column and multiplies it by the vector. And the way you do that is you just map the multiplication operator over the column in the vector, and then I just run a map to do one for each column. So this is a pretty standard way that you would express in functional terms how to do a sparse matrix vector. It just says gather all the non-zeros, multiply them each vector, and just repeat that thing for each column, right? What happens if you run this through the copperhead specializer? So this specializer is actually smart enough to do the code analysis to figure out the gather operation is actually supported by something in the CUDA libraries. It generates automatic C++ templates -- so if you're a C++ template fan, God bless you. If you're a scientist, you should run away screaming from this, because nobody really knows how to use C++ templates well. That's my theory. So what's happening here is it's actually generating code that's going to go into the CUDA compiler that uses C++ templates which the compiler will turn into the right CUDA machine code. So this is a good example of why you don't want to generate binaries directly. There's a lot of work involved even getting from this to CUDA, so we've actually raised the level of abstraction significantly by doing this. So kind of the message of this example -- in real life we'd probably just package up the entire sparse matrix vector as its own function, but the example is useful for showing that the ability to leverage downstream tool chains that do this could actually be significant leverage. One last example. So let's actually talk about the cloud now. All I did in this picture -this is the same one I did before, but instead of C, I put Scala and the Scala compiler. Anybody familiar with Scala? More people than I thought. It's a very tasteful quasi-function al language that has the brilliant engineering decision that it compiles to JVM byte code. So they can take advantage of all of the Java infrastructure for running Scala programs. One of our students has created a package called Spark which is an extension of the Scala data parallel extensions API for doing MapReduce kinds of jobs primarily targeted at machine learning. So all I've done is I've replaced the C compiler tool chain with a Scala compiler with these Spark framework that this student has developed, and it runs on top of a project called Nexus, which is a cloud OS that's being developed in the RAD lab. So what does Spark give you? Spark gives you cloud distributed persistent fault tolerant data structures. What this really means is if I want to run something that looks like a bunch of MapReduces, I don't have to do a disc write and a disc read between operations. And I lose one of the nodes, Spark can reconstruct the lost data because it knows the provenance of the data. It's kind of a neat set of stuff. And it relies on Scala. So it's written in Scala, it relies on the Scala run time. And it relies on Nexus which is this cloud resource manager. Okay. So why show all this stuff? Because we have another example. Here's logistic regression in the cloud. Here's, again, a Python logistic regression function. You see the Python at sign syntax, which means intercept this function for possible specialization. Logistic regression is conceptually pretty simple. I have a bunch of points. I want to find a hyperplane that separates them. So I basically start with a random hyperplane, and at each iteration I compute a gradient and I move the plane a little bit so that the separation gets better. And I do that for some number of iterations. We're on the way to having a specializer that will take that and generate this. This is Scala. That's a pretty nice language. But what's interesting to notice here is that we can figure out, because the technology already exists in Copperhead to do this, that this operation is really just a reduction. It's the only thing inside the loop. We have a single initialization step and a single accumulation step. So this is really just kind of a MapReduce operation, right? And the amount of code analysis you need to do this is not that much, and it's already largely in Copperhead. So you might say, well, Scala, actually this is a pretty high level of abstraction. What's the benefit, really, of going from here to here? With a CUDA case, you could see why you'd rather write this. But with this case you could argue, well, maybe I just want to write it in Scala, right? Scala's not so bad. And this is true, but the difference is that -- the difference is that once you generate the Scala code, now all this machinery for doing resilient cloud distributed data structures which was designed only to work with Scala now works with Python. The other example that I gave before using the GPU was also in Python. So I could actually mix different kinds of hardware platforms into the same program and in fact the program is agnostic as to which one it's using. Within a single Python program I have one chunk of stuff that will end up specializing to CUDA for the GPU, I have another chunk of stuff that may end up doing a cloud computation. And, by the way, if I'm just running this on a vanilla laptop that has neither cloud support nor a GPU, it just still runs in Python. And, by the way, just for MapReduce fans, just for completeness, I had to have this slide. I've now replaced Scala with Java and I've replaced Spark with Hadoop. This is probably a nicer way to run MapReduce jobs. Python has map, it has reduce, and you can run a little Python program on your laptop, and when you're ready to go to the cloud, you just run it on top of a SEJITS-enabled machine that actually knows how to talk to the Hadoop master and map and reduce in Python become Hadoop and Hadoop reduce. Yeah, nice animation. So why do we think this is an exciting idea for cloud computing? There's a few different reasons. But really the most exciting one is that you could plausibly have the same application in a language like Python that runs on your desktop on exotic hardware like manycore GPU or in the cloud. If you're doing things like building clouds that have machines with GPU cards in them, there's now the opportunity to do two levels of specialization. You could do per-node specialization targeting multicore GPU, but you could also identify computations that would benefit from being farmed out to the cloud. So you could emit JITable code for something like Spark, like I showed, or for MPI. And at the single node level, you could take advantage of things like a GPU with CUDA or with OpenCL. So you're combining different abstractions targeted at different kinds of hardware, but you're doing it with a common notation that you're doing it in the context of one app. And this is in red for a reason, right? If anything goes wrong, you still have working Python code that will run in a stock Python interpreter with no other libraries. So one of the big benefits we believe is that this gives you an incremental way to start testing out new research ideas. In fact, in the par lab we have an FPGA-based emulator called Ramp that is designed to be a research vehicle for testing out parallel hardware architecture ideas, and rather than having to, you know -- to be able to run a real benchmark, we don't need to get an entire compiler tool chain and everything else all up and running. All we need to do is get CC and load .so running and then we can decide to create specializers that target the subset of the hardware whose features we want to benchmark. And, in fact, we've actually done this. We have a chunk of an image segmentation application that is running on our ramp-enabled hardware, and it does it with specializers that emit a subset of Spark V8 code for it. I think I wanted to leave a bunch -- no, wait, these are our questions. Let me forestall some questions you may have and then we'll take other questions. So questions that we've gotten -- this is, by the way, very early work. And the examples that I showed, the first two examples worked today. The third one is on its way to working. One question that we get is don't you need sort of an arbitrarily large number of specializers to do this? We believe that that par lab bet of motifs that actually apply to many applications, if that bet is correct, then the implication is having tens of specializers will actually help a lot of people. So even that would be a useful contribution. Why is this better than something like libraries or frameworks? Well, we love frameworks, and we think it is complementary to frameworks, but as I said about libraries, if you have a library that's written to be linked against an ELL like C or C++, it's difficult to imagine that the library will export a higher level of abstraction than what is conveniently available in the ELL. So libraries may save you from writing code, but they don't save you from having to work at a lower level of abstraction that you might otherwise like. I think I already mentioned, why isn't this just as hard as the arbitrarily smart compiler problem? It's because it's the people who are arbitrarily smart. What we're trying to do -- I hate using terms like crowdsourcing, but someone suggested it and now it's kind of stuck. But we're trying to package the work that these experts can do in a way that makes them reusable. If we can get -- imagine the open source model where you've got different people contributing specialization modules to some online catalog and you can decide which ones you want to download. That's the direction that we hope and expect that this would take. Possibly a more interesting question is, you know, our target audience is largely programmers who today are using things like MATLAB, heaven forbid some of them are still using Fortran, and some of the examples that I showed definitely are functionally flavored or they use functional constructs, they use list comprehensions. Are these programmers really going to learn how to do that? I think that's an open question, but I also think that there's kind of a 20 percent of the effort in teaching people about the functional way of thinking about things that will take you 80 percent of the way, and you can ask me afterward about a program I have in mind called Functional Programming on the Toilet that I believe will be a final educational campaign for this. If you've ever been to Google, they have this testing on the toilet -- ask me afterward. But we believe that a modest amount of education will go a long way. In, in fact, the example that I showed in Python from Copperhead where we're specializing to a GPU, there's also a fair amount of code analysis there that does things like simple loop fusion. So, again, we don't want to go down the slippery slope of having to build the world's smartest compiler, but we think there's an amount of education that will make a huge difference here. And happily for us, we work at a university where education is one of our supposedly top priorities. So we have an opportunity to get this way of anything engrained at a relatively early stage in students' careers. So we believe that SEJITS will enable a code generation-based strategy of specializing to different target hardware, not at the level of your application but at the level of each function, presumably. We think that there's a possibility that this could be a uniform way to do programming with high productivity from the cloud through multicore and specialized parallel architectures like GPUs, in part because we can combine those multiple different frameworks into the same app. And as we've said, even in the par lab so far, it's been a research enabler because as we develop and deploy new hardware and new OS, we can incrementally develop specializers for specific things that we want to test and let the other stuff just run in the PLL because we don't really care too much about its performance. So the idea that you don't need a fully rebus compiler and tool chain just to get research off the ground, we've already seen some benefits from it, and we think that there are probably more waiting down the line. So with that, I think I have timed it so that there's order of five minutes to discuss stuff. And thank you for bearing with my melange of different topics. [applause] Armando Fox: Don't applaud until after the questions have been answered. >>: You mentioned that you could write these kind of specializers in the productivity language. Isn't the population of people you're targeting to write the specializers people who want to write in C rather than in the high-level languages? Armando Fox: Well, I said you could write them in the PLL. I said you don't have to. Having said that, the grad students who have been writing specializers have said that they much prefer writing the specializers in Python. Right. >>: [inaudible]. Armando Fox: No, no, the grad students are not typical. But I think, by definition, specializer writers are not typical. >>: [inaudible]. Armando Fox: Right. >>: So your specializers depend on the annotations that we saw there to know when -Armando Fox: At the moment. >>: But the idea is ultimately you want to recognize patterns in the code or in the AST or something like that? Armando Fox: It's an open question how far the automatic recognition part will go. Right now we're relying on things like Python and Ruby are object oriented, so if you subclass from a specializable class, you'll get that for free. But I don't know the answer to how far -- how automatic we'll be able to make it without annotations. We've been asked that. >>: [inaudible] invoking the specializer. Armando Fox: Right now you have to know to put the magic key word to annotate your code, yes. >>: [inaudible] but I have a question. So actually the attempt of specializing [inaudible] same layer of architecture that you're using has been done with Temple in '9seven and in the [inaudible] and the hard part -- there were two hard parts. The first one was debugging. Yeah. The second one was a problem you mentioned before. So how much specializer do we have to write down? Armando Fox: Well, time will tell if we're right. But the debugging one is a very good question. So we have been working with [inaudible] who's a languages and language engineering faculty at Berkeley. One of the approaches we're looking at is when you're writing the code in the productivity language, there's a certain amount of instrumentation or debugging or test coverage metrics that you would like to capture at that level, but once it's specialized, it's not clear that the correspondence -- you know, there may be no correspondence between abstractions up here and abstractions down here. So we're looking at are there AST transformation techniques that will be able to either preserve some of that information or give you better coverage. So I know I have good test coverage in the PLL, but I want to make sure that when it runs through the CUDA specializer, I want the equivalent C0 coverage at the CUDA level. We believe that there may be automated techniques that can at least tell you what you're missing and help you generate those test cases. So it's not a complete answer, but we know that debugging and correctness when you go through this arbitrarily hard transformation is going to be a serious challenge. Did you have in your hand up before for a while? >>: I didn't understand why you need to [inaudible]. Armando Fox: I always wish I had more slides to show, but there's a slide that would have kind of what you -- the extreme that you're talking about we think of as auto-tuning where you've actually done -- at compile and install time, you've done a bunch benchmarks, you've figured out code variant that work well on this machine, and you have a relatively small number of them around, and then at run time it's just a matter of picking the right one. One of the things we showed -- let me go back to the example because I think it will answer the question pretty well. This one. So the student who actually wrote the stencil specializer, it turns out that depending on what the value of this constant is, you might choose to tile the GPU completely differently. So it's not a matter of there's a change -- he would use a completely different source code template, he might use a different strategy, depending on the value of something that might not be known until run time. In this example I put the constant 1, but this could be an arbitrary expression. So I think in cases where you have enough information to use a pre-compiled compile-time generated code variant, you should do so. And, in fact, people who do auto-tuning libraries do exactly that, and we are in fact using SEJITS as a deliver vehicle for making auto-tune libraries available to high-level languages. But we believe there are enough cases where taking advantage of run time information, you'll be able to generate much better code. >>: [inaudible]. Armando Fox: So do you need K times N specializers in order to cover K patterns on N platforms? >>: [inaudible]. Armando Fox: Oh, I see. So if you have, you know, you know an integrated GPU and a separate on-card GPU, which one do you -- we don't know. Not yet. We would love to have that problem, because it means we have more specializers working, but we don't know yet. >>: We'll have to have other questions at the breaks. It's time for our next talk. Armando Fox: Okay. I'll be here at the breaks. So thank you for your attention. [applause] Jan Rellermeyer: Okay. So I will talk a little bit about my ideas of elasticity through modularity and [inaudible]. I'm a Ph.D. student at the Systems Group, ETH Zurich, working with Gustavo Alonzo [phonetic] and [inaudible]. So if we look at elasticity, that's maybe the key reason to go to the cloud, as we also heard in Dave Patterson's key note today. Elasticity is the ability to acquire and, equally important, to release resources on demand, because, I mean, once your peak load goes down, you don't want to be over-provisioned. So you have to have the ability to also scale down your resources. If you do this at an infrastructure level, well, we have solutions for this that work pretty well, like MSN EC2. It's all based on virtualization, isn't it? For the problem of software, I think that's a lot harder because commodity software is not really meant to be that elastic. It's not written in a way that it supports this idea of elasticity very well. Most of the time it's a chunk of software and you can tailor it through a certain deployment, but that's about it. Okay. If we look why this is the case, well, we come from a world where basically we have a single large system and the software is written for a single large system. On this system we have memory, we have complete cache coherence, so we can do all the magic that we want. Once your system -- or once the application has grown to a certain size, maybe it won't run on a single machine anymore. So we are going to a distributed setup, and then we also have to make some design choices. So maybe we go to a three-tier architecture kind of style. We would choose the partitioning that supports this kind of hardware setup better than what we had. But that already restricts what we can do. It's not the magic -- the same kind of magic anymore. If we then go to the cloud where maybe for a programmer, a cloud is like a worst case distributed system with things can fail all the time, where you have to think of all these things that can happen, there are certain programming models that work well with elasticity. For instance, Hadoop, MapReduce or Dryad. But it's not clear for a general purpose application how to reach the same level of elasticity out of the box. So what is elasticity about in software systems? Well, we would like it to be maybe as elastic as a fluid. You know, you put it into a certain glass and if you change the shape, it will just redistribute itself. There's nothing that you have to do about it. It will just distribute itself. At the same time, ideally it would be like delocalization, like I don't even have to care where my software currently is. It's just there, and I can talk to it. And in anything bad happens, if I put some forces to the system, well, the system will just redistribute itself and heal itself and, you know, these are the kind of properties that we would like. So I think the key to achieve something like this is actually modularity. Modularity has been discussed like in the 1970s, mostly as an idea of structuring code. It really came from this thinking about how can we structure code in a better way so that we can reuse code and so on. Most recently I think it more the focus of modularity to move this to the deployment time, so to say, to separate the programming in the small from the programming in the large. And this is exactly the kind of composition problems that were also mentioned in the previous talk. If you have certain things that are written in a domain-specific fashion, that's very good, but you have to somehow make them interact. And I think modularity has shown that this is possible. Most recently there were some things like inversion of control and plain-old Java objects which kind of implement the idea of leaving modules as vanilla as they can be and not tying them to a specific communication platform, not tying them to a specific implementation of anything, but use an intelligent run time or a container to inject these functionalities into the modules. The good thing about modularity is all the tradeoffs are very well understood. There is this basic idea of coupling versus cohesion, so if you design a good module, you want to have it as low coupled as possible to the rest of your system because then all your change management, all your code management is always local and it won't affect the rest of your system. That's the composition idea. At the same time, we want it to be highly cohesive so that if you're looking for something, you know where to look. You don't want things to be coincidentally co-located in the same module. If they are, just not the same, if they are two different kind of things. So if you go back to my previous example, I think this is exactly how modularity maps to this. Isn't this kind of cohesion? You want these small functional units that can redistribute themselves. And for this you need this fine granularity. Otherwise it's not possible. And isn't this exactly low coupling? Because if things are low coupling, well, then I can change the structure of my system at run time, and even without letting my application running on top of this layer noticing, because I can rebind these services and so on, I can rebind these modules in any possible way that I want. So I have taken this idea and tried to apply this to a complete modularity system. I've chosen OSGi for a couple of reasons, mainly because I'm most familiar with this. OSGi is the dynamic module system for the Java language. It's an open standard, pretty well supported in the industry. For instance, the newest generation of application service, some of them not even yet released, will -- most of them will be based on OSGi. There is the Eclipse IDE, which is just the most widely used OSGi applications. But also in embedded software or mobile phones, OSGi has gained quite some traction. In OSGi, modules are called bundles, but for practical purposes, you can think of them just being Java JAR files, so kind of a deployment unit that you can take and deploy to a system and run it. The important thing is they contain additional metadata, and this additional metadata can be a lot, but the important part is they explicitly declare their dependencies. So that's, on the one hand, kind of a penalty for over-sharing things. You have to declare them so you will not tightly couple your system too much. You will at least notice. On the other hand, the run time system, which is called the framework in OSGi, has a lot of knowledge about how your modules actually are interconnected. And this knowledge was originally introduced because the first application for OSGi was the update problem on long-running embedded devices, like television set up boxes, you buy them, and once in a while your operator has to operate your software and you want to do this in a way that not your entire system has to be rebooted. And knowing these kind of dependencies permits the framework or the run time to do a selective update of those components that are actually affected and leave the rest of the system untouched. And that's one of the reasons why it also becomes popular with large-scale enterprises. It's a solution to the update problem, it's a solution to the extensibility problem, because you can incrementally add more modules while your system remains running, you can update them, you can remove them from the system, which in traditional Java is not possible. You can't remove something from a running system. You have to throw away the class order. And that's what OSGi does. It takes a single class order per value. And that's also the idea of isolation, because, of course, once you do this, essentially you cannot talk between bundles anymore unless you explicitly shared your code, and this shared code is then, by the framework, handled in a way that you delegate your calls, your class loading calls, to the class order that is [inaudible] for this bundle. But that's more internal. That's still tight coupling. If you structure a system like this, you will gain some flexibility, but that's not the kind of elasticity that I am targeting. What you can do, though, is to introduce something that is a more loosely coupled way of structuring your system, and that's a OSGi achieved through services. Services are a lot different in OSGi than in big SOA stacks, because actually in OSGi they are very lightweight. A service can be an arbitrary Java object that you just register on one or more interfaces. But once you make the handshake with the run time system and you have acquired a service, what you get in return is just this Java object, so there's no penalty involved at this time. And that's why it became popular in embedded systems. There is some overhead involved just asking for the right kind of service, doing some filter matching to get the thing, but once you have a handle on it, it's just an object. It's not more. And then, of course, OSGi, since I told you you can update and remove things at run time, it's a very dynamic platform. So in the application model of OSGi, there is this dynamism built in, deeply built in, into the system. A regular OSGi application that wants to play well has to monitor the systems state to subscribe to certain events. For instance, once you hold a service, you should listen to events. If the service goes away, you should be prepared to just fail over or do whatever you can, but you have to listen to these kind of things. So that's the running OSGi system. You can consider that to be an application consisting of two models, your bundles and the bundle on the right uses the bundle on the left through a service that has been registered with a framework that's a central service registry. That's a standard kind of SOI, I think. That's why that's why Gartner [phonetic] calls OSGi the in-VM SOA. It's supplying the same kind of things to a small scale virtual machine. The system is dynamic. I told you this. But it's still far away from anything that we could expect in the cloud environment. So why are software modules anyway interesting for the cloud? Well, first of all, the cloud, it has a problem of software -- sorry, of software component lifecycle, hasn't it? You want to provision your software at any time to the cloud, to your cloud application, and maybe you even have to update your cloud application on the fly. And that's something that the OSGi kind of thinking applies very well to. There is also the problem of composition. Once you have developed certain things that you want to reuse, well, you have to make them communicate, and services is a very intuitive way of designing interfaces between your components so that they can talk to each other. But now I said OSGi works only on a single virtual machine. Well, that's the state of the art as it was. In the cloud, we typically have, like, a big data center, a big distributed system where things just randomly fail. But for these kind of purposes we earlier developed an extension of OSGi which is called R-OSGi which can transparently turn any kind of service invocation to a remote service invocation. And the nice feature about this is since local OSGi applications are already prepared to handle the removal of a service gracefully, all that we have to do in kind of, let's say, in case of a network failure is to map this consistently to these kind of event that the application can already handle. So we can, for instance, just remove the service proxy, and that would be equivalent to an administrator who has just removed the service from the system, because that's what has happened from the perspective of the client. So that was the first step. But, of course, we want to go further. And this further means that we want to assimilate much of the complexity that arises from such a system and from programming such a system into a run time system so that the knowledge can be reused at any time. And that also means that we want to try to keep modules as plain and vanilla as they are and as little as possible tie them to a specific communication paradigm, to a specific consistency model, to anything that is really specific to one deployment, because also one of the ideas in cloud computing is that your system will change. It will hopefully change from a small deployment to a large deployment over time, but that also means that you have to adjust the tradeoffs that you made in every step of this evolution. And there are a couple of not very highly cited but still interesting papers from practitioners from different cloud platform consumers that just explain how their system setup has changed when they went from a million users to 2 million users to 10 million users. The bottom line is this essentially rewrote half of the system every time they did one of these steps. And I think that's the challenge of software engineering, to evolve -- to avoid this one-term programming and go to a more portable way of expressing things. So we have tried to implement this in a prototype system that is called Cirrostratus, and that you could consider to be a run time system for elastic modules. That's built around the OSGi view of the world. It supports modules and services. But most importantly, it presents itself to a client as a single system image. So if you deploy a Cirrostratus to a set of notes, let's say on EC2, you can talk to any of these nodes and you will get a consistent picture of the system. In the same way, of course, you can add some new nodes and they will just join this overlay that the run time forms. The stack required is a JVM and then a local OSGi framework. Because, of course, we wanted to leverage as much as possible from what a local OSGi framework can already do. What we put on top is a virtualization layer essentially that deals with all the distributed systems behavior that we need. By doing this, we end up having virtual modules and services, which is a lot different to a traditional application. And I will explain this later. But we also have into the problem of state that has to be handled in our system, and we have some thoughts about how to handle this. But in order to be really elastic, it's not enough just to deploy this thing and then hope that it will run well. We have to continue to monitor the system. We have to continuously see how the system performs and if there are any new resources that can be acquired. And for this we are instrumenting the code a little bit and triggering redeployment whenever a controller tells us to do so. So that's the big overview. From an application perspective, that's the view that an application sees. So if this is your fancy application that is written in a certain module away, let's say it's consisting of three modules, that's the expectation of single system, isn't it? You want to see these three modules and they are, let's say, connected to services. But that's only the virtual deployment. That's like the contract between the application and the platform, implicitly expressed. What we can do on the physical layer is we can transform this into any kind of physical employment there that fulfills this implicit contract. We can arbitrarily replicate modules as long as we do the bookkeeping in the back. We can arbitrarily choose to co-locate modules on the same machine. Because here we have a heavy usage pattern, maybe here it makes sense to co-locate these two modules. If we have a module here that has some valuable data and we need high availability, of course we can create a lot of replicas of it and keep track that we never run, let's say, under five replicas. But the important message here is this can all change over time, because your application is not a stead workload. If this goes down to a very low usage pattern, there's no reason not to move this module away to a different machine in order to maybe get a more scalable system. So these are all kind of tricks that we can play, and it's kind of transparent to the application because the application sees this virtual deployment, and what we do under the hood is the physical deployment that can react to all these network and platform and workload changes. So now I told you the big problem here is, of course, state, because on the physical layer, of course it makes a huge difference if you have one service or ten services. If you just constantly invoke functions on one service, well, all the replicas will not see the same consistent picture. So you need something that deals with the state. And if we go back to my kind of examples that I gave for scalable cloud applications or distributed system, most of them are actually built around a very specific and tightly expressible model of state. I mean, in Hadoop and in Dryad, it's very explicit. You can see these boxes and how state is passed through these. In three-tier, we try to be as stateless as possible. We externalize most of the state into the database and then we just keep a session. That's the model of the world in these kind of things. And, of course, for our system, feel free to pick anyway to express your state, and you can write adapter to make this accessible to the system and you are done. But I think there are also kind of applications, maybe legacy applications, where you can or don't want to use this or do this. And for this we have tried to take an extreme in inferring this state from these modules just to see how far we can drive this vanilla module kind of idea. And what we do with these modules is actually we will do an abstract interpretation based on the idea of symbolic execution. So once you load a module, there is a small interpreter that will operate on symbols, sweep through your code, and try to apply all your instructions to the symbolic stack and to a symbolic heap. So what we will get in the end is at any time here in this small little program we will see what kind of impact this instruction has and we can filter out those who have an impact on state and those who haven't. And that's what's important, because we don't want to replicate the entire heap. That's much too expensive. We just want to replicate those operations that are actually dealing with distributed state. So that's a very simple example, and maybe here you can read that I'm accessing fields, fields of the service, and that would be the object oriented way of expressing state, isn't it? In an object oriented language, much of the state at run time is kept in fields of the servers and in the transitive closure that is attached to all these fields. But this can get arbitrarily complicated because, I mean, it's a high-level language, so you can you [inaudible] your state in any possible way, you can call into different methods. Symbolic execution is capable of filtering out much of this. There is a limit, like if you're doing purely work through calls, the interpreter will not be able to figure out what will be the target of your call at run time. So there are some tradeoffs involved are but in practice, it works quite well for a lot of applications. And once we have gathered this insight into the structure of the program, well, that's exactly what the application was designed for if it would run on a single system. So we can transform this into an application that runs now on a distributed system by just instrumenting the code, rewriting the bi-code, and now weaving in exactly the kind of state replication mechanism that you choose. But the important thing is you choose it at run time. So, for instance, you can make this entirely transactional, you could do a weekly consistent represent occasion schemer, anything that you like. The knowledge about where the state is stored, that's the key thing that you need, and you can rewrite the module. And since we are already rewriting it, it seems natural that we also introduce the performance probe at this time so that we can continuously monitor the performance. This is also very specific to the kind of replication that you use because choosing one or the other replication schemer has some impact on, let's say, the network utilization and so on. So we can do this all at the same place. In the end, what you would like to have is a controller that reads all this performance data and then triggers some action. In our platform we have some actors built into the system that a controller can call, but for practical purposes we are not saying we have the best possible controller. We believe that most of the time the controller should be something that is written specifically for an application because that's actually the glue code that tells you how to do the specific part of your application and how to turn this into an elastic deployment. What we will give you is the platform to do so and the promise that you do it once and you don't have to do it again if you change your deployment. So we have implemented some smaller use cases. That's actually one of the larger ones that is a bit interesting. We have taken an online game written in Java which is called Stendhal. That's a typical client server kind of online game. So you have a big Java server and a client application and -- well, you connect to the server and then you have some interaction and some protocol going between the client and the server and you can play this game. Well, we turned this into a modular system in the simplest way that you could imagine. So it's trivial. We just took one module to be the client, one module to be the server. That's not really fine granular. That's not what we envisioned, but it's kind of the baseline of what we can do. And, of course, we replaced all the explicit communication through just services because that's the idea of loose coupling. You don't want to specifically use a protocol, just call the service and let the run time figure out how to map this tool call to the server. Okay. So if you see, this is the setup running one client, one server, and if we measure the -- I think this is here the latency and this is the traffic on the network. Well, that's the picture that you get when you run this as a client server application. And now time moves on and a second player joins the game. While the latency doesn't really change a lot, of course, the aggregate latency has doubled, but we see we get some more network traffic, of course. Okay. Now, this application is very interesting for our purposes because the amount of state shared between these two clients is actually a function of the game. Because if these two players are pretty close to each other, there's a high chance that their action will kind of interfere with each other, that there are some state changes that the right client does and the left client wants to see immediately. But if they are playing in completely different worlds of this game, there is almost no shared state at all. And we think that there are quite a lot of applications that are inherently working with these kind of operations. It's not a static thing, this shared state. It's really a function of the workload. So in here now, since maybe we have -- well, let's say we try now to replicate the server to the left client, because the system has noticed that the potential traffic that is caused through the replication of the items in the server is lower than the traffic that we currently have in the client server fashion, then we get to this picture. So here we have increased the latency for this client because it now has a longer path to the server, but, of course, the latency for the left client has significantly dropped, and we get a different picture in terms of network utilization. And now if we are in the situation that actually they don't share too much state, we can also completely let out the server and just gradually transform the system into a peer-to-peer system. And in this example, the players, as we did these measurements, were not very close to each other, so until the end you see that the aggregate latency has dropped and that the aggregate network utilization has also dropped. So the some was able to optimize this particular deployment and reach a benefit. Yeah, so currently this is all done in Java, but, of course, we want to generalize many of the concepts beyond Java and OSGi. We have kind of done this for OSGi services and implemented them in, let's say, C and nesC for tiny OS and made them communicating with a traditional OSGi framework through the R-OSGi protocol, but that's just one side of the thing. What we would like to have is a kind of OSGi run time for completely different environments, and we are currently importing many of the ideas to the .net CLR. And that's a project supported by the Microsoft Innovation Cluster for Embedded Software in Switzerland. The next thing, of course, we want to build more interesting applications. That's why we're purporting parts of the .net compact framework to Lego Mindstorms which we have previously used successfully in Java, and we want to kind of build swarms of intelligent robots that are interacting through this kind of shared run time abstraction and see how we can simplify the development of applications for such a complicated distributed system through this. And, of course, my personal future work is to graduate this year, and I'm confident to do so. Okay. So this brings me to my final conclusion. I hope I managed to -- well, to bring the message that software elasticity is challenging. I think that modularity is the key challenge, actually, for facilitating elastic deployments of software. Especially elastic redeployment. And I hopefully also have shown you that we can actually mitigate some of the complexity problems and put them into an intelligent run time like we have shown in Cirrostratus. Thank you. [applause] >>: We have time for a couple of questions. >>: [inaudible]. Jan Rellermeyer: I don't think that you can build software without implicitly modularizing your software. Because what you're doing is you're still isolating functional units from your system and presenting this as a service. In OSGi there is a run time system behind it, and the reasons are that you want to do this in a flexible way, you want to be able to control the lifestyle of your deployment. You could live without this, but I think that it has a lot of benefits for cloud environments to do so. But, yeah, the answer is try SOA without modularitization, I wouldn't know how to do that. >>: [inaudible]. Jan Rellermeyer: Yeah. Okay, that's a different question. You can try to design your system entirely stateless. The question is does this really simplify your program development. I'm not always sure. Because then you're pushing a lot of complexity into dealing, let's say, with the session and acquiring the right state from the -- I mean, the state hasn't disappeared. It's just externalized, you know? But as I said, I mean, I'm not advocating the state inferencing as the one and only solution. You could also employ your own way of describing or even avoiding state. Okay. [applause] Rosa Badia: Thank you, Roger. So my proposal is to present a view of how we can program clouds. So I'll start a bit explaining what is a star super scaler programming model and then I'll more specifically move into the COMPS superscalar framework, which is kind of the [inaudible] of this programming model and also present an EMOTIVE cloud, which is the software solution that we're developing at BSC for clouds. And then how we see to [inaudible] the superscalar programming models towards the use of clouds and service oriented architectures. So the idea of the star superscalar comes from the fact of the superscalar processors. So most of you already know about superscalar processors. What happens that although we have a sequential code, at run time what happens, we have different functional units that will execute -- issue the instructions in parallel to think like a speculative execution [inaudible] and many other optimizations. The important thing is, at the end, the result of the application, it's the one that the programmer was devising. So taking into account this idea, we are trying to apply it to the star superscalar programming model. And then what we have is a sequential code, and our parallelization is based on the tasks. So for this right know we need the selection of the task. So the programmer will select from the sequential code which functions are a task, and for this they will also give the direction of the parameters, if it's an input or an output or a parameter that's written [inaudible]. From this, at run time, so at execution time, what we built is a task graph that takes into the account the data dependencies between the different tasks. This means that we are using the starting information about the direction of the parameters and with the actual data that's accessed by the tasks, we advise what are the different data dependencies. This is done at run time. It's important to understand that this has to be done at run time because we are following the real data dependencies, the actual data. So here we have different instances of T1, and the data that is writing this T1 is different off this T1, okay? We apply also the [inaudible] renaming, for example, to reduce the dependencies by -renaming is like replicating data. So what we do is eliminate the false data dependencies and other optimizations. Then when we have this task graph, we don't wait to have all the task graph. When we have a part of this task graph, we start scheduling the different tasks on the parallel platform. This can be a multicore, it can be a cluster, it can be [inaudible]. We apply the same idea to the current parallel platforms. So we do the scheduling of the tasks. If necessary, we will program also the different data transfers. We will try to exploit things like data locality. We apply things like prescheduling to bring prescheduling or perfection of data in order to have the data before the tasks really start and other things. So then also we will need, whenever a task has finished, that it is notified to the main problem to the run time in order to change the state, update the data dependence graph, and continue the evolution. Of course, we cannot start a task until the dependencies -- the data dependencies have been solved. So we have different instances of this programming model. We started with a grid a few years ago, like six or seven, and we have stable versions of grid superscalar that are running on [inaudible] on our supercomputer in Barcelona. It's used for production runs for applications that are not inherently parallel, like you can have work flows or [inaudible] parallel applications that a lot of our life science users have, and other users. COMPS superscalar is the evolution of these. We'll see what are the future of COMPS superscalar. And we have another family that are more based on -- use more of the compiler technology that are tailored for some homogenous multicores, SMPS superscalar or Cell superscalar or GPSUS superscalar that are very specific for the Cell or the GPUs and [inaudible] like for example, Nested superscalar is a hybrid approach that combines two of these families, SMPS superscalar with Cell superscalar, having two level of tasks. So we have a smaller task inside bigger task. The smaller tasks are run in the [inaudible] of the cells and the bigger run on the SMP. So I will explain now COMPS superscalar. COMPS superscalar is a reengineering of grid superscalar that we did on the [inaudible] project in order to use the grid component model that was designed and made to be implemented by this project. So with you used java's programming language, so it's basically everything in Java for this version. We used ProActive on the lane middleware to build the run time as a componentized application. And we used, for example, Java GAT or [inaudible] as the middleware for job submission and file transfer. This case will be for the grid, for example. So this is how we looked. A COMPS superscalar operation, it's important to note that the code itself of the application, the objective is that we don't have to change it. It can remain a regular Java code without any need to add any code to any API or any specific change. The constraint that we put on the application is that the pieces of the application that have to become a task has to have a certain granularity if we take into account that we want to run it in a distributed infrastructure and we need [inaudible] to transfer the data through the internet. So we need a given granularity. Then what we need is that the programmer -- the selection of the task that is made with an initial [inaudible] with a Java interface, here what we say, for example, in this case that genRandom is one of the tasks of the application and it has an input parameter -an output parameter, sorry, f which is a file. Okay? So this is what that is used then. And we can also give constraints to the task. For example, this will mean that this task needs to run on a platform that has Linux as an operating system, and we can put all type of constraint, not only on software, it can be on hardware or memory available or on other things. So at run time what happens is that our Java application is -- it's loaded, like custom loaded, and using the annotated interface by means of Javaassist, we can intercept the tasks that have been annotated in the Java interface. So when we intercept a task, we know that this is one of the tasks on the application, and instead of [inaudible] to the regional code, it will insert a call into the COMPS superscalar run time and then it will -I think there is an omission. Here it is better. So whenever we intercept one of the tasks, the custom loader will put a call on the API on the run time, and this will be the one that will add a node in the task graph for the [inaudible] dependencies. And whenever this task is ready, it will submit to the scheduler in order to put -- to see which resources we have available that can run this task. And, actually, at the end the job manager will be the one that will do the [inaudible]. We have also a component that is responsible for all the file management of the data that is necessity. This is used both by the task analyzer to detect the different data dependencies and also at the execution time to know where the files are and if there is a need for data movement or not. So this is already also using [inaudible], for example, we can be using grids. And here I was showing an example of a reapplication that we developed and that has been used by MareNostrum user. It's a -- initially it's a sequential application, but it's very easy to parallelize because basically what you have is a set of sequences and you want to do all the queries against a database, so it's not difficult to parallelize. In our case what we did is we got into a code. First we have a piece of the code that will split the different sequences and databases, then we have a set of codes to the HMMER and then we have a set of codes to merge the result of the [inaudible] HMMERs. So here is the Java notated interface. You see here for the different tasks that we have selected, that you have the parameters and then we identify the type of the parameter and the direction. So at run time what we will have is that it will be a graph like this one, and then the different tasks that can run concurrently will be scheduled and run in parallel. This is a shot that shows a bit of the actual performance in terms of the speed-up. Although we still would like to push the line a bit above the COMPS superscalar version, it's good that -- for example, we have seen that it's scaling much better than the MPI version that was available on MareNostrum. The important thing, this was then used by the EDI to run a real production run with seven million and a half of protein sequences and using a big database that they come from. And it was used by these people without any problem and was very useful to them to use. So this was my first point. The second is we have also this other project at the BSC. The EMOTIVE cloud is -- the idea is to develop middleware for a cloud. It's an open source project, and the idea is that it can be used for our research but also for other people's research. The architecture of EMOTIVE is basically what you see in this picture. We have three different type of components. We have data infrastructure, basically what the [inaudible] like a file system, a distributed file system, like an FS or other similar. Then we have the different components. Basically a [inaudible] monitor and the resource manager that take care of all the [inaudible] machine lifecycle managements. We believe they take care of creating the VMs, monitoring the VM's destruction and also the data management, and we have right now also support for migration of VMs in case of failure or in case that we -- if we are running a virtual machine and it happens that the user requires more computation, more memory, whatever, and the actual physical resource is not able to provide this, then the VM can be migrated to another physical resource. And we also support checkpointing. And there is right now three different schedulers implemented for this platform. SERA, ERA, and GCS. So SERA I will explain a bit more. SERA is basically composed of two different components, SRLM and ERS. Both of them use agent technology and semantic technology. Was it me? >>: [inaudible]. Rosa Badia: [inaudible]. It wasn't me. >>: [inaudible]. Rosa Badia: Well, it's still 20-minute for the [inaudible]. Okay. So now everybody is awake. Even me. So the important thing here, we have a user that wants to submit a job to our motif. It will submit that [inaudible] to our SRLM. Then this user will have a set of requirements on the execution of this task. So we use semantic information about the resources with a requirement of the user to see which are the resources that can run this task. And then the ERA will be the actual that we will submit this job to the cloud. This is a EMOTIVE cloud, so this will be the responsible to asking the [inaudible] the resource manager to stop the virtual machine, to start the task, et cetera. So what is the goal? The goal is now try to move COMPS superscalar a top of a motif. Initially this is very easy because COMPS superscalar initially, when you start, can run on top of any platform. You just need the IPs that describe the resources that you need to execute. But then if you have a virtual machine and you get the IP, you can put it into the description that is read and it runs. So the first step is very easy. We just have to set up virtual machines on top of a motif and then COMPS superscalar can run tasks on top of these virtual machines. This is already running and it nicely works. The next step will be, well, can we take advantage of the fact that EMOTIVE takes into account the elasticity of the virtual machines and that can take into account the different resource requirements of the tasks. Of course, we can. So what we want is that a COMPS superscalar, when it starts execution of the application, interacts with SERA, the scheduler, and will request which type of virtual machines need. Not virtual machines that have been established before, but it will say, well, for this task is a heavily computing task so we need a bigger virtual machine or maybe I see that now my [inaudible] is very parallel, so I need more virtual machines, so on demand I would like to ask for more virtual machines on my platform and the other things, or I will see that I have a deadline or timetable to meet, so this is what we are working now. The next step. And this is our mission architecture, how we think that we want to evolve COMPS superscalar. So the idea will be we want the different tasks that can run on the cloud. This is what we already have seen. It's already working. The other is that we also want the COMPS superscalar run time to run on the cloud, and similarly to the task, when we said that you can have different number of virtual machines running all depending on the requirements of the application, we would like also to have more COMPS superscalar run times running or less depending on the number of applications that are running in the cloud. So for these, then, we'll have that instead of having one COMPS superscalar run time per application, which is the case now, it will be that we have a server COMPS superscalar run time. So a COMPS superscalar run time should be able to serve more than one application at a time. But, of course, if we have a lot of them, then maybe we would like to have more than one. This will be offered through web service container. This will be like the satisfactory side. Then we'll have the application side. This initially was still here. Now we're sort of having just COMPS superscalar applications running on top of this idea. However, we've seen -- our experience has been was that the idea will be, well, we have core services that are simple services that are offered in the cloud, for example, running, and then we have applications that compose these services and make them a bigger application. But, also, we'll have that these COMPS superscalar applications can be offered as a service. So this is one thing I forgot to tell, by the way. That the idea will be that the COMPS superscalar run now only half tasks. The work is the task of the application. The idea is that these tasks can be services also in this next evolution of the run time. So what we foresee is that well need for this first to have a run time that's able to compose web services. There is a lot of literature on this. Our objective is not to do new things on this but to offer the different behavior, the dynamic behavior of the COMPS superscalar run time that is able to build the task graph dynamically to exploit this idea together with the composition of web service. And I think this will be different of what graphing done before. The other thing that we need is we want a graphical interface to help the programmers on all this. Now, right now basically the development is writing the Java application in a regular environment. We would like to have tools for to ease the deployment of the applications and also tools to ease in general the development of the applications. So, well, basically I went faster than I thought. To conclude a bit, COMPS superscalar, it's a platform that enables programming of applications on a wealth of underlying platforms. So we can have below a grid or we can have a cluster and hopefully we can have also a cloud. The idea is to evolve COMPS superscalar by means of using the SERA scaler EMOTIVE, to be able to use these on top of federated clouds. As an example, I want to mention that although we've seen the EMOTIVE in an environment of the cloud, from SERA we can also [inaudible] task not only to virtual machines or [inaudible] but also to other type of cloud like we demonstrated on the projector. And, further, we want to evolve COMPS superscalar towards what we call probably superscalar that enables this composition of services by means of using these graphical interface to help development of applications and also to evolve the run time to enable all these new requirements on the service composition and service invocation. So an important thing is this is open source. Right now we don't have COMPS superscalar available in the web yet, but we'll have soon. We have grid superscalar available and also EMOTIVE cloud is available. In the future evolution of this, everything, open source. Thank you. [applause]. Rosa Badia: Yes? >>: Question. So when you're talking about deploying an application against some back-end run time, your system would actually choose the [inaudible]. Rosa Badia: Yeah. Initially, but right now it's in the application, we have this interface for the constraints, but the idea will be that we can expand this probably using SLAs. >>: But this is just articulating [inaudible]. Rosa Badia: No, no [inaudible] this one. You have constraints also here, method constraints. The method constraints establish the requirements on the task. Right now this can be hardware or software, the requirement, but the idea can be that it can be service requirement. Also, [inaudible] I said already the task can have an SLA associated. This SLA, it then can be translated from a high-level SLA to a more lower-level SLA, but then it's in -- take it into account with an a scaler, a server, and takes into account the description, the semantic description, of the resources. >>: Other questions? Okay. Let's thank the speaker. And that was the last talk. [applause]