>> Juan Vargas: Ding dong. Good morning, everybody. Welcome. As many of you know, Microsoft and Intel are funding something called the UPCRC, which means Universal Parallel Computing Resource Centers at the University of Berkeley and at the University of Urbana-Champaign. Today we have the great honor of having five visitors from Berkley who will be visiting several teams in incubation, products, and research. And the visit starts with this presentation. The presentation will start with David Patterson giving us an introduction of the center followed by Kurt, followed by John, followed by Ras. >> David Patterson: Okay. Nice to see everybody again. I'll be here -- I was here last month, I'll be here next month. I don't know about July. It seems like I'm probably going to be here. Give you an overview of this lab. It started two, two and a half years ago. We are really driven by applications, which is rare for a lot of computer science projects. And what we pitched in the proposal that Intel and Microsoft selected were the applications in black. And we've enjoyed the applications influenced so much, we've expanded them to these new ones that are shown in blue. And Kurt's going to be talking about patterns in these applications right after me. A big -- one of our -- I'm going to talk about some of our big bets. One of our big bets is that the way your -- the way to make parallel software is to have a really good software architecture. And so in our old Berkley rue report we talked about these 12 or 13 dwarfs or motifs we called them. And we call those the computational patterns which were shown on the right. And then in the process of doing the research, decided that these programming patterns that are good for any kind of program will be the key to us being able to get a parallel software, in particular that software should have as its architecture a composition of these patterns on the left. This has been captured in something that's called our pattern language. And Kurt's going to talk about this. This is happening with people here and at Illinois and at Intel. Our mantra is -- has been and I would say this is, you know, other people say this now too, is we see the world as productivity level programmers and efficiency level programmers. So domain experts as examples of productivity level programmers, programming something like Python or Ruby. If the efficiency level language programmers a lot of people, like people here, are, you know, the C#, C++ types of programmers. They're aiming for this bare level efficiency where the domain experts would be happy with pretty good speedup and productivity. And so we struggled with this one. In fact, if you were to read our proposal, the productivity level thing was kind of -- was a hole in our proposal. We should do something about this. We didn't really know what to do. We were kind of afraid that we were going to have invent our own programming language or that that was the only viable way forward because the chances of inventing a programming language and it actually catching on seemed zero but we didn't know what to do. But now we have a story, and I'll tell about you that. So one of the things we wanted to do was to make efficiency programmers productive. And we and many people believe auto-tuning is a better way to do code generation than simply traditional static compiling. You know, looking at a specific example on the right in this stack bar graph is increase the number of threads. The auto-tuning part is that red part there. So the significant speedups of auto-tuning by generating code and seeing what runs well on a particular computer rather than picking up just the compiling time. The problem with that is it takes really smart people who know architecture and the algorithms to pull that off and kind of use up one grad student at a time. So we've been trying to see what we can do about that. We won't tell about you our successful use of machine learning. There's been some breakthroughs by members of the team on how to avoid computation and even things that have been around forever like dense matrices, we made big improvement this is that. But I will talk about SEJITS on that. SEJITS is this new idea that wasn't in the proposal. And I think of it as making productivity programmers efficient. Or you could think of it if you know what auto-tuning is auto-tuning with high order functions. So rather than invent a programming language to help productivity programmers, we would pick one that's already productive like Python say. Now, these scripting languages like Python really are sophisticated languages that have all the powerful features that you would like to see in a programming language. And so the acronym comes from that this infrastructure is going to specialize the computation but it's not going to have to take any Python program, it can selectively do a function at a time. We will have the option of doing Just in Time compiling to create efficiency level language code. And then the other part of selective embedded is you write this thing in the language itself and you can use the standard interpreter, you don't have to modify it or anything like that. So try and get the name here. So if we're writing in Python, here's these Python methods. A couple of them are marked as this would be interesting to go faster, the G and the H one. Because it's standard Python, if you don't have a specializer for it, it just gets interpreted as it is before. How do you mark them? Well, maybe the programmer marks them or maybe we monitor the performance and see which ones takes a lot of time. The H does have the SEJITS mechanism so it gets invoked. There's a specializer for that hardware. And then that invokes this system on the try-up. So the selective part of the name embedded that it's actually written in Python, that's indicating that those pieces are in blue. You write them in Python, generate the just in time compiling what you see and then the specialization get the speed up there. So we're not going to have time to talk about that, but Armando Fox right here is leading the charge on that effort, and he can ask questions about it. We're pretty excited about it. And recently we wonder -- we think not only will help with multicore or manycore, we think it could help with cloud computing as well. So some of the same arguments there. And Armando can tell you about that. We have a big effort in correctness, debugging verification as does Microsoft. Kind of the -- and this is led by Koushik Sen, who spent was it intern here several -- many years ago. Koushik seems to be winning all the awards in the project as far as I could tell. He's won several best paper awards. There's this research highlights of the new communications of the ACM. It's got selected for that. So that's almost like yet another award. He does this really -- has these really interesting ideas at the intersection of testing and debugging which people here have done as well, active testing. So the idea of the productivity layer we're going to try not to let people specify some of these problems that can really be hairy to debug. But at the efficiency layer we can't avoid that. And Koushik has developed techniques that have worked extremely well even on popular open source software uncover previous unknown bugs and which has led to all of these awards. And Ras can answer questions about correctness and debugging. Kubi will talk about our operating systems effort. A lot of the emphasis is on isolation and quality of services, you know, what we're after. Oh, I left off Burton's. So Burton Smith said, hi, Dave, how are you doing in my fall visit? My fall tour to Microsoft he said by the way, I've solved the resource allocation problem. I can't tell you about it because I'm filing a patent, but I've solved it, okay. So okay. That's interesting. Maybe like to hear it when we can. So he came this month. He came in April, he came on April Fools Day. It wasn't a joke. He actually claims to have made big progress on it. And we're implementing what or going to implement it, to help with us the quality of service. We've also got some ideas of how to make parallel libraries work well together. Often the libraries assume that no one else has got the machine but me and then they can be scheduled and our live system does that. And Kubi can tell about you that as well. Closer to what I know about, we think one of the problems in architecture is that people use software simulators and it's so -- you know, when you're having scores or hundreds of cores, it just takes forever. And so you can't run very much. And I think the architecture community is kind of bottlenecked in how much they can simulate to make progress on manycore. We and others, particularly Chuck Thacker here, are betting on FPGAs. Our version of it is is a simulator that it does 64 cores on a very inexpensive board, which was a big advantage for us. The students were running out of time. They just used their credit card and got some more. This version runs 250 times faster. So what is 250 times faster mean? That means students can get a result in an hour versus 10 days. And when it's -- and if it makes -- it makes it clear that research is latency oriented, right? You want to try this experiment to see if it works throughout, the parameter's right, get an answer back in and hour versus waiting 10 days is completely different. The stress level changed, the rate at which we can do research. And we were very proud that last January at our research retreat we were able to demo to tie all these pieces together. In sum ring up, you know, this really is, you know, an integrated research project. It's often difficult and when you're not in academia to tell the difference between a facade an a real integrated project. This is a real integrated project. All these pieces are working together. We all sit with each other. We all interact. And, you know, it's paying off. I'm surprised at how much progress we've made. This is a really hard problem, right? All these companies of gone out of business trying to solve it. We've made a lot of progress in two years. And I'm surprised. We've got some visibility besides coverage of the ACM. We do a boot camp. So far we're doubling every year, and so we're going to bigger rooms where we invite everybody to come and hear our versions of the ideas. There's a HotPar that's associated -- you know, it's a Usenix workshop but it's at the Berkley campus that we're involved in. Kurt's been leading this way on the ParaPLOPS pattern's workshop. And there's a wide following of his language. And people really are coming, amazingly all over the world to say I've got a problem. Kurt, show me how to parallelize it. And he does. And they get speedup. And we're even planning on testing this idea on undergraduates that Kurt's going to lead the class this fall. And that's the overview. I presume there's not going to be any questions. It would be great if there were questions, but I bet there aren't. It's too early in the morning. So Kurt, I will pass it on. I thought saying there wouldn't be any questions would get you guys to ->>: Sometimes it works. >> David Patterson: Sometimes it works. >> Kurt Keutzer: Okay. Am I alive? No. Microphone? >>: Did you turn it on? >> Kurt Keutzer: I've got a little green light here. I'm on. Okay. Cool. Okay. So I run a group within ParLab called the PALLAS group, which has one of those kind of megalomaniacal acronyms. We're looking at applications, libraries, languages, algorithms and systems. And what we're kind of motivated by are these new manycore whips we don't see the Intel folks putting up the Larrabee slide as often as they used to, so that worries me a little. But you can go outer and buy one of these today. So you've got, depending on how you count them, at least 16 or as many as 512 processors, depending on how you look at it. So we can build them, but the question is how do we program them; in particularly how do we program them to do stuff that we're really interested in, not just stuff to demo particular machines but that we're really interested in. And so my feeling is that the future of microprocessors will be really limited by what we can program. So I'm going to start out by telling you how I think you should not parallel program. So you take your code, you profile it, you look at that performance profile, scratch your head for a while, you go oh, I can add some more threads here, iterate through this loop until you think your -- think it's fast enough at least and you ship it. And the problem with that approach, first of all this is not something that academic just sit around and kind of put up a straw man and knock over for fun. This is if you -- my, you know, friends at Intel if you ship a -- if you buy enough parts, they'll ship an application engineer over to help you parallelize your code and this is almost exactly what they'll tell to you do today. And the problem with that, it's not that it's, you know, kind of just as aesthetically unpleasing somehow is that there are lots of failures. And there's lots and lots of times for any of you who actually tried this that at the end software running on N processor cores is actually slower than software running on just one. So if we look at well, why is that? Well, let's think about what this person is thinking about when they go -- when they go through that inner loop and say okay, I need to re-code this with more threads, right? So there are books out there. They start looking at locks, threads, semaphores. They do UML sequence diagrams, try to puzzle through this a while. And I don't know if any of you have tried that, but after a while you start looking something like this, that is kind of anxious and depressed here. And one of our colleagues at Berkley, Edward Lee has actually written a very good paper kind of analyzing some of the problems with coding -- with kind of a thread mentality. So what's the alternative? Well, the methodology that we're pushing is to -- it's a similar loop. And there's a human in the middle. So in that regard it's not so different. But what we're thinking about is what we're focusing on is actually architect in the software to identify parallelism, not just merely jumping in to find threads. And so we spent a lot of time on architecture and then today it's really a thought experiment. Let's see, how does that software architecture map on to the hardware architecture? You write up some code. Do you some performance profile. You go through this same loop but in the inner loop of the -- of this we're rethinking the architecture, we're not just trying to code in a few more threads. So why is that different? Well, this person what they're thinking about when they say re-architect with patterns, well, today code is serial. That's a fact. But the world is parallel. So there's a sense that if we really dig deep into the application, we really understand it, then we will find the parallelism. And then we just need to reflect that in the software. And so what that person is thinking about is like computational patterns I'll show you in a moment, I guess Dave showed you as well structural patterns and general software architecture. And you think you look at this individual that looks like a well adjusted, happy individual to me there. Kind of contemplating his architecture. Okay. So Dave flashed a slide -- so I'm sure all of you have heard about patterns. Patterns have been around and popular I think for about 15 years now at least. What's different about our endeavor called our pattern language which coordinates with work of Tim Mattson, pattern language for parallel programming is that we're trying to do an entire pattern language, which means we're trying to do a set of patterns which will take you all the way from an application all the way down to a detailed implementation. Now, it's not that adding some patterns that say any layer of this is not a useful thing to do, it's just it's not quite as aggressive or as comprehensive as what we're trying to do, which is kind of soup to nuts. You start out with an application and going through the pattern language, you will end up with a detailed implementation. So I don't have time to talk through all this, but I did want to go in a little more detail at the top level here at these structural and computational patterns. And so the structural patterns are, you know, if you look at those go wait a minute, I've seen those before. Those are the Garman and Shaw architectural styles with maybe a few additions like MapReduce and Iterator here and so forth. But these basically define how you structure your software. But they don't tell you anything about what's actually computed. And although I've used in teaching software engineering in general this analogy that software is like a building, finding after a long time it occurred to me software is actually a lot more like a factory, and in that regard these structural patterns are about how you actually lay out say your factory plant. So complementing those are the key computations. Now, this I think is I think more uniquely the ParLab contribution, which is we in ParLab long before the Intel-Microsoft, you know, UPCRCs were even a gleam in one's eye, you know, Dave, Krste Asanovic and I, Ras, Kubi and so forth would get together on Tuesday afternoons and we would look at different application areas. Sometimes we'd look at individual applications, sometimes we would look very broadly across application areas like high performance computing or computer design integrate circuits are embedded, and we would look across and basically ask ourselves a question, what are the core computations that are performed in those application areas? So this wasn't exactly drafted overnight. We did jump-start on Phil Quarles [phonetic] observation that he had identified seven classes of computation which were broadly used in high performance computing. So red here means highly used. Orange means somewhat used. Green less. And blue means probably not so evident at all. And so literally week by week as we looked at more application areas and slowly added more patterns we came up with this list here, which in general my experience now is I haven't challenged this audience so I'll be here for two days, you bump into me in the hall say what about this computation, then if it's not on here, then we'll add it. But I mean after three years now, things have really settled down quite a bit. So these described computations, it's interesting for, you know, I think you can maybe relate to in a software audience like this is it's amazing at how little we've actually talked about computation. I mean, we've talked to death about, you know, these architecture styles and object oriented programming, data hiding, and this stuff. We don't really talk about the core computations as much when we talk about engineering [inaudible] software. And so to me it's as though we've been talking about the light of the factories without actually talking about the machinery that goes in here. Because these computations are to me the actual machinery of the factory. So when we put those together it's analogous to the entire manufacturing plant, and this is an entire kind of anatomically correct software architecture, the high level software architecture of one of our applications, the large vocabulary continuous speech recognition. Now, literally the afternoon that that kind of went through my mind, wow, this is really more like a factory, I honest to God ran over to our engineering library and said boy, there must be a lot known about this. And I can say that in looking at the books textbooks on factory optimization I was like blown away by all these novel ideas. But I was very gratified that wow, they are thinking about the same things, like scheduling, latency, throughput, workflow, resource management, capacity. Probably one of the most interesting chapters was what they call work-cell design which essentially gets you know what's about the appropriate size of a little work cell to minimize traffic and so forth like that. There's long discussion of tradeoffs like that. So I think this analogy is holding up well. It helps -- at least it helps me think about the problems that we're facing here in software. So the formula that I'm going to show you in the next slides are how we identify particular applications, we work with domain experts, we typically with each domain expert had their state of the art algorithm. If they wanted to go faster, we took our tall, skinny, parallel programmers, which meant my graduate students, and we architected them in the software using the patterns I described and ran that on parallel hardware and produced what I think -- I think you would agree are not, you know, modest improvements, yeah, heck of a job, good engineering work, but really game changing speedups which had the potential to be scalable. So -- yeah? >>: Couldn't you put an application like Outlook which seems to me to be constantly hanging in your list of things and, you know, [inaudible] faster if you give it many cores. >> Kurt Keutzer: Yeah. So Outlook is kind of an event driven architecture or you might kind of put a model view controller around it and then -- but then you know we would have to dig in and look at the core computations. But in a lot of these kind of office automation, I mean, they're really, you know, various types of graph [inaudible] sitting there traversing less and things like that. >>: [inaudible] look at the list of applications that ->> Kurt Keutzer: Oh, sorry. >>: That by itself doesn't seem to match to one of the columns ->> Kurt Keutzer: Yeah. That's ->>: [inaudible] kind of application. >> Kurt Keutzer: That's absolutely true. And I guess this is a good reason to come to Microsoft is that there's no doubt that Office automation type applications have been neglected in our focus. Part of that is because it's not that we're really shying away from large bodies of code, but we are shying away from large bodies of code that we can't get down to some basic kernel, right? So, you know, we'll be around for today's, and if you can help me understand well, if we sped up these particular kernels, you know, by some ->>: [inaudible] kernel there. There is a lot of small things here and there but [inaudible] make use of those things. >> Kurt Keutzer: Right. >>: But many of the applications we have like Visual Studio, Visual Studio has the same problem is that there isn't like a graph algorithm that transfer 20 hours. >> Kurt Keutzer: Right. >>: [inaudible]. >> Kurt Keutzer: Well, you know, I don't think that this is something that I'm going to mastermind during my talk here. But I'm here for two days, I'm at your disposal. I'm completely happy to sit down and take a look at those. And also I will say something a little more, which is, you know, we haven't been checking up yet, you know. I mean, one -- even things oh, you know as soon as we publish the view from Berkley, folks at other universities went around giving talks on Amdahl's law making remarks that some universities seem to have forgotten Amdahl's law and things like that. >>: That was in our Berkley View report. >> Kurt Keutzer: Well, now, we mention Amdahl's law. But they weren't given to suddenly doing tutorials on Amdahl's law after 20 years because, you know, they just thought time to get out the tutorial material. People were standing up in front of audiences saying some universities have forgotten Amdahl's law, you know, the fact that we put a line or two. But we -- I'm going to show you a bunch of applications where, you know, people thought we would be defeated by Amdahl's law and we were not. Yes, sir? >>: [inaudible] seems like in general the focus has been downwards looking at how to use parallelism in hardware. And another source of parallelism is in the environment. I mean, you have in finances, you had streams of data coming in and robots. You have stream sensors giving you streams of activity. In Outlook you get streams of requests from all sorts of sources. And it seems like -- it seems like there's that aspect of parallelism which is dealing with data streams asynchrony in the environment being able to react to that. And in order to be able to react to it quickly you also need to take advantage of the parallelism in the underlying hardware. So I think a lot of the applications actually combine these two. If I'm in games, I'm using GPUs and I'm using multi, multicore to speed up things. But you've also got to have an architecture that is response I have to network events, to input with things like [inaudible], I've got a lot of multimodal. So it seems like you've sliced out a very important part and focused more on -like he was saying, sort of the kernel. And I think there's a whole set of applications which are -- you know, have this external environment in asynchrony that ->> Kurt Keutzer: Yeah. >>: Speech recognition a little. >> Kurt Keutzer: Right. So first of all, I mean briefly -- and again if you don't like -- I mean, like I'm hear for two days to talk about the broad ranging cure for cancer, whatever, you know, what about Zimbabwe, things like that. All that stuff I'm happy to talk about. But what I'd like to do is make sure I get through what we have done ->>: Sorry, you're at Microsoft. >>: You're going to get questions. So I'm sorry ->> Kurt Keutzer: No, no ->>: [inaudible] a very legitimate question. >> Kurt Keutzer: So let me address your question. >>: And I'm asking you a very specific technical point. >> Kurt Keutzer: Okay. So all right, then I'll ->>: So I'd like to hear your thoughts on that. >> Kurt Keutzer: Okay. So first I think you have to understand the domain of the problem that we're trying to address, which is there is for the first time in history inexpensive parallelism being packaged in parts and put in people's laptops and desktops. And the question is what they're going to do with that. Now, that may -- I think that's a very relevant question for Microsoft. >>: That's great. And I want to have a conversation and we can have that later. But I think the other trends are these things are networked, the computers are smaller, the disks are huger. So the fact that it's just more processors that we need to leverage is just a small component of a much large -- of a much larger ecosystem that I think you [inaudible] you need to track all those things. >> Kurt Keutzer: Well, you know ->>: And take them all into [inaudible] not just about. >> Kurt Keutzer: So you probably said I'm from Microsoft. Well, I'm from Bell Labs, right. And I'll stand up here for 24 hours and discuss this with you, if you want. In the meantime ->>: No, we can take that offline. I'm just saying it's not just about trends in the hardware, it's a disk, it's networking ->> Kurt Keutzer: Yeah, but there's obviously a lot of systemable trends and there's a lot of software out there. But, I mean, I think if you at the nexus of that the in a lot -- in the work that -- the world that I live in is a desktop or laptop. And if that doesn't utilize the processors that are economical, you know, now economical, then all this other stuff is going to be held up by that, all right? Okay. Fair enough is good enough. Okay. So I basically proceeds this way of how do you parallelize things. And so what I'd like to do is go into a little bit of detail on one application and then just kind of literally I think at this point flash at you a bunch of other applications, you know. And so part of this is to show -- you know, it's kind of a long winded way of saying trust me we really did some of this, so it's not just the professor talking about we said we were going to architect these things and the students went off and hacked up code. And part of this is really to get some insight into how this proceeds. So this was a -- one of the first things we looked at, you know, Berkley is a hot bed of machine learning and support vector machines are at the core of a lot of what we do, support vector machines basically do a two mode classifiers which is great for like separating out the baby pictures from the flower pictures. And the basic way you approach here, this architecture that I have at the top level here is a pipe and filter architecture. We trained on some initial examples. You put in some seed photos of children, you put in some feed -- seed photos of flowers, you train your classifier and then you get some new images, you extract the features, you train them and you exercise it and get your results are. So at the type level here we have a very simple pipe and filter architecture. As a matter of fact at the top level of almost any software you see a series of filters connected by pipes. So we're going to go a little bit deeper on that in this example. So looking at this feature extractor, we -- if we pop open this filter, what we see is another little mini pipe and filter architecture. And then if we pop open this filter, we see little structured grid computation acting over the image. If we pop open this filter, what we'll see is another what we call structural pattern, MapReduce, which is mapping computations over this, what kind of map computations is it mapping, dense linear algebra and then the actual building up of the feature vector, the scriptor then is a MapReduce computation to map another computation on there and gather it all together. And what's being done at each of those maps is a structured grid. So that's going -- this is kind of -- at this point we would say, yeah, that's a high level architecture. You've got to turn on your side. You see a tree structure there. Okay. Then if we go into the train classifier, basically this is an iterative approach here. So what we're doing here is we're just on our training data, we're just slowly piecing together a frontier that says basically these are the babies and these are the flowers. And this is essentially optimization. We're doing that by solving a quadratic programming problem to create the space. And we're iterating over that until all of the points to be classified -- or all the original training points are within in margin of error. If we pop open these so we have an literator -- sorry, we pop open this filter we have and literator, inside that literator we have a simple pipe and filter. Look inside those filters again we have a MapReduce. Look inside those maps we'll see again just some dense linear algebra. So finally the third piece is this exercise classifier. So after we've done the training and we actually have a new photo to classify, so basically what we -- we build a frontier during training and we got to evaluate which side of this frontier is this a flower or is this a baby. And so inside of this exercise classifier we just have a simple pipe and filter where we compute dot products across all the elements which criteria the frontier support vector machine, that's using dense linear algebra and then sum up and trying to say which side then of this frontier is this on. Okay? So that's really it, nothing up either sleeve. And you can see well that's all pretty obvious or not. But by systematically going through this rather than just jumping in and seeing where is the bottleneck and the support vector machine code we downloaded and so forth, we feel we were able to get a lot more parallelism. And so we published that work a couple years ago now at the International Conference on Machine Learning and we were able to get about 100X speedups. And if you want to download this and try it yourself, last time we looked there were about 900 downloads of this software. Okay. So I am literally -- whoa. I think I'm down to like five minutes now to go through seven more applications. So in the interest of just, you know, I presume there are people from different applications groups here, I'm just going to flash and reiterate, you know, the method on some of these, and then if you want to talk about speech recognition or computational finance, object recognition, I'll be happy to talk more about that. So again, our approach is we find domain experts like Jitendra Malik has a strong computer vision group. We look at his state of the art algorithm for detecting contours here. And this is kind of classic example of the way which we interacted with other faculty. Great algorithm, best results in the world, runs really slow. So we dug into the architecture of that. This is the top level architecture, which is they say it always looks a little bit like pipe and filter and, you know, the moral of the story is we got about 130X speedup of this. We release the software. It's got about 490 downloads. And the 130X speedup and when I talk about this game-changing speedups, when Jitendra first came to us there was this -- his software, his counter detect was so slow it was a non-starter for Adobe to even talk to him. By the time he ended he's saying okay, well let's do video, right? At these speeds we can begin to talk about doing them in counter detection and video. MRI is very different. So professor [inaudible] new a professor at Berkley. He had a very fast algorithm for doing compress sensing in MRI. So he was able to gather the data very quickly and using this compressed sensing, reducing by estimates of a factor of four the amount of MRI time. That's great for UI. For children that can be really critical, making the difference between anesthetic or not. But the problem is the radiologist needs an image to decide do we need to take some more images. So the reconstruction has to be in realtime. His reconstruction took hours, right? So it was basically a non starter for clinical use. So we dug into the architecture of that and pipe and filter the top here it's a little easier to see. We were able to find these kind of nice large fork-join areas where we could do dense linear algebra across here. The fork-join here, data parallelism across these Fourier transforms and so forth. And we were able then to speed that up at about a factor of 100 and so it went from again being kind of an interesting academic exercise to something that could actually be employed and clinical use and it's been in use in over 200 trials now producing images that radiologists really look at. So those are very dense computational type problems. Speech recognition is very different. So I mean you have the basic idea. Speech recognition. We have voice input. You want to get word sequence out. So there's some single processing up front. What we focused on -- and there's lots of potential for kind of data parallelism up there with the speech processing up front. But what we want to focus on is what we thought was actually the harder problem which is the inference engine here. So here's kind of approximately anatomically correct high-level architecture of the inference engine iterating through the individual phones. Here a little bit more detail on that. And to get some sense of why this is a lot harder, you know, if you dug into those other computations that I kind of flashed up, we really went down to the details of dense linear algebra you would see a matrix, matrix multiplier, things like that. You're not shocked that we can make that run faster on a GPU, right, and so forth. Well, here, this is a weighted finite state transducer and so we have the problem which those of you who tried to parallel graph algorithms notices that you never quite know where you're going next even though you've got a lot of places that you need to go. And once you get there, somebody else maybe racing to get there first. And so there's lots of problems with how to parallelize these graph computations. But nevertheless we were able to speed up fact of 11 on manycore machine. And the interesting kind of, you know, it's that's level is a lot different if say 100, but the fact is that we were able to get faster than realtime. So if you were trying to say do a realtime [inaudible], you know, on your laptop you could envision actually, actually being able to do a speech recognition of say a meeting in realtime. Yes, sir? >>: [inaudible]. >> Kurt Keutzer: I'm sorry? >>: Are the speedups ever super linear? Do you have more than 11 cores? >> Kurt Keutzer: Oh, so it depends -- it goes all the way back to that first slides of, you know, whether you say it's 16 -- whether you say a GPU is 16 processors or 512 or something like that. So if anybody shows you more than 512, something's funny's going on. But, you know, we -- we have good scaling, you know, that is to say you throw more cores at it we do constantly go faster. But we're not -- we're not say getting more than 512 speedups. We are getting more than 16 speedups as you've seen. Okay. Here's another one, computational finance. So a value at risk. Sounds like a good thing to do, particularly after October 2008 or so. So here's very simple kind of top level architecture of the 4 steps of Monte Carlo and finance. And we were able to run the 60X faster on a parallel processor which is still yet to be released from Intel, the Larrabee processor. I think we can say that. We're being recorded. I hope I can say that. This was done during a summer internship. So those are the applications that we have that are kind of tidied up, published, peer reviewed. People seem to be excited about it, people want to download the software. I'll just literally flash at you then some other things we're doing just in if case you're interesting. So object recognition. So kind of building on the earlier contour detection work which we described to you earlier we're looking at this basic problem of you've got some basic categories like swans, bottles, Apple logos and so forth that you've trained on and then you have a new image and you want to be able to identify that image inside of a photo say. And we've been able to speed up pretty significantly a Jitendra Malic's algorithms on both the training portion so 7X, and the classification portion there. So as I said, we got such speed-ups on still images that Jitendra well, why aren't we doing video? So we worked with one of his post-doc, Thomas Brocks [phonetic] here who was looking at this whole issue of how do you follow motion in a video? So we have something like this. We'd like to actually like follow the motion of this chameleon walking through here. And we architectured that out, architectured that down to where we got to some high data pros. And we were able to get a 35X speedup. And to give some tangible sense of what does that really buy you, if you actually try to detract the motion of somebody moving as fast assay a tennis player with that leg expended, other approaches just won't capture -- be able to capture and identify that that leg is actually moving continuously with the whole person. But we were able to do that in realtime with the this. And then last I just want to clarify this is not a Berkley professor and his graduate student. I want to be clear about that, particularly since this is being recorded and, you know. I have a beard, no hair on top like that. So in poselet detection what you're trying to do is you're trying to identify key features in human beings. So basically Lubomir, another post-doc of Jitendra Malik has been working on how we can use unique features of human poses to identify humans quickly. And again, we were able to speed that about 17X can and sensing, you know, our work is building on top of each other and actually the matrix multiply in time in the middle that support vector machine slides that I showed you earlier is actually the bottleneck in this computation. Okay. So just to wrap up here, so you know, you can probably guess in this formula that I just showed you earlier can you see what the bottleneck is? Pretty clear what's wrong, what's the big limiter in this picture? Yes, sir. >>: You? >> Kurt Keutzer: Well, in a roundabout way. But these tall, skinny programmers, right? Basically everything that I've shown you has required one to three tall-skinny programmers to dig in there and spend a lot of time. So what we really want to do, my students do these speedups and these -- architect these applications as kind of initiation fee. But what we're really doing is using these tall-skinny programmers to build application frameworks so the domain experts can use these frameworks without having to have some expert programmer there side by side going every step of the way doing the implementation. And so to give you in just a couple slides what this looks like, so I showed you this earlier this recognition inference engine in the middle of this. And so oftentimes if you really dig through, you know, C++ code, hopefully not, but, you know, even MATLAB to try and understand what domain experts are often doing, they're often just doing a wide range of experience about choosing some very high level parameters. What's the printing threshold, how do we want to do the observation probability? Do we want to use -- how do we actually want to represent the words? Do we wanted to use weighted finite state transducer, or do we need a lexical tree and so forth? And so these very high level decisions that domain experts want to use end up, you know, invoking an awful lot of code when you actually want to experiment with them and what we're trying to do is package that up in a way that they don't have to do that. So we kind of see three different Fridays. One is kind of the radio button, bullet-point selection menu type where it's pretty clear the few choices that you want. The other is where they might actually just go and encode the key kernel. And the other is how they might actually take some, say some series of filters in a pipe and filter under a [inaudible] computation across a number of those, right? So it's just to give you some intuition. Yes? >>: 43 out of 36 slides. >> Kurt Keutzer: [inaudible]. So to conclude, so we're pretty gung-ho about single-node parallelism. And with we believe that the key here is to start with software architecture, not just jump in coding. And I believe I have a lot more credibility saying this today than I had two years ago because we've actually had some success doing this. In conjunction with domain experts, particularly we demand this approach in a wide range of area, so I think also we're -- you know, I mean Office automation sounds pretty tough but, you know, once we got past speech recognition, we weren't afraid of graphs and so forth like that. And the goal here is essentially to encapsulate what we've learned in terms of frameworks that we've gained in application developers. Okay. While the next person is setting up, I'll be happy to take some questions. [applause]. >> Kurt Keutzer: Yes, sir? >>: [inaudible]. >> Kurt Keutzer: Sorry. I have a little trouble hearing you. >>: [inaudible]. >> Kurt Keutzer: Oh, great question. Great question. Yeah. Thanks for -- you know, you always love the questions that get you to talk about one more slide even if I [inaudible] so yeah. As you look at [inaudible] some geometric decomposition and then you're going into some sort of [inaudible] limitation, that's a very -- that path is repeated in 70 percent of the applications that we see and so we're building essentially not a high-level application framework but a programming framework that says if you're doing MapReduce in this variety, if you're doing [inaudible] computations, then, you know, here's a programming framework in which you can do those [inaudible] language. And then that goes on to support all the computer visions like you described. [inaudible]. Anything else? Okay. Thank you. >> John Kubiatowicz: Okay. Can you hear me? Can you hear me now? Testing. Oh, there we go. My name is John Kubiatowicz, although most people call me Kubi because they can't pronounce my last name, so that's fine. But I'm going to talk a little bit about some of the operating systems work that we're doing here in the ParLab. We're doing there in the ParLab. And so you might ask legitimately the question what actual operating system support do we need for manycore? And participants we could just take Windows or Linux or something and port it and just be done with it. And you know, these are mature operating systems. There's a lot of functionality in there. It becomes very hard to experiment with. It's possibly fragile, and it may or may not be what we actually want. So the approach that we're taking in the ParLab is to actually say well suppose we start from scratch, what might we do? Okay. We can do this because we're not designing a product, right? We don't have to support everything right at the beginning. So clearly applications of interest -- and what are applications of interest? We're really kind of looking forward asking ourselves what are people going to want going forward with manycore applications possibly on the client? And clearly the whole point of this is there's going to be explicitly parallel components are key, okay. Because if we think manycore as, you know, as a given for a moment, that means that we're hopefully doubling number of cores with some short timeframe. And the only way we're really going to be using that is getting parallelism. So this almost goes without saying. Okay. So if the thing we come up with doesn't support parallel components, then we've got a problem. You know, obviously direct interaction with the Internet cloud services, that's clearly important as well. Okay. And so nothing that we do ought to prevent that. Interestingly enough, just that remote interaction gives us some security and data vulnerability concerns which perhaps we can address. And a lot of these new applications that seem to be of interst to people have real-time or responsiveness requirements to them. People are talking about new GUI interfaces, they're talking about gesture, they're talking about audio and video, and so we would want to see whether we could actually exploit manycore to give us some better, you know, better real-time behavior. And you know, related to that is responsiveness, okay. So real-time I usually think of explicit deadlines or perhaps some streaming requirements of frame rates whereas responsiveness has to do with you know I click that device and I better get something that happens right away. So I'm just going to flash this up here, this acronym was just an amusing one that I came up with. But it sort of reflects some of the things that were really of interest to us. And it's RAPPidS, and it stands for responsiveness, agility, power-efficiency, persistence, security and correctness. As all acronyms go, this one's not perfect, but it kind of gives you a flavor for some of the things that we're interested in, and namely you don't actually see high throughput as a key requirement, it's almost a given that okay, sure, we want to do well computing, but these other things are equally important to us. Okay? And I'm not -- we can debate acronyms some other time. But so what's the problem with current operating systems? Well, I don't know. So they often don't really give you a clean way of expressing the application requirements, okay. And I put often in here because there are many counter-you know, there are counterexamples, not many, but a few. But they might not let you say things like what's the minimal frame rate I need or what's the minimal amount of memory bandwidth or QoS or whatever. Perhaps they don't give you guarantees that the application can actually use. So gee, I've made this component work really well in in isolation but the moment I put it in with a bunch of other things then suddenly it doesn't work well anymore, okay, and that's because it's being interfered with. And there aren't good ways often to express that a particular component cannot be interfered with in order to really have the behavior intended. Full custom scheduling is often not an option. Now, why would I say something about that? Well, if you think about future client applications in a lot of the work that Kurt just told you about, the parallelism there depends on a scheduler that's application specific, okay? It's particularly tuned for the application. And if we're interested in parallel components in the future, then we really want to make sure that we can support whatever kind of scheduling is needed. And, you know, this one's almost funny to put often in question mark but, you know, are security or correctness actually part of modern operating systems, one might hope so. But not always clear. Okay. So the way I view the advent of manycore is it's sort of exacerbates all these problems because there's a lot more resources that we could either use well or poorly. But it also provides an opportunity to sort of redesign things from the ground up. And so I'm going to view manycore as basically a possibility, you know. It gives me the chance to rethink a few things. And what I want to do for the rest of -- what my talk's going to be here is basically tell about you the model of the operating system we're thinking about and some of the interesting implications of that model, okay? And that's pretty much where we'll go. And toward the end I'll tell about you some future directions that we're going with and talk about a prototype and so on. So first thing I want to tell about you is this idea of two-level scheduling and space-time partitioning. And it was kind of interesting when Burton Smith came out to give a talk recently. I was kind of nodding my head, yeah, okay, yeah, I agree with that. So basically two-level scheduling as I am using the term starts by saying well, instead of the standard monolithic scheduler that you often see in an operating system, there is some thing in the middle whose job it is to try to do the best it can at satisfying everybody. Okay? And oftentimes you'll find something it's got lots of options and tweaks and so on, but it's basically monolithic and bare. And instead what we're going to do is we're going to split it into two pieces. One is the resource allocation and distribution, okay? And what this -- the idea here is that there are entities in the system, and we're going to decide to give them resources. And we're not going to try to figure out how to schedule those resources, we're just going to say okay, we're going to give you so many cores, we're going to give you so much memory bandwidth, we're going to give you so many resources. And the decision about that is going to be based on constraints about how fast we want that thing to run or our observations the way it goes in the past, okay? And that's going to be our high-level decision is I'm going to hand resources to you. And then at the second level, the application is going to use its application specific scheduling to use those resources. Okay? So this is kind of a two-level approach here. Yes, go ahead. >>: [inaudible] logical resources? >> John Kubiatowicz: Well, that's a good question. Are they physical or logical resources? Perhaps I will defer that question for a moment. I'd say that ultimately every resource has to be virtualized in some way, but they're going to be as physical as possible, okay? Because that's going to give us better guarantees of performance. Okay? And you can ask your question again this a moment. We'll see whether I answer it. Okay? Now, so this idea of spatial partitioning is -- starts out as a very simple one. So here is a 64 core multicore processor, or chip. And basically what a spatial partition is it's a group of processors with a hardware boundary around it. And so up front I'm admitting the possibility of hardware support. And one of the nice things about the ParLab is we actually have the ability to experiment with new hardware wrappers around processors. And I'll show you how we can use some of that in a moment. But basically it's a group of processors within a hardware boundary. And the boundaries are hard. So here's an example in which I've taken kind of the 64-core chip and I've divided it up into chunks, and the key idea here is basically to go after performance and security isolation by -- that's performance isolation and security isolation. Sorry that should have put parentheses around this. By dividing up the resources. Okay? And so each partition here essentially receives in principle a vector of resources. So some number of processors, some dedicated set of resources which has exclusive access to, for instance, complete access to hardware devices or dedicated raw storage on a disk or a chunk of the cache, okay? And then some guaranteed fraction of other resources. And here's where hardware might help. Things like a fraction of the memory bandwidth. Okay? So I'm assuming here that we might actually have the ability, and we do have preliminary mechanisms for this, to actually say gee this partition gets 30 percent of the memory bandwidth and nobody can interfere with that 30 percent. Okay? Now, if we don't actually have that hardware available, then we can try to emulate that. But, anyway, fractional other services. Yeah, go ahead. >>: [inaudible] programmable can the operating system say now I want 30 percent, now I want 30 percent. >> John Kubiatowicz: Oh, sure. So my assumption, my assumption is that this hardware mechanism is fully under the control of the operating system. Yes. Okay. So yeah, 50 percent, 30 percent, 20 -- whatever. Some fraction. Yeah, go ahead in the back. >>: Have you thought about how this could interact with a hypervisor as far as allowing to punch through to the lowest process [inaudible]. >> John Kubiatowicz: So what we're going to do for the moment is let's get rid of the hypervisor, okay? What I'm going to replace it with is something I'm going to loosely call a NanoVisor, okay? And I'm going to do that simply because calling it a hypervisor has baggage. All right. So now let's take a look at something about -- so the first thing I want to do is okay, this seems interesting and maybe I can understand that performance isolation might be useful, but it seems like I've burned something in performance right off the bat by clipping off, you know, isolating things to one set of processors or another. It's interesting, we actually have some folks in the hardware group that have done some experiments with spatial partitioning and what they've found here is that, in fact, if they just sort of take two applications or multiple applications and run them simultaneously, just multiplexed in sort of a standard OS fashion, that actually doesn't work as well as cutting the machine down and giving sort of part of the machine to one app and part of the machine to the other app. Now, what's interesting about that, though, is you can't just divide it in half. That's this green bar. The sort of -- that's not the best. The best partitioning is something specific to the apps. Okay, maybe I give four processors to one and 60 to the other, okay? So there is some spatial partitioning that is best for those two apps other than just running a regular scheduler. And this is kind of an interesting possibility here from performance standpoint. Now, I'm interested in spatial partitioning for lots of other reasons. But I just wanted to show you this to indicate that maybe it doesn't cause you to throw out performance right away. Question in the back, yeah? >>: Do you have an explanation for why this [inaudible]. >> John Kubiatowicz: You know, one -- basically one way to look at this is that applications don't linearly scale in many cases, and so there's a point beyond which as you add more processors, the bang you're getting for your buck is not being made up for by what you're losing at the other application that could be given those processors. So this is kind of an effect of not perfectly linear scaling, among other things. >>: [inaudible] that produce those kinds of [inaudible]. >> John Kubiatowicz: You mean the patterns in the sense of what Kurt ->>: Yes. >> John Kubiatowicz: So these -- I don't think these were particularly done with patterns. These are actually just a set of standards, parallel benchmarks that we ->>: [inaudible]. >> John Kubiatowicz: I cannot. I apologize. I don't have an answer for that. That would be -- that's a very interesting question which I am now going to try to figure that out. That's a good question. Okay. So obviously if we just stuck to spatial partitions that were fixed, then that's not going to be useful, right, I mean we obviously can't fix something at boot time and keep it that way. So clearly there's going to be what we'd like to call space-time partitioning. And so, you know, here's an example of a 16-core machine where we've partitioned it up and the colors represent partitions. And you could imagine that overtime things vary a little, right? Okay. This is probably a not at all surprising to anybody that we might want something like this. And why is this not just standard scheduling? Well, what's interesting here is that first of all these time slices are somewhat coarser granularity than maybe a normal OS time slice. So we're not trying to go after the really fine grain multiplexing. That would be for the second level scheduler. What we're trying to do is we're trying to put these isolated machines out there that basically are not disturbed by other applications and use their second-level schedulers to get their performance, okay? So I would actually like to call this controlled multiplexing, not uncontrolled virtualization. Okay? Or another way to look at this is for instance, we're planning on scheduling these slices a bit in the future because we know enough about what we're trying to give in terms of resources. Okay? And also I'll point out that resources are gang-scheduled. So this is important for a lot of different parallel programming paradigms that when I give a set of processors I'm going to give them all at once. And I give all the resources at once. Okay? And I'm going to take them away all at once. And the reason for that is that that gives the user-level scheduler the ability to do a better job of scheduling its resources because it knows what it's got and what -- yes? >>: [inaudible] user actual measurements or what it costs to reconfigure your partitioning? >> John Kubiatowicz: So we don't have a lot of numbers of that form right now because, first of all, our prototype's in the early stages. And the second is we're actually playing with hardware support. So I would claim that as a hardware person I could make the changing of this as cheap as possible for the software, but it's not clear that's a good tradeoff. So I think I could tweak the knob in lots of ways as to how expensive this is. Why don't you ask me that question again in a year or something, and I might have a better idea as we really push these ideas. But you could imagine that if the only thing we're talking about is processors, it's the cost of a contact switch. If what we're talking about is setting up registers in the machine to get bandwidth isolation and so on, it could either be more or less expensive, depending on what kind of support we want to put in. Okay? Now, I would claim it's not -- it's not going to be that expensive. We shall see. So let's push this idea a little bit further. So if I'm space-time partitioning things then that isn't really something a programmer wants to deal with. So obviously we need a little virtualization. And this is getting back to your question earlier. And our view here is an abstraction we call a cell. Which is basically a user-level software component, with guaranteed resources. Is it a process? Is it a Virtual Private Machine? You know, we got into a lot of arguments once about whether we should call this a process, and, in fact, I was resisting because it's more, it's less, I don't know. But it's -- there's an analogy with a process here. It's got code, it's got an address space, it's a protected domain. But maybe there might be more than one protected domain or there might be more paging. So it might be more or less than a process. Okay? What are the properties of the cell? It has full control over the resources it owns. So I would say that while the cell is mapped to the hardware, it can use any resources we give it access to. Contains at least one address space. It has a set of communication channels with other part -- other cells, and I'll show about you those in a moment. And then it has a few other things like a security context which may be automatically encrypts and decrypts information as across cell boundaries. These are a few things we're playing with. And maybe has the ability to manipulate its addressing mapping, via some sort of paravirtualized interface. So this is a potentially pretty low level machine that we're handing to a piece of parallel code, a component so that it can make best use of it. And realize that the reason we're proposing something like this is to stay out of the way of focus like Kurt who are busy trying to tune things to run as well as possible. We want to make sure that we provide a nice, clean environment. Yes? >>: Would something like a network device map on to these cells? >> John Kubiatowicz: So a network device would typically -- I'll show you a kind of a funny sketch in a moment. But a network device would typically get a cell or maybe a couple of devices, depending on how you want to program it might be put into the same cell. So we might have the network devices get a cell with a set of resources and they're allowed to use them any way they want. Okay? And so when mapped to the hardware, the cell gets gang-schedule hardware threads. We called these Harts. The guaranteed fractions of resources and so on. Okay. Question. Yes? >>: [inaudible] multiple resources of each given type. >> John Kubiatowicz: Right. >>: Where does this one instance [inaudible]. >> John Kubiatowicz: So if there's one instance of a given type, there are a couple of possibilities here. One, you give it to exactly one cell and that cell forms the multiplexing so you actually have some software that acts as a, you know, as a gateway to that device. That would be more in keeping with this philosophy than trying to multiplex it, that one device automatically underneath, because that would turn us from a NanoVisor into being into the hypervisor view. So it's more that you would have a software component that would do explicit multiplexing. So okay, so what do we do with cells? Well, here you know I see an application divided into an explicit parallel core piece and some parallel library with a secure channel between them. So it's kind of a component based model. Applications are interacting components. And, you know, we get composability here. Obviously we can build this parallel library component separately from this application. This might have some properties that we tune. And then when we use in it this application, we keep those properties. So the other interesting thing is that cells being co-resident on a bigger chip, so remember we're thinking about manycore, means that potentially you can cross this protection domain rapidly just by sending a message. Okay? There's no contact switch involved, which is kind of what we were stuck with a single processor or a small number of processors, okay? So this kind of echos a microkernel view in some sense but is different in that, A, we're giving it to applications as well as services, and, B, we have this potential of very fast crossing of domains. And within the cell you have fast parallel computation. So we're keeping parallelism fundamentally here. And of course here we could see what might be the full mix where there's some real-time cells that are doing audio and video, the file service is part of the OS. Device drivers might be running in their own cells and so on okay? Now, of course it's all about the communication. So I've been kind of ignoring that a little bit. But we're interested in communication for lots of reasons. Communication crosses the resource and security boundaries. The efficiency of communication impacts how much decomposition you're willing to do. Here's an interesting issue here. So we're interested in quality of service. And one of the things we're interested in is the potential to give a fraction of a shared file service to applications A and B and guarantee that. So what does that mean? Well, that means that potentially I have to restrict the amount of number of requests per unit time across these channels to make sure that I don't oversubscribe that service. So we're definitely interested in being able to guarantee to some application that needs it a well defined piece of something. You know, another question which is interesting is so you send a message but this cell happens to not be mapped at a given time. Does it wake up right away? Okay. That would be sort of the traditional event driven approach. And it's certainly something we support. But something more interesting that we support is that, no, it actually wakes up when the thing is scheduled in its time slice, okay? And so interrupts and events are not the only way to send something. And as a result, you don't necessarily have to disturb a parallel item that's running well. Clearly we support interrupts because those are occasionally needed but we're actually kind of the view that they're needed a lot less than people use. And then of course the communication defines the security model. So there's a couple of those we're looking at. But it's really about who do you allow to communicate with whom? Okay. So you could say here is, I don't know, tessellation, right? So we've got a couple of large compute-bound application, a real-time app. We have some file storage going on. Maybe we have a networking component that's doing intrusion detection continuously. Maybe we have something doing GUI and interfacing with the user in other ways. We might have some device drivers and so on. You know, you could see how this scales -- I mean how this might go, right? Here's another view for more of a -- so this was kind of a client version. Here might be a server version where we have a bunch of chips. And we actually put QoS guarantees on say access to memory bandwidth, access to inter-chip communication, maybe access to the disk and so on. Question. Question? >>: [inaudible] happens when one or more new applications come into the workload? >> John Kubiatowicz: So what happens when one or new applications come into the workload is you have to change the allocations. And I'm going to talk about that in the remaining part of my talk. Okay? So clearly static situations are great in the short term, but they're not so good in the long term. Okay. So, in fact, let's talk about resources here for a moment. Good question. So another look at two-level scheduling. So why do we want to do it? So the first level is really about globally partitioning resources to meet the goals of the system. Now, what are the goals of the system in okay. They're defined by policies that are both global and possibly local. And so there's, you know, you could imagine an arbitrary complexity here. But it's busy partitioning up the resources of the system to various cells to try to meet some goals. And of course, we want to make sure that the partitioning is constantly for some sufficiently long period of time that the local schedulers can do a good job. Okay. Second level is application-specific scheduling, okay? Goals might be performance, real-time behavior, responsiveness and so on. And this is sort of running within the cell. There's another scheduler that's running at user level and doing whatever it wants with resources to meet the goals. Okay. Let's see. I think this is all I want to say. All right. You know, second-level schedule can defer interrupts and so on locally because we've got full control. All right. Yes. Sorry? >>: So one of the problems with constant [inaudible] is the consumer client oriented applications parallel as with well? [inaudible]. Things change on the order of milliseconds [inaudible]. >> John Kubiatowicz: Sure. So if the user -- so the way would I view that is if a user actually makes a mouse click and wants some major change to happen, we'll be perfectly happy to stop something that was running well to handle the user. I mean, this idea of trying to keep things constant for long enough time to do well is only true if that's not the thing the user really wanted. So this is potentially a tradeoff between responding rapidly, that's responsiveness and performing well. And we will go for the responsiveness case in the case of the user. >>: [inaudible] is with this kind of architecture can you respond quickly enough to [inaudible]. >> John Kubiatowicz: So we think the answer is yes. We don't know for sure yet. Basically a couple of ways of doing that. One is keeping excess resources that you know how you can get ahold of when you need them. And so you basically are giving the excess resources to be used but you can grab them right away when you need to do something like that. >>: I have a question is that you look [inaudible] and you talk a lot about virtual memory and the impact of virtual memory on these clients. Can you say more about that? >> John Kubiatowicz: So virtual memory in the sense of seeing a memory space that's bigger than the amount of DRAM you have? I mean, there's a lot of different ways of using virtual memory. >>: [inaudible]. >> John Kubiatowicz: Okay. Yeah, let's take that offline. But, you know, basically you could imagine partitioning the physical memory and then giving it access to do anything that it wants with its virtual memory, including paging in an application specific way. So just briefly I want to make sure that my last guy -- last colleague has a chance to talk. Oh, question. Yes? >>: About memory coherence. >> John Kubiatowicz: Yes. >>: Are you going to assume that within a cell all processors have a [inaudible]. >> John Kubiatowicz: So I think that we probably want to assume that a cell has shared memory, cache coherent shared memory within it. But we are certainly looking at architectures for which that wouldn't be true. So, you know, our view is kind of you use whatever resources you've got in a cell, and if you don't have shared memory you use message passing or something. But that's a parallel app that's running in a container. Now, you may ask how can I build a container that -- or an application that can be handled both with shared memory and message passing and it picks one. I'm going to avoid that question for now. That's a potentially hard question. Yeah? >>: Just one question for the [inaudible] resources. How is that determined [inaudible] abstract semantics and those are translated [inaudible] resources or is it more specific [inaudible] how much resources it wants? >> John Kubiatowicz: It's either. So that's a great question. Does the app give something abstractly in terms of a frame rate or does it say it wants so many processors? We're actually supporting both. There's what I would call an impedence mismatch between what the programmer understands which is say frame rate and the resources, okay? And I think that a lot of systems don't even try to address that. And we're experimenting with ways of figuring some of that out automatically. But there is that impedence mismatch. But we actually think that rather than accepting it as being a problem, we want to actually address it. So basically what is the state of the system is specified in what we call a space time resource graph. This is just a chunk of it. But basically cells have what we call space time resources which might be four processors for 50 percent of the time, et cetera. And then potentially they can be grouped so you can actually put resources up higher here which really means that resources can be allowed to move from cell to cell within the group. Okay? And that's the guarantees are made at the cell level here. So how do we build this thing? So we actually have a partition policy service or layer. I actually see an inconsistency. I apologize. That is busy doing the allocation. And I'll show you the structure of that. It produces space-time resource graphs which then get implemented underneath by a mapping layer that takes the graph and decides how to map that to the hardware and into a set of slices and an underlying partition layer that basically is the NanoVisor that provides the hard boundaries underneath. And you could also say that this mapping layer makes no actual decisions. It's constrained by what it's been told to do. It's really doing a form of bin packing on this, but it's a form of bin packing that's already been verified to be workable before it's been handed a graph. Yes? >>: [inaudible] obviously not going to maintain [inaudible] how does that impact the scheduling ->> John Kubiatowicz: Well, so you would -- you would -- my view has been for many reasons, not just the NUMA issue, but other hardware issues that cells are going to be -- consist of co-located processors, not one that's on one side of the machine and one that's on the other. So for any given machine, as your cell gets bigger, this had been some well defined NUMA properties to it. But they won't be, you know, the thing split in half and it's on opposite sides of the machine, right? My suspicious is that the cells will always be co-located to whatever extent is possible. >>: I guess the point is that the NUMA domains form sort of very strong boundaries meaning that you really can't put data in multiple NUMAs unless the pattern in the fuel applications [inaudible] that. >> John Kubiatowicz: Sure. >>: So you're tying to do the [inaudible] mapping problem that's a hot harder because the NUMAs uniform. >> John Kubiatowicz: Right. So okay, so now I understand your question. Let me give you a -- here's how I would answer that question. I would say that if you've got a machine of some size and you want to chop it into pieces, the question is can you build apps that run well. Okay. Now, the flip answer is it's not my problem. The non-flip answer is the following. We're actually looking at interfaces to provide topology information to a layer that's scheduling in deciding how to do that. You could say that if your app can't handle NUMA very well that it needs maybe to be rewritten or you need new patterns to look at. You know, the OS is providing the services basically of this machine boundary that's kind of a clean boundary and being able to program a NUMA machine well is actually I would consider part of the higher layer than the operating system. That's a -- maybe an interesting debate that we could have tomorrow, which I encourage. Okay. So I should finish up here. But let me just show you here -- in fact, I'm not going to walk through everything. But here's an example of our actual architecture I wanted to show you some layers. So we've got the partitionable resources down here. We've got the mechanism layer or the NanoVisor is busy doing the -- implementing the partition, implementing channels, maybe doing QoS enforcement. The partition mapping and multiplexing layer is basically, it's still part of the trusted code base that takes space-you time resource graphs in and implements them. So there's a validator to make sure that you don't try to do something that violates your security in some way like shutting off a key operating system cell. And then it plans the resources and then somebody multiplexes it. And notice that there's kind of two key ideas here. Admission control. So we actually reserve the right to reject requests. Please start this cell. No. Okay. Now, that always throws people for a loop, right? But if you don't reject requests, there's no way to make guarantees. Okay. I'm going to put that -- I'm going to say that that way. What do you do when a request is rejected? Well, that's interesting. Maybe you ask the user to change their preferences or something, or maybe there's an automatic mechanism. Now, what I've shown here, let me just talk about this adaptive loop is in principle to meet this impedence mismatch that I talked about earlier, we're expressing frame rates, but we've got cores, what happens is we're measuring performance, we're building potentially models of how that performance is going, we're adapting resources, changing our graphs. You can see the loop here. And in principle, admission control when it can't make a simple change might ask for a major change. And as long as it meets the policies, maybe that major change will be admitted. Okay. Now, ask me the explicit details about this. I'll tell you we're still working on it, for obvious reasons. But this is the philosophy. Okay? We actually have several different modeling things that we're looking at for building this. And we'll talk about this tomorrow. But you can imagine I know Burton Smith really likes this notion of a convex optimization problem where once you've got a definition for how things behave then you're trying to optimize for something. Yeah, question. >>: [inaudible] the previous slide that you might reconfigure the system where a longer running application might see its resources, you know, change like a -- you know, maybe I was given 16 cores to work with and then suddenly I have like four or something like that? >> John Kubiatowicz: Yeah. So there's explicit interfaces for resource changes. And so if you say that I can deal with resource changes and be a good citizen then you'll be told about that. Okay. I'm getting the -- let's get ready to finish up here. But, yeah, so there's an interface for that. Scheduling inside a cell is just user-level scheduling. There's lots offing the things there. There's also questions of how to divide applications into cells. You can see the obvious questions if the granularity of the cells is too small then the policy layer has got too many complexity and can't really do a good job. And so that's interesting. And then finally, you could imagine things we might want from the hardware. So like, for instance, obviously you want to it compute well in parallel, but partitioning support, QoS enforcement mechanisms, fast messaging support, these all things we've been looking at. And by the way, Dave mentioned RAMP earlier is this great emulator that allows us to add these mechanisms in and take a look. And so, for instance, we've actually got an emulation of a memory bandwidth partitioner mechanism that we've looked at. And you can show that we get good performance isolation having that mechanism. And it's not too expensive. Okay. So I will conclude. I -- and people -- I'll be around for a couple of days. So plenty of questions, I'm sure. But I talked about space-time partitioning and cells as basically a new mechanism in which to construct things. This partitioning service is kind of an interesting part of this process. And we're actually building an OS right now that's got several of the NanoVisor pieces to it and runs code. We had a demo and is probably about to go through restructuring number 5,496 but, you know, we're working on it. So all right. I'll stop now. Sorry. [applause]. >> Ras Bodik: Okay. I'm Ras Bodik. I'll talk about the Web browser part of the project which is really how the software stack would look like on top of the operating system. So for the case when we are talking about client computation. So it is true that it was the browsers what made smart phones popular because mostly people started buying them after browser became usable on them. But in a sense that in the same vein they failed. They serve as successful browsers on those platforms but they are not what they have been on laptops and desktops. So those bigger platforms they become the de facto application platform of choice and many, if not most new applications are developed in the browser. This is not the case on the mobile phone. And that's partly because of the performance. If you look at New York Times front page it may take 15 seconds to load on your iPhone. And the reason is not that the network is slow, this was done actually on a fast network. The reason is that your browser is essentially latent kind of compiler running on your phone and it's very computationally intensive. And so the reason is that there is a lot of abstraction tax that you pay when you use the browser. That's the tax you pay for productivity, the fact that you have a really powerful layout engine, an extensible scripting language of dynamic types which you can embed DSLs very easily into. And that all shows up. And we did an experiment with a simple application return on top of Google Maps API, we wrote the same thing in C and it was 100 times faster. It was probably 100 times harder to write also, and we'll get back to it in the talk. But that's the reason why people don't use browsers to right applications mostly they use things like Android, Silverlight and iPhone as the case, and their siblings. Which is little bit more flexible powerful, but they are lower level and more efficient. So the browser is CPU bound. And what's inside? There is essentially a parser that takes whatever the input is, HTML, CSS gives you the DOM, which is the abstract syntax 3 for the document. And there is a selector engine which takes this CSS styling rule effectively and maps around to the DOM. I'll talk about it some more. Then there is the layout engine which positions those elements on the screen. And then you render, which is you just move the bits, blend them on to the graphics on the screen. And then there is JavaScript which provides the application logically activity and that may actually redo everything. And which of them is expensive? Turns out that all of them expensive. So speaking of Amdahl's law, you can not really side step any of them, and all of them need to be optimized. And so we look in the project at the top four levels. We have a story about the language which is participants the most important story, but it's not based on JavaScript and I'll try to justify why later in the talk. So these are the three main driving forces I'd say. We care about the low-power devices, phones and, in fact, smaller, predicting that phones are the next coming computing platform that we'll use, but it's not the last one. The next thing we'll -- as computers moves from the stuff that you have on your desk to your lap to your hand, they'll end up on your ear somewhere. And not so far away. We hook at client applications. They'll be interactive, they'll have sensors, they'll have augmented the reality which take all the sensors together and help you live your life. And we look at the future of productivity languages. One even why JavaScript became popular was that we had a lot of spare power in the '90s. And we didn't have the application demands. There were sort of more surplus of compute power in the '90s as scripting became popular. Now you see the opposite pressure. People want to write applications with scripting languages. But the compute power is gone or the improvements are gone. So what the future of scripting languages looks like is a major question here. So this was the original motivation for the browser project we realize it's not only a phone, once we observe that one could put a little laser projector into the phone and turn it into a tablet computer in a bar. And that was just a vision. This is a mock-up picture we did by actually holding a real projector above the desk. But turns out that microvision released this laser projector just about a month ago. You can buy it for 550 bucks. And I don't own it yet but apparently it's very impressive. You cannot fit it into the phone yet because the projector itself is about as big as the iPhone, but the projector itself is like the tip of this pen. So I think this will actually you will see it soon. Please? >>: [inaudible] battery life. >> Ras Bodik: That mostly includes battery life, yes. But of course this is heat dissipation connected with it. But I think battery's a good [inaudible]. So here is the stack of the parsers. So the parallel lexer and parser here of course, as before, there is the CSS matcher, which I'll talk about, the parallel layout engine, and the rendering where we don't have many results yet because you really need to rewrite open GL to be parallel. But at least we are investigating how one parallelism -- how parallelism could help over there. Now, the whole thing is not written from scratch, although currently mostly is. It will be generated in the spirit of parser generator. You'll have a generator of par certifies and you'll be able to create multi-variants and auto-tune across the space of the generated things. The same for the layout. We have foamily written the specification of most of the CSS with attribute grammars with the goal of actually generating that engine automatically with various optimizations that will be of serial nature such as incrementalization and parallel nature such as task parallelism. >>: [inaudible] attribute grammar? >> Ras Bodik: It seems that it does. >>: Wow. Which attribute grammar [inaudible] did you get to choose? Or we can go into that. Sorry. >> Ras Bodik: Okay. I will [inaudible] the key differences among them that you have in mind? >>: I don't know. I thought there were like parsing, there's [inaudible]. >> Ras Bodik: Okay. I guess we'll have to take it offline. >>: Okay. >> Ras Bodik: Okay. So -- and but it doesn't end here. So our scripting story is a constraint based language that combines constraints with events, so it really gives you the ability to put together the layout of the page, the semantics of the layout, and essentially the activity and there will be a synthesizer whose output is going to be an attribute grammar and then again you can go through this engine that generates parallel and incremental evaluation. So let's start with parallel lexing and parsing. So lexing is a simple task. You have essentially a string of characters here, and you need to break it into tokens. And the tokens are described here with the regular expression which really corresponds to automaton. So you need to run this input through these automaton and determine at each step which state of the automaton you are. It's as simple as that. The problem is that of course this process is naturally serial. It's embarrassingly serial you could say. You cannot make the next step before you know the state of the previous step. And yet, the way we would like to parallelize it is to actually break the input this way so that you do not need to obtain multiple files in order to obtain parallelism that you can actually parallelize one file, one stream. So for stream processing this seems to make sense. So here is the observation. In at least lexical analysis, maybe less so in regular expressions pattern matching, if you start your lexical analysis in any state after some time you end up in the same state. So we pick an arbitrary state, not necessarily a correct one, but after three steps in this case depending because there is one talk in here, we all end up in the same state. So there is sort of a notion of small warmup prefix that is sufficient to get you pretty much no matter what state you are in into a correct state, even without having seen what was before in the input. And this observation length itself to this algorithm, that you take your input and split it into chunks with certain K character overlap. And what do you do there, well, you just predict that you are in some state, which is this. And you run each of these parallel processes independently starting from this good state that leads to usually a good warmup and you realize that yes, indeed we guess the state correctly after the warmup, even though we started from a wrong state, the state was again correct and so we checked for the matches and speculations. The speculative algorithm and you get speedup because in this domain things seem to be predictable. So here is the speedup on the IBM cell processor. For large files the speedup is nearly perfect. For smaller files which is about 250 kilobytes scalability is not that great yet. We need to work with the OS in our hardware to tune it a little bit better. But it's still not bad. On five cores you are still nearly five times faster than flex, which is the well tuned serial lexical analyzer. Parsing. Parsing is the step that comes after you obtain a stream of tokens, not characters. So this is the resulted of lexical analysis. And your program is described with the context free grammar, which might say that the program is in this case just one function with one argument variable, list of statements, a statement can be an assignment of expression to the ID and each statement and an expression is ID plus E and so on. [inaudible] you know what I'm talking about. And what the parser does, it goes to input usually left to right and identifies things like oh, I have an expression here, here, and here. And I also have an expression here. X greater than Y is also an expression. I have a statement here. And therefore I have a statement here. And therefore the whole thing is a program. And this all attends left to right. So the context that you see on the left is important. Now, if you want to do it in parallel, you give one processor only the left part of the inputs so that another one can work here. And the way we do it is so that there will be sort of a main parser going through the whole input, the one that does non-speculative work, and on those chunks there will be a speculative parser that will try to preparse the input and essentially allowing memorization for the main parser will just skip over when he gets to that part. So that preparser will have to guess at the context of the main parser, which of course it doesn't have because it's still working on the left. So it will guess correctly that, well, these are expressions and that this is an expression as well. And now it will have little bit of a conundrum because it will come here and say well, this looks like an expression to me, and indeed it is because you can derive that from E. It is an expression. Of course in this context it is not. These parentheses are not part of the expression, parentheses they are part of the E statement parentheses. But nothing went wrong. You just memorized here that this is an expression and when the main parser comes to it, this is not going to look for an expression, so this work is, you know, redundant, a little bit extrovert we've done, and it will come here, it will skip over this because we have already parsed it as expression. And it will also, by the way, skip over that part because that has been already preparsed as a statement. And now it can put these two pieces together and realize the whole thing is a statement across these boundaries, whole thing is a program. So it is speculation that based on the context that you've gleaned from looking at the tokens in the input you try to predict the state of the parser and similarly this is in E. And sometimes you do more work but you never go wrong. And here is some fresh data on predictability. So here are various pairs of identifiers that you can see in the input. And what do you see here is that if for each pair you make two guesses as to the state of the parser, you are doing pretty well. This one you get about 50 percent probability that you guess the state correctly. And the data actually are better than what you see here. All these bad looking data with a little bit of static analysis of the grammar which we haven't done would probably go like this as well. I'd rather not go into it right now. So this is the parser. Now let's go quickly to the CSS selector matcher. What happens in this step? You have your document which has a root and then it has a paragraph and another one here has a text and an image. And here is another paragraph with a bunch of words in it. This one is bold. And the goal is to take this styling rule such as this one, which says image that is part of a paragraph needs to have the following font size, and you have a bunch of them. This is the essentially rule, and this is the styling prescription. You need to do each node find the corresponding rules. So in this case, these are these two. And we have about thousand nodes and thousand rules in a typical document. So there is quite a bit of work. In fact this seems to be the most expensive component in browsers. And well tuned websites like Gmail don't even use this functionality because it's so slow. And in a sense you could say they sacrifice for engineering goodness because this matching is very slow. So let's see what we did. Because this is huge data parallel computation, the parallelism is obvious. You just feed to know how to slice it. And the most significant things actually didn't come from parallelism but from locality optimization such as styling, making sure that we have in the cache only the important data at the time, and then you go through next one and next one. So these rules are split in intelligent ways so that only a subset of these rules is in memory at a time. And if you look at the speedup, you start with implementation that is similar to what we have in the WebKit driver, the WebKit browser. And after L2 optimizations it goes probably to factor of 3. After some L1 optimization it goes to factor of about 25. And then you add to its speed up from parallelism which is about factor of 3, 4, 4 cores. We are quite flat after that. But hopefully more work will help you there. But together you are about 60X faster than the original, which probably makes this a non-bottleneck in the browser. So let's talk about CSS. Rather than telling you right up front what happens in this parallel layout engine, let me tell you why formalizing CSS may make sense. So here is a piece of CSS, a few nested boxes which floats. And here how they render on three major browsers. And on each of them, they are different. And we still don't quite understand exactly where there is ambiguity in this part of the specification. Actually Leo probably understands it. But I'll show you a simpler story which tells you why having a formally specified spec may help you find holes before you actually release the spec. So here is three nested boxes. You give the width of the inner one. This one needs to be half the width of the parent, and this one needs to be as small as the child. And you probably immediately see the problem, that you have a dependence that this one depends on this and this one depends on that. So you have a cyclic dependence. Of course you can solve this layout problem and perhaps with fixed-point iteration. But you know what the output looks like, right? It will be a box that these two outer boxes have zero width. Because that's the only condition under which you can meet both of these constraints. And so this is essentially what spec silently says what these rules mean. When an engineer implemented those constraints, browser engineer looked at it, probably said well, either I don't like the cyclic dependencies because they wants to have a bounded number of passes through the tree, or he didn't like the output of the specification. So he said well, I need to break some of these constraints. Because they cannot satisfy all of them and still get something good looking and with performance. So he either breaks the outer constraints, so now this outer box is not shrink to fit to its child or it breaks the inner box and now this one is not half of its parent. So which one would you pick if you had to break one constraint? Of course both of them look equally good. But they decided for this one that not all three, probably one of them decided and then the other just copied the semantics because this is how you interpret the CSS spec, then you implement the browser, you see what the other browsers do and then do you the same. So no wonder in these ad hoc decisions to which constraints are broken and just the fact that you implicitly drop these constraints leads to surprises in CSS. In this case, at least they made a decision that this box doesn't stick out as it does here. But at this mark by the fan of CSS demonstrates, no, you are not always -[laughter] -- you are not always so lucky. So here is why having a spec would help because if you write this in attribute grammar you would immediately see without seeing any particular document but through static analysis they have this cyclic dependence and something is wrong and you need to resolve it. And rather than leaving it to ad hoc dropping of constraints, which is of course surprising. So having these benefits and Leo did find some surprising holes in the spec, especially when it comes to tables. You can also then write a parallel layout engine. You can look at how layout is computed and identify parallelism in it. And so here is how a layout happens. Again you have a tree which is pictured here. You have one paragraph and here we have another one. But there is this image here which is a float, meaning the text needs to flow around it. And so what we have, you have a body, you have two paragraphs, you have the work hello. This guy here is a float, which means float -- can float to the next paragraph and text goes around it and here is the rest of the paragraph. And the CSS layout as this latex layout is so called the flow layout. You lay things one after another essentially the way you lay bricks. Often the specification says where you need to put the next thing after the previous one has been laid out and you know what part of the screen is free. Especially the case in latent. The description does look like that. And so the layout really looks like, okay, what is the layout? We need to compute the sizes of each element and their coordinates, X and Y. So you start out by saying this is the base of the whole thing, this is the font size and this last two X and Y sort of where the cursor currently is. And then you go in order through the graph computing things the way you would do if you laid bricks one after another. And of course there isn't much parallelism here because there are these dependencies, you know, after all, the position of this paragraph depends on this paragraph and in particular where these words can go depends on how much space this picture takes up. But if you look at the attribute grammar, fine grained dependencies show up. And now the thing is all of a sudden more parallel. You can now see that oh, I can compute font sizes of this without touching this part of the subtree. And five phases all of a sudden appear. And in the first phase you compute font side and a temporary width. Then you go bottom up. Both of these phases are parallel, sequentially after another, but parallel themselves. And then you do another phase and you go up. And in the fifth phase you are ready to compute the absolute positions. And here are some preliminary speedups. This is from a not quite a faithful implementation but on on four cores about three. This is going to hopefully be more complete soon. So what you get is then you have formal spec you can then automatically with an attribute grammar engine actually find that parallelism and generate the engine with other optimizations, auto-tune over variance. What smarts. We'll have to go here. We don't quite understand yet. Some part of the parallelism was discovered by Leo tweaking the grammar by hand. So those will need to be journalized into automatic parallelizations. So let's now look at the motivation for this language. So why constraints? So I'll try to give you -- I think I have two or three reasons for why the scripting should be based on constraints. The first one is -- well, the first one actually motivates why the language should have constraints and events together, why it should have layout and scripting, in other words, in the same box. So here you see how we typically program in today's browser. If you have a tile which may be a piece of a map, an image in the browser, if you want to position it somewhere or set its height you will in JavaScript write something like tile.height, you take an X, which is an integer variable and you append PX, a substring to it, and this is then passed to the browser proper and which will discover how high the tile should be and it will lay it out. Now, what happens actually in the execution is that you take 15, an integer value, convert it into a string, concatenate it with PX. Then you pass this 15 PX string down to the browser across the JavaScript C++ boundary and there you of course dutifully parse it, break it into 15 and a flag and now you know this was 15. So this is one of the reasons why browser are slow, that not only the contact switch from here to here is expensive and therefore some research groups such as the one C3 here and Adobe but the layout essentially into the same language so that you don't have to cross this boundary. Still this problem of optimizing this away remains. Because it's not something you can partially evaluate, at least not with the standard techniques. So this is motivation for having one language where the layout part and the scripting part is together and can be optimized. And there is no expensive contact switch. I told you about the application we wrote to understand how much abstraction tags there is in browser, the one that was about 100X faster. It was quite hard to write. One key difficulty was that you had to write macros or whatever for converting between four coordinate systems. Because you had a notion of coordinate out there as sort of longitude latitude. Then there are coordinates of pixels. Then there were tiles within the map. Because the map was built out of tiles. And there was one more which I forget. But I don't understand the code that Krste wrote because this was not easy to get right. It would be much nicer if these conversions were not written in a functional style directionally, they didn't say how to convert, only what the relationships hold between these coordinate systems. So I would much rather write something like that. I would like to say there exists a relation from some family of linear relations between -- what is the relation? It's between the coordinates on the map and coordinates on the screen and I'd like to say well, if this X on the map is -- and there's a relation with the screen, then X plus one kilometer must be screen.X plus 100 pixels. And hopefully I establish efficiently what that relationship is. And from this description I synthesize that we will first come up what is this relation exactly on that it is unambiguous and then synthesize as a function. So that's where we are going with this. Motivation number 3 is if you look at some Web pages in terms of visual design, human perception, some of them are easier to navigate than others. This one is reasonably busy in terms of text, but yet it is easy to navigate. So why is it so? The reason is that it is like this document, which I would call beautiful. It's an example of great design where you can navigate your eyes and really know where you are going and where the captions are for particular images. And the theory is that this is because it is an instance of a grid design. You first prepare a grid for your document and then you put images, text into it. And so clearly you know, you have the images here, a bigger image covers not some arbitrary fraction of the page, as would happen in typical Web page, but covers these four. And then the text comes here. And there is whole theory and informal guideline specified for how to do it. But no language that would make it easy for you to follow this design or better that would make it difficult not to follow that design. In fact, they would like to give designers a way to easily build documents and make it hard to build others. So really I think the long-term goal is to give designers a language for building such layout systems such as CSS in these. Finally it's events. So JavaScript events are sort of gotos. If you have a box on the screen that you want to move, you write a piece of text which has two nested handlers. One and another one. These are sort of first class function that essentially are interrupt handlers. It's not quite is seen but what is going O. It's easier to see when you write it as a dataflow program or you say I have a source, it's a mouse that each time you move the mouse generates a pair of coordinates that go here and delay each by 500 milliseconds. And then they said the coordinates of a box on the screen. And now that we see what the program is doing it's easier to analyze because you know what is the flow of control. And because it's the same as the flow of data here. Dataflow is an improvement, then you start seeing it in scripting languages in terms of data binding and such where you can refer to variables from your HTML document. It's not great when you have a more complex document because it happens that you need to send messages this way and also that way, and these cycles may lead to oscillation and programs that have bugs. So what do we offer? We offer a system that does not have directional flow as dataflow but that is bidirectional or relational where there is no particular flow. So let me just show you two fragments from a case study, sort of driving application. So imagine you have a -- you have a video player with the usual play button and the name of the movie and movie and this is important here. This is your timeline that you can grab and scroll, right? You could grab it here and scroll it. Here is an annotation window. So with this movie comes a set of captions. Each of them has a start time and end time. And they are displayed here in a scroll box. And as the movie goes on, the annotations march along. But you can when you grab this and move it, both the movie needs to be rewound and these annotations. When you scroll these annotations and you click on say this one, the time needs to advance to the -- this annotation, the movie needs to go there and the annotation needs to be centered over there in the window. So that roughly corresponds we believe to what the interactive applications will be in the future. So here is how we envision we will write it. What do we have? We have boxes with ports but these lines here are just constraints. At this level, all these constraints are quality constraints. So this variable here and this variable here must always be equal. And in particularly looks at the time. There is a video player which has the time port and annotation displays a time port. And the slider which shows that the time has a time port, and all we are saying is that these three are tied together. And any of them can initiate change in time, and the others need to adapt but I'm not saying in which direction the messages flow. That will be -- come up automatically with the compiler. Similarly we have a toggle button which tells you whether you are playing or not playing. And both this part and that part can initiate the change. I can press on the button and start playing. When the movie finishes, on the other hand, the activity goes this way because you want to gray out that button is since the movie stopped playing. Yet, I do not need to worry about these messages in both directions, I just set this constraint. So this is how hopefully these applications will be just glued together. Even without messages. Now, I want to tell you how events would be handled here. And I have one minute, and I think I'll imagine. Is that what you see here is this annotation display, which showed all the annotations. You actually create them here. You get a list of annotations and you create a bunch of boxes just like in your HTML single dense boxes. And they are all put into the V box so you can think of the V box as the parent in the DOM of these boxes and you put the V box into a scroll box. There is nothing magical here. But let's look inside here. You'll see two other benefits of constraints and events. One is that, well with, here is one annotation. The text from here comes here, and this is the box that displays the annotation. Here is the height in Y? Now, what do I do when I want to make the annotation active? Normally I would need to come here and say okay, when there is a mouse click on that box I need to somehow send a message somewhere and adjust the time. And even in programming or even the dataflow you typically need to go to some central repository and send the time and then the time would come back and say oh, okay, now we have new time let's see if this annotation, if the time is in the interval of the annotation, if it is, you make it active, if it is active, the Y coordinate needs to be zero, which means this guy is centered. But I do not want to go with this event all the way to time and figure out how things will actually propagate. All I do here is I say well, okay, if I click on the mouse, I insist that this annotation becomes active. But I don't really care about how the time is set. This will be again done automatically. In order for this annotation to be active I need to change the time, but I don't worry about how. So I can inference the time or active because things are bidirectional and it seems to be more natural, more modular to think about setting activity over an annotation rather than figuring out what time I need to set so that it becomes active. And so these bidirectional constraints occur everywhere. You saw how it can simplify events in terms of time, even handling coordinate system translation is everywhere. Imagine you want to map your landmarks from the map on the camera view. It's all about coordinate alternate mapping between the map view and the camera view and the angle of the head you are looking at and so on. Scroll box is full of bidirectional constraints. And visualization is about placing labels such that they do not overlap and look visually appealing. But current solvers are not expressive, clumsy, and other things that we need to address in the work. And there are a few other things that I didn't talk about, technically probably the most interesting ones and I can do it if you are interested. Thank you for your time. [applause]