>> Eric Horvitz: So it's -- we're at exciting times here. I'll put this up. My team is very excited and many of our colleagues in the academic community about all the structured and missing content in the world, lots of data about [inaudible] users now. In the old days we always would say, you know, boy, if we just had data. And we focused on the algorithms. Nowadays we have too much data and we're still focusing on the algorithm, but we have had quite a bit of enhanced prowess in learning and inference reasoning strategies. Still overall an intractable problem, but there's been some nice work on approximations that work pretty well. And we're seeing opportunities to interleave machine intelligence at the core of new kinds of services and experiences. I think this is just in the basement level right now. I see a lot going on and several talks the last couple days that I see on the schedule have addressed some of the opportunities there. So we have a lot of sensors out, a lot of connectivity and content. We have these new algorithms that seem to do well. It's built over 15 years in the probability space, probabilistic graphical model space. But computation is a big question for people working in this area of intelligence and automation of things that people have tended to do in the past. And it's not clear where we are. Certainly we made great progress in the algorithms, but the computation and memory available has helped out quite a bit. If you look at the work in the Deep Blue project, for example, a lot of the intelligence in the boost of intelligence was deep replies in search. On the other hand, there was still quite a bit of innovation in how the search was directed, talking to the people working on that project. So if indeed we're at a bend in the curve here, it's bad news for a lot of expectations in an area that has finally seen a bend in the curve of prowess in terms of doing these kinds of things. Just a quick review. Machine learning, there's several different approaches to machine learning. Typically you have a set of random variables. And you want to build a model from the random variables from a large dataset. One approach that I'm very fond of is structure search where we actually build probabilistic graphical models by doing search over all structures looking at different dependencies given a dataset. It's a big search tree typically. Our team has done a whole bunch of work in this space over the years. The idea here is you're searching with a score that gives you the goodness of models given a dataset. So basically it's the probability that data will explain a model, and you find the best one typically. And then you can do inference and decision-making with that model. And there's all sorts of innovations, like in the relational area, looking at learning in databases. There's ideas of parameter coupling, where you look for templates that you see repeated, given structure, for example, of database relationships. What's nice about machine learning is you can -- besides doing inference about a base-level problem you can sometimes induce the existence of hidden variables, which is really interesting for science. Like you know that it's not just A and B looks like they're influencing C, but D is hidden. And you can actually induce or infer the likelihood of a hidden variable. Which is almost like magic when it works. You also can identify causality sometimes by actually not just inferring an arc of dependency between two variables but actually know its direction, A causes B. Certain conditions given constraints. So one of the hot areas in machine learning right now is active learning, which is, again -which is more intractable than base-level learning. And the idea here is that you want to figure out given any model that you have how should you extend that model, what cases should you acquire next for labeling, for example, from a dataset that we may have a bunch of unlabeled data. And this is used in a variety of ways, including understanding how to explore the world perceptually, how to grow your models. There's been some work in tractable approaches to active learning. But it's a challenge space -- we've done a lot of fun work in the idea of lifelong learning in our team where you have a model that's active over the lifetime of its usage, and it's reasoning about how much each new data point is going to help out the performance over the long period of time, over the lifetime of a system's usage. Given a sense for that and user needs. Another area that we're very excited about, which, again, hard problem, hopefully parallelizable, and there's some opportunity there, is selective perception. Here's a system we built called SEER. It can do -- listen to the desktop, it can listen to sound. It can do video classification from a camera feed, video feed, and audio classification. Turns out we can't compute all of this. So what we did in the SEER project was we built policies to compute the next best thing to look at and could trade off the amount of precision in any of these modes with its tractability. And what we found is you're backing off at times, for example, on vision going -- not going from a color blog analysis to a black-and-white analysis, that's all you need right now and so on. We run this thing dynamically. It's very nice. There's actually pricing strategies and so on. And the CPU is triaged to do the best it can given the structure of the model and the uncertainties at hand. So let me just focus a little bit on a few large-scale service problems. I think they're all kind illuminating and -- couple of them here. And given the titles of the talks that I've seen on the program, I think we've seen some talks in this space. I like to say that on our planet we're seeing a proliferation of intention machines everywhere. Layers and components that take in observations and that predict intensions, actions, services. A good example is Web search. Put a set of queries in, and we're learning or a system is learning, or a corporation or organization, what the intention of the search is as well as what content should be returned and how to basically even target ads, for example. A sister kind of technology are preference machines, where we have sets of preferences being taken into a large library and reasoned about. And we actually make decisions about the products that might be preferred or other content that might be preferred over time. And a good example of this is collaborative filtering, where you have sets of preferences -- actually an end space. This is three space here. And typically you don't get -- you have clusters and when a new user comes in you can say, well, this user is sort of like the other users, like you can recommend things that those users might like, for example. We're seeing richer and richer collaborative filtering going on in the world right now. And this includes geocentric collaborative filtering. We have locations, for example, and times and queries at different locations. We have IP addresses and some nice recent work in MSR, we can actually say give me time of day and day of week, cross it with queries, for example. And we get better interpretation of what something means. For example, MSG. Some times and some places and given a trajectory it means Madison Square Garden; other times it means something else and so on. Let me give you -- talk about the ClearFlow case, which is a fun situation, a fun challenge problem. ClearFlow is one of our larger projects on our team. And the idea was can we predict all velocities on streets in a greater city area given sets of observations, for example, about from sensors from the highway system. So that's a sensor, for example, and we have a prediction challenge of all the side streets arterials and even the smaller streets. If you could do that, you could do a search over that and route and have [inaudible]. So we have a lot of work going on in taking lots of streams of data, multiple views on traffic, weather, major events, incident reports with some lightweight NLP, and as well as lots of data from volunteers who have driven around with GPS devices over a period of years in the Seattle region. These are Microsoft employees and their family members. Quite a bit of data there. It's about 700,000 kilometers now and tens of thousands of trips. We also have access to public transit. We've partnered with Seattle -- sorry, King County Metro and we have data coming in live feeds from all the roving para-transit vehicles. The idea is we want to weave together sort of a learning challenge here that weaves together the highway system with side streets, lets us build probabilistic models that can believe together realtime events, time of day, weather, and all sorts of computer relationships about streets and the topologies, how far is a particular segment, for example, from an on-ramp or off-ramp that's now clogged. And based on that we built a portal called ClearFlow. Lots of -- for the Seattle area, about 8 million street segments. Lots of machine learning. Take days to get through the machine learning problem. But you can sort of generate a model that can provide routing in Seattle based on time you're leaving from now, for example. I'm leaving in 30 minutes, for example. And we actually built an internal prototype that would actually put side-by-side MapPoint and ClearFlow, the default direction system and ClearFlow, and get a sense for how [inaudible], for example, want to get off sides -- how we can get onto side streets and so on. The system performed very well. You have to do performance, which you see all the thousands of points here and how well they're doing in space and how well we're doing versus a standard posted speed model. It did so well we actually ended up impressing the product team. We shipped this project in 72 cities a year ago. It's called ClearFlow on maps.live.com. And you can see here the cities that we're in including Seattle and San Francisco area. What's amazing to me is that every few minutes we are inferring the road velocities on 60 million street segments in North America and having hundreds, probably thousands of cars routing based on that. It takes us running on 128 cores for every rev of this model, just the model [inaudible] predictions for North America, several days of computation. And so -- and that's where we've parallelized the machine learning algorithm to run on each core on separate processes. And we think we could do a lot more there. Without that it would take even longer. And as you can imagine, it would take us weeks. And the product team has a need to get this revved every few weeks right now, which we're doing. This will get even worse over time as we have fresher and fresher data coming into the system and required -- up to requiring ongoing machine learning as we gather data. Yeah. >>: Does ClearFlow output affect traffic at all? >> Eric Horvitz: There's always one smart person in any audience that asks that question. If we're so successful where we're used by a majority of people, that would be true. And then there are other methods we can talk about. We already have been working away on load balancing approaches. Right now we don't intend to do explicit models of how we want to load balance by giving out, for example, end different randomized directions to given what we see in traffic. One could do all sorts of things. Another point, just actually represent how many views and represent each logged direction you're giving out in your database and compensate based on the [inaudible] you're giving to the system based on the current [inaudible] street system. And that's like a very interesting research project. It's a great intern project, you know, how would you actually make ClearFlow work better to load balance if .9 of the citizens were using it to route home in the evening, for example. >>: [inaudible] >> Eric Horvitz: Exactly. Let me just mention a couple [inaudible]. I want to move from the cloud services now to the client a little bit here. And one comment I want to make here is that I believe that privacy is a driving force for innovation in cloud to client as well as client plus cloud solutions here. We don't hear a lot about this, but one of our -- we're very serious about privacy at Microsoft Corporation in general. It's an interesting area at MSR. It spans several groups, several approaches, several labs. One approach we've been looking at -- we have a couple of different projects here. One is called protected sensing and personalization, PSP. And the idea is can we move -- still get the value out of services but move almost all machine learning and reasoning into the dominion of users' machines instead of cloud this and cloud that. So the idea basically of shroud of privacy here. We have data -- for example, GPS data or search data -- all kept within the safety, within the metal structure around your own boxes that you own. Do machine learning and prediction in a very personal way within this shroud of privacy. And then use those predictive models to do things with realtime data needs, context, and so on, at times even using third-party models that might be developed and [inaudible] models then with volunteers. It's a very promising area I think, a competitive area. Here's an example. Personalized Web search. So in our P search project -- this was work I did with Jaime Teevan and Susan Dumais. We go out through a Web and do search on the Web, let's say, and we bring back -- let's say we put the word Lumiere in. We bring back 400 results behind the scenes. And what we do is we do reasoning -- learning an reasoning on -- in the client to rerank what's coming back and to re-present the results to the user with a different rank ordering. So when I put Lumiere in running P search here, Lumiere is a, you know, common word, it's used for restaurants and there's a whole history of the Lumiere brothers and so on. But I get to the top -- but what's at the top of my Web page -- what I mean by Lumiere based on an analysis of my e-mail, my work, my documents, my relationships, and that's all done in the privacy of my box. I won't share that with Bing or Google or anybody. There's some sharing going on, but it's an interesting question as to how you can [inaudible] escape that. One example. We've done this work in GPS as well and so on. So I just want to mention that privacy is going to be a forcing function for doing more computation on clients and not being able to necessarily rely on large datacenters. I think it's a really [inaudible] point. Another area I want to talk about a little bit is back to the client as well as cloud is the challenging area of sensing simulation and inference in human-computer interaction. We have a lot of work going on on my team in this space. You see what we call the 4-by-6 able to here and active Surface, collaboration that's doing quite a bit of sensing, little bit of reasoning. Let me just go to a little video here. Where's my video. So why don't I see my video here. Let's find out why. Okay. I have to pull that up here again. Stand by one second here. The way I was going to save time with the video was by going out and having it set to go here, but it seems to have changed this. So it will just play in the meantime here. All right. So what that sound just was just now was this. What you see here is a future direction in mixing tangible objects in the world with simulation. So what's going on here is that these are simulated cars being built -- generated by a physics engine. What the surface they're on is an actual surface made from construction paper that's being sensed in real time. Adjusters could be used to change and add objects to the surface here, and then they interact in this virtual space as real objects. You can balance the car in your hand, has the gravity and I bounce to it, for example. Shadows are simulated. So there's a lot of work going on in simulation plus gesture recognition and sensing combined with graphics, with CGI, to come up with new kinds of experiences that might mix the virtual and the real in new kinds of ways. Can't say a lot about it, but this is going to be a very big area for certain areas of Microsoft -- in the Microsoft product space. This is another area basically looking at this 3D camera showing how we can sort of sense gestures and come up with a 3D representation of hands that actually interacts with -- again, with a simulation here. Let me show another kind of area here which I think is exciting as well. The idea now is taking -- you know, doing richer and deeper physics simulations. I think the Intel people have done some work in this also looking at parallelization of algorithms to do efficient work with really high-fidelity glass breaking and water simulations. We're looking at the whole notion of the HCI space being in part more and more dependent over time in richer simulations like this. This is all just a surface here that shows how we can do folding, ripping, tearing by combining the gesture recognition with physics. Again, pretty heavy duty on the computational side. Let me move ahead here a little bit to show you a little bit of tearing, which is kind of fun to look at. There's a whole world of how do you debug these systems, how to you build models that do efficient recognition. For example, one of the approaches that Andy Wilson has done in this work and most of the work you're seeing here in his team is coming up with a very interesting geometric algorithm that does a nice job at capturing gesture with this through touch sensing. And I'll just end by saying a little bit here that this is the direction I think for user-interface design someday. You don't sit around and worry about the details of user interface. You have a physics model and you want to basically say I want to basically put down some physical objects and use them directly and use a physics simulator to give you your force feedback on your response and so on. So someday you just tell a system I want an input device that has this kind of weight and angular momentum, and you just use your computation to generate the rich behaviors. Again, this is work by Andy Wilson. I'm a brand-new user of Win 7, so I have to find out what happened with my video. Another interesting realm that we're pushing, which I think is a very tough realm for computation per need, is mix initiative collaboration. The idea is someday you'll have a problem, blue blob in front of you, you want to solve it. Computer look at it with you, might be ubiquitous computation scheme, and say, you know, I can divide that blue blob into subproblems alpha and beta, and, by the way, I can solve beta, and I'll tell you [inaudible] solve beta, that's my problem, the human being, you should solve alpha. And you continue to decompose and iterate. We've done a bunch of work in this space over time. One was called Lookout, one of the original prototypes in this area, that would actually schedule from free text and bring up Outlook and sort of populate the schema for Outlook based on the free text content. And the way that worked basically is the system would learn behind the scenes as used e-mail -- always the machine learning and also reasoning about whether I should do nothing, engage a user, and if so how, or just do ahead and do something. And it has a rich utility model here that users can assess. And if assessed correctly, the system should make users happier when the predictive model sort of predicts the probability of a desired action over time. It's a very rich area, it's like you have mixed initiative interaction. There are some rich future directions in this space where people and computers that work together -- I don't know if people are familiar with the Da Vinci Machine. Right now it's very popular in medicine. I actually worked on an earlier version of this way back when I was collaborating with somebody at SRI in the early '90s that became this company. But this is a robotic surgical system that people collaborate with. The robotic side is not very rich yet beside direct manipulation, but some day the idea of group collaboration and mixed initiative will be very essential. And people at Johns Hopkins right now are actually doing -- working in the motor space in mixed initiative. Very exciting area. So let me now get back to my title which I called Open-World Intelligence, because I sort of build up to this. A theme in my group, and we have a sister group, Andrew Ng's group at Stanford that are working in this realm. Very excited about the idea of systems that realize they are incomplete and inadequate in the world. They have a good sense for their own limitations. They assume incompleteness. They represent uncertainty in their preferences, the preferences and intentions of the people they're supporting. They know that there's a shifting set of goals, sometimes new goals they haven't realized or understood yet. That they're in a dynamic world where there's synchronous sensing and acting. There's different actors in the world that are entering and leaving the observable world. And in this world supervised, unsupervised and active learning are essential. You have to [inaudible] multiple components of services that provide different aspects of sensing, learning, and reasoning. This is an image from before. Deliberate about the spectrum of quality, utility tradeoffs. For example, considering the robustness at the price of optimality. And some directions are looking at the idea of what we call integrative intelligence, taking together components that in the past have been separate. They've all been kind of like the focus of attention typically of intractable problems. You heard about speech earlier, I believe. Planning. Robot motion and manipulation. Localization. Vision problems. General reasoning about the world and plans. And we're weaving these components together and getting a sense for how they work when you have dependencies and latency constraints. [inaudible] the NLP component alone is a whole area of work, and they hold conferences, ACL, for example, and NAACL. So a lot of the work sometimes you actually bring communities together who may have gone to a AAAI conference, the AI main conference, maybe in the 1982 but now are at a whole different area that's just NLP, for example, or vision. Same with robotics. So let me just mention a little bit about one particular project in this space that's a lot of fun. We call this the Situated Interaction Project. Dan Bohus is leading up this effort. I and others collaborate with him. And the idea was to try to build a platform that can do what we call open-world dialogue. And we describe very clearly open-world dialogue involves people that are coming and going, not a push-to-talk device on a directory assistance call, for example. It also involves explicit limitations and reflections about models that are incomplete and so on. So we started by looking at what does a receptionist at Microsoft who you all probably had to work with over the last couple days in the front of Building 99 -- we see right here -- what is the task at hand here in terms of satisfying various goals. So we did a lot of recording with signs up to tell everybody we were recording [inaudible] a little bit of that. We have a bunch of cameras up and we're watching different aspects of the scene here. And we're running a tracking algorithm that can sort of track facing, help us do tagging later. We have an acoustical array microphone to get a sense of who's talking and where the sound is coming from. We're looking at trajectories of entry, of groupings and clusterings of people to understand when they're together versus separate. We're watching carefully the expressions and timing down to the latency of the receptionist and she works with people. And so on. One challenge area that came out of this work was building models of multiparty collaborations, which is very new in the dialogue world. The idea -- here's a receptionist, people are coming and going, some people have a goal that's been verified, like I want a shuttle, I want to get into the building, I want to see somebody, can you call somebody up. And the idea of understanding when people are together versus separate, for example, and trying to build models using vision and speech that can maybe do this kind of thing. So we ended up building a nice platform trying our best to -- we used to get as kind of experience, not the way you need to have an anthropomorphic experience here, but it's kind of a unified experience that might do all the things that you might expect from a receptionist someday just to explore the space a little bit. We built this on the Microsoft robotics platform which let us call many processes and manage dependencies and are need for synchronicity to minimize some latencies. We notice if we ran this system even slight latencies coming between speech and vision made the system seem completely out of whack and unnatural and unusable. So debugging this was quite interesting. Lots of tools that Dan built to do this. And there's also machine learning going on and so on. Let me show you a quick video that Craig Mundie has shown around Microsoft a little bit to give you [inaudible] how this works. That red dot is the gaze of the avatar. Running on eight cores here. You see the eight cores doing their thing, different modalities. [video playing] >> Eric Horvitz: Now, it's unclear whether you had a better experience at the front of the building, but there are many pieces and components here that we're pushing on here. There's lots of good technical work going on beneath the covers and some heuristics and some hacks which are melting away to the formality over time. We had a platform and we've been pushing on it as a platform to write different apps to, and we have several different apps now which help us explore the space of we might call the open-world dialogue space. So if you come to my office right now, outside my door you'll see the PAS system, which is a new version of the receptionist which now is down the hall somewhere else getting polished, the Personal Assistant for Scheduling. Now, PAS has access to lots of data about my comes and goings over the years. We're using some components and tools that have this ability. This shows the position right now by my office. We realized we had a lot of work going on that could give brilliance and intelligence to these systems if you just added them in. They're also, again, intractable. This is actually a Web service that we've had up since 2002 called Coordinate. Coordinate continues to -- neighbor who runs Coordinate has a system continue to look at all the devices and desktop usages and generates models over time to predict how long until they're at different places. So if I'm gone for an hour and a half, you can give a SQL query to coordinate and say how long will it be until Eric will be in his office for at least 15 minutes without a meeting. And it will tell you that based on the statistics of comings and goings. How long will I be on the phone until I hang up based on my statistics of phone use. How long will I be in stop-and-go traffic given by my GPS data and so on. So it's the first time in the Bayesian machine learning world we actually build -- this is kind of for the computation these days. You build a case library in real time with a SQL query, do machine learning -- that graphical search you saw -- structure search, and inference all in a few seconds. Query-specific database. That was the big fanfare when we presented this at UAI a few years ago that we -- the base store for probabilistic reasoning and machine learning was a database, not a prebuilt machine learning model. And so you can have standing queries of various kinds, like when I'll next read e-mail, for example. It's kind of interesting if I give Paul Koch access to my account, he can get into -- when he gets into Outlook he can say, oh, Eric was last in Outlook six minutes and he'll read e-mail within 19 minutes, for example. Anyway, we wanted to give PAS this ability, also the ability to predict cost of interruption at any moment, whether I'll attend the meeting or not. Those are other projects. But here's the experience with PAS, and I'll summarize [inaudible]. [video playing] >> Eric Horvitz: Here's one more scenario here. [video playing] >> Eric Horvitz: So I'll end there. But wanted to suggest that there are some really hard problems. Parallelizing them would be quite valuable. We have some interesting approaches to doing some work, like our machine learning work in ClearFlow. We've already parallelized to 128 cores and we do our cycle for the product time. That's a system that depends on that essentially for our ClearFlow offering. Cars are being routed right now based on that system. If I had more time today I'd talk a little bit about this whole area that we're very excited about and we're looking for collaborators on, learning and reasoning in computation, how do we actually learn to automatically distribute workload, do speculative execution. This is a very interesting area for us given work we've done in the past that led to the SuperFetch component in Windows right now that performs extremely well based on our dataset and so on. And we want to do more of this and we want to move to multicore. In some ways we've promised Craig Mundie we'll be moving into multicore with our machine learning, and I think some of you are already thinking along these lines and doing work in this space. We'd love to catch up. So I'll stop there. >> John Hart: There we go. Sometimes it's ironic to be talking about making applications go so much faster and just bringing the computer back up takes so long. I want to talk about dynamic virtual environments. I'm John Hart. I'm one of the PIs for the UPCRC at Illinois. And this is work I've done in collaboration with our graphics group in UPCRC, also our architecture group and our patterns & languages group. Particularly Sarita Adve, Vikram Adve and Ralph Johnson, our faculty; and students Byn Choi, Rakesh and Hyojin and Rob. And these three are sitting right over there at some point, and I'll be standing up here taking credit for a lot of the things they did. In looking at grid consumer applications that will get a lout of payoff for examination for parallelism and things that should drive multicore in the future, an obvious choice is to look at video games. And you get a spectrum of video games based on two axes. You can either have a lot of photo realism or you can have a lot of flexibility. And we saw this yesterday. There was a couple talks on this yesterday. You have these photo realistic first person shooters, and as they become more flexible they become less photo realistic, very often because you're building new environments, new objects, doing unexpected things in these more flexible video games and environments. And you can't take advantage of the massive amount of precomputation that's needed in order to get these cinematic effects given current computational power and the need to deliver frames at at least 30 frames a second. So what we want to do is we want to make games as flexible as Second Life and other social online network games, but we want them to be more photo realistic, more realistic, more lush, and in doing so also -- if we can do this, we can avoid the need for precomputation, and that will make video game titles faster to produce and cost less as well. And so just an example. This one I'm particularly proud off. This is a collaboration I had -- I was fortunate to have with Microsoft Research back in the summer of 2002, the student that was here. And it was on getting precomputed radiance transfer to work on the GPU. And this is what video games -- in this case, the Xbox 360 -- had to do to get effects. You know, like you see in Halo or like you see at the light shining through this bat wing, and these effects -- it's a precomputation that you do at every vertex on the shape, and the precomputation says that whatever life you have in your environment, whatever light function you have in a large sphere around that vertex, that light's going to change when the light actually reaches that vertex because things will get in the way, some of the light may enter a surface and scatter around and reach it, other light may inner reflect in other places. And so we can specify this light as a 25 vector, that specifies the function over that sphere, and we can specify this as a 25 vector. And so all the stuff that happens to light at the large scale to the small scale that gives use all of these effects is basically this 25 element by 25 element transformation matrix. So 625 numbers that need to be stored at every vertex. And that's fine as long as the object doesn't move. And if the object moves and you have to parameterize things again, and you have a -- these things take 80 hours to compute just for a still object. And when you have dynamic scenes, they can take even longer. So this is the kind of thing that needs to go into making modern day video games look photo realistic like what you're seeing on the cinema; whereas when you're going to a movie, you can take arbitrarily long to generate a frame of video or a frame of film at a movie. But you've got to generate that frame uniquely, 30 frames a second, when you see it in a video game. The other thing I'm really proud of is my only real contribution to this work was I drew this picture which ended up in the MSDN documentation for the technique. Sometimes it's nice to see that stuff happen and get used. And so if we look at the modern day, we have techniques like ray tracing and we have enough cores, enough parallelism that we can enable district ray tracing, realtime ray tracing of scenes. And so here's an NVIDIA demonstration -- Intel's going to have these too -- of a car riding down the street and everything's ray traced. And that's good except that what they're demonstrating is district ray tracing, and district ray tracing is rather easy. This is just tracing a bunch of rays coming from the eye, they show the same location -- same starting point, and they're all nearly parallel, all emanating, and they're all in order so that all the geometry access and all the memory access is coherent. And the same thing's happening with the light. You've got rays coming out of the sun and you get this nice hard shadow underneath the car. This is a photograph of another -it's a micro Porsche, I guess. And if you look at the reflections here, you're seeing perfect mirror reflections, although a little bit bumpy from the surface geometry. This thing is basically a mirror that's been warped into the shape of a car. You know, actual car reflections are a bit glossier when you have rays of light hitting them or when you have eye rays, lines of sight hitting them. They spread out. And it's that spread that makes things much more difficult. And in general the thing we really need ray tracing for, the thing that we use precomputed radiance transfer for for video games is global illumination. And those effects are really hard. The -- you know, after the first bounce you get a mess of rays going -- starting from arbitrary points in arbitrary directions, and you no longer have coherent access of geometry and loss of cash locality. And so we did some experiments with this. We had this rendering of a car, for example. And we're ray tracing it. And this is a GPU ray tracer. And when you ray trace on the GPU you have to do things in phases, so you send all your rays out from the eye and they hit an object and then once all the rays have arrived at an interaction, then you go through and you do all your shading. And if -- you know, the nice thing about rasterization and the reason that shaders have all done rasterization hardware was that when you rasterize a triangle, all the pixels on the triangle were running the same shader. They may look different depending on the outcome of that shader, but they're all running the same program. And so SIMD performance was really good. And we see that same SIMD performance is pretty good on these eye rays. But after that first interaction and the rays start bouncing around, and this is a path tracer generating this, we start to lose that coherence. And all 32 elements of an NVIDIA SIMD processor were running one shader in the blue sections of this car image. After that first bounce, neighboring pixels, the rays that started close to each other have diverged and are executing completely different shaders. And in these red regions we have, you know, up to 16 different shaders being run. And on a SIMD process you get a lot of divergence. And even though you've got 32 different processors, running that, they're going to take 16 times as long because of the serialization. So you get that incoherence leads to SIMD divergence. And one of the things that's painfully obvious is that we're always going to have these SIMD units in our multicore architecture because they're so cheap and give us such a great performance boost when we use them correctly. So last summer we started working on this and we did a little bit more work throughout the year on the idea of shader sorting and basically tracing rays through a scene and accumulating all the shader requests and then trying to sort the shader requests so that if you have a SIMD vector of 16 or 32 elements wide, that you're trying to pass all the same shader program requests to that one SIMD vector to avoid this divergence and to avoid the serialization that you get. And so if you don't do anything, if you just use a big switch, if your shader is -- you know, if I need shader 1, then run shader 1, or else if I need shader 2, run shader 2, then you'll get divergence and you'll get serialization. And for simple scenes with simple shaders, you know, this is a glass Stanford bunny in a red-green Cornell box, that turns out to be the best way to go. Because shaders are so simple, we can actually do faster if we sort the shaders here, but the overhead from the sorting makes it too expensive. But for almost all the other scenes that we tried, this car was 16 shaders, this Cornell box with this scanned Lucy statue has procedural texture for the stone and procedural texture for the floor. So it's just four or five shaders here, but these procedural ones are very expensive. And same for the Siebel Center staircase with procedural copper texture. And this other lab scene that has some 20 or 30 shaders, none of them procedural. In all of these cases we found that it was much more efficient to sort the shaders and send them to the -- and send packets of the same shader to the SIMD vector units. Even with the expense of doing the sorting it became much -- in this case a little bit faster than just paying the serialization price. And in many cases, especially with procedural elements where you have to run a shader program for a long time, the benefits becomes much greater. In fact, we can render these procedural guys as fast as the nonprocedural guys using this shader sorting. And so that's part of it, is running -- is trying to efficiently shade these objects as they're being ray traced. That's one thing we've looked at. Another is the fact that if we're building these dynamic network online virtual environments, we want to build things, we want to be dynamic. The scene can be changing and people can construct their own geometry and do unexpected things. And when you're ray tracing, part of the efficiency of ray tracing is you don't want to intersect every ray with every triangle. That just ends up being too slow regardless of how fast your computer is. You want to use data structures that will help you narrow down a likely set of rays intersecting a likely subset of the geometry. And so we need these spatial data structures in order to accelerate ray triangle interactions, to take collections of photons scattered from the light source and gather them into rays that make it to the eye in order to classify rays so that we can get likely bundles of rays, and that ends up being a five-dimensional structure where you've got the three-dimensional anchor of the ray and the two-dimensional latitude and longitude of the ray direction. If you have a bunch of scattered points from a laser scanner and you want to represent it as a surface, if you want to scan yourself in as your own avatar using cameras or using some of the vision techniques we saw yesterday and you want to reconstruct that as a surface instead of a bunch of points, you want these spatial data structures so you can find the closest point to a given sample point. And also collision detection, and these things are also commonly used in vision and machine learning. So we wanted to come up with efficient parallel techniques for building these things. And that's so far called ParKD. And just to differentiate, you know, we have 20, 30, 40 years of history of doing parallel algorithms largely in scientific domains, and that's good but scientific doesn't pay as well as consumer does. That's a lesson we learned in graphics about ten years ago. And in scientific domains, we have n-body simulations where every body is influencing every other body. We have molecular dynamics where all the bodies are about the same size and you've got water uniformly distributed. And so they set up a certain class of spatial data structures. But in graphics we tend to have objects or polygons or geometry distributed in sort of these submanifold, these two-dimensional surfaces. And so we found that the KD-tree tends to be one of the better choices in that case. But there's all sorts of applications. And it turns out KD-trees can be manipulated to support all of these things. And Rob Bocchino is one of the people on our project working at generalizing the results we get for KD-trees to work with a bunch of different spatial data structures. And so we have the uniform grid or the hash table, and I noticed Hugues Hoppe is here from Microsoft Research, did a great job getting this spatial data structure to work efficient in parallel a couple years ago. And there's quad trees. And one of the things that's interesting to note is you can have region trees or you can have point trees. And likewise you can have KD-trees and so on. And KD-trees can be organized at point trees. This is how they're used in photon maps. Or they can be organized in region trees. And this is how they get organized for geometry. This is the kind of structure we're going to focus on is this region tree for organizing mesh triangles -- objects that people have constructed in online environments. And if you do a bad job of constructing these trees, you get an imbalanced KD-tree and you've really lost the advantage of your spatial data structure. And so the challenge then is how do you build one of these trees quickly, efficiently in parallel when everybody's trying to -- all the processors are trying to manipulate the same central data structure simultaneously. And there's -- we looked at some of the previous work. Intel has this nice algorithm that runs on the CPU, an Microsoft Asia has this nice algorithm that runs on the GPU for constructing KD-trees. And the CPU one's 4 core. I guess it'd be 16 core if you count the SOC. And it's generating this in about six seconds. And 192 GPUs generating this in about six seconds. So we're finding that a lot of GPU processors run about as fast as a few CPU processors. And there's a lot of other subtle differences between those algorithms as well. And so the idea of how does KD-trees -- how do spatial data structures help use render fast, and you have these partitioning planes and you separated geometry into triangles that are on one side of the plane or the other side of the plane and then they create a hierarchy. And so if we have an array that we want to intersect with this geometry, we see that the ray starts on one side of the plane and ends on the other side of the plane. So we intersect the geometry on the first side of the plane first, and if there's no intersection, then we intersect the geometry on the other side of the plane. And so building these trees is pretty easy. You just need to find the best splitting plane. Given your set of triangles, you split the triangles to be on one side or the other side of the plane depending on where they fall, and then you recursively build the left sides -you take the triangles on the left side and further subdivide them. And subdivide the right side's triangles and add splitting planes for them. The real tricky part for doing this in parallel, it turns out, is finding the best splitting planes for a given set of triangles. So choosing a splitting plane. Turns out that computer graphics looking at this over the last couple decades has found the surface area heuristic to be really good for figuring out where to put the splitting plane to get an even split of triangles that's efficient for ray intersection. And it's basically the surface area heuristic says wherever you put the splitting plane you want to -- see, believe you -- I figure if you want to minimize or maximize. I think you want to optimize. Minimize or maximize, I forget which one, the number of triangles on the left side times the left side area of the bounding box plus the number of triangles on the right side and the area of the right side bounding box. I think you want to minimize. And so -- and if you add those together, that gives you, for our cases, a simplified version of this surface area heuristic. And so we want to try that at these events as we sweep from left to right of where we're -you know, where we're changing the number of triangles. And so we need to count the number of triangles to the left and the number to the right, and then measure the surface area. And so you can do this. There's a nice streaming algorithm that Ingo Wald came up with in 2006 with some collaborators that creates three sorted lists and then sweeps along each of these three sorted lists and keeps a count of how many triangles to the left and to the right that you have. And this thing can be implemented pretty easily using some prefix scan operators. And so we did this. And that worked pretty good. And that actually revealed some patterns for processing spatial data structures for processing hierarchies. And so we've decomposed these into this geometric parallel pattern -- geometry parallel pattern and this node parallel pattern. And so at first at the -- if we're creating a hierarchy, the classic way of creating a hierarchy on multicore parallel computers is to -- is to get to the point -- usually it's just a serial algorithm to keep subdividing until you've got one subtree per processor. And once you've got one subtree per processor, then each processor can work on its own subtree independently of the others. And so you've got this node parallel process where the number of nodes equals the number of processors at that given level. And that's pretty easy to parallelize. It just becomes task parallel at that point. The tricky part is up here, and this is where we find geometry parallelism where you sweep through all of the geometry, all of your data, and then you find the midpoint or you find the split point that you want to split, whether it's your spatial median, it's your object count median, or in our case it's a surface area heuristic, and then you create that split. And at each point in the level you've got -- if we're doing geometry parallel, we have all our processors working by, in our case, using these scan primitives in order to find the separating point by streaming through all of this geometry and all of your data and then finding this split point. And so all the processors are processing all of the data at each level in one stream, and they're just inserting these split points wherever they belong. And so the interesting thing is Moore's law, which, you know, in the past has said that a processor gets twice as fast every year and a half or two years. Processors aren't getting any faster, but we're just getting more of them. So we're doubling the number of processors every year and a half or every two years or, you know, every -- you know, every certain amount of time, which means that this origin line is going to be descending every couple of years. And that means that most of the previous algorithms we've looked at focus on this section and do something very approximate up here. And this is the important part. This is what's going to play a big role when -- this is going to dominate the problem when we have hundreds of processors. And thus far we've spent a lot of time down here and very little time in the green section. And so if we look at the past contributions, the recent work in what we've done, in the past couple of years there's been several multicore KD-tree construction algorithms. And in the top half they've done approximations, they've avoided the surface area heuristic and just looked at the simple median or the triangle number median or the spatial median or some approximation of the surface area heuristic, and then focused on doing either an exact surface area heuristic or some approximate binned surface area heuristic in the bottom half of the tree. And those have been very effective, but they're not going to scale well when the number of processors grows very large and we're spending more time in the top half of the tree and less time in the bottom half of the tree. And so that's what we've been focusing on is working on that top half of the tree, and so we have techniques that work in place, and then we have a simpler technique that just uses for simple nested parallelism. And we wanted to focus on an in-place algorithm partially because as we get the number of -- as we get the number -- as we increase the number of cores, these cores are going to become more and more remote, they're going to require more and more cache locality. And so we want to avoid moving memory around as much as possible as we're processing these things. And so this in-place construction algorithm, like Ingo Wald's algorithm, we presort in X, Y, and Z. It's basically a slight variation of Ingo Wald's original formulation. And then as we're cycling through all of the nodes -- or all of the events, all of the triangle left-right extents and counting the number of triangles to the left and counting the number of triangles to the right, when we go to put the triangle in a -- the record of that event either on the left side or the right side of the tree, instead of moving the memory to the left side list or to the right side list, we just leave everything in place and we just put a pointer to the node that that triangle is currently in. And that works great except when you have a KD-tree some of these splitting planes are going to go through triangles and so the triangle -- a portion of the triangle will be on the left side of a plane and a portion of the triangle will be on the right side of the plane. And then we're stuck. We either need to have multiple tree pointers in that, which means we need to either have a dynamic memory allocation or somehow expand the amount of memory at each point by a fixed amount, or we store the record of that node in some higher tree node. And we haven't still decided which is better. We're still doing a bunch of experiments to try to figure that out. The other thing we're investigating is inspired a bit by a talk that Tim Sweeney gave at Illinois as part of Sanjay Patel's Need for Speed seminar series on the impact of specialized hardware programming on programming time. Tim Sweeney help write -he's in charge of writing video game engines and developing video games. And he pays very close attention to how efficient his programmers are at doing this. And he mentioned that when he programmers are doing multithreaded programming, you know, straight multicore programming, it takes them twice as long as when they're writing just a serial program. They're working on the sell it takes them five times as long when they're doing GPGPU programming, it's ten times as long or worse. And he also notes that anything over twice as long is going to put them out of business. So his concern is that as we're -- you know, that we not focus entirely on efficiency. He needs things to go faster but they don't need to go as fast as possible, especially if it takes people ten times as long in order to eke out that last little bit of performance. And so based on that concern, we also focus just a straightforward simplified parallel KD-tree build using nested parallelism in TBB. That just does a straight-ahead, full-quality SAH computation, and we wanted to do some experiments to see based on just that simple straightforward implementation how well it performs. Let me go back one. And that was -- that example, for example, we get a speedup of about six times on 16 cores. Ideally you'd want 16 times speedup. But given the constraints of programmer productivity, if you can get that done quickly and ahead of schedule, a 6x speedup still justifies the expense of the additional processors. >>: Can I ask you how long it took to code these? >> John Hart: Byn, do you guys have numbers for how long it took you guys to code that up? >> Byn Choi: The simple one, I'd say a week or less than that maybe. >>: Compared to the -- >> Byn Choi: Well, actually -- okay. The simplest one, once we had the sequential version parallelizing it was about a day or two using TBB. But [inaudible]. >>: So if you went to 32 cores would you get a 12x speedup? I mean, as long as you're on that curve, it's probably okay. >> John Hart: I think so, yeah. Yeah. >> Byn Choi: You don't even need to be on a linear curve. Because if you look at the way processors use transistors, they didn't get a linear increase in performance versus transistor [inaudible]. So the economics of multicore era, you only need to get maybe square root performance. So it's kind of a stronger justification for what you're doing here. >> John Hart: Yeah. There's two sides of this. There's the high performance side where you want to eke out -- you want your 16x speedup factor. Then there's the scalability side that says, you know, I'm going to buy a computer that has twice as many processors; I want my stuff to run at least faster. >> Byn Choi: [inaudible] addressing is trying to keep the industry on the track that it used to be on. >> John Hart: Let me just conclude. We're starting to gear up to make what I'm tentatively calling McRENDER, which is a Many-core Realtime Extensible Network Dynamic Environments Renderer. For online social environments, kind of like OpenSim which we saw yesterday, that leverages ray tracing specifically for these kind of back-end global illumination things. And uses Larrabee's software rasterizer as a programmable rasterizer to get a lot of the same effects we get with ray tracing on the front end, and depends critically on this parallel KD-tree for a lot of the features in order to get things to run fast enough to be able to deploy in real time on upcoming multicore architectures. Okay. That's it. Thanks. Oh. Question. >>: So I'm a big fan of BVHs and it seems like they have a lot of nice properties for these kinds of problems because they're more forgiving than KD-trees, you can deform geometries and you don't have to have the -- in constructing in BVH you don't have to insert this joint regions, so the parallelism should be easier than constructing in KD-tree. >> John Hart: Yeah, yeah, yeah. And there's not really much difference. The whole point we looked at this was we wanted to look at hierarchies, spatial data structures in general. And so we needed to settle on one, like a KD-tree. But the bounding volume hierarchy, you still have -- I mean, with a KD-tree you still have a hierarchy as well. The big difference with bounding volume hierarchies is they tend to get built bottom up instead of top down. You tend to surround all of your geometry with bounding volumes, and you tend to merge the bounding volumes from bottom up as opposed to a KD-tree where you take a look at all your geometry and split it top down. That's on the or horizon too. >>: [inaudible] >> John Hart: The what? >>: I've built them both ways. >>: Any other questions? Let's thank John. [applause] >> Ras Bodik: Okay. I'm Ras Bodik. I'm one of the co-PIs working on the parallel browser project, and I want to start by introducing Leo, a grad student in the white shirt in the back. He's excellent at answering the really hard questions. I'm not saying that to discourage them; I'm saying that so that when I redirect them it's not because I don't have the answer, it's because he has a better one. So what this project wants to do, one of the goals and perhaps the key goal is to run a Web browser on maybe one watt or smaller processor of 800 full parallelism. And I'll try to start the talk by saying why this may be interesting. So Bell's Law is the corollary of Moore's law. And it says that as long as the transistors keep coming, meaning shrinking, there will be a new computer class appearing with certain regularity. And the computer class will reach new people with new applications and redefine the industry leaders. And we are now at the stage where we are making a transition from the laptop computer class to handset computer class. And something is very different now than it used to be, because in all these computer classes, the software that you could run on the new class was essentially the same software already you could run on the previous class. In some cases it was a same as it is in the laptops; in some other cases it was a variation, even if you ran a difference [inaudible] you still could run the previous one. But it's different now because the single thread performance of these processors is not improving, so the efficiency of the handset computers is essentially much smaller because they have much less energy available to them and they can dissipate much less heat. So the power wall means that we need to write different software for these handhelds because the software of laptops doesn't run and it's never going to run efficiently on these handsets well unless you are willing to wait very long and optimize the software considerably. So single thread performance is not improving. In fact, it may go a little bit down to improve energy efficiency. But energy efficiency is still getting better, about 25 percent per [inaudible] generation, so that's 25 per two years. And that means that every maybe four years you will get -- you will double the number of your core. So you get more cycles, but you are not going to get better single thread performance. So in order to get better performance, these handsets variable will have to be parallel as well. To convince you that this is not just a supplement to existing classes but indeed a new computer class, it probably makes sense to look at the output alternatives. And starting here you have flexible displays which are one possibility. Here we have bigger projectors in phones which already are on the market. They are not as small as they would like to be, but eventually they'll be so small that you will be able to hide them in computer classes and enable applications like this one where you are looking at the engine, you are repairing it, and you see a superimposition of the schematics of the engine which help use navigate your work, and the same you can do with navigation when you are working in a foreign city and so on and so on. So why parallel the browser, what does browser have to do with is the future computer class? After all, browser is just one application. And it's not really an application. It's an application platform. It's a way of writing and deploying application. And programmers like it because the applications are not installed on the browser but they are downloaded, such as Gmail, and the JavaScript language and the HTML standards provide a portable environment. So no matter what platform you have, as long as you have a browser, you can run it, with minor differences. It's also a productive environment because the scripting offers high-level dynamically type environment in which you can easily embed a new DSL, and the layout engine that is in the browser is [inaudible] and easy to make new user interface. But on handhelds the browsers don't perform really well. And in fact the application development on iPhone is different than on phones, in general is different than on laptops. On laptops many new applications are written as Web browser applications, probably almost all of them. But on the phone you write them in the iPhone as [inaudible] or in the Android, which is a considerably lower-level programming model. And you do it because that's a much more efficient way of writing applications. And to tell you why people don't use browser, if you run Slashdot on this laptop, relatively lightweight, it may download layout and rendering in about three seconds. On the iPhone it would be seven times more. And hardware people tell us that once things settle down the fact that this processor has about 17 watts and the handheld one will have half a watt, it will probably translate to about 12x slowdown on single thread performance for the phone. So even if you optimize existing browsers and there is a lot of room for improvement, the programmers of Web applications will still push the boundaries such that on the laptop these Web pages are fast enough, probably two, three seconds, maybe five, multiply that by ten, and you see why these same applications, Web applications are not going to run fast enough on the phone. So parallelism is one way of improving the performance of the browser. It's not the only way. There are still sequential optimizations, there are other ways such as running part of the computation on the cloud, but the cloud has limited power because it is quite far away in terms of latency and it's not always connected. So parallelism is one way. Parallelism solves at least one of the two problems that you have, and that's sort of the responsiveness, the latency. You can get the 21 seconds down to maybe 2 seconds if you can parallelize it tenfold. It doesn't sold the energy efficiency really well because you still do the same amount of work, and so the battery lifetime is not going to be improved as it would be if you did sequential optimizations. But this is why parallelism is still one of the weapons that you may want to use. And parallel browser, however, may need a slightly different architecture than existing browser has, and that's in particular because of the JavaScript which is a relatively unstructured language. There are many go-tos, and I'll try to illustrate them later in the talk, and these may need to be resold. So I'll try to talk about the anatomy of the browser, but if there are questions about the animation and whether this all makes sense, this may be a good time. Okay. So what do you want to parallelize. So what is a browser. You have a cloud of Web services which the browser accesses, loads a page and he checks whether it's a page or an image. If it is a page, if you decompress it, do lexical analysis, syntactic analysis, build a DOM, which is an abstract syntax tree of the document, and that's the front end. Essentially a compiler. Then you lay out the page and then you render it on a graphics card, and that's your layout part. And then there is the scripting which provides interactivity or in fact makes the browser a general application platform. And there is the image also that if what you download is an image you go sort of on the side, decode it, and then it goes into the layout together with these dimensions and eventually it is rendered on the page. So here is the scripting which listens to the events such as the mouse and the keyboard but also listens to the changes to the DOM. For example, the page may keep loading forever, the browser -- sorry, the server may keep loading the rest of the page later incrementally and the script will react to new nodes being attached to it. The scripts of course may modify the DOM and in fact they do because this is how scripts print something on the screen, by adding new boxes and new text and new images. And of course the layout also modifies the DOM because once you lay it out, you compute its real code and it's on the screen, these are written there and script can read them again, so there are a lot of data dependencies over there. So enough because the scripts can request more data from the servers. So and that's so there's the third leg, the scripting. So where does the performance go. And there isn't a single bottleneck that if you optimize problem go away. So the lexing and parsing may take 10 percent, sometimes more. It really varies on the application quite a bit. Building of the DOM is another 5 percent. The layout in general is the most expensive component. This is sort of the [inaudible] and DBIPS of the Web browser. And this may be 50 percent. And then the actual working with the graphics -- and I think fonts also for here, right, Leo? Turning fonts into rasterized images also falls there. So all of it needs to be parallelized and the problem is that many of these algorithms are not algorithms for which we already have parallel versions. We need to invent it from scratch because so far all of these have been typical implemented in a sequential fashion. For the layout component here, Donald Canudes [phonetic] would even tell you that multicores are not useful for [inaudible] at all. And in fact he wrote his algorithm in a very sequential fashion, so squeezing parallelism out of it is nontrivial as we learned. So what have we done. We have developed some work-efficient parallel algorithms for some of these problems. Work efficient means that the parallel algorithm doesn't do more work than the sequential one. Sometime parallelization is easier if you can do more work and then you throw some of it out if it turns out that you don't need it. But this would hurt energy efficiently. So we are going after work-efficient ones. And we've done it for a lay out has two parts. And the first one essentially does the parallel map over a tree. I'll show you the algorithm. And the other one turns an in order traversal over a tree into a parallel one. And then I'll show you one algorithm from the compiler phase from the lexing, how to take an inherently sequential algorithm and parallelize it. So then on the scripting side, we spend a lot of time trying to understand the domain and what programming model would be suitable and why JavaScript may not be the right thing to do parallelize. And so we looked at programmer productivity also, what abstraction the programmers may want to have, and we go from the callbacks that are at the heart of JavaScript, the heart of AJAX, to actors. And I'll show you small examples. And then what do you need to add into the language to get better performance. So let's look at lexing. Now, lexing is, you know, a problem that is relatively easy to study and you know it from your compiler classes. You have a string of characters here and you have a description of your lexemes and tokens, so you have a tag, this is the HTML tag, so two parentheses with some stuff inside. Here is the content and here is the closing parentheses. And the goal is to go through the string and label it as whether it belongs to this token or this token or that token, and the way it is done via regular expression and from that a state machine. And the state actually tells you whether this belongs to that token or the other one. So the goal is really to go through this string and label these guys states of the state machine. The problem of course is that if you break it into pieces of parallel work, what you see is that you cannot really start scanning this using the state machine because you don't know what the previous state is, and you need it in order to find the next state. So you can only scan this once you know this state here. So this is the inherently sequential dependence. So here is an observation that makes parallelization possible. And it's sort of specific to lexing, so it doesn't apply to arbitrary parallelization of finite state machines. But the observation says that pretty much no matter which start state you start from in that state machine you will eventually converge to a correct state after a few characters. So whether we start from orange or red or yellow here, you reach a convergence. And this is because in the regular expressions or the state [inaudible] automata that arise in lexing, you have one or more sort of start state into which the regular expression comes back when it is done with a particular token. And usually these segments are not too long. So this lead to parallel algorithm that works as follows. You first partition the string among parallel processors in such a way that you get these K characters of overlap. And the K characters is what we are going to use to get from a state that is suitably but more or less arbitrarily chosen into a state that is correct most of the time. So we think that's the way. And now we scan in parallel. And in some cases we are correct; in some cases we are not. So we are correct here and here and here, meaning that the state that we reach after the K characters turn out to be the same as this automaton reached even though it doesn't look like in this color, but let's assume that it is. And so are these and these. But we did not guess correctly here, and so this parallel work here needs to be redone, because our way of guessing the start state for that segment didn't work out, so we redo this work and we have a correct result. What I should point out, if there was a misspeculation somewhere here, you wouldn't have to redo the entire rest of the work, just the segment, as long as by the end of this segment you are in the same state where you would be in the sequential algorithm. So this is a way of obtaining parallelism through speculation in an inherently sequential algorithm by looking at a suitable domain property. So here is how it works on a cell which has six cores. And so as you go up this is adding more end cores of the cell. Of course the interesting file sizes are here for current Web pages, and we could improve performance by tuning it a little bit more. It wasn't tuned up really much for the small sizes. But already onto this page -- page size is what you see is that on five cores we are almost five times faster than Flex, which is highly optimized electrical algorithm in C. So this is pretty promising. Of course this was for lexing, and parsing is really what we need to parallelize when we are probably halfway through done with it, with quite promising results I'd say. Now let's look at the layout here. So here is where we spend most of the time. And this is where Leo has done a lot of work, and he can answer details of it. But I'll give you a high-level overview. Sorry. I should go here. So a layout has two phases. The first phase is you take your abstract syntax tree of the page, which is here, and then you have rules like this one which tell you -- these are CSS rules which tells you how to lay out the document. And you need to essentially take these rules, you may have thousands of them, and match them onto the tree. These rules will tell you how particular nodes will be laid out. So here we have an image node in the tree. And this will be associated with this rule and this rule. So we have two rules that match this node. And then there is a prioritization phase which we'll decide whether this rule and that rule applies, but we'll skip that phase here. How is this matching done? Well, you take this path from the node to the route, and that describes a certain string based on the labels on those nodes, and then you have your rule here. This is what is called a selector. And you see whether the end of this rule here matches the end of this path, and then the rest needs to be a substring. This here, the selector needs to be a substring of that path. So this is the work that you do here. So as you can see, it's highly parallel because you have many nodes being independent from each other. You have many rules, again they're independent from each other. So this is a nice parallelism, except there are two things that you need to solve. The first one is load balancing, so that be solved for now by randomly assigning work to processors. Here we show it for three different processors which independently compute the work. So this so far seems to be working okay. And the problem other one is memory locality. These hash tables that store these rules are quite large. They don't fit in the cache. So what you do, you do tiling. You first do the matching with a subset of the rules, you perform the match, then throw this away from memory and you work on the remaining part of the rule, you do the matching. And this turns out to be essential for obtaining not only good sequential speedup but enabling parallelism to harvest the potential of having parallelism. So here are the results. And what you see here, here is the original sequential algorithm here, so this is speedup 1. Now, if you parallelize that, you get some speedup maybe a factor of 3 here. Now, if you apply sequential optimization including the tiling, you get to maybe 12. And then parallelism will give you more. And I think the speedup here will be more than you had in the non-optimized version because you get from eight milliseconds to two, so maybe you get factor of 4 rather than a factor of 3, so even the parallelism does better. Now, if you look at scaling, scaling is not ideal. You'd like to scale a little bit better here on these eight threads. And perhaps we need to look more at memory locality and whether the random distribution of work on these processors is the right idea. But so far this is promising. Layout. So now comes the process in which you have the abstract syntax tree, and it is annotated with the rules which we matched in the previous phase. And now we need to traverse the tree and actually figure out where each letter goes. And when we have the letters we have the words and then we figure out where to break the words in the line and where the boxes stack about each other. This is called this flow layout, because we're essentially flowing the elements on the page. And if you follow the specification of this process, the CSS specification, this seems inherently sequential. Because what you do, you start with a tree, it has some initial parameters such as the page is 100 pixels wide, the font size is 12, now the font size of this subtree is 50 percent and then it says that this image should float, meaning it can -- it can go from this paragraph to the next paragraph because it will float out of the paragraph and you set this width. So this is the input. And now you figure out the rest through in-order traversal of the tree. You compute the X and Y here. This gives you X and Y here. You change the font size to 6. So what you propagated is the font size. The current cursor, so the position on the page and the current width. And you continue like this, propagating information, until you're done. And this appears sequential, because if you trace these dependencies, you end up with such in-order chains, and therefore the algorithm is not parallel. But if you take a closer look you realize that there are some subset of the computation which in fact can be performed independently of the others. And here we see that the font size of this element can be computed by a top-down traversal. And if you follow this idea and invent some new attributes, you end up with a parallel computation that has five phases. In the first one you go top down, compute the font sizes, and some temporary preliminary width of these boxes. Then you go bottom. So we can show this. Then you go bottom up. And for each of the boxes you compute its preferred maximal width, meaning if the box had arbitrary space available for layout, how would it lay itself out. In the case of paragraph it would mean it wouldn't do any -- how wide would it be if it didn't do any line breaking. And also you compute the minimum width which is if you really give it what is sort of the narrowest width that it needs to lay itself properly, and so that would mean, in the case of a paragraph, breaking after every word. So this is what you do in the second phase. Again, it's parallel tree, bottom up. Once you have that in the third phase, you can actually compute the actual width. Then you are ready to compute bottom of the height because now you know each paragraph how wide it is. You compute the height. And then in the fifth phase you actually compute the actual absolute position. So here is the speedup on some preliminary implementation which is not quite realistic because it doesn't have the font work which you need to put in the leaves. So it doesn't quite account for how much work it takes to turn fonts into rasters and so on. But still preliminary there is some speedup, meaning we did discover some parallelism. Finally, scripting. So scripting is this component here. The script is something that interacts with the DOM and also with the layout because the script might change the shape of the DOM, it may change the attributes, such as the sizes of these boxes, and the layout needs to kick in and re-layout the document. So why would one want to parallelize scripting? It doesn't take so much of the computation, maybe 3 to 15 percent of the browser goes into scripting. Well, a lot of what scripting does in the browsers beyond user interfaces is visualization of data. So imagine you have an electoral map like this and now you want to change some parameters and it's to change into something like this. And maybe in addition to changing the color you want to change the shapes of the states to reflect the magnitude of some attribute. And perhaps you want to do it in animation, and perhaps you want to do with granularity of counties rather than states. So now you realize that the computation is quite demanding. You have animation which needs to work three times a second, so 30 frames a second. That gives you about 33 milliseconds for hundreds of nodes in the ASD. And now this is much, much more demanding what JavaScript can do definitely on handhelds. And it gets worse if you start doing 3D with various other annotations. So what do you need to do to speed it up? Well, the programming model looks roughly like as follows. It's not nonpreemptive [inaudible] model. Okay. These handlers respond to events such as the keyboard or the mouse. And they execute atomically. So if there are two events queued up, you first execute this one, finish, then execute the next one. Between the two the layout would kick in and render the document. And only then the second script would go. And this would be sort of the nice execution model. So it looks all very friendly to the programmer. But if you want to parallelize it, meaning you perhaps would like to do two layouts in parallel, perhaps you would like to take the effects of this script and lay it out in parallel with the effects of that script, now you need to look closer into the document and understand what the dependencies are. The dependencies are interesting. This is sort of what one needs to do to parallelize programs in a particular domain, is understand the dependencies that exist. And here in the browser they have two kinds. The first one is what you could call document query dependencies. This script here could write into an attribute X, say the position of one node in the DOM, update it to a particular position on the screen, and then the second handler may want to read it. So this is the classical data dependence that exist because you read and write the same memory location. More interesting are so-called layout dependencies. What happens here, the handler can write the width, change the width of some, say, box or image on the screen. Now the layout kicks in and it will re-lay out the entire document. And as a result, it may change the width of the California document. And then the handler B will read this. And so now we have dependencies here which exist transitively. The handler here did not write stuff that was read by this, but it wrote something which changed the attributes for layout, the layout then changed the attribute in some other node which was read by this. So in order to parallelize the layout process, you need to understand what the scripts do and changes through the layout semantics influence the other scripts. Then despite the fact that this is nonpreemptive single threading, there are concurrency bugs. And we discovered three kinds. The first one are related to animations. Now, it could be that you have two animations running concurrently on the same element and both of them want to change the same attributes, say the size or color, and now they conflict and corrupt the document. Second, there are interactions with the server and the semantics of the browser server http interaction is that the responses not only can be lost when they come back, they could be reordered. And that reordering can break some invariance that the programmer had in mind inside the browser program. And finally, browsers in order to optimize run scripting eagerly before the document is actually loaded. And the effect is that sometimes unpredictable things happen. This script that run too early may destroy the entire document, and the other ordering may also cause problem. I can tell you more details offline. So why JavaScript is not a great language for this. What you see here is a small Web page, animation effectively, whose goal is to create a box. The box is here. It's a simple div that follows the mouse. It follows the mouse in such a way that the box appears in a position on the screen where the mouse was 500 milliseconds before. So here is the box. And here is the script. And what you see in the script are two callbacks. Essentially these are interrupt handlers, if you will. This one here is called whenever the mouse moves. So you register it here. This is the event to which you register it, here is the callback. And inside here you have another callback which is created each time this one is invoked. It's a closure, and this is one is then invoked 500 milliseconds later and this is the one that actually moves the box by setting its coordinates. And so, you know, this is not so bad, but this is a very simple program. And you already see that there are a few things that are really hard to read. The first one is the control flow. The control flow is not quite obvious. This is invoked when the mouse moves. This is invoked 500 milliseconds later. You need to do analysis of how these callbacks are registered in order to see how the control flow through the program. The second is dataflow. If you notice the way things are linked here, this box here has a name which is a string. And you refer to it with this construct. Here is the name reference. So you have a reference from here to here, not through something that is syntactically created and registered in a symbol table, but it could be in the worst case a dynamically computed string, and very often it is. So the data dependence also is really hard for the compiler to figure out, and so again you don't know how the script modify the share document. So the proposal is to switch to something like an actor model or perhaps even high level of extraction where the control flow is changed into a dataflow. We can think of perhaps as streaming. So now you have a mouse here, we generate the sequence of top and left coordinates. Each time the mouse moves, two values are inserted here and they flow down here to this sort of computational components, each of them delays it by 500 milliseconds. In 500 milliseconds they flow down here, and they take this box which now has a nice structural name that the compiler can actually reason about and it changes its attributes. And now both the control flow, which is here, and the dataflow, which is what we are modifying based on this name is nice and visible. So in summary, we develop some work-efficient algorithms which are specially important on these mobiles where energy efficiency is important. And we looked at the programming model that should be useful for scripting in the browser, and we are now finishing the first version of the design of the language based on constraints. We could think of it as raising the level of abstraction or functional reactive programming, maybe [inaudible] somewhere higher. And that's where we are. There is a lot of other work going on, such as partial evaluation of the layout process and so on. And I can tell you about it offline. >>: Yeah. So you're [inaudible] familiar with the fact that you can -- in a nonwork-efficient way you can use parallel prefix computation to relaxing in parallel. And the drawback with that has always been that the amount of additional work is sort of how many stakes are in the finite state machine that you do it with. But this insight that there are these persistent states and these fairly small number of possibilities for the input of certain times suggests another way to do this, which is pretty close to what you're doing. You can kind of compute the narrowing spots in possibilities in state space and do scans at that granularity, not that of the single input simple granularity, but rather on the substrings that transport you to let me call popular states to other popular states. So you're doing parallel prefix only on the popular targets, not only the whole state space. >> Ras Bodik: This seems to require that you examine the content of the string beforehand. >>: No, what I'm suggesting to you is you examine the automaton actually. But, yeah, but maybe the string too. >> Ras Bodik: Right. So we definitely plan to continue by understanding the automaton. And you can do it in various ways through profiling perhaps. But I -- if I understand you correctly, it would really require me to have a look at the string and understand that, oh, this part of the string here is likely to converge faster than others, for example, because if I chop this string in the middle of a comment rather than in the middle of an identifier, of course I may be in big trouble. And in sort of current browser programs, the Web page is a combination of HTML, CSS, JavaScript and potentially other languages. So you would at least like to know in which language you are when you're breaking this. >>: [inaudible] >> Ras Bodik: But I need to look at the content first, which sort of requires -- goes against the benefit here. I would like to chop the input directly from the network card. >>: [inaudible] in places that ->> Ras Bodik: Right. >>: -- and you're making a speculative choice as to which [inaudible]. >> Ras Bodik: Absolutely. >>: If it was a small subset of states instead of single state, then there wouldn't be any backtrack. Might not be a win, I'm just -- just an idea. >>: [inaudible] the only area that you apply speculation is in mixing, right? If you had part of the [inaudible] transactions, would you see opportunities in other phases of your program? >> Ras Bodik: So first of all, speculation actually is used much more pervasively than that. You have to use it in parsing because there is very similar inherently sequential dependencies, unless you use something like CYK parser, which has bad constants, however. So but parser has it. The layout process has it also, because we can break things down into these five phases only if there are no floats. Floats are these images which may float out of their paragraph and influence anything that follows. So you speculate that that doesn't happen. But so you cannot prove that these dependencies indeed do not exist. So speculation is used also there. Now, to the heart of your question, would your hardware support would be useful, probably not. I think you could probably use it, but these are very domain-specific ways of using speculation; namely, you know exactly which [inaudible] you need to check at the end of the work to see whether the speculation was correct or not. And that I think is easier to do with software. Also, we do not need to do any complicated drawback that would require support. We just redo the work. So probably not. I'll be happy to answer other questions offline. [applause] >> Sam King: So hello. My name is Sam King. I'm from University of Illinois. And I'm here to talk today about designing and implementing secure Web browsers. Or another way to look at it is how you can keep your cores busy for two seconds at a time. So this is joint work done with some of my students: Chris Grier, Shuo Tang and Hui Xue. And we're all from University of Illinois. So overall if you look at how people use Web browsers, it's very different now than ten years ago. So ten years ago the Web browser was the application. And the static Web data was the data. But if you roll forward now to so-called Web 2.0, the browser has really become more of a platform for hosting Web-based applications. So it's very common for people to check e-mail, do banking, investing, watch television, and do many of their common computing tasks all through the Web browser. In fact, I would address, and I think Ross did more eloquently, that it's the most commonly used application today. Now, unfortunately, the Web browser is not up to its new role as the operating system in today's computers. And modern Web browsers have been plagued with vulnerabilities. So according to a report from Semantic, in 2007 Internet Explorer had 57 security vulnerabilities. This is over one year. Now, I know what -- when I go around the country and I give this talk and talk to people, they say, Sam, you know, I would never be crazy enough to use Internet Explorer. I'm a Firefox guy. I care about security. Well, turns out Firefox has also had its share of security problems, with 122 security vulnerabilities over the same period of time. Now, some of you might be Mac users, such as myself, and in our infinite arrogance we understand we have error-free software and therefore we don't have to worry about stuff like this. Well, Safari and Opera have also had their share of problems. Now, perhaps most alarmingly the browser plug-ins, which are external applications used to render nonHTML content accounted for 476 security vulnerabilities over this one-year span. And this is just too much. Now, to make matters worse, there were a number of recent studies from Microsoft, Google, and University of Washington, all of which show not only are browsers error prone, but this is a very common, if not the most common, way for attackers to break into computer systems, is all through the Web browser. So what does it mean to have a browser-based attack? What does it mean to have your computer system broken into through the browser? So one way this could happen is through so-called social engineering. So what I have shown here on this figure is a screen shot from greathotporn.com. And so you go to this Web site and they show you a very real looking video plug-in. And when you go to play it, they ask you some questions. They say, okay, you know, I see that you don't have the proper codec. Can you please install this Active X codec so you can watch the video. And you click yes. And instead of watching a video you get a pretty nasty piece of malware installed on your computer system. This particular site also would try to exploit browser vulnerabilities at the same time. So that means behind the scenes they're trying to invisibly break into your system. Now, another way that this can happen are through plug-ins. As I had mentioned, plug-ins are very error prone, at least according to the data I have, and if an attacker is able to take over a plug-in, not only do they then own your entire browser but also your entire computer system. So shown here as viewing a PDF, something about designing and implementing malicious hardware, well, there could be some exploit code mixed in there that could take over your entire computer system. Something very benign that we do all the time can lead to your computer system getting broken into. Now, perhaps most surprisingly is an attack being carried out through a third-party ad. So shown here I have Yahoo!. And I'd like to draw your attention down here to the bottom right where you can see there's a third-party ad. Well, these ad networks are very complex, where there are many layers of indirection between the ad that's being served and how it actually gets to your browser. And there's been evidence that the top ad networks have given people malicious ads. So what this means is that the ad contains an attack inside of it, and it takes over your browser. Now, this is interesting ->>: [inaudible] >> Sam King: Even if you don't click on it. So this is interesting because it violates one of the fundamental assumptions that were taught: hey, if you don't go to greathotporn.com you're not going to get broken into. But that's not actually the case. Legitimate sites can be susceptible, mainly because browsers just bring in information from so many potentially untrusted sources. And finally, and my personal favorite is this so-called UI redressing attack. So the point of what you're trying to do here is you're trying to get the user to turn on their microphone through Flash. So the way this happens normally is you have a frame that visits a page hosted on the Adobe Web site and they'll say this Web site is trying to turn on your microphone; do you want to allow, yes or no. But what you see here is some text that says do you allow AJAX; AJAX will improve your user experience. Well, yeah, of course. I'd love to have my user experience improved. Who wouldn't. But in reality what's happened is that frame from Adobe is hidden here behind the scenes. And the attacker has covered everything except this allow button. So in an attempt to improve your user experience, you've inadvertently turned on your microphone and given the person that's hosting this Web site complete access to what you're saying. So the current state of the art in Web browsers when we started this project was quite poor, and it's improved a little bit since we originally published this work. So some of the more traditional browser architectures are Firefox and Safari, where these are basically monolithic pieces of software that include everything within a single process. If you have a vulnerability in one single part of one of these browsers, your whole system is taken over. And they do try to enforce some security policies, but these enforcement checks are sprinkled throughout the code in the form of a number of IF statements. Now, since we published the original work that I'm going to be talking about today, some more recent browsers have made some big improvements in terms of security. So Google Chrome and IE8 both do a great job with system-level sandboxing. So, you know, this idea of a so-called drive-by download. These browsers do a pretty good job with that. But I think one areas that these browsers still fall short in even today are protecting your browser-level states. And I would argue as more and more content moves onto the Web, these browser-level states will increase in importance, and protecting them is something that we need to be able to do. So finally, the fundamental problem is that it's really difficult to separate security policies from the rest of the browser. So frustrated with the state of the art, my students and I set out to build a new Web browser from the ground up. And this is where we started the OP Web browser project. So our overall goal is to be able to prevent attacks from happening in the first place. You know, let's build this browser right from the beginning. However, being engineers and realists, we realized that vulnerabilities will still happen and browsers will still have bugs in them. So even if there's a successful attack, we want to be able to contain it. Finally, at the end of the day, there's still a user using that Web browser. So they're going to download stuff and double click on it. So the final thing we want to do is provide the ability to recover from attacks. And so what we do is provide an overall architecture for building Web browsers. So we take the Web browser and break it apart into a number of different components. And we can maintain security guarantees even when we're broken into. Now, very much the design was driven by operating systems and formal method design principles, as you'll see later in the talk. So overall I will spend a little bit of time talking about the OP design and one of the aspects of OP that I think is pretty interesting, which is how we applied formal methods to help us reason about security policies. Then I'll talk about the performance of our original prototype. And at the end I'll spend some time touching on some of our future work and the things that we're working on as part of UPCRC. So as I mentioned before, our overall approach with the OP browser is to take the browser and break it into a number of much smaller subcomponents. And really at the heart of this browser is a thin layer of software down here called our browser kernel. And our browser kernel, like an operating system kernel, is responsible for managing all the OS-level resources below and providing abstractions to everything running up above. Now, the browser kernel is where all of our access control and security -- almost all of our security mechanisms are enforced. Now, the key abstraction that the browser kernel supports is message passing. So that means everything that's running up above communicates through the browser kernel using message passing. Now, the key principle in our browser operating system is the Web page. So each time you click on a link, this creates a new Web page instance. And this is the key principle in our operating system. So the Web page instance is broken down into a number of different separate components, starting with a component for plug-ins, one for HTML parsing and rendering, one for JavaScript and one for laying out and rendering content. Now, we sandbox these Web page instances heavily. So this is -- a Web page instance is composed of a number of different processes. But these processes aren't allowed to interact with the underlying operating system directly. We use OS-level mechanisms to make sure that that doesn't happen, and we force it to communicate through our browser kernel. So now the question is, okay, if all we can do is communicate with the browser kernel, how do we actually do normal browsing things. And that's why we have additional components here over on this side, where we've got a user interface component for displaying, we've got a network component for fetching new content from the network, and a storage component for any persistent storage needs that the browser might have. So by designing a browser this way, what we were able to do is provide constrained and explicit communication between all of our different components, and everything is operating on browser-level abstractions. So because of this, we're able to enforce many of our security policies inside the browser kernel itself. And our browser kernel is a very, very simple piece of software which is -which makes it easy to reason about. Now, this is a stark contrast to most modern Web browsers where the security checks are scattered throughout the browser itself, and this intermingling makes it very difficult to figure out what's going on. So given this overall architecture, there are a number of different things that we've done with it. So some of the things that we've been innovating on have been policy, where we were the first ones to take a plug-in and include it within the browser security policy itself. Now, I think we have a pretty good idea of the mechanisms for doing this type of thing, but what we found out are that policies are much more difficult than we had originally thought. And this is still an area of ongoing research. The second thing we did is because of the way we designed it, we were able to pretty easily apply formal methods and give us more mechanical ways to reason about some of our security policies. And I'll talk about that in a few slides. And then finally, we were able to do things with forensics, meaning your browser has been broken into and you download an executable, can you tell me which Web page this executable came from. Something that's very difficult with a traditional Web browser but that we can do pretty easily. So our overall goal for our use of formal methods in OP was not to have a formally verified Web browser but we wanted to see how well -- how amenable our design was to using these types of mechanisms. So we wanted to model check part of our specification. Now, specifically what we do is we model check two key invariants and two very important invariants variants. So the first one is does your URL address bar equal the page that's currently been loaded. So that's if you've got a browser, it says it's at address X, is that really the page that's being displayed. So it's a pretty simple invariant and one that I think we all assume. But history has shown this has been a surprisingly difficult invariant to get right. The second invariant that we tried to model check is our browser's implementation of the same origin policy. And we do this assuming that one or more of our components have been completely compromised. So can we give an attacker access to arbitrary instructions on our computer system, can we still enforce the same origin policy. That's the question we're trying to answer. Now, in order to do this modeling, we built a model using Maude, and each of our different subsystems make up the overall state space for our browser. Now, all of the messages that pass through the browser kernel, these are our state transitions. And we use this for model checking. So one of the interesting things I think that we did in terms of formal methods is that we were able to model an attacker pretty well. And this is because of the sandboxing we're using and because of some of the assumptions we make, we can model an attacker pretty accurately as one of our components that sends arbitrary messages. So they can send whatever message they want, they can drop messages, reorder messages. They can do whatever they'd like as long as it's send a message. And this is our model for an attacker. So the first invariant that we tried was the URL bar equalling the URL that's been loaded in the browser. So the thing that was interesting about this exercise was that we found a bug in our implementation. So it's a pretty simple thing. At least I thought it's a pretty simple thing to try to program this type of invariant, but what we found is that we still made a very subtle mistake. I know what this slide says, but what I think actually happened is we forgot to take into account an attacker that will drop a message. Maybe with more time we would have got it. But because we were using formal methods, it was something that we found I think a lot faster than we would have found otherwise. Another thing that I think is interesting is that we were able to model and -- model check our implementation of the same origin policy, something that is very difficult to do in modern Web browsers. So overall I think what I'm not trying -- I'm not trying to stand here in front of you today and tell you we have a formally verified browser, because that's just not true. But what I think we have is a design that is well suited to this type of analysis, and we have a pretty good starting point for implementing a more secure browser. So at the very least we know the messages that are going back and forth are basically doing the right thing. So now that we've been able to reason about this well, we can focus on implementing the rest of it to make sure that our model and implementation actually match. So one of the things that was interesting to us was performance. So in order for us to measure performance, what we did is use page load latency times. So shown here on this figure on the X axis we have page load latency times in milliseconds, so longer is slower, slower is bad. The two browsers we tested were OP, which is ours, and Firefox. And we tested it on five different Web pages where we've got Wikipedia, cs.uiuc, craigslist, Google and Live. And what we did is load the page and from the time you click the button until when it's displayed on the screen. That's defined as page load latency time. So what we found that was really interesting about this process was not that we're about as fast as Firefox, which more or less we are. The thing that was really interesting was that we're about as fast as Firefox despite our current best efforts to make our browser slow. So we do everything in our power to make this thing slow. It takes 50 OS-level processes to view a single Web page. We've got multiple Java virtual machines running all over the place. You know, we're using IPC as much as possible. Everything is -- you know, all these boxes I show are processes and they're communicating using IPC, so we're doing all the things that should make it slow, but despite this it actually isn't that bad. And I think there are a number -- a couple of things, reasons why this is true. One of which is multicore. So because we have sufficient number of computational resources, we can do these things where we add some latency and it doesn't effect the overall loading time. All right. So what I presented so far was some of our older work. This was from a little while ago. And as I mentioned, since we published the basic architecture, other people have started to take up this line of research. So what I want to spend just a few minutes talking about here today is how we're going to use this as a platform for future computing and some of the things that my group and I are working on today. So I think that the original OP architecture is going to do a good job keeping a few cores busy, you know, four, maybe even eight. We'll keep those busy. But what I'm trying to think about now is what types of applications can we build on top of this framework to keep tens of cores busy. So our first -- the first thing that we're looking at in this general area is we want to enable our browser to enforce client-side security policies. So this means in the browser we want to be able to see what's going on in a Web page and potentially restrict parts of the Web page and do this in a way so that we can improve performance. But by restricting a Web page we're changing the Web page itself which could have a compatibility impact. So what you could think about this from a high level is maybe you want to visit a Web page and you want to try out 15 different security policies in parallel and then try to pick which one is right for you depending upon the compatibility versus security tradeoff that you observe. But one of the problems is that it can be difficult to try to quantity these types of things, specially when you want to -- when you're concerned with how the user has been interacting with a specific Web page. So one of the things we're looking at is a connection to another one of my UPCRC projects which is Replay. So can we use Replay technology as a way to replay Web browsers. And if you look at the two different extremes of the type of replay you could do, is on one extreme you could have full-blown deterministic replay. So this means that you can recreate arbitrary past states and events instruction by instruction. So this works great at recreating past states and events. This is what it's designed to do. And if you want to do something like reverse debugging, it's great for that. But the problem is if you want to take the same browser and apply a different security policy, traditional deterministic replay doesn't really work. Because things have changed a little bit now. And so what you really need is some more flexibility. Now, some techniques that you can use that are much more flexible are things like replaying UI events. So you can think of the most simple form of replay on a browser is clicking the refresh button. Right? So it will go out and it will refetch the page and it will basically do the same thing. And this is going to be very flexible. Now, the problem with the naive approach is that you're not getting U events. So if the user interacts with the Web page, you're missing this. Then there are known approaches for doing these types of things. There's a project called Selenium that does stuff like this. And you can have a very flexible environment for replaying browser events. But the problem is it's not deterministic enough. There's still certain things like JavaScript that will induce types of nondeterminism, especially when you're applying security policies that remove part of your JavaScript out of the page. You know, who knows what Selenium is going to do in there. So you've got to spend a little time trying to think about how to cope with some of these types of scenarios. So from a high level, what we want to do is called semantic browser reply, where it's something that's in between the spectrum of full-blown deterministic replay where we can reproduce as many of the past states as we can but where we do it in a way that's flexible enough that we can try different policies out. So the basic architecture is that you've got a browser instance and as you load this page you're going to record the sources of nondeterminism as you execute and the user interface pages -- events, and then you can replay it on a couple of different browser instances. Now, what you can do is you can have one that doesn't have any security policy, another with a new security policy, maybe a third, fourth, and fifth with even different security policies. And then you can figure out ways to try to determine and quantify how different are these pages. You know, as I mentioned before, one of the things that gets really tricky about this is when you remove a piece of JavaScript -- because this JavaScript might have causal dependencies with JavaScript that you don't remove. And so it creates some interesting challenges that we're still working through today. Now, in addition to replay for browsers, I think one thing that I'm personally very interested in are more formal methods. I'm more of a systems researcher myself, but we're talking to people who do formal methods for a living and trying to see how much we can use these mechanisms and how much of a benefit it has. Another thing that my group has been thinking about are display policies. So this is something that we've been collaborating with Microsoft Research on on a project called Gazelle where -- I don't know if you guys noticed this, but the UI redressing attack where you turn on the microphone -- I didn't really mention anything today that's going to help with that. So this is still a very wide-open area that needs to be looked at in more detail. A general browser extensibility is yet another area that's interesting. Plug-ins are one example, browser add-ons or extensions are another example. And thinking about how to facilitate browser extensibility where you can provide both flexibility and security is an interesting topic. And finally the Berkeley project where they're making individual components faster by parallelizing it, you know, hopefully we can find ways to take those results and work it into our overall architecture. And so the hope is we'll keep even more cores busy. So overall the browser has really evolved from an application into a platform for hosting Web-based applications. And as such it's really become much like an operating system. The problem is traditional Web browsers weren't built like operating systems. So when you apply OS principles to designing and building Web browsers, you can make it much more secure. So what I showed here today is our approach to this basic philosophy where you can decompose a browser into a number of much smaller subsystems, and this provides a number of advantages including separating the security logic from the [inaudible] browser, and it gives you the ability to model formally some browser interactions. So what I've shown here is a step towards preventing, containing, and recovering from browser-based attacks. So any questions? So I have a demo in case anyone's interested afterwards. Was there a question in the back? Yeah. >>: [inaudible] the architecture [inaudible] totally on its side, is it -- do you think that's inherent or is that just kind of [inaudible]? >> Sam King: I'm sorry, I didn't hear the first part of the question. >>: [inaudible] the architecture of the browser, you showed that different Web pages [inaudible] processes and isolation groups. But then you have this [inaudible] network components [inaudible] I'm not surprise today see that [inaudible] do you think that's inherent or just kind of that was just what you did? >> Sam King: I'm not sure what you mean by inherent. Certainly it's a design decision we made. So the big distinction I make is that these are components that we wrote from scratch. You know, these are something that we wrote in JavaScript and we built it -- a hundred lines of code, whereas the browser instances, these are off-the-shelf components. So we're taking WebKit, for example, and jamming it in there. And because we have less assurances about the implementation, it's a million-line artifact, we use more processes to help there. So it's -- you can draw as many boxes as you want. This is just one design decision we made. >>: So can you give me an example of [inaudible] that you'll be able to? >> Sam King: I'm sorry, one more time? >>: In an attack that the IE will not be able to prevent but [inaudible] will be able to prevent? >> Sam King: So, you know, I think -- so the first -- the first thing that I wouldn't say it's a concrete that we prevent and IE doesn't, but I think the IE implementation of the same origin policy has had some well-known flaws in the past. Whether or not we would be susceptible to that, I don't know. I mean, so I can make a qualitative argument that I think we did it better because of it's so much smaller, but, you know, that's more of a qualitative argument. I think one thing that we -- so let me try to answer this for a higher level. I wouldn't say that there's anything like a specific attack that we prevent and they don't. I think it's -- if you look at most of the policies that we're playing with here, it's basically the same thing that IE's doing. We just draw our boxes in a different way. So the hope is that it's easier to reason about security. But it's basically the same thing. >>: [inaudible] Chrome or Safari? >> Sam King: So this was a -- the performance numbers I showed were from a little bit of an older browser. The new one uses WebKit. For whatever reason we haven't run it against WebKit. We really should, though. I think that's a good suggestion. I think at the end of the day, I personally am just not that worried about performance. Like, you know, as long as it's reasonably fast, I'm going to be happy. We're more focused on the security side. But I agree. We should run that. My intuition tells me we add a little bit of overhead, but not much. >>: Thank you very much. [applause]