>> Eric Horvitz: So it's -- we're at exciting... excited and many of our colleagues in the academic community...

advertisement
>> Eric Horvitz: So it's -- we're at exciting times here. I'll put this up. My team is very
excited and many of our colleagues in the academic community about all the structured
and missing content in the world, lots of data about [inaudible] users now.
In the old days we always would say, you know, boy, if we just had data. And we
focused on the algorithms. Nowadays we have too much data and we're still focusing on
the algorithm, but we have had quite a bit of enhanced prowess in learning and inference
reasoning strategies. Still overall an intractable problem, but there's been some nice work
on approximations that work pretty well.
And we're seeing opportunities to interleave machine intelligence at the core of new
kinds of services and experiences. I think this is just in the basement level right now. I
see a lot going on and several talks the last couple days that I see on the schedule have
addressed some of the opportunities there.
So we have a lot of sensors out, a lot of connectivity and content. We have these new
algorithms that seem to do well. It's built over 15 years in the probability space,
probabilistic graphical model space.
But computation is a big question for people working in this area of intelligence and
automation of things that people have tended to do in the past. And it's not clear where
we are.
Certainly we made great progress in the algorithms, but the computation and memory
available has helped out quite a bit. If you look at the work in the Deep Blue project, for
example, a lot of the intelligence in the boost of intelligence was deep replies in search.
On the other hand, there was still quite a bit of innovation in how the search was directed,
talking to the people working on that project.
So if indeed we're at a bend in the curve here, it's bad news for a lot of expectations in an
area that has finally seen a bend in the curve of prowess in terms of doing these kinds of
things.
Just a quick review. Machine learning, there's several different approaches to machine
learning. Typically you have a set of random variables. And you want to build a model
from the random variables from a large dataset.
One approach that I'm very fond of is structure search where we actually build
probabilistic graphical models by doing search over all structures looking at different
dependencies given a dataset. It's a big search tree typically.
Our team has done a whole bunch of work in this space over the years. The idea here is
you're searching with a score that gives you the goodness of models given a dataset. So
basically it's the probability that data will explain a model, and you find the best one
typically. And then you can do inference and decision-making with that model.
And there's all sorts of innovations, like in the relational area, looking at learning in
databases. There's ideas of parameter coupling, where you look for templates that you
see repeated, given structure, for example, of database relationships.
What's nice about machine learning is you can -- besides doing inference about a
base-level problem you can sometimes induce the existence of hidden variables, which is
really interesting for science. Like you know that it's not just A and B looks like they're
influencing C, but D is hidden. And you can actually induce or infer the likelihood of a
hidden variable. Which is almost like magic when it works.
You also can identify causality sometimes by actually not just inferring an arc of
dependency between two variables but actually know its direction, A causes B. Certain
conditions given constraints.
So one of the hot areas in machine learning right now is active learning, which is, again -which is more intractable than base-level learning. And the idea here is that you want to
figure out given any model that you have how should you extend that model, what cases
should you acquire next for labeling, for example, from a dataset that we may have a
bunch of unlabeled data.
And this is used in a variety of ways, including understanding how to explore the world
perceptually, how to grow your models.
There's been some work in tractable approaches to active learning. But it's a challenge
space -- we've done a lot of fun work in the idea of lifelong learning in our team where
you have a model that's active over the lifetime of its usage, and it's reasoning about how
much each new data point is going to help out the performance over the long period of
time, over the lifetime of a system's usage. Given a sense for that and user needs.
Another area that we're very excited about, which, again, hard problem, hopefully
parallelizable, and there's some opportunity there, is selective perception. Here's a
system we built called SEER. It can do -- listen to the desktop, it can listen to sound. It
can do video classification from a camera feed, video feed, and audio classification.
Turns out we can't compute all of this. So what we did in the SEER project was we built
policies to compute the next best thing to look at and could trade off the amount of
precision in any of these modes with its tractability. And what we found is you're
backing off at times, for example, on vision going -- not going from a color blog analysis
to a black-and-white analysis, that's all you need right now and so on.
We run this thing dynamically. It's very nice. There's actually pricing strategies and so
on. And the CPU is triaged to do the best it can given the structure of the model and the
uncertainties at hand.
So let me just focus a little bit on a few large-scale service problems. I think they're all
kind illuminating and -- couple of them here. And given the titles of the talks that I've
seen on the program, I think we've seen some talks in this space.
I like to say that on our planet we're seeing a proliferation of intention machines
everywhere. Layers and components that take in observations and that predict intensions,
actions, services.
A good example is Web search. Put a set of queries in, and we're learning or a system is
learning, or a corporation or organization, what the intention of the search is as well as
what content should be returned and how to basically even target ads, for example.
A sister kind of technology are preference machines, where we have sets of preferences
being taken into a large library and reasoned about. And we actually make decisions
about the products that might be preferred or other content that might be preferred over
time.
And a good example of this is collaborative filtering, where you have sets of
preferences -- actually an end space. This is three space here. And typically you don't
get -- you have clusters and when a new user comes in you can say, well, this user is sort
of like the other users, like you can recommend things that those users might like, for
example.
We're seeing richer and richer collaborative filtering going on in the world right now.
And this includes geocentric collaborative filtering. We have locations, for example, and
times and queries at different locations.
We have IP addresses and some nice recent work in MSR, we can actually say give me
time of day and day of week, cross it with queries, for example. And we get better
interpretation of what something means. For example, MSG. Some times and some
places and given a trajectory it means Madison Square Garden; other times it means
something else and so on.
Let me give you -- talk about the ClearFlow case, which is a fun situation, a fun
challenge problem.
ClearFlow is one of our larger projects on our team. And the idea was can we predict all
velocities on streets in a greater city area given sets of observations, for example, about
from sensors from the highway system. So that's a sensor, for example, and we have a
prediction challenge of all the side streets arterials and even the smaller streets.
If you could do that, you could do a search over that and route and have [inaudible]. So
we have a lot of work going on in taking lots of streams of data, multiple views on traffic,
weather, major events, incident reports with some lightweight NLP, and as well as lots of
data from volunteers who have driven around with GPS devices over a period of years in
the Seattle region. These are Microsoft employees and their family members.
Quite a bit of data there. It's about 700,000 kilometers now and tens of thousands of
trips.
We also have access to public transit. We've partnered with Seattle -- sorry, King County
Metro and we have data coming in live feeds from all the roving para-transit vehicles.
The idea is we want to weave together sort of a learning challenge here that weaves
together the highway system with side streets, lets us build probabilistic models that can
believe together realtime events, time of day, weather, and all sorts of computer
relationships about streets and the topologies, how far is a particular segment, for
example, from an on-ramp or off-ramp that's now clogged.
And based on that we built a portal called ClearFlow. Lots of -- for the Seattle area,
about 8 million street segments. Lots of machine learning. Take days to get through the
machine learning problem. But you can sort of generate a model that can provide routing
in Seattle based on time you're leaving from now, for example. I'm leaving in 30
minutes, for example.
And we actually built an internal prototype that would actually put side-by-side MapPoint
and ClearFlow, the default direction system and ClearFlow, and get a sense for how
[inaudible], for example, want to get off sides -- how we can get onto side streets and so
on.
The system performed very well. You have to do performance, which you see all the
thousands of points here and how well they're doing in space and how well we're doing
versus a standard posted speed model.
It did so well we actually ended up impressing the product team. We shipped this project
in 72 cities a year ago. It's called ClearFlow on maps.live.com. And you can see here the
cities that we're in including Seattle and San Francisco area.
What's amazing to me is that every few minutes we are inferring the road velocities on 60
million street segments in North America and having hundreds, probably thousands of
cars routing based on that.
It takes us running on 128 cores for every rev of this model, just the model [inaudible]
predictions for North America, several days of computation. And so -- and that's where
we've parallelized the machine learning algorithm to run on each core on separate
processes. And we think we could do a lot more there. Without that it would take even
longer. And as you can imagine, it would take us weeks. And the product team has a
need to get this revved every few weeks right now, which we're doing.
This will get even worse over time as we have fresher and fresher data coming into the
system and required -- up to requiring ongoing machine learning as we gather data.
Yeah.
>>: Does ClearFlow output affect traffic at all?
>> Eric Horvitz: There's always one smart person in any audience that asks that
question. If we're so successful where we're used by a majority of people, that would be
true. And then there are other methods we can talk about. We already have been
working away on load balancing approaches.
Right now we don't intend to do explicit models of how we want to load balance by
giving out, for example, end different randomized directions to given what we see in
traffic. One could do all sorts of things.
Another point, just actually represent how many views and represent each logged
direction you're giving out in your database and compensate based on the [inaudible]
you're giving to the system based on the current [inaudible] street system. And that's like
a very interesting research project. It's a great intern project, you know, how would you
actually make ClearFlow work better to load balance if .9 of the citizens were using it to
route home in the evening, for example.
>>: [inaudible]
>> Eric Horvitz: Exactly. Let me just mention a couple [inaudible]. I want to move
from the cloud services now to the client a little bit here. And one comment I want to
make here is that I believe that privacy is a driving force for innovation in cloud to client
as well as client plus cloud solutions here.
We don't hear a lot about this, but one of our -- we're very serious about privacy at
Microsoft Corporation in general. It's an interesting area at MSR. It spans several
groups, several approaches, several labs.
One approach we've been looking at -- we have a couple of different projects here. One
is called protected sensing and personalization, PSP. And the idea is can we move -- still
get the value out of services but move almost all machine learning and reasoning into the
dominion of users' machines instead of cloud this and cloud that.
So the idea basically of shroud of privacy here. We have data -- for example, GPS data
or search data -- all kept within the safety, within the metal structure around your own
boxes that you own. Do machine learning and prediction in a very personal way within
this shroud of privacy. And then use those predictive models to do things with realtime
data needs, context, and so on, at times even using third-party models that might be
developed and [inaudible] models then with volunteers.
It's a very promising area I think, a competitive area. Here's an example. Personalized
Web search. So in our P search project -- this was work I did with Jaime Teevan and
Susan Dumais. We go out through a Web and do search on the Web, let's say, and we
bring back -- let's say we put the word Lumiere in. We bring back 400 results behind the
scenes.
And what we do is we do reasoning -- learning an reasoning on -- in the client to rerank
what's coming back and to re-present the results to the user with a different rank ordering.
So when I put Lumiere in running P search here, Lumiere is a, you know, common word,
it's used for restaurants and there's a whole history of the Lumiere brothers and so on.
But I get to the top -- but what's at the top of my Web page -- what I mean by Lumiere
based on an analysis of my e-mail, my work, my documents, my relationships, and that's
all done in the privacy of my box. I won't share that with Bing or Google or anybody.
There's some sharing going on, but it's an interesting question as to how you can
[inaudible] escape that.
One example. We've done this work in GPS as well and so on. So I just want to mention
that privacy is going to be a forcing function for doing more computation on clients and
not being able to necessarily rely on large datacenters. I think it's a really [inaudible]
point.
Another area I want to talk about a little bit is back to the client as well as cloud is the
challenging area of sensing simulation and inference in human-computer interaction. We
have a lot of work going on on my team in this space. You see what we call the 4-by-6
able to here and active Surface, collaboration that's doing quite a bit of sensing, little bit
of reasoning.
Let me just go to a little video here. Where's my video. So why don't I see my video
here. Let's find out why. Okay. I have to pull that up here again. Stand by one second
here. The way I was going to save time with the video was by going out and having it set
to go here, but it seems to have changed this. So it will just play in the meantime here.
All right. So what that sound just was just now was this. What you see here is a future
direction in mixing tangible objects in the world with simulation. So what's going on
here is that these are simulated cars being built -- generated by a physics engine. What
the surface they're on is an actual surface made from construction paper that's being
sensed in real time.
Adjusters could be used to change and add objects to the surface here, and then they
interact in this virtual space as real objects.
You can balance the car in your hand, has the gravity and I bounce to it, for example.
Shadows are simulated. So there's a lot of work going on in simulation plus gesture
recognition and sensing combined with graphics, with CGI, to come up with new kinds of
experiences that might mix the virtual and the real in new kinds of ways.
Can't say a lot about it, but this is going to be a very big area for certain areas of
Microsoft -- in the Microsoft product space.
This is another area basically looking at this 3D camera showing how we can sort of
sense gestures and come up with a 3D representation of hands that actually interacts
with -- again, with a simulation here.
Let me show another kind of area here which I think is exciting as well. The idea now is
taking -- you know, doing richer and deeper physics simulations. I think the Intel people
have done some work in this also looking at parallelization of algorithms to do efficient
work with really high-fidelity glass breaking and water simulations.
We're looking at the whole notion of the HCI space being in part more and more
dependent over time in richer simulations like this. This is all just a surface here that
shows how we can do folding, ripping, tearing by combining the gesture recognition with
physics. Again, pretty heavy duty on the computational side.
Let me move ahead here a little bit to show you a little bit of tearing, which is kind of fun
to look at.
There's a whole world of how do you debug these systems, how to you build models that
do efficient recognition. For example, one of the approaches that Andy Wilson has done
in this work and most of the work you're seeing here in his team is coming up with a very
interesting geometric algorithm that does a nice job at capturing gesture with this through
touch sensing.
And I'll just end by saying a little bit here that this is the direction I think for
user-interface design someday. You don't sit around and worry about the details of user
interface. You have a physics model and you want to basically say I want to basically put
down some physical objects and use them directly and use a physics simulator to give
you your force feedback on your response and so on.
So someday you just tell a system I want an input device that has this kind of weight and
angular momentum, and you just use your computation to generate the rich behaviors.
Again, this is work by Andy Wilson.
I'm a brand-new user of Win 7, so I have to find out what happened with my video.
Another interesting realm that we're pushing, which I think is a very tough realm for
computation per need, is mix initiative collaboration. The idea is someday you'll have a
problem, blue blob in front of you, you want to solve it. Computer look at it with you,
might be ubiquitous computation scheme, and say, you know, I can divide that blue blob
into subproblems alpha and beta, and, by the way, I can solve beta, and I'll tell you
[inaudible] solve beta, that's my problem, the human being, you should solve alpha. And
you continue to decompose and iterate.
We've done a bunch of work in this space over time. One was called Lookout, one of the
original prototypes in this area, that would actually schedule from free text and bring up
Outlook and sort of populate the schema for Outlook based on the free text content.
And the way that worked basically is the system would learn behind the scenes as used
e-mail -- always the machine learning and also reasoning about whether I should do
nothing, engage a user, and if so how, or just do ahead and do something.
And it has a rich utility model here that users can assess. And if assessed correctly, the
system should make users happier when the predictive model sort of predicts the
probability of a desired action over time.
It's a very rich area, it's like you have mixed initiative interaction. There are some rich
future directions in this space where people and computers that work together -- I don't
know if people are familiar with the Da Vinci Machine. Right now it's very popular in
medicine. I actually worked on an earlier version of this way back when I was
collaborating with somebody at SRI in the early '90s that became this company.
But this is a robotic surgical system that people collaborate with. The robotic side is not
very rich yet beside direct manipulation, but some day the idea of group collaboration
and mixed initiative will be very essential. And people at Johns Hopkins right now are
actually doing -- working in the motor space in mixed initiative. Very exciting area.
So let me now get back to my title which I called Open-World Intelligence, because I sort
of build up to this.
A theme in my group, and we have a sister group, Andrew Ng's group at Stanford that are
working in this realm. Very excited about the idea of systems that realize they are
incomplete and inadequate in the world. They have a good sense for their own
limitations. They assume incompleteness. They represent uncertainty in their
preferences, the preferences and intentions of the people they're supporting. They know
that there's a shifting set of goals, sometimes new goals they haven't realized or
understood yet. That they're in a dynamic world where there's synchronous sensing and
acting. There's different actors in the world that are entering and leaving the observable
world.
And in this world supervised, unsupervised and active learning are essential. You have to
[inaudible] multiple components of services that provide different aspects of sensing,
learning, and reasoning. This is an image from before. Deliberate about the spectrum of
quality, utility tradeoffs. For example, considering the robustness at the price of
optimality.
And some directions are looking at the idea of what we call integrative intelligence,
taking together components that in the past have been separate. They've all been kind of
like the focus of attention typically of intractable problems. You heard about speech
earlier, I believe.
Planning. Robot motion and manipulation. Localization. Vision problems. General
reasoning about the world and plans.
And we're weaving these components together and getting a sense for how they work
when you have dependencies and latency constraints.
[inaudible] the NLP component alone is a whole area of work, and they hold conferences,
ACL, for example, and NAACL.
So a lot of the work sometimes you actually bring communities together who may have
gone to a AAAI conference, the AI main conference, maybe in the 1982 but now are at a
whole different area that's just NLP, for example, or vision. Same with robotics.
So let me just mention a little bit about one particular project in this space that's a lot of
fun. We call this the Situated Interaction Project. Dan Bohus is leading up this effort. I
and others collaborate with him. And the idea was to try to build a platform that can do
what we call open-world dialogue. And we describe very clearly open-world dialogue
involves people that are coming and going, not a push-to-talk device on a directory
assistance call, for example. It also involves explicit limitations and reflections about
models that are incomplete and so on.
So we started by looking at what does a receptionist at Microsoft who you all probably
had to work with over the last couple days in the front of Building 99 -- we see right
here -- what is the task at hand here in terms of satisfying various goals.
So we did a lot of recording with signs up to tell everybody we were recording
[inaudible] a little bit of that. We have a bunch of cameras up and we're watching
different aspects of the scene here. And we're running a tracking algorithm that can sort
of track facing, help us do tagging later.
We have an acoustical array microphone to get a sense of who's talking and where the
sound is coming from. We're looking at trajectories of entry, of groupings and
clusterings of people to understand when they're together versus separate.
We're watching carefully the expressions and timing down to the latency of the
receptionist and she works with people. And so on.
One challenge area that came out of this work was building models of multiparty
collaborations, which is very new in the dialogue world.
The idea -- here's a receptionist, people are coming and going, some people have a goal
that's been verified, like I want a shuttle, I want to get into the building, I want to see
somebody, can you call somebody up. And the idea of understanding when people are
together versus separate, for example, and trying to build models using vision and speech
that can maybe do this kind of thing.
So we ended up building a nice platform trying our best to -- we used to get as kind of
experience, not the way you need to have an anthropomorphic experience here, but it's
kind of a unified experience that might do all the things that you might expect from a
receptionist someday just to explore the space a little bit.
We built this on the Microsoft robotics platform which let us call many processes and
manage dependencies and are need for synchronicity to minimize some latencies.
We notice if we ran this system even slight latencies coming between speech and vision
made the system seem completely out of whack and unnatural and unusable.
So debugging this was quite interesting. Lots of tools that Dan built to do this. And
there's also machine learning going on and so on.
Let me show you a quick video that Craig Mundie has shown around Microsoft a little bit
to give you [inaudible] how this works. That red dot is the gaze of the avatar. Running
on eight cores here. You see the eight cores doing their thing, different modalities.
[video playing]
>> Eric Horvitz: Now, it's unclear whether you had a better experience at the front of the
building, but there are many pieces and components here that we're pushing on here.
There's lots of good technical work going on beneath the covers and some heuristics and
some hacks which are melting away to the formality over time.
We had a platform and we've been pushing on it as a platform to write different apps to,
and we have several different apps now which help us explore the space of we might call
the open-world dialogue space.
So if you come to my office right now, outside my door you'll see the PAS system, which
is a new version of the receptionist which now is down the hall somewhere else getting
polished, the Personal Assistant for Scheduling. Now, PAS has access to lots of data
about my comes and goings over the years. We're using some components and tools that
have this ability. This shows the position right now by my office.
We realized we had a lot of work going on that could give brilliance and intelligence to
these systems if you just added them in. They're also, again, intractable. This is actually
a Web service that we've had up since 2002 called Coordinate. Coordinate continues
to -- neighbor who runs Coordinate has a system continue to look at all the devices and
desktop usages and generates models over time to predict how long until they're at
different places.
So if I'm gone for an hour and a half, you can give a SQL query to coordinate and say
how long will it be until Eric will be in his office for at least 15 minutes without a
meeting. And it will tell you that based on the statistics of comings and goings.
How long will I be on the phone until I hang up based on my statistics of phone use.
How long will I be in stop-and-go traffic given by my GPS data and so on.
So it's the first time in the Bayesian machine learning world we actually build -- this is
kind of for the computation these days. You build a case library in real time with a SQL
query, do machine learning -- that graphical search you saw -- structure search, and
inference all in a few seconds. Query-specific database.
That was the big fanfare when we presented this at UAI a few years ago that we -- the
base store for probabilistic reasoning and machine learning was a database, not a prebuilt
machine learning model.
And so you can have standing queries of various kinds, like when I'll next read e-mail,
for example. It's kind of interesting if I give Paul Koch access to my account, he can get
into -- when he gets into Outlook he can say, oh, Eric was last in Outlook six minutes and
he'll read e-mail within 19 minutes, for example.
Anyway, we wanted to give PAS this ability, also the ability to predict cost of
interruption at any moment, whether I'll attend the meeting or not. Those are other
projects.
But here's the experience with PAS, and I'll summarize [inaudible].
[video playing]
>> Eric Horvitz: Here's one more scenario here.
[video playing]
>> Eric Horvitz: So I'll end there. But wanted to suggest that there are some really hard
problems. Parallelizing them would be quite valuable. We have some interesting
approaches to doing some work, like our machine learning work in ClearFlow. We've
already parallelized to 128 cores and we do our cycle for the product time. That's a
system that depends on that essentially for our ClearFlow offering. Cars are being routed
right now based on that system.
If I had more time today I'd talk a little bit about this whole area that we're very excited
about and we're looking for collaborators on, learning and reasoning in computation, how
do we actually learn to automatically distribute workload, do speculative execution. This
is a very interesting area for us given work we've done in the past that led to the
SuperFetch component in Windows right now that performs extremely well based on our
dataset and so on.
And we want to do more of this and we want to move to multicore. In some ways we've
promised Craig Mundie we'll be moving into multicore with our machine learning, and I
think some of you are already thinking along these lines and doing work in this space.
We'd love to catch up. So I'll stop there.
>> John Hart: There we go. Sometimes it's ironic to be talking about making
applications go so much faster and just bringing the computer back up takes so long.
I want to talk about dynamic virtual environments. I'm John Hart. I'm one of the PIs for
the UPCRC at Illinois. And this is work I've done in collaboration with our graphics
group in UPCRC, also our architecture group and our patterns & languages group.
Particularly Sarita Adve, Vikram Adve and Ralph Johnson, our faculty; and students Byn
Choi, Rakesh and Hyojin and Rob.
And these three are sitting right over there at some point, and I'll be standing up here
taking credit for a lot of the things they did.
In looking at grid consumer applications that will get a lout of payoff for examination for
parallelism and things that should drive multicore in the future, an obvious choice is to
look at video games. And you get a spectrum of video games based on two axes. You
can either have a lot of photo realism or you can have a lot of flexibility.
And we saw this yesterday. There was a couple talks on this yesterday. You have these
photo realistic first person shooters, and as they become more flexible they become less
photo realistic, very often because you're building new environments, new objects, doing
unexpected things in these more flexible video games and environments. And you can't
take advantage of the massive amount of precomputation that's needed in order to get
these cinematic effects given current computational power and the need to deliver frames
at at least 30 frames a second.
So what we want to do is we want to make games as flexible as Second Life and other
social online network games, but we want them to be more photo realistic, more realistic,
more lush, and in doing so also -- if we can do this, we can avoid the need for
precomputation, and that will make video game titles faster to produce and cost less as
well.
And so just an example. This one I'm particularly proud off. This is a collaboration I
had -- I was fortunate to have with Microsoft Research back in the summer of 2002, the
student that was here. And it was on getting precomputed radiance transfer to work on
the GPU. And this is what video games -- in this case, the Xbox 360 -- had to do to get
effects.
You know, like you see in Halo or like you see at the light shining through this bat wing,
and these effects -- it's a precomputation that you do at every vertex on the shape, and the
precomputation says that whatever life you have in your environment, whatever light
function you have in a large sphere around that vertex, that light's going to change when
the light actually reaches that vertex because things will get in the way, some of the light
may enter a surface and scatter around and reach it, other light may inner reflect in other
places.
And so we can specify this light as a 25 vector, that specifies the function over that
sphere, and we can specify this as a 25 vector. And so all the stuff that happens to light at
the large scale to the small scale that gives use all of these effects is basically this 25
element by 25 element transformation matrix.
So 625 numbers that need to be stored at every vertex. And that's fine as long as the
object doesn't move. And if the object moves and you have to parameterize things again,
and you have a -- these things take 80 hours to compute just for a still object. And when
you have dynamic scenes, they can take even longer.
So this is the kind of thing that needs to go into making modern day video games look
photo realistic like what you're seeing on the cinema; whereas when you're going to a
movie, you can take arbitrarily long to generate a frame of video or a frame of film at a
movie. But you've got to generate that frame uniquely, 30 frames a second, when you see
it in a video game.
The other thing I'm really proud of is my only real contribution to this work was I drew
this picture which ended up in the MSDN documentation for the technique. Sometimes
it's nice to see that stuff happen and get used.
And so if we look at the modern day, we have techniques like ray tracing and we have
enough cores, enough parallelism that we can enable district ray tracing, realtime ray
tracing of scenes. And so here's an NVIDIA demonstration -- Intel's going to have these
too -- of a car riding down the street and everything's ray traced.
And that's good except that what they're demonstrating is district ray tracing, and district
ray tracing is rather easy. This is just tracing a bunch of rays coming from the eye, they
show the same location -- same starting point, and they're all nearly parallel, all
emanating, and they're all in order so that all the geometry access and all the memory
access is coherent.
And the same thing's happening with the light. You've got rays coming out of the sun
and you get this nice hard shadow underneath the car. This is a photograph of another -it's a micro Porsche, I guess. And if you look at the reflections here, you're seeing perfect
mirror reflections, although a little bit bumpy from the surface geometry. This thing is
basically a mirror that's been warped into the shape of a car.
You know, actual car reflections are a bit glossier when you have rays of light hitting
them or when you have eye rays, lines of sight hitting them. They spread out. And it's
that spread that makes things much more difficult. And in general the thing we really
need ray tracing for, the thing that we use precomputed radiance transfer for for video
games is global illumination.
And those effects are really hard. The -- you know, after the first bounce you get a mess
of rays going -- starting from arbitrary points in arbitrary directions, and you no longer
have coherent access of geometry and loss of cash locality.
And so we did some experiments with this. We had this rendering of a car, for example.
And we're ray tracing it. And this is a GPU ray tracer. And when you ray trace on the
GPU you have to do things in phases, so you send all your rays out from the eye and they
hit an object and then once all the rays have arrived at an interaction, then you go through
and you do all your shading.
And if -- you know, the nice thing about rasterization and the reason that shaders have all
done rasterization hardware was that when you rasterize a triangle, all the pixels on the
triangle were running the same shader. They may look different depending on the
outcome of that shader, but they're all running the same program. And so SIMD
performance was really good. And we see that same SIMD performance is pretty good
on these eye rays.
But after that first interaction and the rays start bouncing around, and this is a path tracer
generating this, we start to lose that coherence. And all 32 elements of an NVIDIA
SIMD processor were running one shader in the blue sections of this car image. After
that first bounce, neighboring pixels, the rays that started close to each other have
diverged and are executing completely different shaders.
And in these red regions we have, you know, up to 16 different shaders being run. And
on a SIMD process you get a lot of divergence. And even though you've got 32 different
processors, running that, they're going to take 16 times as long because of the
serialization.
So you get that incoherence leads to SIMD divergence. And one of the things that's
painfully obvious is that we're always going to have these SIMD units in our multicore
architecture because they're so cheap and give us such a great performance boost when
we use them correctly.
So last summer we started working on this and we did a little bit more work throughout
the year on the idea of shader sorting and basically tracing rays through a scene and
accumulating all the shader requests and then trying to sort the shader requests so that if
you have a SIMD vector of 16 or 32 elements wide, that you're trying to pass all the same
shader program requests to that one SIMD vector to avoid this divergence and to avoid
the serialization that you get.
And so if you don't do anything, if you just use a big switch, if your shader is -- you
know, if I need shader 1, then run shader 1, or else if I need shader 2, run shader 2, then
you'll get divergence and you'll get serialization. And for simple scenes with simple
shaders, you know, this is a glass Stanford bunny in a red-green Cornell box, that turns
out to be the best way to go.
Because shaders are so simple, we can actually do faster if we sort the shaders here, but
the overhead from the sorting makes it too expensive. But for almost all the other scenes
that we tried, this car was 16 shaders, this Cornell box with this scanned Lucy statue has
procedural texture for the stone and procedural texture for the floor.
So it's just four or five shaders here, but these procedural ones are very expensive. And
same for the Siebel Center staircase with procedural copper texture. And this other lab
scene that has some 20 or 30 shaders, none of them procedural.
In all of these cases we found that it was much more efficient to sort the shaders and send
them to the -- and send packets of the same shader to the SIMD vector units. Even with
the expense of doing the sorting it became much -- in this case a little bit faster than just
paying the serialization price.
And in many cases, especially with procedural elements where you have to run a shader
program for a long time, the benefits becomes much greater. In fact, we can render these
procedural guys as fast as the nonprocedural guys using this shader sorting.
And so that's part of it, is running -- is trying to efficiently shade these objects as they're
being ray traced. That's one thing we've looked at.
Another is the fact that if we're building these dynamic network online virtual
environments, we want to build things, we want to be dynamic.
The scene can be changing and people can construct their own geometry and do
unexpected things. And when you're ray tracing, part of the efficiency of ray tracing is
you don't want to intersect every ray with every triangle. That just ends up being too
slow regardless of how fast your computer is. You want to use data structures that will
help you narrow down a likely set of rays intersecting a likely subset of the geometry.
And so we need these spatial data structures in order to accelerate ray triangle
interactions, to take collections of photons scattered from the light source and gather
them into rays that make it to the eye in order to classify rays so that we can get likely
bundles of rays, and that ends up being a five-dimensional structure where you've got the
three-dimensional anchor of the ray and the two-dimensional latitude and longitude of the
ray direction.
If you have a bunch of scattered points from a laser scanner and you want to represent it
as a surface, if you want to scan yourself in as your own avatar using cameras or using
some of the vision techniques we saw yesterday and you want to reconstruct that as a
surface instead of a bunch of points, you want these spatial data structures so you can
find the closest point to a given sample point. And also collision detection, and these
things are also commonly used in vision and machine learning.
So we wanted to come up with efficient parallel techniques for building these things.
And that's so far called ParKD. And just to differentiate, you know, we have 20, 30, 40
years of history of doing parallel algorithms largely in scientific domains, and that's good
but scientific doesn't pay as well as consumer does. That's a lesson we learned in
graphics about ten years ago.
And in scientific domains, we have n-body simulations where every body is influencing
every other body. We have molecular dynamics where all the bodies are about the same
size and you've got water uniformly distributed. And so they set up a certain class of
spatial data structures.
But in graphics we tend to have objects or polygons or geometry distributed in sort of
these submanifold, these two-dimensional surfaces. And so we found that the KD-tree
tends to be one of the better choices in that case.
But there's all sorts of applications. And it turns out KD-trees can be manipulated to
support all of these things. And Rob Bocchino is one of the people on our project
working at generalizing the results we get for KD-trees to work with a bunch of different
spatial data structures.
And so we have the uniform grid or the hash table, and I noticed Hugues Hoppe is here
from Microsoft Research, did a great job getting this spatial data structure to work
efficient in parallel a couple years ago.
And there's quad trees. And one of the things that's interesting to note is you can have
region trees or you can have point trees. And likewise you can have KD-trees and so on.
And KD-trees can be organized at point trees. This is how they're used in photon maps.
Or they can be organized in region trees. And this is how they get organized for
geometry. This is the kind of structure we're going to focus on is this region tree for
organizing mesh triangles -- objects that people have constructed in online environments.
And if you do a bad job of constructing these trees, you get an imbalanced KD-tree and
you've really lost the advantage of your spatial data structure.
And so the challenge then is how do you build one of these trees quickly, efficiently in
parallel when everybody's trying to -- all the processors are trying to manipulate the same
central data structure simultaneously.
And there's -- we looked at some of the previous work. Intel has this nice algorithm that
runs on the CPU, an Microsoft Asia has this nice algorithm that runs on the GPU for
constructing KD-trees. And the CPU one's 4 core. I guess it'd be 16 core if you count the
SOC. And it's generating this in about six seconds. And 192 GPUs generating this in
about six seconds. So we're finding that a lot of GPU processors run about as fast as a
few CPU processors. And there's a lot of other subtle differences between those
algorithms as well.
And so the idea of how does KD-trees -- how do spatial data structures help use render
fast, and you have these partitioning planes and you separated geometry into triangles
that are on one side of the plane or the other side of the plane and then they create a
hierarchy.
And so if we have an array that we want to intersect with this geometry, we see that the
ray starts on one side of the plane and ends on the other side of the plane. So we intersect
the geometry on the first side of the plane first, and if there's no intersection, then we
intersect the geometry on the other side of the plane.
And so building these trees is pretty easy. You just need to find the best splitting plane.
Given your set of triangles, you split the triangles to be on one side or the other side of
the plane depending on where they fall, and then you recursively build the left sides -you take the triangles on the left side and further subdivide them. And subdivide the
right side's triangles and add splitting planes for them.
The real tricky part for doing this in parallel, it turns out, is finding the best splitting
planes for a given set of triangles.
So choosing a splitting plane. Turns out that computer graphics looking at this over the
last couple decades has found the surface area heuristic to be really good for figuring out
where to put the splitting plane to get an even split of triangles that's efficient for ray
intersection.
And it's basically the surface area heuristic says wherever you put the splitting plane you
want to -- see, believe you -- I figure if you want to minimize or maximize. I think you
want to optimize. Minimize or maximize, I forget which one, the number of triangles on
the left side times the left side area of the bounding box plus the number of triangles on
the right side and the area of the right side bounding box. I think you want to minimize.
And so -- and if you add those together, that gives you, for our cases, a simplified version
of this surface area heuristic.
And so we want to try that at these events as we sweep from left to right of where we're -you know, where we're changing the number of triangles. And so we need to count the
number of triangles to the left and the number to the right, and then measure the surface
area.
And so you can do this. There's a nice streaming algorithm that Ingo Wald came up with
in 2006 with some collaborators that creates three sorted lists and then sweeps along each
of these three sorted lists and keeps a count of how many triangles to the left and to the
right that you have.
And this thing can be implemented pretty easily using some prefix scan operators. And
so we did this.
And that worked pretty good. And that actually revealed some patterns for processing
spatial data structures for processing hierarchies. And so we've decomposed these into
this geometric parallel pattern -- geometry parallel pattern and this node parallel pattern.
And so at first at the -- if we're creating a hierarchy, the classic way of creating a
hierarchy on multicore parallel computers is to -- is to get to the point -- usually it's just a
serial algorithm to keep subdividing until you've got one subtree per processor. And once
you've got one subtree per processor, then each processor can work on its own subtree
independently of the others.
And so you've got this node parallel process where the number of nodes equals the
number of processors at that given level. And that's pretty easy to parallelize. It just
becomes task parallel at that point.
The tricky part is up here, and this is where we find geometry parallelism where you
sweep through all of the geometry, all of your data, and then you find the midpoint or you
find the split point that you want to split, whether it's your spatial median, it's your object
count median, or in our case it's a surface area heuristic, and then you create that split.
And at each point in the level you've got -- if we're doing geometry parallel, we have all
our processors working by, in our case, using these scan primitives in order to find the
separating point by streaming through all of this geometry and all of your data and then
finding this split point.
And so all the processors are processing all of the data at each level in one stream, and
they're just inserting these split points wherever they belong.
And so the interesting thing is Moore's law, which, you know, in the past has said that a
processor gets twice as fast every year and a half or two years. Processors aren't getting
any faster, but we're just getting more of them. So we're doubling the number of
processors every year and a half or every two years or, you know, every -- you know,
every certain amount of time, which means that this origin line is going to be descending
every couple of years.
And that means that most of the previous algorithms we've looked at focus on this section
and do something very approximate up here. And this is the important part. This is
what's going to play a big role when -- this is going to dominate the problem when we
have hundreds of processors. And thus far we've spent a lot of time down here and very
little time in the green section.
And so if we look at the past contributions, the recent work in what we've done, in the
past couple of years there's been several multicore KD-tree construction algorithms.
And in the top half they've done approximations, they've avoided the surface area
heuristic and just looked at the simple median or the triangle number median or the
spatial median or some approximation of the surface area heuristic, and then focused on
doing either an exact surface area heuristic or some approximate binned surface area
heuristic in the bottom half of the tree.
And those have been very effective, but they're not going to scale well when the number
of processors grows very large and we're spending more time in the top half of the tree
and less time in the bottom half of the tree.
And so that's what we've been focusing on is working on that top half of the tree, and so
we have techniques that work in place, and then we have a simpler technique that just
uses for simple nested parallelism.
And we wanted to focus on an in-place algorithm partially because as we get the number
of -- as we get the number -- as we increase the number of cores, these cores are going to
become more and more remote, they're going to require more and more cache locality.
And so we want to avoid moving memory around as much as possible as we're
processing these things.
And so this in-place construction algorithm, like Ingo Wald's algorithm, we presort in X,
Y, and Z. It's basically a slight variation of Ingo Wald's original formulation. And then
as we're cycling through all of the nodes -- or all of the events, all of the triangle left-right
extents and counting the number of triangles to the left and counting the number of
triangles to the right, when we go to put the triangle in a -- the record of that event either
on the left side or the right side of the tree, instead of moving the memory to the left side
list or to the right side list, we just leave everything in place and we just put a pointer to
the node that that triangle is currently in.
And that works great except when you have a KD-tree some of these splitting planes are
going to go through triangles and so the triangle -- a portion of the triangle will be on the
left side of a plane and a portion of the triangle will be on the right side of the plane.
And then we're stuck. We either need to have multiple tree pointers in that, which means
we need to either have a dynamic memory allocation or somehow expand the amount of
memory at each point by a fixed amount, or we store the record of that node in some
higher tree node.
And we haven't still decided which is better. We're still doing a bunch of experiments to
try to figure that out.
The other thing we're investigating is inspired a bit by a talk that Tim Sweeney gave at
Illinois as part of Sanjay Patel's Need for Speed seminar series on the impact of
specialized hardware programming on programming time. Tim Sweeney help write -he's in charge of writing video game engines and developing video games. And he pays
very close attention to how efficient his programmers are at doing this.
And he mentioned that when he programmers are doing multithreaded programming, you
know, straight multicore programming, it takes them twice as long as when they're
writing just a serial program. They're working on the sell it takes them five times as long
when they're doing GPGPU programming, it's ten times as long or worse.
And he also notes that anything over twice as long is going to put them out of business.
So his concern is that as we're -- you know, that we not focus entirely on efficiency. He
needs things to go faster but they don't need to go as fast as possible, especially if it takes
people ten times as long in order to eke out that last little bit of performance.
And so based on that concern, we also focus just a straightforward simplified parallel
KD-tree build using nested parallelism in TBB. That just does a straight-ahead,
full-quality SAH computation, and we wanted to do some experiments to see based on
just that simple straightforward implementation how well it performs.
Let me go back one. And that was -- that example, for example, we get a speedup of
about six times on 16 cores. Ideally you'd want 16 times speedup. But given the
constraints of programmer productivity, if you can get that done quickly and ahead of
schedule, a 6x speedup still justifies the expense of the additional processors.
>>: Can I ask you how long it took to code these?
>> John Hart: Byn, do you guys have numbers for how long it took you guys to code
that up?
>> Byn Choi: The simple one, I'd say a week or less than that maybe.
>>: Compared to the --
>> Byn Choi: Well, actually -- okay. The simplest one, once we had the sequential
version parallelizing it was about a day or two using TBB. But [inaudible].
>>: So if you went to 32 cores would you get a 12x speedup? I mean, as long as you're
on that curve, it's probably okay.
>> John Hart: I think so, yeah. Yeah.
>> Byn Choi: You don't even need to be on a linear curve. Because if you look at the
way processors use transistors, they didn't get a linear increase in performance versus
transistor [inaudible]. So the economics of multicore era, you only need to get maybe
square root performance. So it's kind of a stronger justification for what you're doing
here.
>> John Hart: Yeah. There's two sides of this. There's the high performance side where
you want to eke out -- you want your 16x speedup factor. Then there's the scalability
side that says, you know, I'm going to buy a computer that has twice as many processors;
I want my stuff to run at least faster.
>> Byn Choi: [inaudible] addressing is trying to keep the industry on the track that it
used to be on.
>> John Hart: Let me just conclude. We're starting to gear up to make what I'm
tentatively calling McRENDER, which is a Many-core Realtime Extensible Network
Dynamic Environments Renderer. For online social environments, kind of like OpenSim
which we saw yesterday, that leverages ray tracing specifically for these kind of back-end
global illumination things. And uses Larrabee's software rasterizer as a programmable
rasterizer to get a lot of the same effects we get with ray tracing on the front end, and
depends critically on this parallel KD-tree for a lot of the features in order to get things to
run fast enough to be able to deploy in real time on upcoming multicore architectures.
Okay. That's it. Thanks. Oh. Question.
>>: So I'm a big fan of BVHs and it seems like they have a lot of nice properties for
these kinds of problems because they're more forgiving than KD-trees, you can deform
geometries and you don't have to have the -- in constructing in BVH you don't have to
insert this joint regions, so the parallelism should be easier than constructing in KD-tree.
>> John Hart: Yeah, yeah, yeah. And there's not really much difference. The whole
point we looked at this was we wanted to look at hierarchies, spatial data structures in
general. And so we needed to settle on one, like a KD-tree.
But the bounding volume hierarchy, you still have -- I mean, with a KD-tree you still
have a hierarchy as well. The big difference with bounding volume hierarchies is they
tend to get built bottom up instead of top down. You tend to surround all of your
geometry with bounding volumes, and you tend to merge the bounding volumes from
bottom up as opposed to a KD-tree where you take a look at all your geometry and split it
top down. That's on the or horizon too.
>>: [inaudible]
>> John Hart: The what?
>>: I've built them both ways.
>>: Any other questions? Let's thank John.
[applause]
>> Ras Bodik: Okay. I'm Ras Bodik. I'm one of the co-PIs working on the parallel
browser project, and I want to start by introducing Leo, a grad student in the white shirt
in the back. He's excellent at answering the really hard questions. I'm not saying that to
discourage them; I'm saying that so that when I redirect them it's not because I don't have
the answer, it's because he has a better one.
So what this project wants to do, one of the goals and perhaps the key goal is to run a
Web browser on maybe one watt or smaller processor of 800 full parallelism. And I'll try
to start the talk by saying why this may be interesting.
So Bell's Law is the corollary of Moore's law. And it says that as long as the transistors
keep coming, meaning shrinking, there will be a new computer class appearing with
certain regularity. And the computer class will reach new people with new applications
and redefine the industry leaders.
And we are now at the stage where we are making a transition from the laptop computer
class to handset computer class.
And something is very different now than it used to be, because in all these computer
classes, the software that you could run on the new class was essentially the same
software already you could run on the previous class. In some cases it was a same as it is
in the laptops; in some other cases it was a variation, even if you ran a difference
[inaudible] you still could run the previous one.
But it's different now because the single thread performance of these processors is not
improving, so the efficiency of the handset computers is essentially much smaller
because they have much less energy available to them and they can dissipate much less
heat.
So the power wall means that we need to write different software for these handhelds
because the software of laptops doesn't run and it's never going to run efficiently on these
handsets well unless you are willing to wait very long and optimize the software
considerably.
So single thread performance is not improving. In fact, it may go a little bit down to
improve energy efficiency. But energy efficiency is still getting better, about 25 percent
per [inaudible] generation, so that's 25 per two years.
And that means that every maybe four years you will get -- you will double the number
of your core. So you get more cycles, but you are not going to get better single thread
performance.
So in order to get better performance, these handsets variable will have to be parallel as
well.
To convince you that this is not just a supplement to existing classes but indeed a new
computer class, it probably makes sense to look at the output alternatives. And starting
here you have flexible displays which are one possibility. Here we have bigger
projectors in phones which already are on the market.
They are not as small as they would like to be, but eventually they'll be so small that you
will be able to hide them in computer classes and enable applications like this one where
you are looking at the engine, you are repairing it, and you see a superimposition of the
schematics of the engine which help use navigate your work, and the same you can do
with navigation when you are working in a foreign city and so on and so on.
So why parallel the browser, what does browser have to do with is the future computer
class? After all, browser is just one application. And it's not really an application. It's an
application platform. It's a way of writing and deploying application. And programmers
like it because the applications are not installed on the browser but they are downloaded,
such as Gmail, and the JavaScript language and the HTML standards provide a portable
environment. So no matter what platform you have, as long as you have a browser, you
can run it, with minor differences.
It's also a productive environment because the scripting offers high-level dynamically
type environment in which you can easily embed a new DSL, and the layout engine that
is in the browser is [inaudible] and easy to make new user interface.
But on handhelds the browsers don't perform really well. And in fact the application
development on iPhone is different than on phones, in general is different than on
laptops. On laptops many new applications are written as Web browser applications,
probably almost all of them. But on the phone you write them in the iPhone as
[inaudible] or in the Android, which is a considerably lower-level programming model.
And you do it because that's a much more efficient way of writing applications.
And to tell you why people don't use browser, if you run Slashdot on this laptop,
relatively lightweight, it may download layout and rendering in about three seconds. On
the iPhone it would be seven times more. And hardware people tell us that once things
settle down the fact that this processor has about 17 watts and the handheld one will have
half a watt, it will probably translate to about 12x slowdown on single thread
performance for the phone.
So even if you optimize existing browsers and there is a lot of room for improvement, the
programmers of Web applications will still push the boundaries such that on the laptop
these Web pages are fast enough, probably two, three seconds, maybe five, multiply that
by ten, and you see why these same applications, Web applications are not going to run
fast enough on the phone.
So parallelism is one way of improving the performance of the browser. It's not the only
way. There are still sequential optimizations, there are other ways such as running part of
the computation on the cloud, but the cloud has limited power because it is quite far away
in terms of latency and it's not always connected. So parallelism is one way.
Parallelism solves at least one of the two problems that you have, and that's sort of the
responsiveness, the latency. You can get the 21 seconds down to maybe 2 seconds if you
can parallelize it tenfold.
It doesn't sold the energy efficiency really well because you still do the same amount of
work, and so the battery lifetime is not going to be improved as it would be if you did
sequential optimizations. But this is why parallelism is still one of the weapons that you
may want to use.
And parallel browser, however, may need a slightly different architecture than existing
browser has, and that's in particular because of the JavaScript which is a relatively
unstructured language. There are many go-tos, and I'll try to illustrate them later in the
talk, and these may need to be resold.
So I'll try to talk about the anatomy of the browser, but if there are questions about the
animation and whether this all makes sense, this may be a good time.
Okay. So what do you want to parallelize. So what is a browser. You have a cloud of
Web services which the browser accesses, loads a page and he checks whether it's a page
or an image. If it is a page, if you decompress it, do lexical analysis, syntactic analysis,
build a DOM, which is an abstract syntax tree of the document, and that's the front end.
Essentially a compiler.
Then you lay out the page and then you render it on a graphics card, and that's your
layout part.
And then there is the scripting which provides interactivity or in fact makes the browser a
general application platform. And there is the image also that if what you download is an
image you go sort of on the side, decode it, and then it goes into the layout together with
these dimensions and eventually it is rendered on the page.
So here is the scripting which listens to the events such as the mouse and the keyboard
but also listens to the changes to the DOM. For example, the page may keep loading
forever, the browser -- sorry, the server may keep loading the rest of the page later
incrementally and the script will react to new nodes being attached to it.
The scripts of course may modify the DOM and in fact they do because this is how
scripts print something on the screen, by adding new boxes and new text and new images.
And of course the layout also modifies the DOM because once you lay it out, you
compute its real code and it's on the screen, these are written there and script can read
them again, so there are a lot of data dependencies over there. So enough because the
scripts can request more data from the servers.
So and that's so there's the third leg, the scripting. So where does the performance go.
And there isn't a single bottleneck that if you optimize problem go away. So the lexing
and parsing may take 10 percent, sometimes more. It really varies on the application
quite a bit. Building of the DOM is another 5 percent.
The layout in general is the most expensive component. This is sort of the [inaudible]
and DBIPS of the Web browser. And this may be 50 percent. And then the actual
working with the graphics -- and I think fonts also for here, right, Leo? Turning fonts
into rasterized images also falls there.
So all of it needs to be parallelized and the problem is that many of these algorithms are
not algorithms for which we already have parallel versions. We need to invent it from
scratch because so far all of these have been typical implemented in a sequential fashion.
For the layout component here, Donald Canudes [phonetic] would even tell you that
multicores are not useful for [inaudible] at all. And in fact he wrote his algorithm in a
very sequential fashion, so squeezing parallelism out of it is nontrivial as we learned.
So what have we done. We have developed some work-efficient parallel algorithms for
some of these problems. Work efficient means that the parallel algorithm doesn't do
more work than the sequential one. Sometime parallelization is easier if you can do more
work and then you throw some of it out if it turns out that you don't need it. But this
would hurt energy efficiently. So we are going after work-efficient ones.
And we've done it for a lay out has two parts. And the first one essentially does the
parallel map over a tree. I'll show you the algorithm. And the other one turns an in order
traversal over a tree into a parallel one.
And then I'll show you one algorithm from the compiler phase from the lexing, how to
take an inherently sequential algorithm and parallelize it.
So then on the scripting side, we spend a lot of time trying to understand the domain and
what programming model would be suitable and why JavaScript may not be the right
thing to do parallelize.
And so we looked at programmer productivity also, what abstraction the programmers
may want to have, and we go from the callbacks that are at the heart of JavaScript, the
heart of AJAX, to actors. And I'll show you small examples. And then what do you need
to add into the language to get better performance.
So let's look at lexing. Now, lexing is, you know, a problem that is relatively easy to
study and you know it from your compiler classes. You have a string of characters here
and you have a description of your lexemes and tokens, so you have a tag, this is the
HTML tag, so two parentheses with some stuff inside. Here is the content and here is the
closing parentheses.
And the goal is to go through the string and label it as whether it belongs to this token or
this token or that token, and the way it is done via regular expression and from that a state
machine. And the state actually tells you whether this belongs to that token or the other
one.
So the goal is really to go through this string and label these guys states of the state
machine.
The problem of course is that if you break it into pieces of parallel work, what you see is
that you cannot really start scanning this using the state machine because you don't know
what the previous state is, and you need it in order to find the next state.
So you can only scan this once you know this state here. So this is the inherently
sequential dependence.
So here is an observation that makes parallelization possible. And it's sort of specific to
lexing, so it doesn't apply to arbitrary parallelization of finite state machines. But the
observation says that pretty much no matter which start state you start from in that state
machine you will eventually converge to a correct state after a few characters.
So whether we start from orange or red or yellow here, you reach a convergence. And
this is because in the regular expressions or the state [inaudible] automata that arise in
lexing, you have one or more sort of start state into which the regular expression comes
back when it is done with a particular token.
And usually these segments are not too long. So this lead to parallel algorithm that works
as follows. You first partition the string among parallel processors in such a way that you
get these K characters of overlap. And the K characters is what we are going to use to get
from a state that is suitably but more or less arbitrarily chosen into a state that is correct
most of the time.
So we think that's the way. And now we scan in parallel. And in some cases we are
correct; in some cases we are not. So we are correct here and here and here, meaning that
the state that we reach after the K characters turn out to be the same as this automaton
reached even though it doesn't look like in this color, but let's assume that it is.
And so are these and these. But we did not guess correctly here, and so this parallel work
here needs to be redone, because our way of guessing the start state for that segment
didn't work out, so we redo this work and we have a correct result.
What I should point out, if there was a misspeculation somewhere here, you wouldn't
have to redo the entire rest of the work, just the segment, as long as by the end of this
segment you are in the same state where you would be in the sequential algorithm.
So this is a way of obtaining parallelism through speculation in an inherently sequential
algorithm by looking at a suitable domain property.
So here is how it works on a cell which has six cores. And so as you go up this is adding
more end cores of the cell. Of course the interesting file sizes are here for current Web
pages, and we could improve performance by tuning it a little bit more. It wasn't tuned
up really much for the small sizes.
But already onto this page -- page size is what you see is that on five cores we are almost
five times faster than Flex, which is highly optimized electrical algorithm in C.
So this is pretty promising. Of course this was for lexing, and parsing is really what we
need to parallelize when we are probably halfway through done with it, with quite
promising results I'd say.
Now let's look at the layout here. So here is where we spend most of the time. And this
is where Leo has done a lot of work, and he can answer details of it. But I'll give you a
high-level overview. Sorry. I should go here.
So a layout has two phases. The first phase is you take your abstract syntax tree of the
page, which is here, and then you have rules like this one which tell you -- these are CSS
rules which tells you how to lay out the document.
And you need to essentially take these rules, you may have thousands of them, and match
them onto the tree. These rules will tell you how particular nodes will be laid out.
So here we have an image node in the tree. And this will be associated with this rule and
this rule. So we have two rules that match this node. And then there is a prioritization
phase which we'll decide whether this rule and that rule applies, but we'll skip that phase
here.
How is this matching done? Well, you take this path from the node to the route, and that
describes a certain string based on the labels on those nodes, and then you have your rule
here. This is what is called a selector. And you see whether the end of this rule here
matches the end of this path, and then the rest needs to be a substring. This here, the
selector needs to be a substring of that path. So this is the work that you do here.
So as you can see, it's highly parallel because you have many nodes being independent
from each other. You have many rules, again they're independent from each other. So
this is a nice parallelism, except there are two things that you need to solve.
The first one is load balancing, so that be solved for now by randomly assigning work to
processors. Here we show it for three different processors which independently compute
the work.
So this so far seems to be working okay. And the problem other one is memory locality.
These hash tables that store these rules are quite large. They don't fit in the cache.
So what you do, you do tiling. You first do the matching with a subset of the rules, you
perform the match, then throw this away from memory and you work on the remaining
part of the rule, you do the matching. And this turns out to be essential for obtaining not
only good sequential speedup but enabling parallelism to harvest the potential of having
parallelism.
So here are the results. And what you see here, here is the original sequential algorithm
here, so this is speedup 1. Now, if you parallelize that, you get some speedup maybe a
factor of 3 here.
Now, if you apply sequential optimization including the tiling, you get to maybe 12. And
then parallelism will give you more. And I think the speedup here will be more than you
had in the non-optimized version because you get from eight milliseconds to two, so
maybe you get factor of 4 rather than a factor of 3, so even the parallelism does better.
Now, if you look at scaling, scaling is not ideal. You'd like to scale a little bit better here
on these eight threads. And perhaps we need to look more at memory locality and
whether the random distribution of work on these processors is the right idea. But so far
this is promising.
Layout. So now comes the process in which you have the abstract syntax tree, and it is
annotated with the rules which we matched in the previous phase. And now we need to
traverse the tree and actually figure out where each letter goes. And when we have the
letters we have the words and then we figure out where to break the words in the line and
where the boxes stack about each other.
This is called this flow layout, because we're essentially flowing the elements on the
page.
And if you follow the specification of this process, the CSS specification, this seems
inherently sequential. Because what you do, you start with a tree, it has some initial
parameters such as the page is 100 pixels wide, the font size is 12, now the font size of
this subtree is 50 percent and then it says that this image should float, meaning it can -- it
can go from this paragraph to the next paragraph because it will float out of the paragraph
and you set this width. So this is the input.
And now you figure out the rest through in-order traversal of the tree. You compute the
X and Y here. This gives you X and Y here. You change the font size to 6. So what you
propagated is the font size. The current cursor, so the position on the page and the
current width. And you continue like this, propagating information, until you're done.
And this appears sequential, because if you trace these dependencies, you end up with
such in-order chains, and therefore the algorithm is not parallel.
But if you take a closer look you realize that there are some subset of the computation
which in fact can be performed independently of the others. And here we see that the
font size of this element can be computed by a top-down traversal.
And if you follow this idea and invent some new attributes, you end up with a parallel
computation that has five phases. In the first one you go top down, compute the font
sizes, and some temporary preliminary width of these boxes.
Then you go bottom. So we can show this. Then you go bottom up. And for each of the
boxes you compute its preferred maximal width, meaning if the box had arbitrary space
available for layout, how would it lay itself out. In the case of paragraph it would mean it
wouldn't do any -- how wide would it be if it didn't do any line breaking.
And also you compute the minimum width which is if you really give it what is sort of
the narrowest width that it needs to lay itself properly, and so that would mean, in the
case of a paragraph, breaking after every word.
So this is what you do in the second phase. Again, it's parallel tree, bottom up. Once you
have that in the third phase, you can actually compute the actual width. Then you are
ready to compute bottom of the height because now you know each paragraph how wide
it is. You compute the height. And then in the fifth phase you actually compute the
actual absolute position.
So here is the speedup on some preliminary implementation which is not quite realistic
because it doesn't have the font work which you need to put in the leaves. So it doesn't
quite account for how much work it takes to turn fonts into rasters and so on. But still
preliminary there is some speedup, meaning we did discover some parallelism.
Finally, scripting. So scripting is this component here. The script is something that
interacts with the DOM and also with the layout because the script might change the
shape of the DOM, it may change the attributes, such as the sizes of these boxes, and the
layout needs to kick in and re-layout the document.
So why would one want to parallelize scripting? It doesn't take so much of the
computation, maybe 3 to 15 percent of the browser goes into scripting.
Well, a lot of what scripting does in the browsers beyond user interfaces is visualization
of data. So imagine you have an electoral map like this and now you want to change
some parameters and it's to change into something like this.
And maybe in addition to changing the color you want to change the shapes of the states
to reflect the magnitude of some attribute. And perhaps you want to do it in animation,
and perhaps you want to do with granularity of counties rather than states.
So now you realize that the computation is quite demanding. You have animation which
needs to work three times a second, so 30 frames a second. That gives you about 33
milliseconds for hundreds of nodes in the ASD. And now this is much, much more
demanding what JavaScript can do definitely on handhelds. And it gets worse if you start
doing 3D with various other annotations.
So what do you need to do to speed it up? Well, the programming model looks roughly
like as follows. It's not nonpreemptive [inaudible] model. Okay. These handlers
respond to events such as the keyboard or the mouse. And they execute atomically. So if
there are two events queued up, you first execute this one, finish, then execute the next
one. Between the two the layout would kick in and render the document. And only then
the second script would go.
And this would be sort of the nice execution model. So it looks all very friendly to the
programmer. But if you want to parallelize it, meaning you perhaps would like to do two
layouts in parallel, perhaps you would like to take the effects of this script and lay it out
in parallel with the effects of that script, now you need to look closer into the document
and understand what the dependencies are.
The dependencies are interesting. This is sort of what one needs to do to parallelize
programs in a particular domain, is understand the dependencies that exist. And here in
the browser they have two kinds.
The first one is what you could call document query dependencies. This script here could
write into an attribute X, say the position of one node in the DOM, update it to a
particular position on the screen, and then the second handler may want to read it. So this
is the classical data dependence that exist because you read and write the same memory
location.
More interesting are so-called layout dependencies. What happens here, the handler can
write the width, change the width of some, say, box or image on the screen. Now the
layout kicks in and it will re-lay out the entire document. And as a result, it may change
the width of the California document. And then the handler B will read this.
And so now we have dependencies here which exist transitively. The handler here did
not write stuff that was read by this, but it wrote something which changed the attributes
for layout, the layout then changed the attribute in some other node which was read by
this.
So in order to parallelize the layout process, you need to understand what the scripts do
and changes through the layout semantics influence the other scripts.
Then despite the fact that this is nonpreemptive single threading, there are concurrency
bugs. And we discovered three kinds. The first one are related to animations. Now, it
could be that you have two animations running concurrently on the same element and
both of them want to change the same attributes, say the size or color, and now they
conflict and corrupt the document.
Second, there are interactions with the server and the semantics of the browser server http
interaction is that the responses not only can be lost when they come back, they could be
reordered. And that reordering can break some invariance that the programmer had in
mind inside the browser program.
And finally, browsers in order to optimize run scripting eagerly before the document is
actually loaded. And the effect is that sometimes unpredictable things happen. This
script that run too early may destroy the entire document, and the other ordering may also
cause problem. I can tell you more details offline.
So why JavaScript is not a great language for this. What you see here is a small Web
page, animation effectively, whose goal is to create a box. The box is here. It's a simple
div that follows the mouse. It follows the mouse in such a way that the box appears in a
position on the screen where the mouse was 500 milliseconds before.
So here is the box. And here is the script. And what you see in the script are two
callbacks. Essentially these are interrupt handlers, if you will. This one here is called
whenever the mouse moves. So you register it here. This is the event to which you
register it, here is the callback.
And inside here you have another callback which is created each time this one is invoked.
It's a closure, and this is one is then invoked 500 milliseconds later and this is the one that
actually moves the box by setting its coordinates.
And so, you know, this is not so bad, but this is a very simple program. And you already
see that there are a few things that are really hard to read. The first one is the control
flow. The control flow is not quite obvious. This is invoked when the mouse moves.
This is invoked 500 milliseconds later. You need to do analysis of how these callbacks
are registered in order to see how the control flow through the program.
The second is dataflow. If you notice the way things are linked here, this box here has a
name which is a string. And you refer to it with this construct. Here is the name
reference. So you have a reference from here to here, not through something that is
syntactically created and registered in a symbol table, but it could be in the worst case a
dynamically computed string, and very often it is.
So the data dependence also is really hard for the compiler to figure out, and so again you
don't know how the script modify the share document.
So the proposal is to switch to something like an actor model or perhaps even high level
of extraction where the control flow is changed into a dataflow. We can think of perhaps
as streaming.
So now you have a mouse here, we generate the sequence of top and left coordinates.
Each time the mouse moves, two values are inserted here and they flow down here to this
sort of computational components, each of them delays it by 500 milliseconds. In 500
milliseconds they flow down here, and they take this box which now has a nice structural
name that the compiler can actually reason about and it changes its attributes.
And now both the control flow, which is here, and the dataflow, which is what we are
modifying based on this name is nice and visible.
So in summary, we develop some work-efficient algorithms which are specially
important on these mobiles where energy efficiency is important. And we looked at the
programming model that should be useful for scripting in the browser, and we are now
finishing the first version of the design of the language based on constraints. We could
think of it as raising the level of abstraction or functional reactive programming, maybe
[inaudible] somewhere higher.
And that's where we are. There is a lot of other work going on, such as partial evaluation
of the layout process and so on. And I can tell you about it offline.
>>: Yeah. So you're [inaudible] familiar with the fact that you can -- in a
nonwork-efficient way you can use parallel prefix computation to relaxing in parallel.
And the drawback with that has always been that the amount of additional work is sort of
how many stakes are in the finite state machine that you do it with.
But this insight that there are these persistent states and these fairly small number of
possibilities for the input of certain times suggests another way to do this, which is pretty
close to what you're doing. You can kind of compute the narrowing spots in possibilities
in state space and do scans at that granularity, not that of the single input simple
granularity, but rather on the substrings that transport you to let me call popular states to
other popular states.
So you're doing parallel prefix only on the popular targets, not only the whole state space.
>> Ras Bodik: This seems to require that you examine the content of the string
beforehand.
>>: No, what I'm suggesting to you is you examine the automaton actually. But, yeah,
but maybe the string too.
>> Ras Bodik: Right. So we definitely plan to continue by understanding the automaton.
And you can do it in various ways through profiling perhaps.
But I -- if I understand you correctly, it would really require me to have a look at the
string and understand that, oh, this part of the string here is likely to converge faster than
others, for example, because if I chop this string in the middle of a comment rather than
in the middle of an identifier, of course I may be in big trouble. And in sort of current
browser programs, the Web page is a combination of HTML, CSS, JavaScript and
potentially other languages.
So you would at least like to know in which language you are when you're breaking this.
>>: [inaudible]
>> Ras Bodik: But I need to look at the content first, which sort of requires -- goes
against the benefit here. I would like to chop the input directly from the network card.
>>: [inaudible] in places that ->> Ras Bodik: Right.
>>: -- and you're making a speculative choice as to which [inaudible].
>> Ras Bodik: Absolutely.
>>: If it was a small subset of states instead of single state, then there wouldn't be any
backtrack. Might not be a win, I'm just -- just an idea.
>>: [inaudible] the only area that you apply speculation is in mixing, right? If you had
part of the [inaudible] transactions, would you see opportunities in other phases of your
program?
>> Ras Bodik: So first of all, speculation actually is used much more pervasively than
that. You have to use it in parsing because there is very similar inherently sequential
dependencies, unless you use something like CYK parser, which has bad constants,
however.
So but parser has it. The layout process has it also, because we can break things down
into these five phases only if there are no floats. Floats are these images which may float
out of their paragraph and influence anything that follows. So you speculate that that
doesn't happen. But so you cannot prove that these dependencies indeed do not exist. So
speculation is used also there.
Now, to the heart of your question, would your hardware support would be useful,
probably not. I think you could probably use it, but these are very domain-specific ways
of using speculation; namely, you know exactly which [inaudible] you need to check at
the end of the work to see whether the speculation was correct or not. And that I think is
easier to do with software.
Also, we do not need to do any complicated drawback that would require support. We
just redo the work. So probably not.
I'll be happy to answer other questions offline.
[applause]
>> Sam King: So hello. My name is Sam King. I'm from University of Illinois. And
I'm here to talk today about designing and implementing secure Web browsers. Or
another way to look at it is how you can keep your cores busy for two seconds at a time.
So this is joint work done with some of my students: Chris Grier, Shuo Tang and Hui
Xue. And we're all from University of Illinois.
So overall if you look at how people use Web browsers, it's very different now than ten
years ago. So ten years ago the Web browser was the application. And the static Web
data was the data. But if you roll forward now to so-called Web 2.0, the browser has
really become more of a platform for hosting Web-based applications.
So it's very common for people to check e-mail, do banking, investing, watch television,
and do many of their common computing tasks all through the Web browser. In fact, I
would address, and I think Ross did more eloquently, that it's the most commonly used
application today.
Now, unfortunately, the Web browser is not up to its new role as the operating system in
today's computers. And modern Web browsers have been plagued with vulnerabilities.
So according to a report from Semantic, in 2007 Internet Explorer had 57 security
vulnerabilities. This is over one year. Now, I know what -- when I go around the
country and I give this talk and talk to people, they say, Sam, you know, I would never be
crazy enough to use Internet Explorer. I'm a Firefox guy. I care about security.
Well, turns out Firefox has also had its share of security problems, with 122 security
vulnerabilities over the same period of time.
Now, some of you might be Mac users, such as myself, and in our infinite arrogance we
understand we have error-free software and therefore we don't have to worry about stuff
like this.
Well, Safari and Opera have also had their share of problems.
Now, perhaps most alarmingly the browser plug-ins, which are external applications used
to render nonHTML content accounted for 476 security vulnerabilities over this one-year
span. And this is just too much.
Now, to make matters worse, there were a number of recent studies from Microsoft,
Google, and University of Washington, all of which show not only are browsers error
prone, but this is a very common, if not the most common, way for attackers to break into
computer systems, is all through the Web browser.
So what does it mean to have a browser-based attack? What does it mean to have your
computer system broken into through the browser?
So one way this could happen is through so-called social engineering. So what I have
shown here on this figure is a screen shot from greathotporn.com. And so you go to this
Web site and they show you a very real looking video plug-in. And when you go to play
it, they ask you some questions. They say, okay, you know, I see that you don't have the
proper codec. Can you please install this Active X codec so you can watch the video.
And you click yes. And instead of watching a video you get a pretty nasty piece of
malware installed on your computer system.
This particular site also would try to exploit browser vulnerabilities at the same time. So
that means behind the scenes they're trying to invisibly break into your system.
Now, another way that this can happen are through plug-ins. As I had mentioned,
plug-ins are very error prone, at least according to the data I have, and if an attacker is
able to take over a plug-in, not only do they then own your entire browser but also your
entire computer system.
So shown here as viewing a PDF, something about designing and implementing
malicious hardware, well, there could be some exploit code mixed in there that could take
over your entire computer system. Something very benign that we do all the time can
lead to your computer system getting broken into.
Now, perhaps most surprisingly is an attack being carried out through a third-party ad.
So shown here I have Yahoo!. And I'd like to draw your attention down here to the
bottom right where you can see there's a third-party ad. Well, these ad networks are very
complex, where there are many layers of indirection between the ad that's being served
and how it actually gets to your browser. And there's been evidence that the top ad
networks have given people malicious ads. So what this means is that the ad contains an
attack inside of it, and it takes over your browser.
Now, this is interesting ->>: [inaudible]
>> Sam King: Even if you don't click on it.
So this is interesting because it violates one of the fundamental assumptions that were
taught: hey, if you don't go to greathotporn.com you're not going to get broken into. But
that's not actually the case. Legitimate sites can be susceptible, mainly because browsers
just bring in information from so many potentially untrusted sources.
And finally, and my personal favorite is this so-called UI redressing attack. So the point
of what you're trying to do here is you're trying to get the user to turn on their
microphone through Flash.
So the way this happens normally is you have a frame that visits a page hosted on the
Adobe Web site and they'll say this Web site is trying to turn on your microphone; do you
want to allow, yes or no.
But what you see here is some text that says do you allow AJAX; AJAX will improve
your user experience. Well, yeah, of course. I'd love to have my user experience
improved. Who wouldn't.
But in reality what's happened is that frame from Adobe is hidden here behind the scenes.
And the attacker has covered everything except this allow button. So in an attempt to
improve your user experience, you've inadvertently turned on your microphone and given
the person that's hosting this Web site complete access to what you're saying.
So the current state of the art in Web browsers when we started this project was quite
poor, and it's improved a little bit since we originally published this work.
So some of the more traditional browser architectures are Firefox and Safari, where these
are basically monolithic pieces of software that include everything within a single
process.
If you have a vulnerability in one single part of one of these browsers, your whole system
is taken over. And they do try to enforce some security policies, but these enforcement
checks are sprinkled throughout the code in the form of a number of IF statements.
Now, since we published the original work that I'm going to be talking about today, some
more recent browsers have made some big improvements in terms of security. So
Google Chrome and IE8 both do a great job with system-level sandboxing. So, you
know, this idea of a so-called drive-by download.
These browsers do a pretty good job with that. But I think one areas that these browsers
still fall short in even today are protecting your browser-level states. And I would argue
as more and more content moves onto the Web, these browser-level states will increase in
importance, and protecting them is something that we need to be able to do.
So finally, the fundamental problem is that it's really difficult to separate security policies
from the rest of the browser.
So frustrated with the state of the art, my students and I set out to build a new Web
browser from the ground up. And this is where we started the OP Web browser project.
So our overall goal is to be able to prevent attacks from happening in the first place. You
know, let's build this browser right from the beginning.
However, being engineers and realists, we realized that vulnerabilities will still happen
and browsers will still have bugs in them. So even if there's a successful attack, we want
to be able to contain it.
Finally, at the end of the day, there's still a user using that Web browser. So they're going
to download stuff and double click on it. So the final thing we want to do is provide the
ability to recover from attacks.
And so what we do is provide an overall architecture for building Web browsers. So we
take the Web browser and break it apart into a number of different components. And we
can maintain security guarantees even when we're broken into.
Now, very much the design was driven by operating systems and formal method design
principles, as you'll see later in the talk.
So overall I will spend a little bit of time talking about the OP design and one of the
aspects of OP that I think is pretty interesting, which is how we applied formal methods
to help us reason about security policies. Then I'll talk about the performance of our
original prototype. And at the end I'll spend some time touching on some of our future
work and the things that we're working on as part of UPCRC.
So as I mentioned before, our overall approach with the OP browser is to take the
browser and break it into a number of much smaller subcomponents. And really at the
heart of this browser is a thin layer of software down here called our browser kernel.
And our browser kernel, like an operating system kernel, is responsible for managing all
the OS-level resources below and providing abstractions to everything running up above.
Now, the browser kernel is where all of our access control and security -- almost all of
our security mechanisms are enforced. Now, the key abstraction that the browser kernel
supports is message passing. So that means everything that's running up above
communicates through the browser kernel using message passing.
Now, the key principle in our browser operating system is the Web page. So each time
you click on a link, this creates a new Web page instance. And this is the key principle in
our operating system.
So the Web page instance is broken down into a number of different separate
components, starting with a component for plug-ins, one for HTML parsing and
rendering, one for JavaScript and one for laying out and rendering content.
Now, we sandbox these Web page instances heavily. So this is -- a Web page instance is
composed of a number of different processes. But these processes aren't allowed to
interact with the underlying operating system directly. We use OS-level mechanisms to
make sure that that doesn't happen, and we force it to communicate through our browser
kernel.
So now the question is, okay, if all we can do is communicate with the browser kernel,
how do we actually do normal browsing things. And that's why we have additional
components here over on this side, where we've got a user interface component for
displaying, we've got a network component for fetching new content from the network,
and a storage component for any persistent storage needs that the browser might have.
So by designing a browser this way, what we were able to do is provide constrained and
explicit communication between all of our different components, and everything is
operating on browser-level abstractions.
So because of this, we're able to enforce many of our security policies inside the browser
kernel itself. And our browser kernel is a very, very simple piece of software which is -which makes it easy to reason about.
Now, this is a stark contrast to most modern Web browsers where the security checks are
scattered throughout the browser itself, and this intermingling makes it very difficult to
figure out what's going on.
So given this overall architecture, there are a number of different things that we've done
with it. So some of the things that we've been innovating on have been policy, where we
were the first ones to take a plug-in and include it within the browser security policy
itself.
Now, I think we have a pretty good idea of the mechanisms for doing this type of thing,
but what we found out are that policies are much more difficult than we had originally
thought. And this is still an area of ongoing research.
The second thing we did is because of the way we designed it, we were able to pretty
easily apply formal methods and give us more mechanical ways to reason about some of
our security policies. And I'll talk about that in a few slides.
And then finally, we were able to do things with forensics, meaning your browser has
been broken into and you download an executable, can you tell me which Web page this
executable came from. Something that's very difficult with a traditional Web browser but
that we can do pretty easily.
So our overall goal for our use of formal methods in OP was not to have a formally
verified Web browser but we wanted to see how well -- how amenable our design was to
using these types of mechanisms. So we wanted to model check part of our specification.
Now, specifically what we do is we model check two key invariants and two very
important invariants variants. So the first one is does your URL address bar equal the
page that's currently been loaded. So that's if you've got a browser, it says it's at address
X, is that really the page that's being displayed. So it's a pretty simple invariant and one
that I think we all assume. But history has shown this has been a surprisingly difficult
invariant to get right.
The second invariant that we tried to model check is our browser's implementation of the
same origin policy. And we do this assuming that one or more of our components have
been completely compromised. So can we give an attacker access to arbitrary
instructions on our computer system, can we still enforce the same origin policy. That's
the question we're trying to answer.
Now, in order to do this modeling, we built a model using Maude, and each of our
different subsystems make up the overall state space for our browser. Now, all of the
messages that pass through the browser kernel, these are our state transitions. And we
use this for model checking.
So one of the interesting things I think that we did in terms of formal methods is that we
were able to model an attacker pretty well. And this is because of the sandboxing we're
using and because of some of the assumptions we make, we can model an attacker pretty
accurately as one of our components that sends arbitrary messages.
So they can send whatever message they want, they can drop messages, reorder
messages. They can do whatever they'd like as long as it's send a message. And this is
our model for an attacker.
So the first invariant that we tried was the URL bar equalling the URL that's been loaded
in the browser. So the thing that was interesting about this exercise was that we found a
bug in our implementation. So it's a pretty simple thing. At least I thought it's a pretty
simple thing to try to program this type of invariant, but what we found is that we still
made a very subtle mistake.
I know what this slide says, but what I think actually happened is we forgot to take into
account an attacker that will drop a message. Maybe with more time we would have got
it. But because we were using formal methods, it was something that we found I think a
lot faster than we would have found otherwise.
Another thing that I think is interesting is that we were able to model and -- model check
our implementation of the same origin policy, something that is very difficult to do in
modern Web browsers.
So overall I think what I'm not trying -- I'm not trying to stand here in front of you today
and tell you we have a formally verified browser, because that's just not true. But what I
think we have is a design that is well suited to this type of analysis, and we have a pretty
good starting point for implementing a more secure browser.
So at the very least we know the messages that are going back and forth are basically
doing the right thing. So now that we've been able to reason about this well, we can
focus on implementing the rest of it to make sure that our model and implementation
actually match.
So one of the things that was interesting to us was performance. So in order for us to
measure performance, what we did is use page load latency times. So shown here on this
figure on the X axis we have page load latency times in milliseconds, so longer is slower,
slower is bad. The two browsers we tested were OP, which is ours, and Firefox. And we
tested it on five different Web pages where we've got Wikipedia, cs.uiuc, craigslist,
Google and Live.
And what we did is load the page and from the time you click the button until when it's
displayed on the screen. That's defined as page load latency time.
So what we found that was really interesting about this process was not that we're about
as fast as Firefox, which more or less we are. The thing that was really interesting was
that we're about as fast as Firefox despite our current best efforts to make our browser
slow.
So we do everything in our power to make this thing slow. It takes 50 OS-level processes
to view a single Web page. We've got multiple Java virtual machines running all over the
place. You know, we're using IPC as much as possible. Everything is -- you know, all
these boxes I show are processes and they're communicating using IPC, so we're doing all
the things that should make it slow, but despite this it actually isn't that bad.
And I think there are a number -- a couple of things, reasons why this is true. One of
which is multicore. So because we have sufficient number of computational resources,
we can do these things where we add some latency and it doesn't effect the overall
loading time.
All right. So what I presented so far was some of our older work. This was from a little
while ago. And as I mentioned, since we published the basic architecture, other people
have started to take up this line of research.
So what I want to spend just a few minutes talking about here today is how we're going to
use this as a platform for future computing and some of the things that my group and I are
working on today.
So I think that the original OP architecture is going to do a good job keeping a few cores
busy, you know, four, maybe even eight. We'll keep those busy. But what I'm trying to
think about now is what types of applications can we build on top of this framework to
keep tens of cores busy.
So our first -- the first thing that we're looking at in this general area is we want to enable
our browser to enforce client-side security policies. So this means in the browser we
want to be able to see what's going on in a Web page and potentially restrict parts of the
Web page and do this in a way so that we can improve performance.
But by restricting a Web page we're changing the Web page itself which could have a
compatibility impact. So what you could think about this from a high level is maybe you
want to visit a Web page and you want to try out 15 different security policies in parallel
and then try to pick which one is right for you depending upon the compatibility versus
security tradeoff that you observe.
But one of the problems is that it can be difficult to try to quantity these types of things,
specially when you want to -- when you're concerned with how the user has been
interacting with a specific Web page.
So one of the things we're looking at is a connection to another one of my UPCRC
projects which is Replay. So can we use Replay technology as a way to replay Web
browsers. And if you look at the two different extremes of the type of replay you could
do, is on one extreme you could have full-blown deterministic replay. So this means that
you can recreate arbitrary past states and events instruction by instruction. So this works
great at recreating past states and events. This is what it's designed to do. And if you
want to do something like reverse debugging, it's great for that.
But the problem is if you want to take the same browser and apply a different security
policy, traditional deterministic replay doesn't really work. Because things have changed
a little bit now. And so what you really need is some more flexibility.
Now, some techniques that you can use that are much more flexible are things like
replaying UI events. So you can think of the most simple form of replay on a browser is
clicking the refresh button. Right? So it will go out and it will refetch the page and it
will basically do the same thing. And this is going to be very flexible.
Now, the problem with the naive approach is that you're not getting U events. So if the
user interacts with the Web page, you're missing this. Then there are known approaches
for doing these types of things. There's a project called Selenium that does stuff like this.
And you can have a very flexible environment for replaying browser events.
But the problem is it's not deterministic enough. There's still certain things like
JavaScript that will induce types of nondeterminism, especially when you're applying
security policies that remove part of your JavaScript out of the page. You know, who
knows what Selenium is going to do in there. So you've got to spend a little time trying
to think about how to cope with some of these types of scenarios.
So from a high level, what we want to do is called semantic browser reply, where it's
something that's in between the spectrum of full-blown deterministic replay where we
can reproduce as many of the past states as we can but where we do it in a way that's
flexible enough that we can try different policies out.
So the basic architecture is that you've got a browser instance and as you load this page
you're going to record the sources of nondeterminism as you execute and the user
interface pages -- events, and then you can replay it on a couple of different browser
instances.
Now, what you can do is you can have one that doesn't have any security policy, another
with a new security policy, maybe a third, fourth, and fifth with even different security
policies. And then you can figure out ways to try to determine and quantify how
different are these pages.
You know, as I mentioned before, one of the things that gets really tricky about this is
when you remove a piece of JavaScript -- because this JavaScript might have causal
dependencies with JavaScript that you don't remove. And so it creates some interesting
challenges that we're still working through today.
Now, in addition to replay for browsers, I think one thing that I'm personally very
interested in are more formal methods. I'm more of a systems researcher myself, but
we're talking to people who do formal methods for a living and trying to see how much
we can use these mechanisms and how much of a benefit it has.
Another thing that my group has been thinking about are display policies. So this is
something that we've been collaborating with Microsoft Research on on a project called
Gazelle where -- I don't know if you guys noticed this, but the UI redressing attack where
you turn on the microphone -- I didn't really mention anything today that's going to help
with that. So this is still a very wide-open area that needs to be looked at in more detail.
A general browser extensibility is yet another area that's interesting. Plug-ins are one
example, browser add-ons or extensions are another example. And thinking about how to
facilitate browser extensibility where you can provide both flexibility and security is an
interesting topic.
And finally the Berkeley project where they're making individual components faster by
parallelizing it, you know, hopefully we can find ways to take those results and work it
into our overall architecture.
And so the hope is we'll keep even more cores busy.
So overall the browser has really evolved from an application into a platform for hosting
Web-based applications. And as such it's really become much like an operating system.
The problem is traditional Web browsers weren't built like operating systems. So when
you apply OS principles to designing and building Web browsers, you can make it much
more secure.
So what I showed here today is our approach to this basic philosophy where you can
decompose a browser into a number of much smaller subsystems, and this provides a
number of advantages including separating the security logic from the [inaudible]
browser, and it gives you the ability to model formally some browser interactions.
So what I've shown here is a step towards preventing, containing, and recovering from
browser-based attacks. So any questions? So I have a demo in case anyone's interested
afterwards. Was there a question in the back? Yeah.
>>: [inaudible] the architecture [inaudible] totally on its side, is it -- do you think that's
inherent or is that just kind of [inaudible]?
>> Sam King: I'm sorry, I didn't hear the first part of the question.
>>: [inaudible] the architecture of the browser, you showed that different Web pages
[inaudible] processes and isolation groups. But then you have this [inaudible] network
components [inaudible] I'm not surprise today see that [inaudible] do you think that's
inherent or just kind of that was just what you did?
>> Sam King: I'm not sure what you mean by inherent. Certainly it's a design decision
we made. So the big distinction I make is that these are components that we wrote from
scratch. You know, these are something that we wrote in JavaScript and we built it -- a
hundred lines of code, whereas the browser instances, these are off-the-shelf components.
So we're taking WebKit, for example, and jamming it in there. And because we have less
assurances about the implementation, it's a million-line artifact, we use more processes to
help there. So it's -- you can draw as many boxes as you want. This is just one design
decision we made.
>>: So can you give me an example of [inaudible] that you'll be able to?
>> Sam King: I'm sorry, one more time?
>>: In an attack that the IE will not be able to prevent but [inaudible] will be able to
prevent?
>> Sam King: So, you know, I think -- so the first -- the first thing that I wouldn't say it's
a concrete that we prevent and IE doesn't, but I think the IE implementation of the same
origin policy has had some well-known flaws in the past.
Whether or not we would be susceptible to that, I don't know. I mean, so I can make a
qualitative argument that I think we did it better because of it's so much smaller, but, you
know, that's more of a qualitative argument.
I think one thing that we -- so let me try to answer this for a higher level. I wouldn't say
that there's anything like a specific attack that we prevent and they don't. I think it's -- if
you look at most of the policies that we're playing with here, it's basically the same thing
that IE's doing. We just draw our boxes in a different way. So the hope is that it's easier
to reason about security.
But it's basically the same thing.
>>: [inaudible] Chrome or Safari?
>> Sam King: So this was a -- the performance numbers I showed were from a little bit
of an older browser. The new one uses WebKit. For whatever reason we haven't run it
against WebKit. We really should, though. I think that's a good suggestion. I think at
the end of the day, I personally am just not that worried about performance. Like, you
know, as long as it's reasonably fast, I'm going to be happy. We're more focused on the
security side.
But I agree. We should run that. My intuition tells me we add a little bit of overhead, but
not much.
>>: Thank you very much.
[applause]
Download