>> Juan Vargas: Ding dong. Good morning, everybody. ... you know, Microsoft and Intel are funding something called the UPCRC,...

advertisement
>> Juan Vargas: Ding dong. Good morning, everybody. Welcome. As many of
you know, Microsoft and Intel are funding something called the UPCRC, which
means Universal Parallel Computing Resource Centers at the University of
Berkeley and at the University of Urbana-Champaign.
Today we have the great honor of having five visitors from Berkley who will be
visiting several teams in incubation, products, and research. And the visit starts
with this presentation. The presentation will start with David Patterson giving us
an introduction of the center followed by Kurt, followed by John, followed by Ras.
>> David Patterson: Okay. Nice to see everybody again. I'll be here -- I was
here last month, I'll be here next month. I don't know about July. It seems like
I'm probably going to be here.
Give you an overview of this lab. It started two, two and a half years ago. We
are really driven by applications, which is rare for a lot of computer science
projects. And what we pitched in the proposal that Intel and Microsoft selected
were the applications in black. And we've enjoyed the applications influenced so
much, we've expanded them to these new ones that are shown in blue. And
Kurt's going to be talking about patterns in these applications right after me.
A big -- one of our -- I'm going to talk about some of our big bets. One of our big
bets is that the way your -- the way to make parallel software is to have a really
good software architecture. And so in our old Berkley rue report we talked about
these 12 or 13 dwarfs or motifs we called them. And we call those the
computational patterns which were shown on the right. And then in the process
of doing the research, decided that these programming patterns that are good for
any kind of program will be the key to us being able to get a parallel software, in
particular that software should have as its architecture a composition of these
patterns on the left. This has been captured in something that's called our
pattern language. And Kurt's going to talk about this.
This is happening with people here and at Illinois and at Intel.
Our mantra is -- has been and I would say this is, you know, other people say
this now too, is we see the world as productivity level programmers and
efficiency level programmers. So domain experts as examples of productivity
level programmers, programming something like Python or Ruby. If the
efficiency level language programmers a lot of people, like people here, are, you
know, the C#, C++ types of programmers. They're aiming for this bare level
efficiency where the domain experts would be happy with pretty good speedup
and productivity.
And so we struggled with this one. In fact, if you were to read our proposal, the
productivity level thing was kind of -- was a hole in our proposal. We should do
something about this. We didn't really know what to do. We were kind of afraid
that we were going to have invent our own programming language or that that
was the only viable way forward because the chances of inventing a
programming language and it actually catching on seemed zero but we didn't
know what to do. But now we have a story, and I'll tell about you that.
So one of the things we wanted to do was to make efficiency programmers
productive. And we and many people believe auto-tuning is a better way to do
code generation than simply traditional static compiling. You know, looking at a
specific example on the right in this stack bar graph is increase the number of
threads. The auto-tuning part is that red part there. So the significant speedups
of auto-tuning by generating code and seeing what runs well on a particular
computer rather than picking up just the compiling time. The problem with that is
it takes really smart people who know architecture and the algorithms to pull that
off and kind of use up one grad student at a time.
So we've been trying to see what we can do about that. We won't tell about you
our successful use of machine learning. There's been some breakthroughs by
members of the team on how to avoid computation and even things that have
been around forever like dense matrices, we made big improvement this is that.
But I will talk about SEJITS on that. SEJITS is this new idea that wasn't in the
proposal. And I think of it as making productivity programmers efficient. Or you
could think of it if you know what auto-tuning is auto-tuning with high order
functions.
So rather than invent a programming language to help productivity programmers,
we would pick one that's already productive like Python say. Now, these
scripting languages like Python really are sophisticated languages that have all
the powerful features that you would like to see in a programming language. And
so the acronym comes from that this infrastructure is going to specialize the
computation but it's not going to have to take any Python program, it can
selectively do a function at a time. We will have the option of doing Just in Time
compiling to create efficiency level language code. And then the other part of
selective embedded is you write this thing in the language itself and you can use
the standard interpreter, you don't have to modify it or anything like that.
So try and get the name here. So if we're writing in Python, here's these Python
methods. A couple of them are marked as this would be interesting to go faster,
the G and the H one. Because it's standard Python, if you don't have a
specializer for it, it just gets interpreted as it is before. How do you mark them?
Well, maybe the programmer marks them or maybe we monitor the performance
and see which ones takes a lot of time.
The H does have the SEJITS mechanism so it gets invoked. There's a
specializer for that hardware. And then that invokes this system on the try-up.
So the selective part of the name embedded that it's actually written in Python,
that's indicating that those pieces are in blue. You write them in Python,
generate the just in time compiling what you see and then the specialization get
the speed up there.
So we're not going to have time to talk about that, but Armando Fox right here is
leading the charge on that effort, and he can ask questions about it. We're pretty
excited about it. And recently we wonder -- we think not only will help with
multicore or manycore, we think it could help with cloud computing as well. So
some of the same arguments there. And Armando can tell you about that.
We have a big effort in correctness, debugging verification as does Microsoft.
Kind of the -- and this is led by Koushik Sen, who spent was it intern here several
-- many years ago. Koushik seems to be winning all the awards in the project as
far as I could tell. He's won several best paper awards. There's this research
highlights of the new communications of the ACM. It's got selected for that. So
that's almost like yet another award.
He does this really -- has these really interesting ideas at the intersection of
testing and debugging which people here have done as well, active testing. So
the idea of the productivity layer we're going to try not to let people specify some
of these problems that can really be hairy to debug. But at the efficiency layer
we can't avoid that.
And Koushik has developed techniques that have worked extremely well even on
popular open source software uncover previous unknown bugs and which has
led to all of these awards. And Ras can answer questions about correctness and
debugging.
Kubi will talk about our operating systems effort. A lot of the emphasis is on
isolation and quality of services, you know, what we're after. Oh, I left off
Burton's. So Burton Smith said, hi, Dave, how are you doing in my fall visit? My
fall tour to Microsoft he said by the way, I've solved the resource allocation
problem. I can't tell you about it because I'm filing a patent, but I've solved it,
okay.
So okay. That's interesting. Maybe like to hear it when we can. So he came this
month. He came in April, he came on April Fools Day. It wasn't a joke. He
actually claims to have made big progress on it. And we're implementing what or
going to implement it, to help with us the quality of service.
We've also got some ideas of how to make parallel libraries work well together.
Often the libraries assume that no one else has got the machine but me and then
they can be scheduled and our live system does that. And Kubi can tell about
you that as well.
Closer to what I know about, we think one of the problems in architecture is that
people use software simulators and it's so -- you know, when you're having
scores or hundreds of cores, it just takes forever. And so you can't run very
much. And I think the architecture community is kind of bottlenecked in how
much they can simulate to make progress on manycore.
We and others, particularly Chuck Thacker here, are betting on FPGAs. Our
version of it is is a simulator that it does 64 cores on a very inexpensive board,
which was a big advantage for us. The students were running out of time. They
just used their credit card and got some more.
This version runs 250 times faster. So what is 250 times faster mean? That
means students can get a result in an hour versus 10 days. And when it's -- and
if it makes -- it makes it clear that research is latency oriented, right? You want
to try this experiment to see if it works throughout, the parameter's right, get an
answer back in and hour versus waiting 10 days is completely different.
The stress level changed, the rate at which we can do research. And we were
very proud that last January at our research retreat we were able to demo to tie
all these pieces together. In sum ring up, you know, this really is, you know, an
integrated research project. It's often difficult and when you're not in academia to
tell the difference between a facade an a real integrated project. This is a real
integrated project. All these pieces are working together. We all sit with each
other. We all interact. And, you know, it's paying off.
I'm surprised at how much progress we've made. This is a really hard problem,
right? All these companies of gone out of business trying to solve it. We've
made a lot of progress in two years. And I'm surprised.
We've got some visibility besides coverage of the ACM. We do a boot camp. So
far we're doubling every year, and so we're going to bigger rooms where we
invite everybody to come and hear our versions of the ideas. There's a HotPar
that's associated -- you know, it's a Usenix workshop but it's at the Berkley
campus that we're involved in.
Kurt's been leading this way on the ParaPLOPS pattern's workshop. And there's
a wide following of his language. And people really are coming, amazingly all
over the world to say I've got a problem. Kurt, show me how to parallelize it.
And he does. And they get speedup. And we're even planning on testing this
idea on undergraduates that Kurt's going to lead the class this fall. And that's the
overview. I presume there's not going to be any questions. It would be great if
there were questions, but I bet there aren't. It's too early in the morning. So
Kurt, I will pass it on.
I thought saying there wouldn't be any questions would get you guys to ->>: Sometimes it works.
>> David Patterson: Sometimes it works.
>> Kurt Keutzer: Okay. Am I alive? No. Microphone?
>>: Did you turn it on?
>> Kurt Keutzer: I've got a little green light here. I'm on. Okay. Cool.
Okay. So I run a group within ParLab called the PALLAS group, which has one
of those kind of megalomaniacal acronyms. We're looking at applications,
libraries, languages, algorithms and systems. And what we're kind of motivated
by are these new manycore whips we don't see the Intel folks putting up the
Larrabee slide as often as they used to, so that worries me a little. But you can
go outer and buy one of these today.
So you've got, depending on how you count them, at least 16 or as many as 512
processors, depending on how you look at it. So we can build them, but the
question is how do we program them; in particularly how do we program them to
do stuff that we're really interested in, not just stuff to demo particular machines
but that we're really interested in.
And so my feeling is that the future of microprocessors will be really limited by
what we can program. So I'm going to start out by telling you how I think you
should not parallel program.
So you take your code, you profile it, you look at that performance profile, scratch
your head for a while, you go oh, I can add some more threads here, iterate
through this loop until you think your -- think it's fast enough at least and you ship
it.
And the problem with that approach, first of all this is not something that
academic just sit around and kind of put up a straw man and knock over for fun.
This is if you -- my, you know, friends at Intel if you ship a -- if you buy enough
parts, they'll ship an application engineer over to help you parallelize your code
and this is almost exactly what they'll tell to you do today.
And the problem with that, it's not that it's, you know, kind of just as aesthetically
unpleasing somehow is that there are lots of failures. And there's lots and lots of
times for any of you who actually tried this that at the end software running on N
processor cores is actually slower than software running on just one.
So if we look at well, why is that? Well, let's think about what this person is
thinking about when they go -- when they go through that inner loop and say
okay, I need to re-code this with more threads, right? So there are books out
there. They start looking at locks, threads, semaphores. They do UML
sequence diagrams, try to puzzle through this a while. And I don't know if any of
you have tried that, but after a while you start looking something like this, that is
kind of anxious and depressed here.
And one of our colleagues at Berkley, Edward Lee has actually written a very
good paper kind of analyzing some of the problems with coding -- with kind of a
thread mentality.
So what's the alternative? Well, the methodology that we're pushing is to -- it's a
similar loop. And there's a human in the middle. So in that regard it's not so
different. But what we're thinking about is what we're focusing on is actually
architect in the software to identify parallelism, not just merely jumping in to find
threads.
And so we spent a lot of time on architecture and then today it's really a thought
experiment. Let's see, how does that software architecture map on to the
hardware architecture? You write up some code. Do you some performance
profile. You go through this same loop but in the inner loop of the -- of this we're
rethinking the architecture, we're not just trying to code in a few more threads.
So why is that different? Well, this person what they're thinking about when they
say re-architect with patterns, well, today code is serial. That's a fact. But the
world is parallel. So there's a sense that if we really dig deep into the application,
we really understand it, then we will find the parallelism. And then we just need
to reflect that in the software.
And so what that person is thinking about is like computational patterns I'll show
you in a moment, I guess Dave showed you as well structural patterns and
general software architecture. And you think you look at this individual that looks
like a well adjusted, happy individual to me there. Kind of contemplating his
architecture.
Okay. So Dave flashed a slide -- so I'm sure all of you have heard about
patterns. Patterns have been around and popular I think for about 15 years now
at least. What's different about our endeavor called our pattern language which
coordinates with work of Tim Mattson, pattern language for parallel programming
is that we're trying to do an entire pattern language, which means we're trying to
do a set of patterns which will take you all the way from an application all the way
down to a detailed implementation.
Now, it's not that adding some patterns that say any layer of this is not a useful
thing to do, it's just it's not quite as aggressive or as comprehensive as what
we're trying to do, which is kind of soup to nuts. You start out with an application
and going through the pattern language, you will end up with a detailed
implementation.
So I don't have time to talk through all this, but I did want to go in a little more
detail at the top level here at these structural and computational patterns. And so
the structural patterns are, you know, if you look at those go wait a minute, I've
seen those before. Those are the Garman and Shaw architectural styles with
maybe a few additions like MapReduce and Iterator here and so forth. But these
basically define how you structure your software. But they don't tell you anything
about what's actually computed.
And although I've used in teaching software engineering in general this analogy
that software is like a building, finding after a long time it occurred to me software
is actually a lot more like a factory, and in that regard these structural patterns
are about how you actually lay out say your factory plant.
So complementing those are the key computations. Now, this I think is I think
more uniquely the ParLab contribution, which is we in ParLab long before the
Intel-Microsoft, you know, UPCRCs were even a gleam in one's eye, you know,
Dave, Krste Asanovic and I, Ras, Kubi and so forth would get together on
Tuesday afternoons and we would look at different application areas.
Sometimes we'd look at individual applications, sometimes we would look very
broadly across application areas like high performance computing or computer
design integrate circuits are embedded, and we would look across and basically
ask ourselves a question, what are the core computations that are performed in
those application areas?
So this wasn't exactly drafted overnight. We did jump-start on Phil Quarles
[phonetic] observation that he had identified seven classes of computation which
were broadly used in high performance computing. So red here means highly
used. Orange means somewhat used. Green less. And blue means probably
not so evident at all. And so literally week by week as we looked at more
application areas and slowly added more patterns we came up with this list here,
which in general my experience now is I haven't challenged this audience so I'll
be here for two days, you bump into me in the hall say what about this
computation, then if it's not on here, then we'll add it.
But I mean after three years now, things have really settled down quite a bit. So
these described computations, it's interesting for, you know, I think you can
maybe relate to in a software audience like this is it's amazing at how little we've
actually talked about computation. I mean, we've talked to death about, you
know, these architecture styles and object oriented programming, data hiding,
and this stuff. We don't really talk about the core computations as much when
we talk about engineering [inaudible] software. And so to me it's as though we've
been talking about the light of the factories without actually talking about the
machinery that goes in here. Because these computations are to me the actual
machinery of the factory.
So when we put those together it's analogous to the entire manufacturing plant,
and this is an entire kind of anatomically correct software architecture, the high
level software architecture of one of our applications, the large vocabulary
continuous speech recognition.
Now, literally the afternoon that that kind of went through my mind, wow, this is
really more like a factory, I honest to God ran over to our engineering library and
said boy, there must be a lot known about this. And I can say that in looking at
the books textbooks on factory optimization I was like blown away by all these
novel ideas. But I was very gratified that wow, they are thinking about the same
things, like scheduling, latency, throughput, workflow, resource management,
capacity.
Probably one of the most interesting chapters was what they call work-cell design
which essentially gets you know what's about the appropriate size of a little work
cell to minimize traffic and so forth like that. There's long discussion of tradeoffs
like that. So I think this analogy is holding up well. It helps -- at least it helps me
think about the problems that we're facing here in software.
So the formula that I'm going to show you in the next slides are how we identify
particular applications, we work with domain experts, we typically with each
domain expert had their state of the art algorithm. If they wanted to go faster, we
took our tall, skinny, parallel programmers, which meant my graduate students,
and we architected them in the software using the patterns I described and ran
that on parallel hardware and produced what I think -- I think you would agree are
not, you know, modest improvements, yeah, heck of a job, good engineering
work, but really game changing speedups which had the potential to be scalable.
So -- yeah?
>>: Couldn't you put an application like Outlook which seems to me to be
constantly hanging in your list of things and, you know, [inaudible] faster if you
give it many cores.
>> Kurt Keutzer: Yeah. So Outlook is kind of an event driven architecture or you
might kind of put a model view controller around it and then -- but then you know
we would have to dig in and look at the core computations. But in a lot of these
kind of office automation, I mean, they're really, you know, various types of graph
[inaudible] sitting there traversing less and things like that.
>>: [inaudible] look at the list of applications that ->> Kurt Keutzer: Oh, sorry.
>>: That by itself doesn't seem to match to one of the columns ->> Kurt Keutzer: Yeah. That's ->>: [inaudible] kind of application.
>> Kurt Keutzer: That's absolutely true. And I guess this is a good reason to
come to Microsoft is that there's no doubt that Office automation type
applications have been neglected in our focus. Part of that is because it's not
that we're really shying away from large bodies of code, but we are shying away
from large bodies of code that we can't get down to some basic kernel, right?
So, you know, we'll be around for today's, and if you can help me understand
well, if we sped up these particular kernels, you know, by some ->>: [inaudible] kernel there. There is a lot of small things here and there but
[inaudible] make use of those things.
>> Kurt Keutzer: Right.
>>: But many of the applications we have like Visual Studio, Visual Studio has
the same problem is that there isn't like a graph algorithm that transfer 20 hours.
>> Kurt Keutzer: Right.
>>: [inaudible].
>> Kurt Keutzer: Well, you know, I don't think that this is something that I'm
going to mastermind during my talk here. But I'm here for two days, I'm at your
disposal. I'm completely happy to sit down and take a look at those.
And also I will say something a little more, which is, you know, we haven't been
checking up yet, you know. I mean, one -- even things oh, you know as soon as
we publish the view from Berkley, folks at other universities went around giving
talks on Amdahl's law making remarks that some universities seem to have
forgotten Amdahl's law and things like that.
>>: That was in our Berkley View report.
>> Kurt Keutzer: Well, now, we mention Amdahl's law. But they weren't given to
suddenly doing tutorials on Amdahl's law after 20 years because, you know, they
just thought time to get out the tutorial material. People were standing up in front
of audiences saying some universities have forgotten Amdahl's law, you know,
the fact that we put a line or two.
But we -- I'm going to show you a bunch of applications where, you know, people
thought we would be defeated by Amdahl's law and we were not. Yes, sir?
>>: [inaudible] seems like in general the focus has been downwards looking at
how to use parallelism in hardware. And another source of parallelism is in the
environment. I mean, you have in finances, you had streams of data coming in
and robots. You have stream sensors giving you streams of activity. In Outlook
you get streams of requests from all sorts of sources.
And it seems like -- it seems like there's that aspect of parallelism which is
dealing with data streams asynchrony in the environment being able to react to
that. And in order to be able to react to it quickly you also need to take
advantage of the parallelism in the underlying hardware. So I think a lot of the
applications actually combine these two. If I'm in games, I'm using GPUs and I'm
using multi, multicore to speed up things. But you've also got to have an
architecture that is response I have to network events, to input with things like
[inaudible], I've got a lot of multimodal.
So it seems like you've sliced out a very important part and focused more on -like he was saying, sort of the kernel. And I think there's a whole set of
applications which are -- you know, have this external environment in asynchrony
that ->> Kurt Keutzer: Yeah.
>>: Speech recognition a little.
>> Kurt Keutzer: Right. So first of all, I mean briefly -- and again if you don't like
-- I mean, like I'm hear for two days to talk about the broad ranging cure for
cancer, whatever, you know, what about Zimbabwe, things like that. All that stuff
I'm happy to talk about.
But what I'd like to do is make sure I get through what we have done ->>: Sorry, you're at Microsoft.
>>: You're going to get questions. So I'm sorry ->> Kurt Keutzer: No, no ->>: [inaudible] a very legitimate question.
>> Kurt Keutzer: So let me address your question.
>>: And I'm asking you a very specific technical point.
>> Kurt Keutzer: Okay. So all right, then I'll ->>: So I'd like to hear your thoughts on that.
>> Kurt Keutzer: Okay. So first I think you have to understand the domain of the
problem that we're trying to address, which is there is for the first time in history
inexpensive parallelism being packaged in parts and put in people's laptops and
desktops. And the question is what they're going to do with that.
Now, that may -- I think that's a very relevant question for Microsoft.
>>: That's great. And I want to have a conversation and we can have that later.
But I think the other trends are these things are networked, the computers are
smaller, the disks are huger. So the fact that it's just more processors that we
need to leverage is just a small component of a much large -- of a much larger
ecosystem that I think you [inaudible] you need to track all those things.
>> Kurt Keutzer: Well, you know ->>: And take them all into [inaudible] not just about.
>> Kurt Keutzer: So you probably said I'm from Microsoft. Well, I'm from Bell
Labs, right. And I'll stand up here for 24 hours and discuss this with you, if you
want. In the meantime ->>: No, we can take that offline. I'm just saying it's not just about trends in the
hardware, it's a disk, it's networking ->> Kurt Keutzer: Yeah, but there's obviously a lot of systemable trends and
there's a lot of software out there. But, I mean, I think if you at the nexus of that
the in a lot -- in the work that -- the world that I live in is a desktop or laptop. And
if that doesn't utilize the processors that are economical, you know, now
economical, then all this other stuff is going to be held up by that, all right? Okay.
Fair enough is good enough. Okay.
So I basically proceeds this way of how do you parallelize things. And so what
I'd like to do is go into a little bit of detail on one application and then just kind of
literally I think at this point flash at you a bunch of other applications, you know.
And so part of this is to show -- you know, it's kind of a long winded way of saying
trust me we really did some of this, so it's not just the professor talking about we
said we were going to architect these things and the students went off and
hacked up code.
And part of this is really to get some insight into how this proceeds. So this was
a -- one of the first things we looked at, you know, Berkley is a hot bed of
machine learning and support vector machines are at the core of a lot of what we
do, support vector machines basically do a two mode classifiers which is great
for like separating out the baby pictures from the flower pictures. And the basic
way you approach here, this architecture that I have at the top level here is a
pipe and filter architecture. We trained on some initial examples. You put in
some seed photos of children, you put in some feed -- seed photos of flowers,
you train your classifier and then you get some new images, you extract the
features, you train them and you exercise it and get your results are.
So at the type level here we have a very simple pipe and filter architecture. As a
matter of fact at the top level of almost any software you see a series of filters
connected by pipes. So we're going to go a little bit deeper on that in this
example.
So looking at this feature extractor, we -- if we pop open this filter, what we see is
another little mini pipe and filter architecture. And then if we pop open this filter,
we see little structured grid computation acting over the image. If we pop open
this filter, what we'll see is another what we call structural pattern, MapReduce,
which is mapping computations over this, what kind of map computations is it
mapping, dense linear algebra and then the actual building up of the feature
vector, the scriptor then is a MapReduce computation to map another
computation on there and gather it all together. And what's being done at each
of those maps is a structured grid.
So that's going -- this is kind of -- at this point we would say, yeah, that's a high
level architecture. You've got to turn on your side. You see a tree structure
there. Okay. Then if we go into the train classifier, basically this is an iterative
approach here. So what we're doing here is we're just on our training data, we're
just slowly piecing together a frontier that says basically these are the babies and
these are the flowers. And this is essentially optimization. We're doing that by
solving a quadratic programming problem to create the space. And we're
iterating over that until all of the points to be classified -- or all the original training
points are within in margin of error.
If we pop open these so we have an literator -- sorry, we pop open this filter we
have and literator, inside that literator we have a simple pipe and filter. Look
inside those filters again we have a MapReduce. Look inside those maps we'll
see again just some dense linear algebra.
So finally the third piece is this exercise classifier. So after we've done the
training and we actually have a new photo to classify, so basically what we -- we
build a frontier during training and we got to evaluate which side of this frontier is
this a flower or is this a baby. And so inside of this exercise classifier we just
have a simple pipe and filter where we compute dot products across all the
elements which criteria the frontier support vector machine, that's using dense
linear algebra and then sum up and trying to say which side then of this frontier is
this on. Okay?
So that's really it, nothing up either sleeve. And you can see well that's all pretty
obvious or not. But by systematically going through this rather than just jumping
in and seeing where is the bottleneck and the support vector machine code we
downloaded and so forth, we feel we were able to get a lot more parallelism.
And so we published that work a couple years ago now at the International
Conference on Machine Learning and we were able to get about 100X speedups.
And if you want to download this and try it yourself, last time we looked there
were about 900 downloads of this software.
Okay. So I am literally -- whoa. I think I'm down to like five minutes now to go
through seven more applications. So in the interest of just, you know, I presume
there are people from different applications groups here, I'm just going to flash
and reiterate, you know, the method on some of these, and then if you want to
talk about speech recognition or computational finance, object recognition, I'll be
happy to talk more about that.
So again, our approach is we find domain experts like Jitendra Malik has a strong
computer vision group. We look at his state of the art algorithm for detecting
contours here. And this is kind of classic example of the way which we
interacted with other faculty. Great algorithm, best results in the world, runs
really slow. So we dug into the architecture of that. This is the top level
architecture, which is they say it always looks a little bit like pipe and filter and,
you know, the moral of the story is we got about 130X speedup of this. We
release the software. It's got about 490 downloads. And the 130X speedup and
when I talk about this game-changing speedups, when Jitendra first came to us
there was this -- his software, his counter detect was so slow it was a non-starter
for Adobe to even talk to him. By the time he ended he's saying okay, well let's
do video, right? At these speeds we can begin to talk about doing them in
counter detection and video. MRI is very different. So professor [inaudible] new
a professor at Berkley. He had a very fast algorithm for doing compress sensing
in MRI. So he was able to gather the data very quickly and using this
compressed sensing, reducing by estimates of a factor of four the amount of MRI
time. That's great for UI. For children that can be really critical, making the
difference between anesthetic or not.
But the problem is the radiologist needs an image to decide do we need to take
some more images. So the reconstruction has to be in realtime. His
reconstruction took hours, right? So it was basically a non starter for clinical use.
So we dug into the architecture of that and pipe and filter the top here it's a little
easier to see. We were able to find these kind of nice large fork-join areas where
we could do dense linear algebra across here. The fork-join here, data
parallelism across these Fourier transforms and so forth.
And we were able then to speed that up at about a factor of 100 and so it went
from again being kind of an interesting academic exercise to something that
could actually be employed and clinical use and it's been in use in over 200 trials
now producing images that radiologists really look at.
So those are very dense computational type problems. Speech recognition is
very different. So I mean you have the basic idea. Speech recognition. We
have voice input. You want to get word sequence out. So there's some single
processing up front. What we focused on -- and there's lots of potential for kind
of data parallelism up there with the speech processing up front. But what we
want to focus on is what we thought was actually the harder problem which is the
inference engine here. So here's kind of approximately anatomically correct
high-level architecture of the inference engine iterating through the individual
phones. Here a little bit more detail on that.
And to get some sense of why this is a lot harder, you know, if you dug into those
other computations that I kind of flashed up, we really went down to the details of
dense linear algebra you would see a matrix, matrix multiplier, things like that.
You're not shocked that we can make that run faster on a GPU, right, and so
forth. Well, here, this is a weighted finite state transducer and so we have the
problem which those of you who tried to parallel graph algorithms notices that
you never quite know where you're going next even though you've got a lot of
places that you need to go. And once you get there, somebody else maybe
racing to get there first.
And so there's lots of problems with how to parallelize these graph computations.
But nevertheless we were able to speed up fact of 11 on manycore machine.
And the interesting kind of, you know, it's that's level is a lot different if say 100,
but the fact is that we were able to get faster than realtime. So if you were trying
to say do a realtime [inaudible], you know, on your laptop you could envision
actually, actually being able to do a speech recognition of say a meeting in
realtime. Yes, sir?
>>: [inaudible].
>> Kurt Keutzer: I'm sorry?
>>: Are the speedups ever super linear? Do you have more than 11 cores?
>> Kurt Keutzer: Oh, so it depends -- it goes all the way back to that first slides
of, you know, whether you say it's 16 -- whether you say a GPU is 16 processors
or 512 or something like that. So if anybody shows you more than 512,
something's funny's going on. But, you know, we -- we have good scaling, you
know, that is to say you throw more cores at it we do constantly go faster. But
we're not -- we're not say getting more than 512 speedups. We are getting more
than 16 speedups as you've seen.
Okay. Here's another one, computational finance. So a value at risk. Sounds
like a good thing to do, particularly after October 2008 or so. So here's very
simple kind of top level architecture of the 4 steps of Monte Carlo and finance.
And we were able to run the 60X faster on a parallel processor which is still yet to
be released from Intel, the Larrabee processor. I think we can say that. We're
being recorded. I hope I can say that.
This was done during a summer internship. So those are the applications that
we have that are kind of tidied up, published, peer reviewed. People seem to be
excited about it, people want to download the software. I'll just literally flash at
you then some other things we're doing just in if case you're interesting.
So object recognition. So kind of building on the earlier contour detection work
which we described to you earlier we're looking at this basic problem of you've
got some basic categories like swans, bottles, Apple logos and so forth that
you've trained on and then you have a new image and you want to be able to
identify that image inside of a photo say. And we've been able to speed up pretty
significantly a Jitendra Malic's algorithms on both the training portion so 7X, and
the classification portion there.
So as I said, we got such speed-ups on still images that Jitendra well, why aren't
we doing video? So we worked with one of his post-doc, Thomas Brocks
[phonetic] here who was looking at this whole issue of how do you follow motion
in a video? So we have something like this. We'd like to actually like follow the
motion of this chameleon walking through here. And we architectured that out,
architectured that down to where we got to some high data pros. And we were
able to get a 35X speedup. And to give some tangible sense of what does that
really buy you, if you actually try to detract the motion of somebody moving as
fast assay a tennis player with that leg expended, other approaches just won't
capture -- be able to capture and identify that that leg is actually moving
continuously with the whole person. But we were able to do that in realtime with
the this.
And then last I just want to clarify this is not a Berkley professor and his graduate
student. I want to be clear about that, particularly since this is being recorded
and, you know. I have a beard, no hair on top like that.
So in poselet detection what you're trying to do is you're trying to identify key
features in human beings. So basically Lubomir, another post-doc of Jitendra
Malik has been working on how we can use unique features of human poses to
identify humans quickly. And again, we were able to speed that about 17X can
and sensing, you know, our work is building on top of each other and actually the
matrix multiply in time in the middle that support vector machine slides that I
showed you earlier is actually the bottleneck in this computation.
Okay. So just to wrap up here, so you know, you can probably guess in this
formula that I just showed you earlier can you see what the bottleneck is? Pretty
clear what's wrong, what's the big limiter in this picture? Yes, sir.
>>: You?
>> Kurt Keutzer: Well, in a roundabout way. But these tall, skinny programmers,
right? Basically everything that I've shown you has required one to three
tall-skinny programmers to dig in there and spend a lot of time. So what we
really want to do, my students do these speedups and these -- architect these
applications as kind of initiation fee. But what we're really doing is using these
tall-skinny programmers to build application frameworks so the domain experts
can use these frameworks without having to have some expert programmer there
side by side going every step of the way doing the implementation.
And so to give you in just a couple slides what this looks like, so I showed you
this earlier this recognition inference engine in the middle of this. And so
oftentimes if you really dig through, you know, C++ code, hopefully not, but, you
know, even MATLAB to try and understand what domain experts are often doing,
they're often just doing a wide range of experience about choosing some very
high level parameters. What's the printing threshold, how do we want to do the
observation probability? Do we want to use -- how do we actually want to
represent the words? Do we wanted to use weighted finite state transducer, or
do we need a lexical tree and so forth?
And so these very high level decisions that domain experts want to use end up,
you know, invoking an awful lot of code when you actually want to experiment
with them and what we're trying to do is package that up in a way that they don't
have to do that. So we kind of see three different Fridays. One is kind of the
radio button, bullet-point selection menu type where it's pretty clear the few
choices that you want.
The other is where they might actually just go and encode the key kernel. And
the other is how they might actually take some, say some series of filters in a
pipe and filter under a [inaudible] computation across a number of those, right?
So it's just to give you some intuition. Yes?
>>: 43 out of 36 slides.
>> Kurt Keutzer: [inaudible]. So to conclude, so we're pretty gung-ho about
single-node parallelism. And with we believe that the key here is to start with
software architecture, not just jump in coding. And I believe I have a lot more
credibility saying this today than I had two years ago because we've actually had
some success doing this.
In conjunction with domain experts, particularly we demand this approach in a
wide range of area, so I think also we're -- you know, I mean Office automation
sounds pretty tough but, you know, once we got past speech recognition, we
weren't afraid of graphs and so forth like that. And the goal here is essentially to
encapsulate what we've learned in terms of frameworks that we've gained in
application developers.
Okay. While the next person is setting up, I'll be happy to take some questions.
[applause].
>> Kurt Keutzer: Yes, sir?
>>: [inaudible].
>> Kurt Keutzer: Sorry. I have a little trouble hearing you.
>>: [inaudible].
>> Kurt Keutzer: Oh, great question. Great question. Yeah. Thanks for -- you
know, you always love the questions that get you to talk about one more slide
even if I [inaudible] so yeah. As you look at [inaudible] some geometric
decomposition and then you're going into some sort of [inaudible] limitation, that's
a very -- that path is repeated in 70 percent of the applications that we see and
so we're building essentially not a high-level application framework but a
programming framework that says if you're doing MapReduce in this variety, if
you're doing [inaudible] computations, then, you know, here's a programming
framework in which you can do those [inaudible] language. And then that goes
on to support all the computer visions like you described. [inaudible]. Anything
else? Okay. Thank you.
>> John Kubiatowicz: Okay. Can you hear me? Can you hear me now?
Testing. Oh, there we go.
My name is John Kubiatowicz, although most people call me Kubi because they
can't pronounce my last name, so that's fine. But I'm going to talk a little bit
about some of the operating systems work that we're doing here in the ParLab.
We're doing there in the ParLab.
And so you might ask legitimately the question what actual operating system
support do we need for manycore? And participants we could just take Windows
or Linux or something and port it and just be done with it. And you know, these
are mature operating systems. There's a lot of functionality in there. It becomes
very hard to experiment with. It's possibly fragile, and it may or may not be what
we actually want.
So the approach that we're taking in the ParLab is to actually say well suppose
we start from scratch, what might we do? Okay. We can do this because we're
not designing a product, right? We don't have to support everything right at the
beginning.
So clearly applications of interest -- and what are applications of interest? We're
really kind of looking forward asking ourselves what are people going to want
going forward with manycore applications possibly on the client? And clearly the
whole point of this is there's going to be explicitly parallel components are key,
okay. Because if we think manycore as, you know, as a given for a moment, that
means that we're hopefully doubling number of cores with some short timeframe.
And the only way we're really going to be using that is getting parallelism. So this
almost goes without saying. Okay. So if the thing we come up with doesn't
support parallel components, then we've got a problem.
You know, obviously direct interaction with the Internet cloud services, that's
clearly important as well. Okay. And so nothing that we do ought to prevent that.
Interestingly enough, just that remote interaction gives us some security and data
vulnerability concerns which perhaps we can address.
And a lot of these new applications that seem to be of interst to people have
real-time or responsiveness requirements to them. People are talking about new
GUI interfaces, they're talking about gesture, they're talking about audio and
video, and so we would want to see whether we could actually exploit manycore
to give us some better, you know, better real-time behavior. And you know,
related to that is responsiveness, okay. So real-time I usually think of explicit
deadlines or perhaps some streaming requirements of frame rates whereas
responsiveness has to do with you know I click that device and I better get
something that happens right away.
So I'm just going to flash this up here, this acronym was just an amusing one that
I came up with. But it sort of reflects some of the things that were really of
interest to us. And it's RAPPidS, and it stands for responsiveness, agility,
power-efficiency, persistence, security and correctness. As all acronyms go, this
one's not perfect, but it kind of gives you a flavor for some of the things that we're
interested in, and namely you don't actually see high throughput as a key
requirement, it's almost a given that okay, sure, we want to do well computing,
but these other things are equally important to us. Okay? And I'm not -- we can
debate acronyms some other time.
But so what's the problem with current operating systems? Well, I don't know.
So they often don't really give you a clean way of expressing the application
requirements, okay. And I put often in here because there are many counter-you know, there are counterexamples, not many, but a few. But they might not
let you say things like what's the minimal frame rate I need or what's the minimal
amount of memory bandwidth or QoS or whatever.
Perhaps they don't give you guarantees that the application can actually use. So
gee, I've made this component work really well in in isolation but the moment I
put it in with a bunch of other things then suddenly it doesn't work well anymore,
okay, and that's because it's being interfered with. And there aren't good ways
often to express that a particular component cannot be interfered with in order to
really have the behavior intended.
Full custom scheduling is often not an option. Now, why would I say something
about that? Well, if you think about future client applications in a lot of the work
that Kurt just told you about, the parallelism there depends on a scheduler that's
application specific, okay? It's particularly tuned for the application. And if we're
interested in parallel components in the future, then we really want to make sure
that we can support whatever kind of scheduling is needed.
And, you know, this one's almost funny to put often in question mark but, you
know, are security or correctness actually part of modern operating systems, one
might hope so. But not always clear. Okay. So the way I view the advent of
manycore is it's sort of exacerbates all these problems because there's a lot
more resources that we could either use well or poorly. But it also provides an
opportunity to sort of redesign things from the ground up. And so I'm going to
view manycore as basically a possibility, you know. It gives me the chance to
rethink a few things. And what I want to do for the rest of -- what my talk's going
to be here is basically tell about you the model of the operating system we're
thinking about and some of the interesting implications of that model, okay? And
that's pretty much where we'll go. And toward the end I'll tell about you some
future directions that we're going with and talk about a prototype and so on.
So first thing I want to tell about you is this idea of two-level scheduling and
space-time partitioning. And it was kind of interesting when Burton Smith came
out to give a talk recently. I was kind of nodding my head, yeah, okay, yeah, I
agree with that.
So basically two-level scheduling as I am using the term starts by saying well,
instead of the standard monolithic scheduler that you often see in an operating
system, there is some thing in the middle whose job it is to try to do the best it
can at satisfying everybody. Okay? And oftentimes you'll find something it's got
lots of options and tweaks and so on, but it's basically monolithic and bare.
And instead what we're going to do is we're going to split it into two pieces. One
is the resource allocation and distribution, okay? And what this -- the idea here is
that there are entities in the system, and we're going to decide to give them
resources. And we're not going to try to figure out how to schedule those
resources, we're just going to say okay, we're going to give you so many cores,
we're going to give you so much memory bandwidth, we're going to give you so
many resources. And the decision about that is going to be based on constraints
about how fast we want that thing to run or our observations the way it goes in
the past, okay? And that's going to be our high-level decision is I'm going to
hand resources to you.
And then at the second level, the application is going to use its application
specific scheduling to use those resources. Okay? So this is kind of a two-level
approach here. Yes, go ahead.
>>: [inaudible] logical resources?
>> John Kubiatowicz: Well, that's a good question. Are they physical or logical
resources? Perhaps I will defer that question for a moment. I'd say that
ultimately every resource has to be virtualized in some way, but they're going to
be as physical as possible, okay? Because that's going to give us better
guarantees of performance. Okay? And you can ask your question again this a
moment. We'll see whether I answer it. Okay?
Now, so this idea of spatial partitioning is -- starts out as a very simple one. So
here is a 64 core multicore processor, or chip. And basically what a spatial
partition is it's a group of processors with a hardware boundary around it. And so
up front I'm admitting the possibility of hardware support. And one of the nice
things about the ParLab is we actually have the ability to experiment with new
hardware wrappers around processors. And I'll show you how we can use some
of that in a moment.
But basically it's a group of processors within a hardware boundary. And the
boundaries are hard. So here's an example in which I've taken kind of the
64-core chip and I've divided it up into chunks, and the key idea here is basically
to go after performance and security isolation by -- that's performance isolation
and security isolation. Sorry that should have put parentheses around this. By
dividing up the resources. Okay?
And so each partition here essentially receives in principle a vector of resources.
So some number of processors, some dedicated set of resources which has
exclusive access to, for instance, complete access to hardware devices or
dedicated raw storage on a disk or a chunk of the cache, okay?
And then some guaranteed fraction of other resources. And here's where
hardware might help. Things like a fraction of the memory bandwidth. Okay?
So I'm assuming here that we might actually have the ability, and we do have
preliminary mechanisms for this, to actually say gee this partition gets 30 percent
of the memory bandwidth and nobody can interfere with that 30 percent. Okay?
Now, if we don't actually have that hardware available, then we can try to
emulate that. But, anyway, fractional other services. Yeah, go ahead.
>>: [inaudible] programmable can the operating system say now I want 30
percent, now I want 30 percent.
>> John Kubiatowicz: Oh, sure. So my assumption, my assumption is that this
hardware mechanism is fully under the control of the operating system. Yes.
Okay. So yeah, 50 percent, 30 percent, 20 -- whatever. Some fraction. Yeah,
go ahead in the back.
>>: Have you thought about how this could interact with a hypervisor as far as
allowing to punch through to the lowest process [inaudible].
>> John Kubiatowicz: So what we're going to do for the moment is let's get rid of
the hypervisor, okay? What I'm going to replace it with is something I'm going to
loosely call a NanoVisor, okay? And I'm going to do that simply because calling
it a hypervisor has baggage.
All right. So now let's take a look at something about -- so the first thing I want to
do is okay, this seems interesting and maybe I can understand that performance
isolation might be useful, but it seems like I've burned something in performance
right off the bat by clipping off, you know, isolating things to one set of processors
or another.
It's interesting, we actually have some folks in the hardware group that have
done some experiments with spatial partitioning and what they've found here is
that, in fact, if they just sort of take two applications or multiple applications and
run them simultaneously, just multiplexed in sort of a standard OS fashion, that
actually doesn't work as well as cutting the machine down and giving sort of part
of the machine to one app and part of the machine to the other app.
Now, what's interesting about that, though, is you can't just divide it in half.
That's this green bar. The sort of -- that's not the best. The best partitioning is
something specific to the apps. Okay, maybe I give four processors to one and
60 to the other, okay? So there is some spatial partitioning that is best for those
two apps other than just running a regular scheduler.
And this is kind of an interesting possibility here from performance standpoint.
Now, I'm interested in spatial partitioning for lots of other reasons. But I just
wanted to show you this to indicate that maybe it doesn't cause you to throw out
performance right away. Question in the back, yeah?
>>: Do you have an explanation for why this [inaudible].
>> John Kubiatowicz: You know, one -- basically one way to look at this is that
applications don't linearly scale in many cases, and so there's a point beyond
which as you add more processors, the bang you're getting for your buck is not
being made up for by what you're losing at the other application that could be
given those processors. So this is kind of an effect of not perfectly linear scaling,
among other things.
>>: [inaudible] that produce those kinds of [inaudible].
>> John Kubiatowicz: You mean the patterns in the sense of what Kurt ->>: Yes.
>> John Kubiatowicz: So these -- I don't think these were particularly done with
patterns. These are actually just a set of standards, parallel benchmarks that we
->>: [inaudible].
>> John Kubiatowicz: I cannot. I apologize. I don't have an answer for that.
That would be -- that's a very interesting question which I am now going to try to
figure that out. That's a good question.
Okay. So obviously if we just stuck to spatial partitions that were fixed, then
that's not going to be useful, right, I mean we obviously can't fix something at
boot time and keep it that way. So clearly there's going to be what we'd like to
call space-time partitioning.
And so, you know, here's an example of a 16-core machine where we've
partitioned it up and the colors represent partitions. And you could imagine that
overtime things vary a little, right? Okay. This is probably a not at all surprising
to anybody that we might want something like this.
And why is this not just standard scheduling? Well, what's interesting here is that
first of all these time slices are somewhat coarser granularity than maybe a
normal OS time slice. So we're not trying to go after the really fine grain
multiplexing. That would be for the second level scheduler. What we're trying to
do is we're trying to put these isolated machines out there that basically are not
disturbed by other applications and use their second-level schedulers to get their
performance, okay?
So I would actually like to call this controlled multiplexing, not uncontrolled
virtualization. Okay? Or another way to look at this is for instance, we're
planning on scheduling these slices a bit in the future because we know enough
about what we're trying to give in terms of resources. Okay? And also I'll point
out that resources are gang-scheduled. So this is important for a lot of different
parallel programming paradigms that when I give a set of processors I'm going to
give them all at once. And I give all the resources at once. Okay? And I'm going
to take them away all at once. And the reason for that is that that gives the
user-level scheduler the ability to do a better job of scheduling its resources
because it knows what it's got and what -- yes?
>>: [inaudible] user actual measurements or what it costs to reconfigure your
partitioning?
>> John Kubiatowicz: So we don't have a lot of numbers of that form right now
because, first of all, our prototype's in the early stages. And the second is we're
actually playing with hardware support. So I would claim that as a hardware
person I could make the changing of this as cheap as possible for the software,
but it's not clear that's a good tradeoff. So I think I could tweak the knob in lots of
ways as to how expensive this is.
Why don't you ask me that question again in a year or something, and I might
have a better idea as we really push these ideas. But you could imagine that if
the only thing we're talking about is processors, it's the cost of a contact switch.
If what we're talking about is setting up registers in the machine to get bandwidth
isolation and so on, it could either be more or less expensive, depending on what
kind of support we want to put in. Okay?
Now, I would claim it's not -- it's not going to be that expensive. We shall see.
So let's push this idea a little bit further. So if I'm space-time partitioning things
then that isn't really something a programmer wants to deal with. So obviously
we need a little virtualization. And this is getting back to your question earlier.
And our view here is an abstraction we call a cell. Which is basically a user-level
software component, with guaranteed resources.
Is it a process? Is it a Virtual Private Machine? You know, we got into a lot of
arguments once about whether we should call this a process, and, in fact, I was
resisting because it's more, it's less, I don't know. But it's -- there's an analogy
with a process here. It's got code, it's got an address space, it's a protected
domain. But maybe there might be more than one protected domain or there
might be more paging. So it might be more or less than a process. Okay?
What are the properties of the cell? It has full control over the resources it owns.
So I would say that while the cell is mapped to the hardware, it can use any
resources we give it access to. Contains at least one address space. It has a
set of communication channels with other part -- other cells, and I'll show about
you those in a moment. And then it has a few other things like a security context
which may be automatically encrypts and decrypts information as across cell
boundaries. These are a few things we're playing with. And maybe has the
ability to manipulate its addressing mapping, via some sort of paravirtualized
interface. So this is a potentially pretty low level machine that we're handing to a
piece of parallel code, a component so that it can make best use of it.
And realize that the reason we're proposing something like this is to stay out of
the way of focus like Kurt who are busy trying to tune things to run as well as
possible. We want to make sure that we provide a nice, clean environment.
Yes?
>>: Would something like a network device map on to these cells?
>> John Kubiatowicz: So a network device would typically -- I'll show you a kind
of a funny sketch in a moment. But a network device would typically get a cell or
maybe a couple of devices, depending on how you want to program it might be
put into the same cell. So we might have the network devices get a cell with a
set of resources and they're allowed to use them any way they want. Okay?
And so when mapped to the hardware, the cell gets gang-schedule hardware
threads. We called these Harts. The guaranteed fractions of resources and so
on. Okay. Question. Yes?
>>: [inaudible] multiple resources of each given type.
>> John Kubiatowicz: Right.
>>: Where does this one instance [inaudible].
>> John Kubiatowicz: So if there's one instance of a given type, there are a
couple of possibilities here. One, you give it to exactly one cell and that cell
forms the multiplexing so you actually have some software that acts as a, you
know, as a gateway to that device. That would be more in keeping with this
philosophy than trying to multiplex it, that one device automatically underneath,
because that would turn us from a NanoVisor into being into the hypervisor view.
So it's more that you would have a software component that would do explicit
multiplexing.
So okay, so what do we do with cells? Well, here you know I see an application
divided into an explicit parallel core piece and some parallel library with a secure
channel between them. So it's kind of a component based model. Applications
are interacting components. And, you know, we get composability here.
Obviously we can build this parallel library component separately from this
application. This might have some properties that we tune. And then when we
use in it this application, we keep those properties.
So the other interesting thing is that cells being co-resident on a bigger chip, so
remember we're thinking about manycore, means that potentially you can cross
this protection domain rapidly just by sending a message. Okay? There's no
contact switch involved, which is kind of what we were stuck with a single
processor or a small number of processors, okay?
So this kind of echos a microkernel view in some sense but is different in that, A,
we're giving it to applications as well as services, and, B, we have this potential
of very fast crossing of domains. And within the cell you have fast parallel
computation. So we're keeping parallelism fundamentally here. And of course
here we could see what might be the full mix where there's some real-time cells
that are doing audio and video, the file service is part of the OS. Device drivers
might be running in their own cells and so on okay?
Now, of course it's all about the communication. So I've been kind of ignoring
that a little bit. But we're interested in communication for lots of reasons.
Communication crosses the resource and security boundaries. The efficiency of
communication impacts how much decomposition you're willing to do. Here's an
interesting issue here.
So we're interested in quality of service. And one of the things we're interested in
is the potential to give a fraction of a shared file service to applications A and B
and guarantee that. So what does that mean? Well, that means that potentially I
have to restrict the amount of number of requests per unit time across these
channels to make sure that I don't oversubscribe that service.
So we're definitely interested in being able to guarantee to some application that
needs it a well defined piece of something. You know, another question which is
interesting is so you send a message but this cell happens to not be mapped at a
given time. Does it wake up right away? Okay. That would be sort of the
traditional event driven approach. And it's certainly something we support. But
something more interesting that we support is that, no, it actually wakes up when
the thing is scheduled in its time slice, okay? And so interrupts and events are
not the only way to send something. And as a result, you don't necessarily have
to disturb a parallel item that's running well. Clearly we support interrupts
because those are occasionally needed but we're actually kind of the view that
they're needed a lot less than people use.
And then of course the communication defines the security model. So there's a
couple of those we're looking at. But it's really about who do you allow to
communicate with whom? Okay.
So you could say here is, I don't know, tessellation, right? So we've got a couple
of large compute-bound application, a real-time app. We have some file storage
going on. Maybe we have a networking component that's doing intrusion
detection continuously. Maybe we have something doing GUI and interfacing
with the user in other ways. We might have some device drivers and so on.
You know, you could see how this scales -- I mean how this might go, right?
Here's another view for more of a -- so this was kind of a client version. Here
might be a server version where we have a bunch of chips. And we actually put
QoS guarantees on say access to memory bandwidth, access to inter-chip
communication, maybe access to the disk and so on. Question. Question?
>>: [inaudible] happens when one or more new applications come into the
workload?
>> John Kubiatowicz: So what happens when one or new applications come into
the workload is you have to change the allocations. And I'm going to talk about
that in the remaining part of my talk. Okay? So clearly static situations are great
in the short term, but they're not so good in the long term. Okay.
So, in fact, let's talk about resources here for a moment. Good question. So
another look at two-level scheduling. So why do we want to do it? So the first
level is really about globally partitioning resources to meet the goals of the
system. Now, what are the goals of the system in okay. They're defined by
policies that are both global and possibly local. And so there's, you know, you
could imagine an arbitrary complexity here. But it's busy partitioning up the
resources of the system to various cells to try to meet some goals.
And of course, we want to make sure that the partitioning is constantly for some
sufficiently long period of time that the local schedulers can do a good job. Okay.
Second level is application-specific scheduling, okay? Goals might be
performance, real-time behavior, responsiveness and so on. And this is sort of
running within the cell. There's another scheduler that's running at user level and
doing whatever it wants with resources to meet the goals. Okay. Let's see. I
think this is all I want to say. All right. You know, second-level schedule can
defer interrupts and so on locally because we've got full control. All right. Yes.
Sorry?
>>: So one of the problems with constant [inaudible] is the consumer client
oriented applications parallel as with well? [inaudible]. Things change on the
order of milliseconds [inaudible].
>> John Kubiatowicz: Sure. So if the user -- so the way would I view that is if a
user actually makes a mouse click and wants some major change to happen,
we'll be perfectly happy to stop something that was running well to handle the
user. I mean, this idea of trying to keep things constant for long enough time to
do well is only true if that's not the thing the user really wanted. So this is
potentially a tradeoff between responding rapidly, that's responsiveness and
performing well. And we will go for the responsiveness case in the case of the
user.
>>: [inaudible] is with this kind of architecture can you respond quickly enough to
[inaudible].
>> John Kubiatowicz: So we think the answer is yes. We don't know for sure
yet. Basically a couple of ways of doing that. One is keeping excess resources
that you know how you can get ahold of when you need them. And so you
basically are giving the excess resources to be used but you can grab them right
away when you need to do something like that.
>>: I have a question is that you look [inaudible] and you talk a lot about virtual
memory and the impact of virtual memory on these clients. Can you say more
about that?
>> John Kubiatowicz: So virtual memory in the sense of seeing a memory space
that's bigger than the amount of DRAM you have? I mean, there's a lot of
different ways of using virtual memory.
>>: [inaudible].
>> John Kubiatowicz: Okay. Yeah, let's take that offline. But, you know,
basically you could imagine partitioning the physical memory and then giving it
access to do anything that it wants with its virtual memory, including paging in an
application specific way.
So just briefly I want to make sure that my last guy -- last colleague has a chance
to talk. Oh, question. Yes?
>>: About memory coherence.
>> John Kubiatowicz: Yes.
>>: Are you going to assume that within a cell all processors have a [inaudible].
>> John Kubiatowicz: So I think that we probably want to assume that a cell has
shared memory, cache coherent shared memory within it. But we are certainly
looking at architectures for which that wouldn't be true. So, you know, our view is
kind of you use whatever resources you've got in a cell, and if you don't have
shared memory you use message passing or something. But that's a parallel
app that's running in a container.
Now, you may ask how can I build a container that -- or an application that can
be handled both with shared memory and message passing and it picks one. I'm
going to avoid that question for now. That's a potentially hard question. Yeah?
>>: Just one question for the [inaudible] resources. How is that determined
[inaudible] abstract semantics and those are translated [inaudible] resources or is
it more specific [inaudible] how much resources it wants?
>> John Kubiatowicz: It's either. So that's a great question. Does the app give
something abstractly in terms of a frame rate or does it say it wants so many
processors? We're actually supporting both. There's what I would call an
impedence mismatch between what the programmer understands which is say
frame rate and the resources, okay? And I think that a lot of systems don't even
try to address that. And we're experimenting with ways of figuring some of that
out automatically. But there is that impedence mismatch. But we actually think
that rather than accepting it as being a problem, we want to actually address it.
So basically what is the state of the system is specified in what we call a space
time resource graph. This is just a chunk of it. But basically cells have what we
call space time resources which might be four processors for 50 percent of the
time, et cetera. And then potentially they can be grouped so you can actually put
resources up higher here which really means that resources can be allowed to
move from cell to cell within the group. Okay? And that's the guarantees are
made at the cell level here.
So how do we build this thing? So we actually have a partition policy service or
layer. I actually see an inconsistency. I apologize. That is busy doing the
allocation. And I'll show you the structure of that. It produces space-time
resource graphs which then get implemented underneath by a mapping layer that
takes the graph and decides how to map that to the hardware and into a set of
slices and an underlying partition layer that basically is the NanoVisor that
provides the hard boundaries underneath.
And you could also say that this mapping layer makes no actual decisions. It's
constrained by what it's been told to do. It's really doing a form of bin packing on
this, but it's a form of bin packing that's already been verified to be workable
before it's been handed a graph. Yes?
>>: [inaudible] obviously not going to maintain [inaudible] how does that impact
the scheduling ->> John Kubiatowicz: Well, so you would -- you would -- my view has been for
many reasons, not just the NUMA issue, but other hardware issues that cells are
going to be -- consist of co-located processors, not one that's on one side of the
machine and one that's on the other. So for any given machine, as your cell gets
bigger, this had been some well defined NUMA properties to it. But they won't
be, you know, the thing split in half and it's on opposite sides of the machine,
right? My suspicious is that the cells will always be co-located to whatever extent
is possible.
>>: I guess the point is that the NUMA domains form sort of very strong
boundaries meaning that you really can't put data in multiple NUMAs unless the
pattern in the fuel applications [inaudible] that.
>> John Kubiatowicz: Sure.
>>: So you're tying to do the [inaudible] mapping problem that's a hot harder
because the NUMAs uniform.
>> John Kubiatowicz: Right. So okay, so now I understand your question. Let
me give you a -- here's how I would answer that question. I would say that if
you've got a machine of some size and you want to chop it into pieces, the
question is can you build apps that run well. Okay.
Now, the flip answer is it's not my problem. The non-flip answer is the following.
We're actually looking at interfaces to provide topology information to a layer
that's scheduling in deciding how to do that. You could say that if your app can't
handle NUMA very well that it needs maybe to be rewritten or you need new
patterns to look at. You know, the OS is providing the services basically of this
machine boundary that's kind of a clean boundary and being able to program a
NUMA machine well is actually I would consider part of the higher layer than the
operating system. That's a -- maybe an interesting debate that we could have
tomorrow, which I encourage.
Okay. So I should finish up here. But let me just show you here -- in fact, I'm not
going to walk through everything. But here's an example of our actual
architecture I wanted to show you some layers. So we've got the partitionable
resources down here. We've got the mechanism layer or the NanoVisor is busy
doing the -- implementing the partition, implementing channels, maybe doing
QoS enforcement.
The partition mapping and multiplexing layer is basically, it's still part of the
trusted code base that takes space-you time resource graphs in and implements
them. So there's a validator to make sure that you don't try to do something that
violates your security in some way like shutting off a key operating system cell.
And then it plans the resources and then somebody multiplexes it.
And notice that there's kind of two key ideas here. Admission control. So we
actually reserve the right to reject requests. Please start this cell. No. Okay.
Now, that always throws people for a loop, right? But if you don't reject requests,
there's no way to make guarantees. Okay. I'm going to put that -- I'm going to
say that that way. What do you do when a request is rejected? Well, that's
interesting. Maybe you ask the user to change their preferences or something,
or maybe there's an automatic mechanism.
Now, what I've shown here, let me just talk about this adaptive loop is in principle
to meet this impedence mismatch that I talked about earlier, we're expressing
frame rates, but we've got cores, what happens is we're measuring performance,
we're building potentially models of how that performance is going, we're
adapting resources, changing our graphs. You can see the loop here. And in
principle, admission control when it can't make a simple change might ask for a
major change. And as long as it meets the policies, maybe that major change
will be admitted.
Okay. Now, ask me the explicit details about this. I'll tell you we're still working
on it, for obvious reasons. But this is the philosophy. Okay?
We actually have several different modeling things that we're looking at for
building this. And we'll talk about this tomorrow. But you can imagine I know
Burton Smith really likes this notion of a convex optimization problem where once
you've got a definition for how things behave then you're trying to optimize for
something. Yeah, question.
>>: [inaudible] the previous slide that you might reconfigure the system where a
longer running application might see its resources, you know, change like a --
you know, maybe I was given 16 cores to work with and then suddenly I have like
four or something like that?
>> John Kubiatowicz: Yeah. So there's explicit interfaces for resource changes.
And so if you say that I can deal with resource changes and be a good citizen
then you'll be told about that. Okay. I'm getting the -- let's get ready to finish up
here. But, yeah, so there's an interface for that.
Scheduling inside a cell is just user-level scheduling. There's lots offing the
things there. There's also questions of how to divide applications into cells. You
can see the obvious questions if the granularity of the cells is too small then the
policy layer has got too many complexity and can't really do a good job. And so
that's interesting.
And then finally, you could imagine things we might want from the hardware. So
like, for instance, obviously you want to it compute well in parallel, but partitioning
support, QoS enforcement mechanisms, fast messaging support, these all things
we've been looking at. And by the way, Dave mentioned RAMP earlier is this
great emulator that allows us to add these mechanisms in and take a look.
And so, for instance, we've actually got an emulation of a memory bandwidth
partitioner mechanism that we've looked at. And you can show that we get good
performance isolation having that mechanism. And it's not too expensive. Okay.
So I will conclude. I -- and people -- I'll be around for a couple of days. So plenty
of questions, I'm sure. But I talked about space-time partitioning and cells as
basically a new mechanism in which to construct things.
This partitioning service is kind of an interesting part of this process. And we're
actually building an OS right now that's got several of the NanoVisor pieces to it
and runs code. We had a demo and is probably about to go through
restructuring number 5,496 but, you know, we're working on it. So all right. I'll
stop now. Sorry.
[applause].
>> Ras Bodik: Okay. I'm Ras Bodik. I'll talk about the Web browser part of the
project which is really how the software stack would look like on top of the
operating system. So for the case when we are talking about client computation.
So it is true that it was the browsers what made smart phones popular because
mostly people started buying them after browser became usable on them. But in
a sense that in the same vein they failed. They serve as successful browsers on
those platforms but they are not what they have been on laptops and desktops.
So those bigger platforms they become the de facto application platform of
choice and many, if not most new applications are developed in the browser.
This is not the case on the mobile phone. And that's partly because of the
performance. If you look at New York Times front page it may take 15 seconds
to load on your iPhone. And the reason is not that the network is slow, this was
done actually on a fast network. The reason is that your browser is essentially
latent kind of compiler running on your phone and it's very computationally
intensive.
And so the reason is that there is a lot of abstraction tax that you pay when you
use the browser. That's the tax you pay for productivity, the fact that you have a
really powerful layout engine, an extensible scripting language of dynamic types
which you can embed DSLs very easily into. And that all shows up.
And we did an experiment with a simple application return on top of Google Maps
API, we wrote the same thing in C and it was 100 times faster. It was probably
100 times harder to write also, and we'll get back to it in the talk. But that's the
reason why people don't use browsers to right applications mostly they use
things like Android, Silverlight and iPhone as the case, and their siblings. Which
is little bit more flexible powerful, but they are lower level and more efficient.
So the browser is CPU bound. And what's inside? There is essentially a parser
that takes whatever the input is, HTML, CSS gives you the DOM, which is the
abstract syntax 3 for the document. And there is a selector engine which takes
this CSS styling rule effectively and maps around to the DOM. I'll talk about it
some more. Then there is the layout engine which positions those elements on
the screen. And then you render, which is you just move the bits, blend them on
to the graphics on the screen. And then there is JavaScript which provides the
application logically activity and that may actually redo everything.
And which of them is expensive? Turns out that all of them expensive. So
speaking of Amdahl's law, you can not really side step any of them, and all of
them need to be optimized.
And so we look in the project at the top four levels. We have a story about the
language which is participants the most important story, but it's not based on
JavaScript and I'll try to justify why later in the talk.
So these are the three main driving forces I'd say. We care about the low-power
devices, phones and, in fact, smaller, predicting that phones are the next coming
computing platform that we'll use, but it's not the last one.
The next thing we'll -- as computers moves from the stuff that you have on your
desk to your lap to your hand, they'll end up on your ear somewhere. And not so
far away. We hook at client applications. They'll be interactive, they'll have
sensors, they'll have augmented the reality which take all the sensors together
and help you live your life.
And we look at the future of productivity languages. One even why JavaScript
became popular was that we had a lot of spare power in the '90s. And we didn't
have the application demands. There were sort of more surplus of compute
power in the '90s as scripting became popular. Now you see the opposite
pressure. People want to write applications with scripting languages. But the
compute power is gone or the improvements are gone.
So what the future of scripting languages looks like is a major question here.
So this was the original motivation for the browser project we realize it's not only
a phone, once we observe that one could put a little laser projector into the
phone and turn it into a tablet computer in a bar. And that was just a vision. This
is a mock-up picture we did by actually holding a real projector above the desk.
But turns out that microvision released this laser projector just about a month
ago. You can buy it for 550 bucks. And I don't own it yet but apparently it's very
impressive.
You cannot fit it into the phone yet because the projector itself is about as big as
the iPhone, but the projector itself is like the tip of this pen. So I think this will
actually you will see it soon. Please?
>>: [inaudible] battery life.
>> Ras Bodik: That mostly includes battery life, yes. But of course this is heat
dissipation connected with it. But I think battery's a good [inaudible].
So here is the stack of the parsers. So the parallel lexer and parser here of
course, as before, there is the CSS matcher, which I'll talk about, the parallel
layout engine, and the rendering where we don't have many results yet because
you really need to rewrite open GL to be parallel. But at least we are
investigating how one parallelism -- how parallelism could help over there.
Now, the whole thing is not written from scratch, although currently mostly is. It
will be generated in the spirit of parser generator. You'll have a generator of par
certifies and you'll be able to create multi-variants and auto-tune across the
space of the generated things.
The same for the layout. We have foamily written the specification of most of the
CSS with attribute grammars with the goal of actually generating that engine
automatically with various optimizations that will be of serial nature such as
incrementalization and parallel nature such as task parallelism.
>>: [inaudible] attribute grammar?
>> Ras Bodik: It seems that it does.
>>: Wow. Which attribute grammar [inaudible] did you get to choose? Or we
can go into that. Sorry.
>> Ras Bodik: Okay. I will [inaudible] the key differences among them that you
have in mind?
>>: I don't know. I thought there were like parsing, there's [inaudible].
>> Ras Bodik: Okay. I guess we'll have to take it offline.
>>: Okay.
>> Ras Bodik: Okay. So -- and but it doesn't end here. So our scripting story is
a constraint based language that combines constraints with events, so it really
gives you the ability to put together the layout of the page, the semantics of the
layout, and essentially the activity and there will be a synthesizer whose output is
going to be an attribute grammar and then again you can go through this engine
that generates parallel and incremental evaluation.
So let's start with parallel lexing and parsing. So lexing is a simple task. You
have essentially a string of characters here, and you need to break it into tokens.
And the tokens are described here with the regular expression which really
corresponds to automaton. So you need to run this input through these
automaton and determine at each step which state of the automaton you are. It's
as simple as that.
The problem is that of course this process is naturally serial. It's embarrassingly
serial you could say. You cannot make the next step before you know the state
of the previous step. And yet, the way we would like to parallelize it is to actually
break the input this way so that you do not need to obtain multiple files in order to
obtain parallelism that you can actually parallelize one file, one stream. So for
stream processing this seems to make sense.
So here is the observation. In at least lexical analysis, maybe less so in regular
expressions pattern matching, if you start your lexical analysis in any state after
some time you end up in the same state. So we pick an arbitrary state, not
necessarily a correct one, but after three steps in this case depending because
there is one talk in here, we all end up in the same state. So there is sort of a
notion of small warmup prefix that is sufficient to get you pretty much no matter
what state you are in into a correct state, even without having seen what was
before in the input.
And this observation length itself to this algorithm, that you take your input and
split it into chunks with certain K character overlap. And what do you do there,
well, you just predict that you are in some state, which is this. And you run each
of these parallel processes independently starting from this good state that leads
to usually a good warmup and you realize that yes, indeed we guess the state
correctly after the warmup, even though we started from a wrong state, the state
was again correct and so we checked for the matches and speculations. The
speculative algorithm and you get speedup because in this domain things seem
to be predictable.
So here is the speedup on the IBM cell processor. For large files the speedup is
nearly perfect. For smaller files which is about 250 kilobytes scalability is not that
great yet. We need to work with the OS in our hardware to tune it a little bit
better. But it's still not bad. On five cores you are still nearly five times faster
than flex, which is the well tuned serial lexical analyzer.
Parsing. Parsing is the step that comes after you obtain a stream of tokens, not
characters. So this is the resulted of lexical analysis. And your program is
described with the context free grammar, which might say that the program is in
this case just one function with one argument variable, list of statements, a
statement can be an assignment of expression to the ID and each statement and
an expression is ID plus E and so on.
[inaudible] you know what I'm talking about. And what the parser does, it goes to
input usually left to right and identifies things like oh, I have an expression here,
here, and here. And I also have an expression here. X greater than Y is also an
expression. I have a statement here. And therefore I have a statement here.
And therefore the whole thing is a program.
And this all attends left to right. So the context that you see on the left is
important. Now, if you want to do it in parallel, you give one processor only the
left part of the inputs so that another one can work here. And the way we do it is
so that there will be sort of a main parser going through the whole input, the one
that does non-speculative work, and on those chunks there will be a speculative
parser that will try to preparse the input and essentially allowing memorization for
the main parser will just skip over when he gets to that part.
So that preparser will have to guess at the context of the main parser, which of
course it doesn't have because it's still working on the left. So it will guess
correctly that, well, these are expressions and that this is an expression as well.
And now it will have little bit of a conundrum because it will come here and say
well, this looks like an expression to me, and indeed it is because you can derive
that from E. It is an expression. Of course in this context it is not. These
parentheses are not part of the expression, parentheses they are part of the E
statement parentheses.
But nothing went wrong. You just memorized here that this is an expression and
when the main parser comes to it, this is not going to look for an expression, so
this work is, you know, redundant, a little bit extrovert we've done, and it will
come here, it will skip over this because we have already parsed it as
expression. And it will also, by the way, skip over that part because that has
been already preparsed as a statement. And now it can put these two pieces
together and realize the whole thing is a statement across these boundaries,
whole thing is a program.
So it is speculation that based on the context that you've gleaned from looking at
the tokens in the input you try to predict the state of the parser and similarly this
is in E. And sometimes you do more work but you never go wrong.
And here is some fresh data on predictability. So here are various pairs of
identifiers that you can see in the input. And what do you see here is that if for
each pair you make two guesses as to the state of the parser, you are doing
pretty well. This one you get about 50 percent probability that you guess the
state correctly. And the data actually are better than what you see here. All
these bad looking data with a little bit of static analysis of the grammar which we
haven't done would probably go like this as well. I'd rather not go into it right
now.
So this is the parser. Now let's go quickly to the CSS selector matcher. What
happens in this step? You have your document which has a root and then it has
a paragraph and another one here has a text and an image. And here is another
paragraph with a bunch of words in it. This one is bold.
And the goal is to take this styling rule such as this one, which says image that is
part of a paragraph needs to have the following font size, and you have a bunch
of them. This is the essentially rule, and this is the styling prescription. You
need to do each node find the corresponding rules. So in this case, these are
these two. And we have about thousand nodes and thousand rules in a typical
document. So there is quite a bit of work. In fact this seems to be the most
expensive component in browsers. And well tuned websites like Gmail don't
even use this functionality because it's so slow.
And in a sense you could say they sacrifice for engineering goodness because
this matching is very slow. So let's see what we did. Because this is huge data
parallel computation, the parallelism is obvious. You just feed to know how to
slice it. And the most significant things actually didn't come from parallelism but
from locality optimization such as styling, making sure that we have in the cache
only the important data at the time, and then you go through next one and next
one.
So these rules are split in intelligent ways so that only a subset of these rules is
in memory at a time. And if you look at the speedup, you start with
implementation that is similar to what we have in the WebKit driver, the WebKit
browser. And after L2 optimizations it goes probably to factor of 3. After some
L1 optimization it goes to factor of about 25.
And then you add to its speed up from parallelism which is about factor of 3, 4, 4
cores. We are quite flat after that. But hopefully more work will help you there.
But together you are about 60X faster than the original, which probably makes
this a non-bottleneck in the browser.
So let's talk about CSS. Rather than telling you right up front what happens in
this parallel layout engine, let me tell you why formalizing CSS may make sense.
So here is a piece of CSS, a few nested boxes which floats. And here how they
render on three major browsers. And on each of them, they are different. And
we still don't quite understand exactly where there is ambiguity in this part of the
specification. Actually Leo probably understands it.
But I'll show you a simpler story which tells you why having a formally specified
spec may help you find holes before you actually release the spec. So here is
three nested boxes. You give the width of the inner one. This one needs to be
half the width of the parent, and this one needs to be as small as the child. And
you probably immediately see the problem, that you have a dependence that this
one depends on this and this one depends on that. So you have a cyclic
dependence.
Of course you can solve this layout problem and perhaps with fixed-point
iteration. But you know what the output looks like, right? It will be a box that
these two outer boxes have zero width. Because that's the only condition under
which you can meet both of these constraints. And so this is essentially what
spec silently says what these rules mean.
When an engineer implemented those constraints, browser engineer looked at it,
probably said well, either I don't like the cyclic dependencies because they wants
to have a bounded number of passes through the tree, or he didn't like the output
of the specification. So he said well, I need to break some of these constraints.
Because they cannot satisfy all of them and still get something good looking and
with performance. So he either breaks the outer constraints, so now this outer
box is not shrink to fit to its child or it breaks the inner box and now this one is not
half of its parent.
So which one would you pick if you had to break one constraint? Of course both
of them look equally good. But they decided for this one that not all three,
probably one of them decided and then the other just copied the semantics
because this is how you interpret the CSS spec, then you implement the
browser, you see what the other browsers do and then do you the same.
So no wonder in these ad hoc decisions to which constraints are broken and just
the fact that you implicitly drop these constraints leads to surprises in CSS. In
this case, at least they made a decision that this box doesn't stick out as it does
here. But at this mark by the fan of CSS demonstrates, no, you are not always -[laughter] -- you are not always so lucky.
So here is why having a spec would help because if you write this in attribute
grammar you would immediately see without seeing any particular document but
through static analysis they have this cyclic dependence and something is wrong
and you need to resolve it. And rather than leaving it to ad hoc dropping of
constraints, which is of course surprising.
So having these benefits and Leo did find some surprising holes in the spec,
especially when it comes to tables. You can also then write a parallel layout
engine. You can look at how layout is computed and identify parallelism in it.
And so here is how a layout happens. Again you have a tree which is pictured
here. You have one paragraph and here we have another one. But there is this
image here which is a float, meaning the text needs to flow around it. And so
what we have, you have a body, you have two paragraphs, you have the work
hello. This guy here is a float, which means float -- can float to the next
paragraph and text goes around it and here is the rest of the paragraph.
And the CSS layout as this latex layout is so called the flow layout. You lay
things one after another essentially the way you lay bricks. Often the
specification says where you need to put the next thing after the previous one
has been laid out and you know what part of the screen is free. Especially the
case in latent. The description does look like that. And so the layout really looks
like, okay, what is the layout? We need to compute the sizes of each element
and their coordinates, X and Y. So you start out by saying this is the base of the
whole thing, this is the font size and this last two X and Y sort of where the cursor
currently is. And then you go in order through the graph computing things the
way you would do if you laid bricks one after another.
And of course there isn't much parallelism here because there are these
dependencies, you know, after all, the position of this paragraph depends on this
paragraph and in particular where these words can go depends on how much
space this picture takes up. But if you look at the attribute grammar, fine grained
dependencies show up. And now the thing is all of a sudden more parallel. You
can now see that oh, I can compute font sizes of this without touching this part of
the subtree. And five phases all of a sudden appear.
And in the first phase you compute font side and a temporary width. Then you
go bottom up. Both of these phases are parallel, sequentially after another, but
parallel themselves. And then you do another phase and you go up. And in the
fifth phase you are ready to compute the absolute positions.
And here are some preliminary speedups. This is from a not quite a faithful
implementation but on on four cores about three. This is going to hopefully be
more complete soon.
So what you get is then you have formal spec you can then automatically with an
attribute grammar engine actually find that parallelism and generate the engine
with other optimizations, auto-tune over variance. What smarts. We'll have to go
here. We don't quite understand yet. Some part of the parallelism was
discovered by Leo tweaking the grammar by hand. So those will need to be
journalized into automatic parallelizations.
So let's now look at the motivation for this language. So why constraints? So I'll
try to give you -- I think I have two or three reasons for why the scripting should
be based on constraints.
The first one is -- well, the first one actually motivates why the language should
have constraints and events together, why it should have layout and scripting, in
other words, in the same box. So here you see how we typically program in
today's browser. If you have a tile which may be a piece of a map, an image in
the browser, if you want to position it somewhere or set its height you will in
JavaScript write something like tile.height, you take an X, which is an integer
variable and you append PX, a substring to it, and this is then passed to the
browser proper and which will discover how high the tile should be and it will lay
it out.
Now, what happens actually in the execution is that you take 15, an integer
value, convert it into a string, concatenate it with PX. Then you pass this 15 PX
string down to the browser across the JavaScript C++ boundary and there you of
course dutifully parse it, break it into 15 and a flag and now you know this was
15.
So this is one of the reasons why browser are slow, that not only the contact
switch from here to here is expensive and therefore some research groups such
as the one C3 here and Adobe but the layout essentially into the same language
so that you don't have to cross this boundary. Still this problem of optimizing this
away remains. Because it's not something you can partially evaluate, at least not
with the standard techniques.
So this is motivation for having one language where the layout part and the
scripting part is together and can be optimized. And there is no expensive
contact switch.
I told you about the application we wrote to understand how much abstraction
tags there is in browser, the one that was about 100X faster. It was quite hard to
write. One key difficulty was that you had to write macros or whatever for
converting between four coordinate systems. Because you had a notion of
coordinate out there as sort of longitude latitude. Then there are coordinates of
pixels. Then there were tiles within the map. Because the map was built out of
tiles. And there was one more which I forget.
But I don't understand the code that Krste wrote because this was not easy to get
right.
It would be much nicer if these conversions were not written in a functional style
directionally, they didn't say how to convert, only what the relationships hold
between these coordinate systems. So I would much rather write something like
that. I would like to say there exists a relation from some family of linear relations
between -- what is the relation? It's between the coordinates on the map and
coordinates on the screen and I'd like to say well, if this X on the map is -- and
there's a relation with the screen, then X plus one kilometer must be screen.X
plus 100 pixels. And hopefully I establish efficiently what that relationship is.
And from this description I synthesize that we will first come up what is this
relation exactly on that it is unambiguous and then synthesize as a function.
So that's where we are going with this. Motivation number 3 is if you look at
some Web pages in terms of visual design, human perception, some of them are
easier to navigate than others. This one is reasonably busy in terms of text, but
yet it is easy to navigate. So why is it so?
The reason is that it is like this document, which I would call beautiful. It's an
example of great design where you can navigate your eyes and really know
where you are going and where the captions are for particular images. And the
theory is that this is because it is an instance of a grid design.
You first prepare a grid for your document and then you put images, text into it.
And so clearly you know, you have the images here, a bigger image covers not
some arbitrary fraction of the page, as would happen in typical Web page, but
covers these four. And then the text comes here.
And there is whole theory and informal guideline specified for how to do it. But
no language that would make it easy for you to follow this design or better that
would make it difficult not to follow that design. In fact, they would like to give
designers a way to easily build documents and make it hard to build others. So
really I think the long-term goal is to give designers a language for building such
layout systems such as CSS in these.
Finally it's events. So JavaScript events are sort of gotos. If you have a box on
the screen that you want to move, you write a piece of text which has two nested
handlers. One and another one. These are sort of first class function that
essentially are interrupt handlers. It's not quite is seen but what is going O.
It's easier to see when you write it as a dataflow program or you say I have a
source, it's a mouse that each time you move the mouse generates a pair of
coordinates that go here and delay each by 500 milliseconds. And then they said
the coordinates of a box on the screen. And now that we see what the program
is doing it's easier to analyze because you know what is the flow of control. And
because it's the same as the flow of data here.
Dataflow is an improvement, then you start seeing it in scripting languages in
terms of data binding and such where you can refer to variables from your HTML
document. It's not great when you have a more complex document because it
happens that you need to send messages this way and also that way, and these
cycles may lead to oscillation and programs that have bugs.
So what do we offer? We offer a system that does not have directional flow as
dataflow but that is bidirectional or relational where there is no particular flow. So
let me just show you two fragments from a case study, sort of driving application.
So imagine you have a -- you have a video player with the usual play button and
the name of the movie and movie and this is important here. This is your timeline
that you can grab and scroll, right? You could grab it here and scroll it. Here is
an annotation window. So with this movie comes a set of captions. Each of
them has a start time and end time. And they are displayed here in a scroll box.
And as the movie goes on, the annotations march along.
But you can when you grab this and move it, both the movie needs to be
rewound and these annotations. When you scroll these annotations and you
click on say this one, the time needs to advance to the -- this annotation, the
movie needs to go there and the annotation needs to be centered over there in
the window.
So that roughly corresponds we believe to what the interactive applications will
be in the future. So here is how we envision we will write it. What do we have?
We have boxes with ports but these lines here are just constraints. At this level,
all these constraints are quality constraints. So this variable here and this
variable here must always be equal. And in particularly looks at the time. There
is a video player which has the time port and annotation displays a time port.
And the slider which shows that the time has a time port, and all we are saying is
that these three are tied together. And any of them can initiate change in time,
and the others need to adapt but I'm not saying in which direction the messages
flow. That will be -- come up automatically with the compiler.
Similarly we have a toggle button which tells you whether you are playing or not
playing. And both this part and that part can initiate the change. I can press on
the button and start playing. When the movie finishes, on the other hand, the
activity goes this way because you want to gray out that button is since the movie
stopped playing. Yet, I do not need to worry about these messages in both
directions, I just set this constraint.
So this is how hopefully these applications will be just glued together. Even
without messages. Now, I want to tell you how events would be handled here.
And I have one minute, and I think I'll imagine. Is that what you see here is this
annotation display, which showed all the annotations. You actually create them
here. You get a list of annotations and you create a bunch of boxes just like in
your HTML single dense boxes. And they are all put into the V box so you can
think of the V box as the parent in the DOM of these boxes and you put the V box
into a scroll box. There is nothing magical here.
But let's look inside here. You'll see two other benefits of constraints and events.
One is that, well with, here is one annotation. The text from here comes here,
and this is the box that displays the annotation. Here is the height in Y? Now,
what do I do when I want to make the annotation active? Normally I would need
to come here and say okay, when there is a mouse click on that box I need to
somehow send a message somewhere and adjust the time. And even in
programming or even the dataflow you typically need to go to some central
repository and send the time and then the time would come back and say oh,
okay, now we have new time let's see if this annotation, if the time is in the
interval of the annotation, if it is, you make it active, if it is active, the Y coordinate
needs to be zero, which means this guy is centered.
But I do not want to go with this event all the way to time and figure out how
things will actually propagate. All I do here is I say well, okay, if I click on the
mouse, I insist that this annotation becomes active. But I don't really care about
how the time is set. This will be again done automatically. In order for this
annotation to be active I need to change the time, but I don't worry about how.
So I can inference the time or active because things are bidirectional and it
seems to be more natural, more modular to think about setting activity over an
annotation rather than figuring out what time I need to set so that it becomes
active.
And so these bidirectional constraints occur everywhere. You saw how it can
simplify events in terms of time, even handling coordinate system translation is
everywhere. Imagine you want to map your landmarks from the map on the
camera view. It's all about coordinate alternate mapping between the map view
and the camera view and the angle of the head you are looking at and so on.
Scroll box is full of bidirectional constraints. And visualization is about placing
labels such that they do not overlap and look visually appealing.
But current solvers are not expressive, clumsy, and other things that we need to
address in the work. And there are a few other things that I didn't talk about,
technically probably the most interesting ones and I can do it if you are
interested. Thank you for your time.
[applause]
Download