Document 17882748

advertisement
>> Juan Vargas: We are going to continue with the workshop. I'm going to start the sessions on
applications. As you can see from the sequence of speakers, we have two speakers coming from
Illinois and three speakers coming from UC Berkeley.
So what we hope to accomplish is to have the two presentations from Illinois in about one hour
and the three presentations from UC Berkeley in one hour and ten minutes or so. We can always
get short on the break.
So we will start with John Hart presenting AvaScholar, then Minh Do doing 3D reconstruction,
followed by Gerald Friedman from Berkeley talking about PyCASP, followed by Leo talking
about parallel components for an energy-efficient Web browser, and followed by Ras who's
going to talk about program synthesis for systems biology.
So we'll start with John. And John, of course, comes from the University of Illinois.
>> John Hart: Thanks a lot. So I want to talk about AvaScholar and dig into some of the details
of AvaScholar.
AvaScholar is the applications layer of our work through UPCRC in I2PC as a way of both, you
know, forward looking at what are going to be the killer applications requiring lots of parallel
computation in the future.
And it's fairly obvious that those are going to be visual computing based. There's many
applications, but given that the kind of computing we want to do at present and the kind of
computing we want to do in the future, it's obviously going to be some form of visual computing.
And it's an exciting area right now because visual computing is working in areas of graphics and
computer vision and machine learning and so on are really starting to gain a lot of excitement
because we can do things that just didn't work a few years ago.
In part because computing as gotten faster because of Moore's law, because they work only so
well and we can make them work much better if we can scale them up even further.
So that's sort of the target at looking at visual computing as an application driving our parallel
computing work.
And also if we want to look at the kinds of sample programs we want to give to our team's
developing pretty well programming tools, we want to give them software that's realistic of the
kind of applications that we'll expect in a few years.
So we're not only giving them visual computing code, but we're giving them visual compute
graduate student code, which is I think the worth kind of code there is.
And it's motivated by our work at developing this AvaScholar system. And so this AvaScholar
system consists of two pieces. There's been a lot of excitement about online education and
delivering courses and so on, but beyond the company's providing these and beyond videos of
lectures and working on tablets and having them recorded and played back, there's interactions
with students.
And so some things that are lacking that we hope to fix in the next few years are these two
modules, kind of this row mote online instruction where you can old up a visual aid, a
three-dimensional visual aid, and, you know, we use the scholar name to mean academic
education, but this would work equally well in meetings.
And as Josep [phonetic] mentioned, I've always envisioned a Toyota engineer in the U.S. holding
up an accelerator pedal assembly trying to explain what's going on to an engineer in Japan. And
having this conversation and just simple video may not be enough information to see the
intricacies of what's going on in an assembly. You may want some deeper three-dimensional
representation. And so some way of building that.
And then at the other end we have the student module which is just basically a simple Web cam
or what will likely be a Kinect that's embedded in everybody's laptop computer and cell phone or
whatever platform we're on in the next few years, some student receiving these lectures, and then
some indication, some agglomerated indication of what those students are responding to.
And so we have tools that do this now that do soft biometrics that can tell the expression of a
student what reliably, and also they can give us the demographics, the age, the sex, and other
attributes of the student population. And so as you're giving your lecture, you can have a -- you
can find out that 40 percent of women from 35 to 45 aren't really interested in what you're
talking about, and you can adjust your lecture in order to do this.
And this would be useful in an online context. This would be useful in an ordinary classroom
presentation. And as Josep just mentioned yesterday, it would also be useful in a political speech
when you're trying to give a popular speech and you want to know who's listening to you and
who's not.
So I want to give you an idea of what this looks like. And so see if any of this works. Nope. Let
me try it again. Hmm. Try one more time. There we go.
So here's an example of the system running. And on the left here you can see expressions, and
here's some face tracking software. That can track my face as I'm moving around. And if I hit 1
start fitting, there. Well, it's not a good fit. Let me try again. That's a better fit. Now I've got a
grid that's on my face that's tracking my face. And it's working okay. It took two times to
actually fit the grid. I shaved this morning, so that helps, but I still have this. I'm just not going
to shave is goatee for a demo.
But as I'm talking, you're going to see a lot of different expressions detected by the software,
mostly surprise that it's working and a little bit of fear. And I found that I can do a lot of this by
just moving my eyebrows.
So it works reasonably well. And this is just one part of our student interface to this system. If
we can scale it up, it will work more reliably. We can use larger models trained with more
details. This one's just tracking basically the motion of these grids. And we can have the grid
follow the face more accurately and you can detect a wider variety of faces. Some you get much
more facial hair, and it has trouble following faces.
So that's the ->>: [inaudible].
>> John Hart: Not yet.
>>: [inaudible] motion at all.
>> John Hart: I'm going to the bozo school of acting. This is surprise. This is sad.
So this is all built on a house of cards. In order to get these things running, we have all this other
technology underneath. And we want to be able to scale this up. I want this to work reliably
looking at a person, at an individual student. I want this reliably looking at an entire classroom
of students.
And we can't do that right now. And this is one component, and it has to work along with the
shrug detector so we can tell if students aren't understanding what's being presented to them and
a few other demographics, biometrics as well.
So we want to scale that up. And so we need to make these things run, you know, faster and
more robustly. And, you know, looking at current trends, we're going to have to make what was
otherwise serial code in parallel. So it's a good example of trying to use the parallel
programming tools we have to solve a problem.
And it's built on top of all these tools. In order to do the instructor part, we need to do surface
reconstruction for multiple cameras. In order to do the student part, we need soft biometrics.
And these require things like depth computations between multiple images in order to infer
depth.
And things like alignment, ICP alignment, where you can take multiple 3D scans and align them
as quickly as possible.
And if you boil those all the way down, it basically rests on the shoulders of these two
fundamental technologies, something to do a really fast nearest neighbor in parallel, and then
some image processing and histogram programs that are basically -- give you feature detection,
be able to look at images and come up with some vector describing what's in the image.
And so surface reconstruction is mostly you're just registering a couple of images so that they're
aligned. And there's this really good algorithm for doing this that was developed Disney and
Zurich that ran in about 20 minutes, this top-of-the-line algorithm, and we've been trying to
speed it up. We've got it down to about 20 seconds, and then we're trying to use the rest of the
tools to get another two orders of magnitude in order to get this thing into real time, which I
think would be quite an accomplishment.
There's deformable alignment algorithms. We've got an algorithm that's going to be presented at
SIGGRAPH -- it's not by us, but somebody else -- that we're implementing. It's based on this
technique that came out of Stanford back in 2007 to take multiple scans of a hand moving, for an
example, and be able to align them by segmenting them into individual components.
And you can look at the running times for this, 13 minutes, 51 minutes, over an hour. That's the
kind of thing we're trying to implement and scale up to the point where it can run in real time.
So lots of big challenges there. Basically this is -- another way of thinking of this is it's doing
KinectFusion for moving objects. And Minh Do's going to talk about our progress on doing that
and the AvaScholar instructor module in the talk after mine.
So I want to focus on sort of those low-level tools, in particular nearest neighbor problems.
And these nearest neighbor problems come up tall time. It's one of the fundamental algorithms
we use in visual computing and in machine learning and in many other areas. In 3D applications
use it for surface reconstruction from scatter point data because you need to know where the
neighboring points are quickly and efficiently. And for aligning points. The ICP algorithm, the
C stands for closest point, so you always need to know where your point neighbors are.
And it also happens in high-dimensional applications. So anytime you stitch images, you're
finding features and comparing other images with features in similar locations. So you're
looking for neighboring feature points in this high-dimensional feature space. Well, those
features can be just the pixels, you know, lined up in one long vector or they can be these other
spaces of features.
But these features can be vectors that are hundreds of thousands of elements long, and then you
need to find nearest neighbors in those high-dimensional spaces. And this nearest neighbor
problem dominates -- it's the critical code segment for a lot of visual computing applications.
And dealing with spatial data. Spatial data is very -- is distributed in a very nonuniform fashion.
If you think of where all the atoms are in this room, it's in a very nonuniform distribution.
If you look at parallel speedups, this came from Pradeep Dubey and through Jim Held, you
know, Intel's scalability curves, if you look at the least scalable visual computing applications
they have, their game rigid body and production cloth both dominated by collisions. With cloth
it's keeping the cloth from turning inside out, and with rigid bodies it's keeping these rigid bodies
from intersecting each other.
And so early on we did some work on looking at parallel patterns for how k-d trees are
constructed and discovered that the state-of-the-art routines for doing k-d tree, these spatial
decompositions, these tree-based spatial decompositions had very simple, you know, trivial
parallelism at the top where, you know, they were, you know, one-processor, two-processor,
four-processor style parallelism at the top, and then became parallel at these lower levels of the
tree as they were built in sort of a breadth-first fashion.
And as the number of processors increases, you know, we're going to lose a lot of parallelism at
the top, and this processing of the top layers will end up dominating the entire problem.
So as computers become more and more parallel, the actual performance of this algorithm will
decrease as a result. And so we came up with this algorithm called ParKD that basically
streamed through all the data that you want to partition into areas and used all of the parallelism
they had available on both the bottom half and the top half of the tree and, you know, set some
speed records for correcting these trees back in 2010.
And we had two different approaches. One was a nested parallel construction where you're just
forking off tasks for each new level. And we have this in-place construction that basically didn't
move any data around but had a lot of pointer direction.
And, you know, we found a bunch of interesting things about the parallelism of these processes;
that nested ended up being faster than in-place in our examples, but that the in-place algorithm
scaled better. And we've been able to improve those results quite a bit in a years after that.
And so we have that for low-dimensional nearest neighbors. For high-dimensional nearest
neighbors, things are different. When the dimension gets greater than 15 dimensions, then, you
know, what constitutes a nearest neighbor and what metric you use becomes a little murky. And
there's approximate algorithms that work very fast where you're taking random travels down a
spatial subdivision tree where you're just subdividing one of the five axes with the highest
variance at each step.
And we can compute that dimension of highest variance just by looking at a small subset of the
points.
And these algorithms work pretty well, and the best one that's out there is called FLANN by
Muja and Lowe I believe here at UBC -- or up at UBC. And the -- but it's fast, but it's only -- it's
only a serial algorithm. There's parallel versions of it, but the parallel versions just distribute
multiple queries over parallel processors. They don't speed up individual queries. And very
often you need a single query for these queries.
So we did some theoretical analysis, tried to find out what the maximum performance in parallel
of one of these trees could be, and we did some parallel implementations just using TBB with a
depth-first scheduling, got some reasonably good scalable results on single queries and
construction of these trees and, in fact, beat existing algorithms and beat a GPU algorithm for
doing this.
And one of the interesting things we were able to accomplish is comparing CPU performance to
GPU performance. We have a CPU that has four processors, but each of those four processors
has 32 -- or has eight-element vectors, now, these AVX vectors. And so we made sure we took
full add advantage of those vector processors.
So, you know, when you're programming a GPU, you may have what ends up being, you know,
16 or more processors, each with 32-element vectors, and you want to make sure you're
comparing that properly with the CPU that has, you know, four processors with eight-wide
vectors.
And so we're getting -- on just four-processor Ivy Bridge system we're getting 22X speedups,
27X speedups from the pure scaler code, pure serial code, because, you know, we're using the
multiple cores plus the multiple vectors appropriately. And we beat the pants off the GPU
implementation of a variation of this. And that was estimated from this -- from the -- kind of the
top of-the-line GPU implementation.
So we've made some good headway in implementing these nearest neighbor searches. The other
thing we're working on is ViVid, and ViVid is this vision video library that has all sorts of layers
built on kind of this low-level GPU implementation, a C++ player, and a Python layer. And
we've been mostly focusing at the low end, the low layer of this construction.
And ViVid is basically our main feature detector, our way of processing images into the features
that describe the images, the features that get tracked when you're trying to keep a grid on a
moving face, for example.
And it consists of three components. One's a filter bank component, one is a block -- a
histogramming component, and then the third one is this pairwise distance, which ends up being,
you know, the kind of brute force nearest neighbor algorithm they use for this particular kind of
data.
And you have blocks. You take your image, and you have a moving window that creates these
small blocks, 16-pixel-by-16-pixel blocks, and each of these blocks can be separated into 16
elements, each of which is a four-pixel cell, and then we compute these histograms over each one
of these little 4-pixel-by-4-pixel cells, and those histograms are basically the results of
convolving these particular filters -- that's something like a hundred filters -- over those little
4-pixel-by-4-pixel regions.
Now, why would you do that? Well, that converts this blocks into this vector, this long vector
that's basically the response of these histograms, that you can then use to describe the content of
that block such that you can compare it to other blocks and other images. And if the same kind
of image is in that block, then it will give you that same feature vector. And the distances
between those feature vectors is better.
If you just compare the pixels and the 16-by-16-pixel block with another 16-by-16 block, you'll
get a bad answer, even if those pixels is just -- even if that block is moved over one pixel. If
everything's moved over one pixel, you'll get a bad answer, but if you use one of these feature
descriptors, then you get a much better indication that those two images are displaying the same
thing.
And so this is giving us a feature descriptor, kinds of like SIFT or SURF or other feature
detectors, and this one works pretty well. And this has been sort of the target of our
parallelization effort to try to speed these things up, scale them up.
And so we've implemented them. The convolution is implemented as a GPU process doing four
blocks at the same time in order to save overhead of the apron space you need when doing a
convolution, the histogramming basically looks through the filter responses. You're applying
about a hundred filters to each of those little cells in order to see what the response is, and so
doing a hundred convolutions at a time, and then looking at the answer to that convolution.
And what we want is the histogram, the filters that give you the highest response rate. And so
we have to do some histogramming. And there's all sorts of techniques we've been looking at in
order to improve this histogramming, using an atomic scatter instead of a gather and ignoring
low responses, not histogramming filters that don't respond very well so that we avoid the chance
of a collision, which creates a -- so we can do faster, kind of sloppier histogramming that still
works well.
And then this pairwise distance matrix, which is basically comparing the distance between two
feature vectors, you know, every feature vector with every other feature vector, in order to find
matches in that particular case.
And so we've gotten quite a few speedups. It's interesting to compare the CPU speedups with the
CUDA speedups, doing 3-by-3 filters, a filter bank of 3-by-3 filters versus 5-by-5 or 7-by-7. The
CPU does just fine with all of those, but we hit some serious register ->>: Sir, in each case your speedups are relative to what?
>> John Hart: Serial.
>>: And it's the same serial for both of them?
>> John Hart: Yes.
>>: Thank you.
>> John Hart: And, in fact, these 4Xs actually go up a little bit more than 4X because there's
been some performance improvement in addition to the -- in addition to just making them
parallel.
And histogramming, you know, we don't get as good a CPU speedup, but the GPU speedup is
significantly degraded. And likewise with pairwise distance. We're doing quite well.
>>: [inaudible].
>> John Hart: Yeah.
>>: [inaudible] is that the case here as well?
>> John Hart: No. Not yet. We're trying to.
>>: Ah. Okay. So you're ->> John Hart: Yeah. These are just four core without using the vectors.
>>: Oh. Okay.
>> John Hart: And these are, of course, using the vectors.
>>: So now it makes a lot more sense.
>> John Hart: Yeah. Yeah. Sorry. I should have mentioned that, differentiated that from the
nearest neighbor.
So we're struggling. Our current plan is to -- this is using TBB, but we really want to get an
OpenCL implementation of this so we can kind of target all the architectures. And we're in the
process of doing that.
We've got some preliminary results. So here's OpenCL on two GPUs, on the same GPU
comparing CUDA to OpenCL, and OpenCL is falling a little short, but I think, you know, we
just finished these results, and it may be as much an issue of not knowing how to -- the same
tricks with OpenCL that we know with CUDA. But we're getting there. And that's just on the
pairwise distance. We're in the process of getting the rest of the elements to open up.
>>: Have you done experiments yet with running that on OpenCL on an Intel CPU?
>> John Hart: That's why we're doing it. We don't have those results yet, though.
>>: [inaudible] feedback we'd very much appreciate.
>> John Hart: Yes. And I do know that. Yeah. These results are I think two days old.
>>: Okay.
>> John Hart: And so also, you know, we've looked at power. And this is a collaboration with -I think Josep yesterday talked about taking some of that face grid matching code and -- or the
motion tracking code and applying some of the tools from DPJ to that code. And so, you know,
we've been using the AvaScholar application to sort of provide code to -- for some of these other
tools.
And David Padwa [phonetic] and Maria Garzeron [phonetic] have been working on basically
trying to schedule things like these filter banks across multiple processors. And in the process of
doing that, we got some nice data on the CPU system power on an Ivy Bridge system for running
those hundred filters on a sample image. And so this is the power rate, you know, joules per
second, by basically being able to turn up or down the clock rate of the CPU.
And you can see that, you know, if you have more power, you end up running it faster. And if
you have less power, you end up running it slower. And the question is that's fine, but system
power is joules per second, and so if you want to look at the total energy used, you've got to
divide this by this in order to figure out the ->>: [inaudible].
>> John Hart: No, you have to multiply that by that in order to figure out the total amount of
energy used. And so if we do that, we can find out at what clock rate we want to run our -- what
clock rate we want to run our CPU if we want to compute these, the feature detector using the
least amount of power possible.
And the answer ends up being at a clock rate that runs at about 114 milliseconds. I think that's
the 3.2 gigahertz clock rate.
So a 3.4 gigahertz clock rate computes it faster, computes it in a little over a hundred
milliseconds, but with a lot more -- using a lot more energy. And so trying to find the sweet spot
for this particular application gets revealed by that data.
>>: Yes, and you got this power by just plugging a wall power meter in?
>> John Hart: That's a good question. I don't -- it's -- no. I don't remember what we used to
measure that. It was much more complicated than with a wall power meter, though. Because we
have those results too. But these are -- I don't remember the details for how that is. I can find
out, though.
>>: I guess the question is there's so much different parts of the system and the different parts
are measured in different ways, so it may only be the wall [inaudible] multiply by time. I don't
know.
>> John Hart: Right, right. Yeah. I would have check. This is something we provided the code
for and I happened to see the results from, so it's not my project. But I can go back. I mean, we
have some Ivy Bridge systems, and we instrumented one, and I think the instrumentation was
more significant than just looking at the wall power coming out of it.
And this is just kind of scratching the surface. This is -- you know, it's -- you want to do this on
your cell phone, right, and if your battery is getting low and you still want to point your camera
at something and have it automatically translated or have it tell you what you're looking at, you
want to find out, you know, at what clock rate do I run the thing and still get the answer and use
least amount of battery.
And the other thing is just showing another example of, you know, using AvaScholar code to
help -- to provide sample code for some of these other parallel processing and power processing
projects.
Okay. And that's it. Thanks.
[applause].
>> Minh Do: All right. So I'm going to present a continuation of what John just explained of the
kind of overall arching property that we are doing at the University of Illinois.
And I'm going to zoom in on more -- the bigger aspect of that, which is on the presentation side,
so how can we enhance the visualization aspect of this AvaScholar. But you can also see in the
context of very important ubiquitous applications out there are going to be deployed in a variety
of devices.
And here's a set of my colleagues and former and current Ph.D. students that have been work
only this project. Matthiew, Matt now is actually working right here at Microsoft.
So let's start with the very common application that all of us get benefit is the current video
communication systems. So, you know, ranging from Skype, G Talk, Google Talk, VUDU, and
iChat.
So these kind of system I think now, you know, it is the reason why all of the laptop have a
camera on it, the cell phone have not one but two. There's a rear-facing and there's a front-facing
camera, which is just really for video chat.
So certainly it is a very ubiquitous application, and it's also the one that demand a lot of
applications.
But I'm going to -- my thesis here is, well, that is still very simple. Because what you see in
those system is -- basically it's record the video, the visual scenes, and simply transmit and then
playback.
All right. Now, what we want -- you know, so that obviously is very effective, you know, have
people to communicate visually, but the reason that, you know, many of us do travel and go to
meet people, you know, face-to-face, it really indicated that user demand something more. The
ultimate goal is can we replace those actual face-to-face meetings.
And that what -- this is a big set of demand now and research in our area now is how to provide
the computational tools that enhance these visual communication system, provide user a very
immersive experience, need to elaborate more on that, because if you happen to see the plenary
talk by Rick Rashid, you know, it's exactly about this, you know, how can we provide this
remote augmented realities.
Now, but the key thing is that you see there's a system out there that tried to, you know, elevate
this, you know, commodity visual communication system, like, you know, TelePresence, but
they all require very expensive hardware and room setup, whereas we all like the convenient of
carrying our laptop or mobile phone around, so the vision is can we using computations but with
commodity hardware and cameras provide the realtime visual communication systems, be able to
let multiple parties remotely interact and visual feel that they are being there in the same
common space. So that is a vision.
And if you elaborate a little bit, you can think about, well, so -- you know, we can think about
this overall structure which we have multiple devices, we have some soft processing, the
capturing size you get the information, you packetize them, and then you send to multiple
parties, and then individual viewers now can select their viewpoint, their compositions,
depending what they are interested in, and depend on who are they talking to.
So that's going to open up much richer visual communication system than what we have seen
now with Skype and so on.
So just recap, you can see really, you know, we have an existing structure, I think is enormously
successful, you know, if you think about Skype, but if you look at that, you can see the simple
record capture and then transmit. You can see in the capturing side, camera have more and more
pixels now, it's already reaching the limit of what the human can see. On the display side, people
already have a very high resolution display. It's also reaching the capacity of the human eyes.
In terms of the bandwidth, you know, we always worry about the bandwidth, but the psychology
tell us very clearly that human eye can only absorb 10 megabit per second, and more of the
network, you know, even Wi-Fi now start reaching that.
So you can see that we all -- from the technology [inaudible] we all reach that, that human limit
of absorbing visual information.
What's really missing is, well, can we synthesize, can we, you know, generate some of these
more novel viewing experience that's beyond what, you know, the single camera can capture.
So I think that that opened a lot of opportunities. And because of the application demand
realtime experience, very efficient computing on commodity hardware I think is a key element
there.
So just very, you know, set up a key example, here's a paper we recently presented. That kind of
articulates there would be a future when we can take multiple parties, we can put them in a
common environment and they can really feel being there, being there in the sense that in the
same environment, they know that the other party also in the same environment, that are being
looked at, so -- but of course you can see that still have a lot of artifacts, the lighting is not here.
So, again, it's showing there's a potential to enhance our, you know, user's Skype communication
system. But, yes, a lot more computation is needed there.
So what are the resource challenge and what are the computational going to provide us? Here's a
number of, you know, really important tasks [inaudible] give us, scratch on the surface, and how
parallel computing can help them. But, again, you know, the key bottom line here is that we
want to use a low-cost commodity hardware, you know, divide cameras as well as like, you
know, CPU, GPU on our mobile platform, or laptop. The experience have to be, you know,
like -- you think about like being there, really they want to sit down to another person and feel
like you are both in the same environment.
And the problems are going to be how to capture, how to synthesize, and how to transmit and
encode those information.
Okay. Now, if you think about our current user communication, you think about capturing the
color information, and what it is is it's take a 2D -- sorry, taking a 3D one and project into an
image. And when you project a 3D one into an image, what you lose is exactly the depth
information. You know, I have a pixel with some color now, you know, in my eye, but, you
know, if I only have a single image, I don't know how far that light going to hit the object.
So that's why you can see the excitement around the Kinect in our community, you know,
computer visions, video processing, and computer graphics, in the sense that now we have some
commodity low-cost devices that could capture in addition to color information the depth
information. And that allows to now in real time really fully capture that 3D information and
then be able to now do something much more sophisticated than just simply display a recorded
video.
So but the -- the challenge coming in is these devices, as we already learned and know very well
about, you know, digital camera, right, there's a lot of processing inside the SL or the handheld
camera in order to turn typically a noisy capture signal into a high-quality images. And same
here with the depth camera. You know, we often end up with a very poor quality data. But, yes,
you know, we want to present it at a very high quality visual experience in viewing to the user.
So here's where, again, computation can come in. And let me, you know, illustrate with a simple
application here. Our scenario we think about, we have some commodity, color, and depth
camera. And based on that a small setup, it could potentially all integrate it inside with our
laptop, you know, in the display bezel. And the ability to take those actual captured images and
synthesize a novel viewpoint. I don't go into detail why that would be important, but, you know,
here's an algorithm that we proposed, and the -- you know, basically it is very standard, but, you
know, clever ways of fusing and utilizing the complementary color and depth information in
enhancing the visual quality.
And the example here in the real lab, and we were able to do in real time thanks to using a lot of
GPU, be able to synthesize another viewpoint which, again, back to the example John mentioned
early on, AvaScholar, you can think about the teacher now talking to students. And we think
about one thing student like to come to the real class is a feel that the teacher looking at them
while they're learning or listening to the lecture, whereas for remote participants normally they
feel they're completely left out because they never look at the eye gaze of the professor,
professor's tend to look elsewhere.
But now with this eye gaze correction, the participant remotely can feel like, well, they are being
looked at and they feel they are in the same environment.
Now, this problem, I'm sure if you use Skype, for example, you also experience when the camera
is on top, you look at the screen, so the remote parties, you know, look, you know, even feel like
you are looking down. And, again, you know, it breaks out the communication, it breaks out the
trust.
So, again, it need to be real time and this processing using color depth, we deploy using parallel
architecture, provide us the real time capabilities with significant speedup.
Another example that John probably -- as John briefly mentioned about this is, again, when you
have typical data from the Kinect due to many, many reasons, you know, the infrared interfere
with a lot of other things, you know, when the Kinect try to capture the depth image, and you end
up with a lot of noisy data.
One thing we know is we understand the object that, you know, moving, the same human face,
but moving around, so we can have very efficient way of accumulated data dealing with
something like, you know, deformable, and then based on that can build a very high-quality
depth map. Again have to be in real time so that when you, you know, present and render that
person to the other side, you have a very high-quality image.
You can even do some stylization. So, for example, render that person, look younger, look
lighter, you know, the viewers, for example, increase engagements, many potential application
can be done once you capture this data in real time.
And of course, you know, again we deploy that and try and utilize with a lot of our colors here,
you know, we are experienced in parallel computing and get a lot of significant speedup.
So let me, you know -- I'm deeply inspired by the talk by David yesterday, you know, what are
the big ideas. So let me -- if I can leave one key idea here -- and let me go through a key
example, one of the example here.
One thing, you know, through this experiment, like, you know, we have some application, we
really need real time in order to talk to our colleagues in parallel computing and learn all the
tricks, and with their help be able to fight, get realtime algorithms.
Now, on the reverse direction, slowly as we learn about, okay, what would make an algorithm
easily to parallelizable, also there's some other feedback on reverse direction as well.
So here's one example. So one key application in image and video processing is how to enhance
the quality. And this is a very basic building block. You know, it's inside, you know, all of the
camera, all of the laptop here. And it's building block for, for example, doing a style matching,
image stylizations, and many others.
The problem here is that you have some data and it is noisy. I try to just exaggerate here. And
you try to enhance a particular pixel using some kind of local average. So of course, you know,
we probably would know just simple linear, some kind of filtering would help, but the key
challenge there is how can you do that filtering, that [inaudible] average pixel that, you know, in
the same object but not, you know, outside, you know, from other area.
So, in other words, you want to have some kind of edge-preserving filter. And that is what -- the
bilateral filter is one of the very powerful tools, become ubiquitous now in visualization and
image processing.
The idea is instead of just using the typical spatial invariant filter, which, you know, the white
depend on the related position of the neighbor pixel to the current pixel, we're also bringing in
the difference of -- in the value. And using that we can build up that adaptive weight for, you
know, locally when we want to try to refine estimate of a current pixel.
And of course bilateral filter, you can see that it's very challenging to parallelize in GPU. For
example, there's a lot of effort on that. And the algorithm of course when it was developed it
was purely in the term of, well, that's an algorithm that's a sequential one, and then, you know,
they got [inaudible] and, okay, someone else can try to parallelize them and make them faster.
But then what's interesting, and there's some work coming out actually from Microsoft Research
in Asia, the observations say that, well, okay, that -- now we understand which element of the
algorithm would be very easy to parallelize and what's not.
And of course now the goal is to try to enhance the visual quality of the image. We can
approximate them in a different way. I won't get into the detail, but there is the so-called guided
filter which essentially they break that sequential convolution spatial adaptive into multiple
stage. Multiple stage. And each stage simply just computing the sum of the pixel around some
box, and that box could have some adaptive size. And that -- you know, to file the sum around
the box, as we know we can use so called the integral image which is, you know, yeah, I know
the sum of the image up to this point, then any side here I can just simply have in four
operations.
So extremely fast. And now this guided filter is embarrassingly parallelizable and it provide,
you know, almost 200X speedup compared to the best GPU implementation of the bilateral filter.
And, yes, now it bring a lot of these very hard problems like, you know, optical flow, dance map,
stereo matching, into, you know, real time now.
And we, again, understanding about the parallelism, we extend that even further, work that I
recently presented at CVPR. We can extend that shape to arbitrary support. And, again, having
this multistitch filtering architecture which look in the locally and then algorithm together.
And the key idea is each of those step is -- rely on very simple computation. You look at the
scan line and you just find the difference between two points, for example.
And just to get the point across, so, for example, we want to find out, you know, enhancing this
particular pixel using the neighboring pixel, right, and the weight, you know, which pixel should
contribute that particular Y dot pixel there, listed over here, so bilateral filter, again, the key idea
is I only want to get the pixel of the same object to influence my next estimate here. I don't want
to, you know, get the pixel from other object get influenced, and those are the mask, right,
adaptive masks, it various from one pixel to the other pixel.
So these are the example, but then you can see that by decompose -- decompose is filtering into
multiple stage, each of them is very simple to parallelize. Again, it's basically just -- you know,
require a lot of just sum and averaging. So I won't go into detail of the math, but, you know, we
can show and prove that we have the same effect, and of course it provides a very efficient -- for
example, try to improve a very noisy depth on the top right, a picture there, and down here that's
improved using guided filter or using our newly developed filter.
So, again, what the main idea is here, I think is one as we see, you know, as we move in with a
lot of computing power, highly parallelizable, computing power platform in hand, I think we can
aim for something much more impressive than the current visual communication system can
provide now. We can provide something like chain the viewpoint, we can correct for the eye
gaze, we can throw away the background and embed the object, you know, multiparties, and in
the right perspective in a virtual environment.
All of this relies on how to capture the depth of these 3D informations in real time, and of
course, you know, none of those can be possible in real time without the help of parallel
computing.
But, you know, one of the key thing I would say here is because of this visualizations, there's no
single right or unique answer. There's a lot of approximations. So, again, a co-design of an
algorithm developed with someone knowing about the parallel architecture would allow us to
come up with very effective, highly parallelizable algorithm that, again, provide that high
performance.
Okay. Thanks.
[applause].
>> Minh Do: Yes.
>>: There was a demo about ten years ago between Berkeley and Illinois where there was a
dancer on stage at Illinois and a dancer on stage in Berkeley and they danced together by having
3D capture and sending the images back and forth, so they danced with one another's images.
And the bottleneck that they discovered was [inaudible] to make all of that happen. And so has
that changed?
>> Minh Do: Yeah. Yes. So yeah. We [inaudible] work at Illinois, my colleagues [inaudible]
the bandwidth at that time -- yeah. So ten years ago of course you don't have that high
bandwidth. Also the other is they really don't know how to process data. So with what I'm
thinking about like our human eyes, we see the information in 2D. And we -- and that is what,
you know, people manage a 10 megabit per second the most.
Now, for the [inaudible] system, does this capture fully the 3D data [inaudible] and they ship it
over.
So what we think here is, well, we -- by take the data and do some kind of analysis, we can
condense that information and, you know, get closer to what need to be represented on the other
side.
So I think to answer the question, yes, you know, the bandwidth getting better now, but also with
some more processes we can reduce that information and, you know, still provide that
high-quality rendering finally.
>>: So the other reason I ask, it sounds like you want to scale up to many people all interacting,
so you're going to need more bandwidth because you have many different people you're trying to
merge into one image that everybody shares.
>>: Great question. Yes. Yes. So the question here is, you know, how about when I have
multiple parties, yes. So there's two ways. One is let's say a small number of parties, you can do
a mash, which is, you know, everyone talk to the other, or maybe there's when everyone do in
cloud, so you can send a note, and then that get composition and then, you know, you send back
the composite image.
>>: I'm confused why you say the bilateral filter is difficult to parallelize. Is it that it's difficult
to vectorize, or is it that it's difficult to parallelize?
>> Minh Do: The reason for that is it is the weight here on the range, depending on the
difference of the current pixel with many of the local -- so there's a lot of dependencies and, you
know -- so it's hard to cut, you know ->>: So it's a vectorization problem.
>> Minh Do: Yes, yes.
>>: I see.
>> Minh Do: And that support -- normally people will try to get [inaudible] support so they get
a good estimate.
>>: So this -- because there's not good scattered gathered support on these [inaudible].
>> Minh Do: Yes, yes. Whereas we turn the problem around -- at least with the current -- if we
use a guided filter on this cross base, become extremely simple to parallelize. We just -- so this
kind of formulation here, it's mean that you sequentially compute the algorithm designed with
sequential computing in mind, whereas now we understand about parallelisms, people start
developing algorithms that now much suitable for, you know, when they deploy to multicore,
many core.
>>: I think the hard part of parallelizing that is when you actually go to implement it, you're like
the threshold below -- you know, if you have two pixels that are -- if you look at the thing in the
red box, a lot of those pixels are black after you've applied the green thing, so you threshold
those, and that thresholding is the if statement that occurs with vectorization. It's not in that
equation ->>: Okay, thank you. That helps. Because looking at the equations, it's like that's a standard
geometric decomposition problem to be doing ->> Minh Do: Yeah, but you don't compute that [inaudible] threshold way, many other --
>>: Sure. Sure. I can see that.
>>: Or you have to do [inaudible] get back to zero.
>>: Or you use a k-d tree.
>>: Yes.
>>: Yeah.
>>: Okay.
>> Juan Vargas: Okay. So many, many questions. Thank you very much.
[applause].
>> Juan Vargas: Now we have Gerald Friedman talking about PyCASP, scalable multimedia
content analysis on parallel platforms using Python. He's comes from UC Berkeley.
>> Gerald Friedland: So yeah. So some of them -- some of you have been at the Par Lab retreat
and you have probably seen their like different specialized version of this. This is basically an
overview talk of the not-so-specialized version.
So first of all, a little correction. I'm Gerald Friedland, with the land, not the man. And I work at
ICSI mostly, at the International Computer Science Institute. That's a nonprofit, private institute
affiliated with UC Berkeley. But this is collaborative work with many people. So there's
definitely Katya and Kurt in there who are at UC Berkeley, but then also lots of other UC
Berkeley students, as well as a little bit of work has been done by Dan Ellis, who's actually at
Columbia University.
So let me just dig into this.
Basically the motivation for me, was I'm not a core parallel guy, is -- I'm multimedia guy, the
motivation for me is if I look at the Internet, I'm actually happy because right now the
multimedia field is growing drastically. Like every two years we have the same amount of data
uploaded that has been uploaded last [inaudible] years before, pretty much doubling of data in
the Internet.
This is some random graphics on how this might look like from Raphael Troncy from
MozCamp. It's old and [inaudible] but it's basically just saying we have more and more data
every minute.
We can also ask other people. For example, YouTube claims that 65,000 videos are uploaded
per day. And that's just YouTube. Or there is 48 hours of video per minute. That means every
minute I speak, another 48 hours of material are on this site. And if you think about, this is just
YouTube, right? There are so many other social networking site.
Another one is Flickr, and we know that they have about a million images uploads per day, and
everybody knows that they are officially going downhill, but still there's a million images per
day.
And then also many people, when they think of social networks, more think of Twitter, but what
they often forget is that there is about every hundredth message has an image or a video
associated, which, at a hundred million messages per day, gives you about a million images or
videos per day as well.
So all this multimedia data is currently going like in massive amounts onto the Web.
So so what? Right? So why do we care? Right? So usual answer could be, well, let Twitter and
YouTube and so on deal with that. But, in fact, it's more than that because what's really
interesting for us as researchers in so, so many fields is that this consumer produced multimedia
content allows empirical studies at never-before-seen scale. Right?
So basically these videos -- and people usually really look at me angrily when I say this and
going to correct me after I said it, these videos are mirror into the vault, right? And the reason
why people look angrily at me is because they are highly biased mirroring to the vault, right?
And there's various biases. And actually just studying these biases is already interesting.
But in any ways, if you look at literature right now, sociology, medicine, economics,
environmental sciences actually even have studied these videos. And we as computer scientists
are actually more like how come you actually find a video, right? So we started this like very
bottom, but in reality we can do a service to all these fields.
And of course there's a buzzword for it now. It's called big data. We've done large scale for
quite some while, and then somebody came up with a buzzword. Now I can put myself into the
bin. Okay.
In any way, so the problem here is a practical one. How can we students and researchers as
individuals effectively work on big data. Right? So the problem is whenever I said -- whenever,
you know, I say something like, cool, we have all this data, and then people say, well, yeah, but
in order to access it you need to go to Google or to Microsoft, for that matter, right?
And the other problem is like if you have the data, then many people, what they do is, because
they can't work on a million videos, and that would even be small data compared to big data,
they work on 10 images and try their algorithm and say, oh, it works for ten images so it should
probably work for 10 million as well. And that's maybe true or may not be true.
So in any way, what researchers want to do, they want to keep doing what they're doing, so they
want to play around with different statistical modeling techniques for a different problem, like,
you know, I have this method, can I train Gaussian mixture models, can I train neural networks,
can I train whatever I want. And they basically want to prototype their effort.
But the bottleneck is the processing time. That's why I said before they'll probably choose a
subsample that's so small that it might not scale.
And you want to prototype in a productivity language, right? So you don't want to -- if you use
EC2 or something right now, you're probably prototyping in C++, and that's not really
prototyping, that's already hard-core coding. So you want to prototype with something like
mostly MATLAB or Python.
And the other thing is also you want to leverage and integrate existing tools. So you have used
your face recognizer tool forever that works and that you build on. You don't want to just now
say how to I port this to this weird parallel hardware that I don't know how to use.
So that's basically -- these are the requirements.
So what are the issues? The issues are computation time is the main bottleneck. If you develop
anything with big data, just feature extraction takes forever. Right? So I'm currently in the
Aladdin project, which is 150,000 videos, and one deliverable is just features it takes a week. So
that's the problem. Computation here is the main bottleneck.
And that makes it for a slow experiment turnaround which makes it for people being more
conservative about experiments, which is what we don't want.
And then the deployment part is the other one. So let's say I have written up something or some
experiment that sort of works, then how do I scale it up to the, you know, 10 million videos and
so on. Yeah.
And the problem there is the way we do it right now is we scale it up by asking an expert
programmer to sort of write, you know, a computational driver would be sort of the simpler
phrasing of basically port it to a platform that will make it massively parallel, for example,
CUDA or multicore.
The problem here is once you've ported it to one of them, and we know this multicore and GPU
parallelism, for example, is different, then it's stuck to that platform. And then, you know, in a
couple years, there's another platform out and you want to do it completely differently, and then
even though your other platform might not exist anymore -- I mean, we even had this problems
in Par Lab -- and then your code does not run anymore. So that's not how you want to work.
So basically we have this approach. We call this right now PyCASP. And we try to solve all
these problems at once.
Let's see how far we got. So basically the idea is you write all your code in Python. Right? So
as an application programmer, you don't deal with the sort of super special thing.
And then we have something that's called SEJITS, which is basically Just-in-Time compiling this
Python to whatever platform. And then we try to provide for sort of a productive environment.
That means what we want to do is we want to do Python, but on the other hand we want to when
we use Python have efficiency in the background so that you can run your stuff fast, and that
efficiency, since it's hidden, can also be portable because it's some stuff we call whatever we
have depending on the hardware, and then we hope because it's portable it's also scalable,
scalable in the sense that you can run it on one CPU, eight CPUs, on a GPU, or a cluster of
GPUs. And in fact these are the experiments that I'm going to show you.
And so the whole idea is that there's sort of different people involved. Right? So you have the
hardware architect, then you have the expert parallel programmer that creates something, as I
said, like a computational driver, and then you have the application developer who just writes
Python. And then in the end the end user uses your application, but it's not that important
because right now we talk about research, so we might not have end users, but will be later on at
that level that you actually can deploy these programs.
And then you create -- the problem here is that's all nice, but then where do you start, right? So
you have so many problems, how many sort of drivers do you write since we're talking about
computation.
And so our idea for that is, first of all, let's stick to content analysis. That's just because the
people involved in the project have some background in this. And the second thing is let's try if
patterns can help with this approach.
So basically the idea is to limit the -- to identify the most important patterns and then basically,
based on these patterns, try to identify like which of these patterns will influence most
application. And, yes, we will never ever be able to solve the entire space of problems. But then
there's probably something like a long-tail distribution of the algorithms that are used and the
frequency. And then basically if you can have the top-end used algorithms be parallelized, that
would already help quite a lot.
So before I go over this, I wanted to say one thing about SEJITS, sort of this is the idea. There
are people in this room who can explain this way better than me, but I just basically -- the
overview is that you write your code in whatever language, in this case Python, and then there
are certain template files that tell the framework how to compile it down to C and then later on
also to really run a compiler to binary codes that then can run on the platform, for example,
CUDA or multicore.
And this whole thing is called Selected Embedded Just-in-Time Specialization. But, again, there
are other people in this room who can tell you more about this. We're just using it.
And so back to the patterns. So the patterns that we came up with right now for an initial take on
the framework, and we came up actually with more patterns for a more sophisticated framework
but would also solve more sort of really multimedia problems. Right now this is very audio
specific.
And it's basically we said, well, there's sort of three major groups of patterns. First of all, there's
structural patterns, and I'm starting at the bottom here. Structural patterns means you need to
compose components somehow. Right? The typical component composition is a pipeline, right?
So we have A doing some filtering, B doing some filtering, C doing some filtering, how do you
propose those pipeline. And there's other ones -- so this would be pipe and filter. But then
there's iterative and then there's MapReduce, and there's going to be more of these. But these
are, first of all, three that are really important.
And then we have computational patterns. So we know, for example, that we have this spectral
methods, graph methods, and then inside these methods, inside these algorithms, we can then go
down and say, well, what are the concrete algorithms based on these patterns that we want to
implement, and then we end up with the application patterns that are then quite specific but not
that specific. So convolution, as everybody can imagine, is not that specific. It's pretty general.
So basically these are sort of the first patterns that we sort of imagine, but from those we actually
only have implemented the GMM pattern and the MapReduce pattern. And, in fact, the K means
is sort of a side effect.
But now instead of saying, oh, we didn't have a lot of progress, I'm going to use this small thing
to my advantage. I'm going to say, well, let's take these two, how generalizable are they. Right?
Because a major problem of this framework will be are these patterns actually the suitable tool
for parallelization. That means are they specialized enough so that we can implement them but
then generalized enough that they help lots of applications at the same time. Right? So you
don't want to -- this is a reasonable tradeoff.
And so what's interesting right now, we only, as I said, have two patterns. But then from these
two patterns we could already create three applications, and one was sort of a speaker diarization
application -- I'm going to talk about these in a minute -- video event detection application, and a
music retrieval application.
And as you can see, these are like three rather different applications just based upon two core
patterns.
So let me talk a little bit of what they do so you see why they're different. So speaker diarization
basically is -- for historic reasons we worked on this, is basically who spoke when. You give a
speech, you give the input as a speech track, and then the output is clusters of this is speaker one,
this is speaker two, this is speaker three, this is speaker one again.
And we had a -- this used to be -- usually, a couple years ago, this used to be in the order of real
time. That means for a 10-minute speech it would take about 10 minutes. Now it's 250 times
faster than that. So for right now a minute of speech would take like -- it's like this, right, or a 10
minutes of speech is like this as well. So 250 minutes of speech would be one minute of
processing.
And, anyway, then you have video event detection, which is, as I said, I'm part of this Aladdin,
which is TRECVID MED. So we have 150,000 videos and we have to categorize them into
wedding ceremony, changing attire, and the weirdest category, which we haven't solved yet, is
winning a race without a vehicle. So give me all the videos that show winning a race without a
vehicle. So, yeah, we're working on that.
The point is we part of a big team where we're using this on audio, but this is of course really
helpful was now we can run all these experiments on 150,000 videos.
And then the one that I want to go further like as an example in this talk is music
recommendation. I don't know if you know, but recently Columbia, and that's the collaboration
with [inaudible] has released a database of 1 million songs. 1 million songs is quite exciting
because we estimate this, that there's about 16 million songs in total recorded since we have
recording.
So that means if you solve a problem in this 1 million song space, the high likelihood you
basically already solved it completely because 16 million is not so far away from 1 million.
And, in fact, in terms of scalability, we definitely solved it. Just in terms of accuracy we still
have to work on that.
But the point is the -- this database is really big data. I mean, it's a million songs.
And, anyway, what we created was a music recommendation system that would basically work
based on the contents. So most of you probably know music recommendations is still based on
tags, right? So I say rock, and then you get rock songs back. But the idea here was to do it by
song content. That means I give you a list of songs in terms of songs. So basically, for example,
it could be UI choose library, but instead of the metadata you give it the MP3s, and then it finds
similar songs based on the MP3s.
Again, that was just an example of we could do this rather than, you know, being like maximally
interesting for the user.
But the point here is can we do this. Right? So, first of all, we wanted -- we said we wanted to
be more productive than just implementing low-level C code. Well, so what we did is we
actually implemented all these three in C, and then we implemented them again using our
framework.
And what happens is that you see that the whole speaker diarization engine, who spoke when,
now takes 50 lines of Python code. And the video event detection -- that's not entirely correct -is basically the diarization engine plus a couple more lines of code. I should correct that because
there's a little more. It's probably a hundred lines.
And then the music recommendation is about 500 lines of code. And I'm going to actually go
deeper into the music recommendation system because I'm going to show you the typical
diagrams that you see for such a system, which are quite complex, but, you know, we can
definitely do it in 500 lines of Python.
And now people might say, yeah, but all the back end takes all the code. Right. Except I can run
in 500 lines of Python -- these are runable without the back end. That means it just takes forever.
Right? But if we're at the back end, then I'm much faster. Right?
So lines of code -- yeah, right. Okay. So what we have there is what we do is we have this big
database, so I'm going to just go quickly through this because I'm assuming people either know
this or it will take a little bit too long.
So basically the point is what we do is we train a large database of all the songs we have in the 1
million song thing, and then we -- once we have this sort of the -- we have a space of features
basically and of models for the 1 million songs. That's some offline processing we do.
And then in the online phase, what we do is we have a little demo that works a little differently
and it's little bit confusing, so please pay attention. What we do is -- what we initially, as I said,
what the thing does is you give it an MP3 and then it computes features and compares it to the
general universal background model and gives you all the MP3s back that thinks are similar.
So now because in the demo here I can't actually upload an MP3, it would take way too long.
What we did is we created a little Web site called Pardora, the similarities to known services are
probably intentional. In any ways, that Pardora site, what it does is if you give it actually a
keyword and then that keyword retrieves some music from a different subset and then takes that
music to create the engine.
That's kind of way to do it this way. But, again, we just wanted to put our point -- I mean, we
just wanted to make our point.
And, in fact, let me show, first of all -- okay. So that doesn't matter that I'm running on up
battery. Hah. I know. Okay. Okay. What? Okay. I need a power adapter otherwise it
wouldn't connect.
>>: This one?
>> Gerald Friedland: Yeah. Thank you.
>>: [inaudible].
>> Gerald Friedland: That's a lot of intelligence for a little computer to not let me continue my
presentation. Okay. In any way, good. So the point here is this is my -- this is sort of -- this is
sort of the demo. But before I go into the demo, we of course did some measurements. So
basically based on Nick Fury and on the songs returned and the songs sifted through, how does
the system scale. And there's actually a couple of interesting points.
So this is basically really based on 1 million songs. We're actually querying 1 million songs. So
in the last retreat I think when we presented this we were querying only 10K songs because we
had some issues. Now we are querying 1 million songs, the entire thing.
And then a couple interesting points here at this database, and this is basically where you get
some bumps between the -- this is basically I/O bumps. Because if you have -- if you do actually
the 1 million song thing, then you start to -- it's basically -- it's working, but then if you do less
than this, you don't have to do the same kind of caching in I/O than if you do the full thing.
But in any ways, it's scaling most linearly and it's taking actually really less than second to query
1 million songs. And we got to that point, we were like really happy. Now we can do really all
kinds of algorithm development based off 1 million songs. I mean, that's kind of really cool.
Of course this takes a cluster of 16 GPUs to do that. Sure. But in the front end it's only Python,
and I can now try all kinds of music retrieval algorithms based on a million songs.
And, again, if I solve this -- if I solve it on a million songs, I probably have solved it for most of
the space. Because, again, it's 1/16 of the space already.
So I'm just going to show you how this feels. I have a little video. So the right side is the GUI
where somebody enters a keyword to retrieve some songs which are then used for retrieval of
other songs. And that is not that interesting part, even though people will be attracted to the
graphs.
I hope that on the left side you can see basically what happens there. And it's just basically your
proof of concept thing that you'll see it's Python that's being used, and the first time the query is
done, the Python is compiled. You'll see GCC messages. And you'll also see how long the
query takes, and then later on we do a second query you see it's not compiled again because we
already compiled.
So the compile time is basically part of that. So that's what we're doing.
So this is the sort of -- yeah. It's really Python. Right? And now it's basically compiling it. And
then somebody's entering I think Jess or indie rock. Okay. And then we're querying the
database. And then we're having some results here. And then we ask Grooveshark for us to play
them. Except you won't hear them right now because we don't use audio. Yeah. So that would
be sort of the player.
So when we go back and you do a second query, again, that part is not the interesting one. We
go here again. It's not compiled again. It just does the query. And then that's it. So that's kind
of cool to do it on 1 million songs this way.
So and the two slides are the conclusion, which is basically we now have a pattern-oriented
framework for hardware-independent specialization, which we like. And we aim at productivity,
efficiency, portability, scalability. And we show in the proof of concept that we can create three
diverse audio-based multimedia content analysis applications based off two patterns that we
implemented.
And this will allow a better handle on big data for diverse research communities. I'm hoping,
and I'm actually believing in this, will enable new research because the experimentation time
barrier is lowered. Right? So that's the point. Now I can do all kinds of stuff directly on a
million songs rather than subsampling.
And, yeah, there's future work, though. So this all is sort of like -- looks like a product, but of
course it's not. We have to do more. We have to implement more patterns. We'd like to extend
this to visual and textual media, and later maybe also to other computational tasks.
And then there's one other thing that I didn't talk about or only briefly is this all came from the
Par Lab, so that was all about parallelization. But the problem is, if you handle big data, there's
also an I/O bottleneck. And the I/O bottleneck can sometimes really be bad. So that needs to be
sort of tackled in the framework as well.
And but the most important thing I think is for us to develop a community where the sort of
expert programmers for the parallel hardware work together with the -- with the researchers or
the users of the framework so that the users tell them what they need and the sort of hardware
specialists can make the stuff most efficient.
Yeah. And that's it.
>> Juan Vargas: So are there any questions? There's one question.
>> Gerald Friedland: Yeah.
>>: So how long is the preprocessing of the million songs?
>> Gerald Friedland: So for 10K it's about 14 seconds, and then it scales almost linearly, so it's a
couple of minutes only.
>>: How many songs have you tapped in the database at this point?
>> Gerald Friedland: 1 million.
>>: 1 million?
>> Gerald Friedland: Yeah. It's the entire database.
>>: So if I check for [inaudible] going to there.
>> Gerald Friedland: I don't know. So the selection of the 1 million is not done by me. But you
can check it.
>>: Like 282 gigabytes, so it's huge.
>> Gerald Friedland: Yeah.
>>: But I don't know if they completely automate it. There's some hand annotation in the
database, I think.
>> Gerald Friedland: There is some hand annotation. Yeah. That's the whole reason why they
have the database, because there's hand annotation. And then -- it was actually hand annotated,
now that I -- yeah.
>>: I thought they had some hand because they had tags. Though, at any rate ->> Gerald Friedland: I think it's all hand annotated. There's no automatic.
>> Yeah, yeah.
>> Gerald Friedland: Yeah, yeah.
>>: Yeah, because I started looking at it. As a musician I find it fascinating.
>> Gerald Friedland: It is fascinating.
>>: And I want to go through and write graph clustering algorithms to try and see if I can track
influences across artists.
>> Gerald Friedland: Yeah. Now you can. Actually just, you know -- just you can.
>>: Find the roots of music.
>>: Yeah, yeah. I think it would be really cool.
>> Juan Vargas: Okay. So the next speaker is Leo, and he's going to be talking about parallel
components for an energy-efficient Web browser. And he [inaudible].
>> Leo Meyerovich: I'll probably bug Dave in ten minutes. Yeah. So I'm Leo. My advisor,
Ras, is sitting back there in the black shirt. And this is actually -- a lot of the demos I'll be
talking about actually two undergrads of ours just did, one this weekend and the other about a
month ago. So this is pretty cool.
And so actually for a lot of the people who have seen earlier parts of this work, we actually, like
I just said, we have a lot of new stuff, both in terms of architecture and kind of new ways of
parallel programming.
So I actually wanted to kick off with a demo. It's probably a risky idea because it's made -- I
actually challenged the undergrads this weekend to do a new visualization using our system.
And so kind of in the spirit of one of the workshops going on next door, there's -- they're talking
elections. And so here we have a very, very simple visualization of -- actually, this is -- you're
not seeing anything, are you? See, everybody's flipping here. Well, I'll do this demo, then we'll
see what happens.
Okay. So here what we're seeing is just a population map in Russia, just who lives where. And
then what -- so recent election, not everybody voted, so I'm just going to resize based on the
different districts in Russia, who voted.
And if anybody's familiar with Russia, thinks are kind of fun there. So what really -- what's
really kind of fun to look at is, you know, which districts had a normal voting percentage and
which districts maybe had a hundred percent, which sounds a little bit suspicious.
And so now we just fade in. Like some of these, like the really dark ones, are the ones that are a
hundred percent. So probably it's not you voting for the president, it's the president voting for
you in these dark ones.
But this actually -- like I'm kind of cheating here. Like this isn't really Russia. This is like only
about -- I think this one's about a hundred districts. Maybe we can load in about a thousand
districts and we can kind of look at what's going on in that one.
But you see now this tween is going to a lot slower. If I tried to change the size, it's really
bumpy, right?
In reality, in Russia we have about 100,000 districts. And so if you want to kind of understand
what's going on in 100,000 districts, we're not going to be running this in the browser today.
But using the same code that kind of outputted this thing, this visualization browser, we kind of
got this compiler running on a GPU running last week, and so this is where we get into
dangerous demo territory. We can do the same visualization using 100,000 nodes. Where is my
mouse? Here we go. So this is -- now we're using the same kind of visualization, which I can do
the tween, I could resize the nodes, like hundred thousand nodes, I can change the colors.
This is actually kind of interesting. See this dark solid one? That was kind of one of those really
regions of 100 percent voting. And this is actually all the same political party is red, is the
United Russia. So be careful if you talk to those guys, I guess.
But the -- so this -- this is kind of fun. Because using the same kind of high-level scripting
language we generated the same visualization widget, you could debug a new browser or you can
do this GPU version.
And what we started to realize is when we actually did visualizations of, you know, more like
100,000 things, a million things, maybe they'll do visualization paradigms and actually work.
And now we have this new domain of like, well, how do we understand this stuff and like how
do we slice and dice.
And if you can script this in a weekend, I think this can be kind of exciting. All right.
So now for the actual -- let's do some science. Okay. So, like I said, I want to talk about
architecture and patterns. We just saw in Soviet Russia president votes for you. And so, like I
said, this is big data. But we're also -- part of the reason we're doing this browser work is
actually we're interested in small devices.
So seen a lot about kind of Google glasses. If you see on the right corner, there's actually a
contact lens that people figured out how to put a display on. So this is another domain where
like in the coming years we want to have essentially what app do you have to run, you have to
run the browser. So we want to be targeting these things. So big data, small devices.
All of them, as predicted, power wall. We're not getting single thread of performance. And so if
you actually look at the devices today, the solution's pretty clear. So for the Windows 8 phones
and pretty much everybody else, NVIDIA phones, everybody, four cores, 128-bit-wide SIMD
per core, and then you have a coprocessor in the case of the Tegra 3. It's 12 cores on GPUs.
So this is what the browser needs to be running on. Today it's only using a little bit of that.
And so -- so okay. So what do we need to speedup, parallelize, if we want to optimize this stuff?
Actually, I will run out of power if I don't turn off the GPU. Sorry.
So the question is what do we need to parallelize. I'm showing you this cool benchmark that the
IE 8 team took. You can slice and dice this different ways, but you would predict that, okay,
JavaScript is probably something we'll have to target. And you're right. So in this case it's about
20 percent. And I showed in that early demo we actually have -- we're working on some new
DSLs that will kind of take care of this part of it.
But a lot of our work is actually fixing the kind of 80 percent case, which is these libraries for the
rest of the browser that are kind of rendering layout, all these things, like parsing. And so we
actually -- a lot of our work is actually just straight-up algorithms for a regular code. And so
that's been a project over the past year, is both sides of this.
So I just want to actually focus on the kind of -- the nonJavaScript part, just like those libraries.
And so basically a Web page comes in, you parse it like a normal compiler. Once you get this
document, you have some templating step which is basically imagine you want to say all the
headings on my page have double the font size, you want to attach those constraints to all your
nodes in your document.
Then we lay it out, figure out what goes where, and then finally we send it to render to fill in
those pixels.
And so we have to -- for a lot of these we have to come up with actually pretty new algorithms
for these regular computations. For all of it again, we have to actually figure out what's the new
browser architecture, which that's been a lot of our work.
And what I actually want to spend most of the rest of the talk on is actually how do we deal with
the code explosion problem. Because if we're regressively optimizing this, the algorithms and
doing new architecture, we have to do this for a whole lot of code. Like the browser is one of the
biggest code bases running on your client today. And so I'll talk about that.
But before then I do want to make a couple of kind of essentially shout outs to our collaborators
in kind of recent efforts on the architecture and algorithm side.
So for the algorithms we've done a lot on things like finite state machines and computations over
trees. And actually it's starting to show up elsewhere. One of our collaborators, I don't know if
he's in the room, but Todd Mytkowicz has been looking at things like, well, what if you want to
do malware detection on the Internet, like your search engine finds fraud results, like finds
malicious Web pages.
A lot of the -- are the same algorithms and actually even for advanced versions of them are
actually showing up in that type of work. And then on the architecture side, our collaborators
have been taking -- for the Qualcomm team took a look at our templating algorithms, and now
actually I'm working with the Mozilla team on architecting their new server browser, where,
again, they're going to take that -- we're looking how to put in the templating algorithms and
actually maybe even -- still unsure, but maybe even using our actual layout engine like the actual
binaries we're generating.
So there's been a lot of cool stuff. But I want to talk about how we deal with the code explosion,
like how do we deal with all this code complexity of really aggressively optimizing this big code
base.
And so I want to talk about how we do this for a layout engine, which I've been mostly focusing
on. And the layout is the component that figures out what goes where, what's the sizes of things,
what are the positions, for each word where is it.
And people have tried to parallelize this type of computation before. And when practice comes
out, it becomes this tradeoff between essentially performance or correctness and you pick one.
And generally on the Web, if your Web page looks wrong, people don't like your browser. So
it's a very -- it's a rough ride to actually achieve this.
So to give you kind of intuition for why this is, for a lot of the algorithms we found to get really
good parallel speedups, we have to actually -- we actually created new algorithms. I don't want
to get into them here, but hopefully some of these sound like terms you haven't heard before.
And so that kind of gives you the idea.
And so we're doing a different approach to writing this type of code. So first we're actually
writing a specification of how the layout language CSS is supposed to work. And from there we
will -- we're running it through our synthesis tool, this special type of compiler I'll talk about that
will not only find you a sequential implementation, but will find a parallel one.
And kind of fitting with the talks today, it's -- we're not -- we're going to actually find how to
decompose this layout engine into parallel tree traversals, and then once we use the synthesis tool
to kind of find this decomposition, then it's more traditional compilation and specialization to go
from the tree traversals to the specializations, and that's more like the SEJITS style of work. But
we need to get there first.
And so I want to talk about that, kind of this -- the rest of the talk is just about how we get there.
So, like I said, the layout engine figures out what goes where. And so in this case we basically
found out that the -- a lot of layout languages can be thought of this way, and we find out how to
parallelize CSS.
And the basic intuition is given a document, the layout engine is going to do a sequence of tree
traversals.
So, for example, in the first tree traversal might be a bottom-up one where I compute the width
and the height for each node, like I'll solve those constraints locally, so maybe I'll have a black
and a green thread, they'll run in parallel. Once they're ready, we move on to the next nodes,
maybe the red thread completes on some node. And we keep going up.
Maybe we have top-down tree traversals for some of them. In this case every node is essentially
a logical spawn for this type of parallelism. And here we're computing the X and Y attribute for
each document node here. And maybe there's some final bottom-up traversal. And we have this
whole kind of essentially spaces of different types of traversal strategies.
And for whatever layout engine, like, for example, that demonstration I showed you in the
beginning, like you can generally decompose it into these types of traversals. And this is, again,
at the logical level, not the actual implementation.
So write -- so the question is how do we write these. And so, like I said, we split the problem.
First you just write kind of the sequential spec or actually the logical spec where you just say
given an input what should the output look like.
And we have a nice little language here. It's descended from something called attribute
grammars. It's a declarative formalism. And the cool thing is we actually added a language
extension on top of this which lets you talk about the decomposition into these tree traversals.
And this is interesting in two ways. First, if you look at the basic schedule, the stuff in green is
actually kind of like the stuff I showed you a couple slides back of those tree traversals. And the
way you read this is the first tree traversal is going to be this bottom-up tree traversal. And when
you -- you instantiate it in a way that will compute all of the width attributes for each node.
We're not actually saying how do that. That's not the functional spec. And then maybe we
actually want to compose tree traversals. In this case we think there's at least two tree traversals,
and so after we do the bottom-up, you know, semicolon, then do the next one.
So this is already cool because we split out this definition of the decomposition into these tree
traversals. But the second part that -- which really gets to some new territory here is actually
writing all of this out explicitly is painful. Oftentimes you don't know how to -- you don't
actually know the valid decomposition, or having to specify each attribute and flow layout
engine would just be painful.
And so we kind of -- we -- you can actually just write question marks, essentially, inside of this
bit of code. And it will be up to our synthesizer to figure out what are valid ways, like ways to
correctly fill in these holes that maybe even optimizer answer questions to us about them.
This has changed how we actually develop. We're building different tools essentially to exploit
this mechanism, so it kind of changes how we program, do parallel programming. And to
understand that, I'm going to actually talk a little bit about how that works, and then you'll kind
of understand the tools.
So for any sort of -- basically we think about this as a search for how to fill this in. And so the -when you put a new traversal into our system, you tell it how to make a small step in the search,
and from there we can stitch in the overall search.
So, for example, if you have a horizontal box and maybe it has some assignment to the X
attribute that's a function of some V attribute, we need -- and then you want to ask, well, if I have
a top-down tree traversal is it okay to sequence on a bottom-up tree traversal that will compute
that attribute, is that a valid way to continue, to compose on extra work.
And so when you add in a traversal to our system to kind of teach it about how to use it, basically
what it would do is the addition would look at, okay, well, if this is a bottom-up tree traversal
and I want to see if adding -- if it really could compute this X attribute on each node, well, either
that V that it depends on is computed in the previous tree traversal because then everything is
already computed, or it's that the attribute is computed in that same tree traversal. Imagine like a
stencil way front where it's computing as you go along. And so then therefore if it's bottom-up,
that means that the attribute is available from earlier in that same traversal.
And once we have kind of all of these local steps, we can actually do kind of full searches -searches of full completions of the schedules. So maybe we first guess that we can decompose a
system into a top-down traversal that only computes the width.
Clearly we don't compute everything, and actually maybe it's entangled with the width and
height. And so we have to -- we might need to compute more. And so this is wrong, and then
we'll try another alternative. We find out this alternative works, but it doesn't compute
everything. And so we'll have to compose on -- we'll have to sequence on another tree traversal
to compute for attributes. Maybe again that works, but it still didn't compute everything, so we
have to sequence on another one.
And you can imagine building -- you can find different sorts of programs this way. And so
basically now our computer's working for the developer. It's finding this decomposition into tree
traversals for them.
And this is really powerful. So this has completely changed how we develop programs. So, for
example, when you start out writing this widget, you actually don't know anything really about
how to decompose it, so you just put a giant question mark and ask, well, can you do anything.
And this is basically enumeration of the search space and basically you stop once you find all
parallel things.
Then later on you might be happy with that, you continue with development, and then you might
want to make it a bit better. Like maybe you're in a browser and you want to tweak your
browser. And so there you might say, well, I actually expected a different parallelization scheme
to work, and you might constrain the search space to only look in using those decompositions
and ask did that work, like did it work, did it not work, and then you can start understanding how
that works.
Or you might actually use something like an auto-tuner to say, well, maybe on different devices
different traversals are better. Maybe they'll compute less and therefore it will fit into smaller
caches per traversal.
Then finally once you really are happy you have to realize that we're dealing with lots of code
and like evolving standards, and so whenever the logical spec changes, we need to make sure
that we don't break that parallel -- or we work with the parallel decompensation. Like maybe the
parallel decomposition assumes some sort of independence, and then once you change your
code, that gets violated. And now that we have that as part of the language and part of the
synthesizer, it will actually check to see if this schedule is still valid.
And so I'll actually show you a bit about this.
So okay. As an example, like I said, we've been using this to write the CSS spec. So here you
see our favorite Wikipedia Web site. It's a little off, but it's pretty good for about a summer's of
work. And here we see some trick key features like nested text and tables. And the cool thing is
to parallelize this, the CSS language, it actually came down to this is the sequence of tree
traversals and actually how a freshman in Berkeley was able to specify how to do it.
In the six lines -- or one, two, three, four, five lines, he specified what the parallel decomposition
is for CSS. And pretty much you understand this spec already. There's one extra thing I can talk
to you about later. It's pretty much a richer form of decomposition.
I'd like of course like this -- once you have the decomposition, you have to actually say, well,
how do we map it down to hardware, how do we do the specialization. Here I don't want to go
into the algorithms, but I do want to say this is real. We actually have been writing new
algorithms. You saw in the beginning we've been doing stuff on GPUs. There our kind of target
is to do about 1 million nodes real time.
On multicore we have -- we ran on an Opteron. We find about subnets scaling in eight cores.
This is relative to sequential. And actually because we're using code generators we can do very
aggressive sequential optimizations, so we actually have -- in this graph we actually -- I combine
the results. We have super linear speedup.
And then we've also been playing with vectorization, so within a core. And so we found for
some tree traversals we can vectorize them. And here we show on some Web sites that, yes,
on -- with our new algorithm we actually do get 4X speedup. There's a reason it's super linear. I
can talk to you later about that.
But kind of cool is we actually started measuring power numbers. In this case we see that on -we see about a 3.5X energy savings. So, you know, power and performance are in this particular
case similar.
So this is a quick recap of what we just saw. The reason we're doing our work is because
browsers needs to change. They need to handle bigger jobs and also move on to smaller form
factors. And we found that for -- we want to do DSLs for that 20 percent case, but I think -- but
for the 80 percent case is really we need to get these algorithms out of like, you know, papers
into real big code. And getting that will be a big step.
And part of what we found to be very effective is decompositions into these traversals. And
unlike kind of traditional, you know, parallel four loops or whatnot, it's -- we found synthesis to
be kind of an enabling technique for -- on both the compiler side and the expression side.
So that is it, and I hope that inspired at least some thoughts with this talk.
>> Juan Vargas: Do we have any questions for Leo? Thank you. Thank you very much.
[applause].
>> Juan Vargas: The last speaker for the session is Ras, Ras Bodik. He's going to be talking
about program synthesis for systems biology. He comes from UC Berkeley. And after him we'll
have a break and we will come back to the room to continue with the next session at 3:30.
>> Ras Bodik: Okay. I think I can start. So what we are looking at is one of the smallest
parallel programs that nature built for us. You can think of it as a small distributed systems. It's
a worm that contains about 900 cells. It's small enough to serve as a subject for various studies
in biology because it has translucent skin so you can see how the cells develop with your naked
eyes, or nearly naked eyes.
And before I tell you more about our collaborators and how this work came together, I want to
leave you with a little bit of a puzzle which asks what do dining philosophers have in common
with systems biology, especially developmental systems biology. So here you see on the right
how the worm develops and here on the left how the dining philosophers are trying to dine.
So this talk should be giving by Saurabh, the postdoc in my group who essentially discovered a
connection between the synthesis of concurrent programs that we did as part of the UPCRC and
systems biology.
Jasmin is our biologist collaborator, and there is no better place apparently to find a biologist
than Microsoft Research, and she together with Nir, one of the people who started the field of
executable biology. And Ali and Evan are our students. Ali's a grad student; Evan is an
undergrad heading to MIT this fall. And they did, of course, as you know, all the work.
So for the systems biology, it's a process of modeling the models, mathematical or other complex
processes in biology. And what the development of biology is looking at is one of the questions
which is not something I should emphasize, should try and answer directly. We are not curing
cancer. We are even far from understanding it. But in order to understand cancer you want to
understand what happened with the genes, how they malfunction so that the cell growth and
differentiation is broken.
So to understand cancer you want to investigate cell differentiation, because each broken
differentiation may lead to cancer.
There are two ways how a cell can differentiate. One of them is a single cell develops into cells
of different types. Another one, the multiple identical cells differentiate by communicating with
each other.
And so understand differentiation you need to understand communication between cells.
And so the worm is perfectly the animal because you can see the cells, how they divide, and it
grows relatively quickly. And there is one particular part of the worm that is studied by
biologists. There are about six or seven cells and another cell which is called an anchor cell.
And this one generates signals stronger to the cells that are close by and weaker to those have
further by. These are called vulval precursor cells. They are sort of stem cells that, depending
on the communication between, demand a signal from the anchor cell, divide in a particular way,
they determine their fate.
So here is the state of these cells before they actually take their fate. And here is after one step of
division. So we can see now one of these cells, this one divided into two, and they have -- they
have taken a particular fate, fate No. 2. This one divided and took fate No. 2, and then
depending on what fate they have, they start growing differently and then eventually form the
desired organ after a few more divisions.
And what we want to understand is actually what program the cells run in this division cycle
from the point when they're identical, not differentiated, to the point where they actually
establish their fate.
So this is the -- you can ask the biological question of understanding how fate is determined in
different ways, but systems biology asks it essentially by trying to determine what is the problem
that the cells run in order to so robustly determine their fate, robustly in the sense that in a
wild-type animal without any mutation, the fate is always 3, 2, 1, 2, 3 and the organ is formed
properly.
Okay. So our goal is to infer the programmer or reviewer working in program synthesis. We
want to synthesize it from experiments in the lab. So now what is common between the dining
philosophers? Well, if you look at the dining philosophers, you see that they somehow, or a
particular set of dining philosophers somehow end up always eating. Perhaps they end up eating
their food in bounded amount of time without the deadlock. So clearly they are running some
algorithm in their head that allows them to communicate and accomplish the goal.
Now, you could ask them what the program is, but philosophers are not going to talk with you.
So you need to do something else. And the methodology which is of course a metaphor for -yes.
>>: Can you actually explain the dining philosophers problem?
>> Ras Bodik: The dining philosophers problem is that you have -- every philosopher has a
meal in front of himself or herself and there are chopsticks, there is a chopstick between every
meal, and the philosopher can only eat if they grab both of the chopsticks. And now you can see
the possibility of the deadlock, for example, if they don't have any reasonable program running,
each of them holds one chopstick. By not releasing it, nobody else can grab two, and the dinner
party stops.
So the philosophers are not going to talk to you and tell you what program they are running to
actually successfully eat. So you need to do something what biologists do with the cells. You
can mutate the philosopher somehow, run their program, observe the results, then repeat it for
various other mutations, and then from that infer what programs they're actually running.
And the mutations that come to mind, perhaps blindfold the philosopher so they cannot read
what the others are doing, you can tie their hands behind their back or something of that sort.
And then maybe you come up with a program like this which will tell you that once the
philosopher has both forks, they eat, and now they could perhaps eat again, so this model
describe with Petri net or they can release the fork so that somebody else can eat, then they wait
on somebody else to release the fork, and then they can repeat the whole thing again.
And so this model of the philosopher of course is not obtained by observing a real philosopher,
but this is the process that we want to do. We want to observe the process and then infer the
program because then the program describes what the cell is doing.
What is important in the process is the language, the modeling language that we choose. Here
they use Petri net to describe what philosophers do. We have designed a different language.
So how do biologists actually do it. What they do, they mutate the cell, and then they let it
develop and observe the changes, observable changes in the phenotype of the cell.
So what you see here is the wild type sort of cell that has not been mutated. Here is time, the
development, and here is the one -- here is [inaudible] mutated was mutated in a particular way.
And what biologists observe, that something bad has happened and from that they make
inferences.
Okay. Now, these experiments are recorded in the following way. They say for a set of genes,
what are the mutations. Well, here is one particular experiment number, three where this gene is
wild type, this one is wild type, this one has been knocked out, and this one is wild type. And so
on and so on.
And the results tell you what fate these cells, six of them, have taken. So here fate 3, fate 3, fate
2. This one is the interesting part. It actually requires a new algorithm. Here we are saying the
cells, depending on how you run the experiment, could end up in fate 1 or 2.
So sort of the cell is nondeterministic, but it's telling you that under that particular mutation this
one, experiment 7, the systems become nondeterministic, sort of multistable. Depending on
chance, the cells can take different fate.
And here is the unique modeling challenge, that we want to synthesize a program that is
complete in the sense that it can reproduce all observed behaviors, right, so if there are various
conditions and there are different nondeterministic outcomes that biologists observe in the lab,
you need to synthesize a program that can replay them all, because if it doesn't, then presumably
it's not modeling everything that the cell is doing.
Okay. Now, so from this general mutations experiments, biologists infer protein interaction. So
they say if you knock out some of these genes then there is negative regulation on the following
proteins. How do they know there is regulation of proteins? Because they know what these
proteins do. And so from observable differences they work back towards these proteins and now
they have sort of pairwise interactions.
So the interaction would say sort of lst-2 negatively regulates this map. Okay. Okay. Now,
these pairwise things are then put together into informal diagrams that look like this. So what we
have here are three of those VPC cells, okay, here is the anchor cell, here are signals between
them, strong, medium, and weaker, because this cell and that cell are further away.
And inside you have proteins, okay, and these positively influence each other. And you have
also these receptors that sort of act as ports on the cell. You can sort of abstractly model them as
proteins too.
Here it shows how these three cells take one fate, another fate, another fate by interacting in a
particular way.
So these are the diagrams that biologists put together from the many experiments. What you
often don't know is are these diagrams complete, are these interaction actually explaining the
cells taking their fate, number one. Number two, we would really like to run this program and
observe how the cells communicate and what sort of exchange of information happens in order
for the fate to take a particular result. So we would really like to turn this static information into
an accurate dynamic model that can run in a simulation.
So here is where executable biology, Jasmin and Nir's work, come in. So executable biology
writes a model that is a program that you can run, model, check and verify that indeed it agrees
with the experiments.
So these verification models can tell you that, oh, there is something correct in the model. Under
a particular execution it disagrees with experiments, and model checking is needed to handle the
combinatorial nature of communication, the fact that cells proceed at different rates.
And you can discover potentially new interactions between proteins and then go to the lab and
verify whether your prediction is correct and sort of work from in silico to in vivo understanding.
So semantics of these models, those that we work with, the concentrations I usually discretize,
time is also discretized as well, and that's usually enough to find sort of causal behavior, the
interesting behavior, even though we don't know concentrations precisely.
But in this setting usually perhaps the scientists don't really care what the actual concentration is.
The cells are concurrent automata that progress at slightly at different rates, so there is
asynchrony but not arbitrary asynchrony, sort of have a little flag between each other. This
project is called bounded asynchrony. And the timing is modelled with state progression in this
automata.
Okay. Here is a program, a model that our collaborators wrote in a language called Reactive
Modules, the language used for model checking. And I won't go into detail, but this is just a
small fragment. And it turned out that it's writing models in this language not because the
language is bad but writing these models in general is laborious because they involve timing,
asynchrony, [inaudible] discretization of concentrations.
So models are hard to understand and debug. RM is partly a problem because it allows you to
write models that do not correspond to biological explanations, so you can build strange
abstractions due to its clairvoyance. I won't go into details here.
So executable biology is great, but writing models is difficult, so here is where we come in. And
we try to synthesize these models rather than writing them by hand.
Okay. So our contribution is that we'll synthesize this model from some initial knowledge
provided by the biologists, and so you obtain them faster. But we go beyond that. We try to tell
you whether there are other executable models that can explain the phenomena differently. And
then we'll tell you what experiments you need to run, if this is so, to disambiguate those models
and rule out those that are incorrect.
So here is a quick example of the language that we have designed. We don't use Petri net, we
use something else. It's a high-level language because that leads to smaller programs, smaller
search space, and faster synthesis, and it resembles these biological diagrams because biologists
then hopefully can read it better.
And the language has sort several levels of hierarchy. The overall model takes cell mutation
which you can think of as configuration input and a schedule and outputs the fate pattern. The
body's inside. You have several cells as automata which advance asynchronously according to
the schedule and communicate. Within each cell we have these proteins or components, in other
words, okay, and these are modelled essentially lookup function, discretization of how
concentrations change.
So each component has a concentration and gets a signal, in the next timestamp it goes up or
down depending on its state and other concentrations.
And what we synthesize, it is these update functions that we synthesize. And the rest is provided
by biologists because this corresponds to the knowledge of the system they already have. And
but this is the part that is really hard to write by hand.
So the results. So we were able to synthesize the model that previously our collaborator took
several months to write, and we obtained a nearly complete model of program for that model. It
had some bugs and after a few weeks of trying to fix it we failed, and so we just said let's go
straight to synthesis.
So we believe this is partly because of the language high-level nature partly because our
synthesis algorithms do automate the hard part.
This is what came into the synthesizer. Essentially here are the six cells, okay, here is the anchor
cell. Each cell has the following skeleton. Essentially you have various proteins here and
interactions between them, and then we synthesize essentially the update functions that go in that
correspond to timing.
This one in particular is pretty difficult because it has two positive interactions coming in, one
negative, and it is not clear how we influence this important pathway.
And here is the input, actually a fraction of the input, coming from the experiments. It otherwise
had about 40 or 50 rows. And here is the example of two update functions that come out. So
there are three discrete levels and it tells you how the level goes up depending on what is the
input states.
And you now start so see why it is so difficult to write these update functions by hand to actually
build a model that corresponds to experiments observed in the lab.
The second result is we are able to help essentially with the following scenario. Imagine that the
biologist develops or synthesizes a model that verifies against all experiments. And he makes
conclusions from the model, publishes the results, and then another biologist performs more
mutation experiments. These mutation experiments are rarely complete because they're
expensive to make and under some mutations the cell actually die.
And so as this new experiment appears, it could be that it invalidates the model that the biologist
had, it invalidates the conclusion, and you have your nightmare scenario. So here we'd like to
ascertain that the experiments that we have performed so far actually are sufficient to create a
nonambiguous sort of constraint system, and in particular we would like to confirm that there is
no alternative model that explains the phenomena in the cell.
And so we have concluded that for this particular modeling the experiments that people have
performed so far are sufficient and there is no alternative model that could have different
outcome, different fate.
Okay. Now, you could ask the other question which is now imagine you don't trust these
experiments and you would like to redo them to validate them. You would like to avoid doing
all 50 experiments and maybe do a smaller set. And so we can tell you what is a minimal sort of
experiment you need to rerun to still have a nonambiguous system that leads to the same model.
Turned out that 10 percent of experiments is enough because they do constrain the result, and we
can tell you which 10 percent you need to perform to have confidence in your modeling.
Finally, I showed you there are no behaviorally distinct models. At least there is no other model
that would have a different fate on some future experiments, future mutation. But it could be
that there are models that differ how they work inside. Perhaps in one of them there is an
interaction between these two proteins [inaudible] with another between another two proteins.
That you cannot distinguish by looking at the fate and the development of the cell because, after
all, they have the same fates. What you need to do is so-called instrument the cell with
fluorescent genes and observe what happens, how these genes express during the evolution.
Here is the time. Right. So this is sort of a more involved experiment. We would again like to
reduce the amount of work that you need to do by tagging only proteins that would differentiate
two models. And so here we can tell you where tagging would be necessary. And indeed we
found an alternative model that of course behaves the same way on the outside but inside it's a
different interaction between proteins that explain the cell. And this is something that apparently
surprised our biologist collaborators.
So what I didn't talk about to conclude is actually the work that perhaps us in formal methods
find more interesting, that executable biology actually pushes on the boundaries of what we do.
The synthesis of these models is different than synthesis of sort of concurrent programs, because
we want these maximal programs, in the sense that if the cell exhibits races the model must
replay them all because otherwise it's not modeling what's going on in the cell.
And it turns out that leads to a new kind of synthesis problem that doesn't need 2QBF but a
3QBF logical solver.
The specs are incomplete because we don't have all the experiments. And so there is a lot of
unknown behavior. And that motivates analysis that goes beyond synthesis. We don't want to
just synthesize the model but actually analyze the space of the plausible models, plausible given
the observations. And so in that sense we are not really synthesizing but modeling the plausible
explanations.
And then you want to answer questions such as are more experiments necessary and so on and so
on.
And so I'll stop here and thank you for your attention.
[applause].
>> Juan Vargas: Do we have guesses for Ras? Yes.
>>: So [inaudible] it seems that you're dealing with some complex situations, for instance you
want to be able to [inaudible] do you have a sense how often [inaudible] as opposed to when they
[inaudible] so how frequently these type of more complex analysis [inaudible]?
>> Ras Bodik: Well, I don't know how advanced is other modeling, but in this particular model
of the cell, we are dealing with multiple genes already. This is so slow. Okay. Just a second.
So here we are looking at four genes, okay, and another cell which could be formed or
essentially killed before the experiment.
So here we have four or more genes already. I know that our collaborators have another dataset
that is richer by maybe one or two more genes which has about 70 experiments. Perhaps not all
of them necessary, some redundant as it may turn out. But it looks like the trends starts really
deeper and deeper and combinatorial mutations.
>>: We know for cancers there can be literally hundreds of [inaudible] genes involved in a
particular pathology, so these can grow huge. You're really looking at a simple little almost toy
problem, which is what makes C. elegans so fascinating.
>> Ras Bodik: Right. So I'm not an expert on how the complexity of the experiments go.
Perhaps in a particular experiment that biologists do they focus on sort of one interaction
between a gene and a protein or a set of genes and a pathway of proteins and they could say just
one edge in this model which I've shown here. But overall then it does lead to a combinatorial
space.
>> Juan Vargas: There's another question.
>>: So often the biological rules are more like rules of thumb. So I could imagine that you go
through and run this combinatorial problem and you come up with either a really strange like
physically impossible answer or no answer. So do you have any ideas how that kind of tell for
sure that the biologists have oversimplified the problem? Like you mentioned that often the
signaling stopped at discretized, which is probably not realistic, just to take on one thing for this
particular experiment.
>> Ras Bodik: Well, we don't have enough experience, but in this particular model be variable
to synthesize an executable model whose discretization does explain the behavior.
>>: The biologists were recently happy with it or ->> Ras Bodik: Well, I wouldn't speak for all biologists. You know, there is -- modeling as
appropriate for biological systems may take a long time to resolve. So biologists apparently
prefer modeling with differential equations, then there is the school of stochastic modeling.
You could think this concurrent modeling, concurrent program modeling as sort of abstraction of
stochastic modeling where you replace probabilities with nondeterministic decisions. Right?
Where instead of probabilities we are tuning the schedule which tells you how cells alternate.
There is a simple explanation which can suggests that this modeling is an abstraction of what
happens deep inside, if you believe that proteins behave according to their concentrations.
If that's not the case, then this modeling might not be a true abstraction of the system. It's a good
question. I don't think anybody really knows it.
>>: I guess as a follow-up, any thoughts on how to compare schools of thought? You could
imagine that the different groups of biologists want different models and multiple of them will
explain the results.
>> Ras Bodik: Well, I would -- a safe answer at this point for me would be to say that there are
situation in which this sort of modeling that is causal, nonquantitative but qualitative is
appropriate, maybe when you're looking at longer running processes where you want to see how
protein concentrations develop based on starting conditions, maybe stochastic modeling is better,
other than the one that purely focuses on whether the communications happens this way or the
other way might be better.
Going back on that would be a speculation for me at this point, so I'd rather not say much more.
>> Juan Vargas: Well, thank you. Thank you, Ras.
>> Ras Bodik: Thank you.
>> Juan Vargas: Thank you very much.
[applause]
Download