>> Juan Vargas: We are going to continue with the workshop. I'm going to start the sessions on applications. As you can see from the sequence of speakers, we have two speakers coming from Illinois and three speakers coming from UC Berkeley. So what we hope to accomplish is to have the two presentations from Illinois in about one hour and the three presentations from UC Berkeley in one hour and ten minutes or so. We can always get short on the break. So we will start with John Hart presenting AvaScholar, then Minh Do doing 3D reconstruction, followed by Gerald Friedman from Berkeley talking about PyCASP, followed by Leo talking about parallel components for an energy-efficient Web browser, and followed by Ras who's going to talk about program synthesis for systems biology. So we'll start with John. And John, of course, comes from the University of Illinois. >> John Hart: Thanks a lot. So I want to talk about AvaScholar and dig into some of the details of AvaScholar. AvaScholar is the applications layer of our work through UPCRC in I2PC as a way of both, you know, forward looking at what are going to be the killer applications requiring lots of parallel computation in the future. And it's fairly obvious that those are going to be visual computing based. There's many applications, but given that the kind of computing we want to do at present and the kind of computing we want to do in the future, it's obviously going to be some form of visual computing. And it's an exciting area right now because visual computing is working in areas of graphics and computer vision and machine learning and so on are really starting to gain a lot of excitement because we can do things that just didn't work a few years ago. In part because computing as gotten faster because of Moore's law, because they work only so well and we can make them work much better if we can scale them up even further. So that's sort of the target at looking at visual computing as an application driving our parallel computing work. And also if we want to look at the kinds of sample programs we want to give to our team's developing pretty well programming tools, we want to give them software that's realistic of the kind of applications that we'll expect in a few years. So we're not only giving them visual computing code, but we're giving them visual compute graduate student code, which is I think the worth kind of code there is. And it's motivated by our work at developing this AvaScholar system. And so this AvaScholar system consists of two pieces. There's been a lot of excitement about online education and delivering courses and so on, but beyond the company's providing these and beyond videos of lectures and working on tablets and having them recorded and played back, there's interactions with students. And so some things that are lacking that we hope to fix in the next few years are these two modules, kind of this row mote online instruction where you can old up a visual aid, a three-dimensional visual aid, and, you know, we use the scholar name to mean academic education, but this would work equally well in meetings. And as Josep [phonetic] mentioned, I've always envisioned a Toyota engineer in the U.S. holding up an accelerator pedal assembly trying to explain what's going on to an engineer in Japan. And having this conversation and just simple video may not be enough information to see the intricacies of what's going on in an assembly. You may want some deeper three-dimensional representation. And so some way of building that. And then at the other end we have the student module which is just basically a simple Web cam or what will likely be a Kinect that's embedded in everybody's laptop computer and cell phone or whatever platform we're on in the next few years, some student receiving these lectures, and then some indication, some agglomerated indication of what those students are responding to. And so we have tools that do this now that do soft biometrics that can tell the expression of a student what reliably, and also they can give us the demographics, the age, the sex, and other attributes of the student population. And so as you're giving your lecture, you can have a -- you can find out that 40 percent of women from 35 to 45 aren't really interested in what you're talking about, and you can adjust your lecture in order to do this. And this would be useful in an online context. This would be useful in an ordinary classroom presentation. And as Josep just mentioned yesterday, it would also be useful in a political speech when you're trying to give a popular speech and you want to know who's listening to you and who's not. So I want to give you an idea of what this looks like. And so see if any of this works. Nope. Let me try it again. Hmm. Try one more time. There we go. So here's an example of the system running. And on the left here you can see expressions, and here's some face tracking software. That can track my face as I'm moving around. And if I hit 1 start fitting, there. Well, it's not a good fit. Let me try again. That's a better fit. Now I've got a grid that's on my face that's tracking my face. And it's working okay. It took two times to actually fit the grid. I shaved this morning, so that helps, but I still have this. I'm just not going to shave is goatee for a demo. But as I'm talking, you're going to see a lot of different expressions detected by the software, mostly surprise that it's working and a little bit of fear. And I found that I can do a lot of this by just moving my eyebrows. So it works reasonably well. And this is just one part of our student interface to this system. If we can scale it up, it will work more reliably. We can use larger models trained with more details. This one's just tracking basically the motion of these grids. And we can have the grid follow the face more accurately and you can detect a wider variety of faces. Some you get much more facial hair, and it has trouble following faces. So that's the ->>: [inaudible]. >> John Hart: Not yet. >>: [inaudible] motion at all. >> John Hart: I'm going to the bozo school of acting. This is surprise. This is sad. So this is all built on a house of cards. In order to get these things running, we have all this other technology underneath. And we want to be able to scale this up. I want this to work reliably looking at a person, at an individual student. I want this reliably looking at an entire classroom of students. And we can't do that right now. And this is one component, and it has to work along with the shrug detector so we can tell if students aren't understanding what's being presented to them and a few other demographics, biometrics as well. So we want to scale that up. And so we need to make these things run, you know, faster and more robustly. And, you know, looking at current trends, we're going to have to make what was otherwise serial code in parallel. So it's a good example of trying to use the parallel programming tools we have to solve a problem. And it's built on top of all these tools. In order to do the instructor part, we need to do surface reconstruction for multiple cameras. In order to do the student part, we need soft biometrics. And these require things like depth computations between multiple images in order to infer depth. And things like alignment, ICP alignment, where you can take multiple 3D scans and align them as quickly as possible. And if you boil those all the way down, it basically rests on the shoulders of these two fundamental technologies, something to do a really fast nearest neighbor in parallel, and then some image processing and histogram programs that are basically -- give you feature detection, be able to look at images and come up with some vector describing what's in the image. And so surface reconstruction is mostly you're just registering a couple of images so that they're aligned. And there's this really good algorithm for doing this that was developed Disney and Zurich that ran in about 20 minutes, this top-of-the-line algorithm, and we've been trying to speed it up. We've got it down to about 20 seconds, and then we're trying to use the rest of the tools to get another two orders of magnitude in order to get this thing into real time, which I think would be quite an accomplishment. There's deformable alignment algorithms. We've got an algorithm that's going to be presented at SIGGRAPH -- it's not by us, but somebody else -- that we're implementing. It's based on this technique that came out of Stanford back in 2007 to take multiple scans of a hand moving, for an example, and be able to align them by segmenting them into individual components. And you can look at the running times for this, 13 minutes, 51 minutes, over an hour. That's the kind of thing we're trying to implement and scale up to the point where it can run in real time. So lots of big challenges there. Basically this is -- another way of thinking of this is it's doing KinectFusion for moving objects. And Minh Do's going to talk about our progress on doing that and the AvaScholar instructor module in the talk after mine. So I want to focus on sort of those low-level tools, in particular nearest neighbor problems. And these nearest neighbor problems come up tall time. It's one of the fundamental algorithms we use in visual computing and in machine learning and in many other areas. In 3D applications use it for surface reconstruction from scatter point data because you need to know where the neighboring points are quickly and efficiently. And for aligning points. The ICP algorithm, the C stands for closest point, so you always need to know where your point neighbors are. And it also happens in high-dimensional applications. So anytime you stitch images, you're finding features and comparing other images with features in similar locations. So you're looking for neighboring feature points in this high-dimensional feature space. Well, those features can be just the pixels, you know, lined up in one long vector or they can be these other spaces of features. But these features can be vectors that are hundreds of thousands of elements long, and then you need to find nearest neighbors in those high-dimensional spaces. And this nearest neighbor problem dominates -- it's the critical code segment for a lot of visual computing applications. And dealing with spatial data. Spatial data is very -- is distributed in a very nonuniform fashion. If you think of where all the atoms are in this room, it's in a very nonuniform distribution. If you look at parallel speedups, this came from Pradeep Dubey and through Jim Held, you know, Intel's scalability curves, if you look at the least scalable visual computing applications they have, their game rigid body and production cloth both dominated by collisions. With cloth it's keeping the cloth from turning inside out, and with rigid bodies it's keeping these rigid bodies from intersecting each other. And so early on we did some work on looking at parallel patterns for how k-d trees are constructed and discovered that the state-of-the-art routines for doing k-d tree, these spatial decompositions, these tree-based spatial decompositions had very simple, you know, trivial parallelism at the top where, you know, they were, you know, one-processor, two-processor, four-processor style parallelism at the top, and then became parallel at these lower levels of the tree as they were built in sort of a breadth-first fashion. And as the number of processors increases, you know, we're going to lose a lot of parallelism at the top, and this processing of the top layers will end up dominating the entire problem. So as computers become more and more parallel, the actual performance of this algorithm will decrease as a result. And so we came up with this algorithm called ParKD that basically streamed through all the data that you want to partition into areas and used all of the parallelism they had available on both the bottom half and the top half of the tree and, you know, set some speed records for correcting these trees back in 2010. And we had two different approaches. One was a nested parallel construction where you're just forking off tasks for each new level. And we have this in-place construction that basically didn't move any data around but had a lot of pointer direction. And, you know, we found a bunch of interesting things about the parallelism of these processes; that nested ended up being faster than in-place in our examples, but that the in-place algorithm scaled better. And we've been able to improve those results quite a bit in a years after that. And so we have that for low-dimensional nearest neighbors. For high-dimensional nearest neighbors, things are different. When the dimension gets greater than 15 dimensions, then, you know, what constitutes a nearest neighbor and what metric you use becomes a little murky. And there's approximate algorithms that work very fast where you're taking random travels down a spatial subdivision tree where you're just subdividing one of the five axes with the highest variance at each step. And we can compute that dimension of highest variance just by looking at a small subset of the points. And these algorithms work pretty well, and the best one that's out there is called FLANN by Muja and Lowe I believe here at UBC -- or up at UBC. And the -- but it's fast, but it's only -- it's only a serial algorithm. There's parallel versions of it, but the parallel versions just distribute multiple queries over parallel processors. They don't speed up individual queries. And very often you need a single query for these queries. So we did some theoretical analysis, tried to find out what the maximum performance in parallel of one of these trees could be, and we did some parallel implementations just using TBB with a depth-first scheduling, got some reasonably good scalable results on single queries and construction of these trees and, in fact, beat existing algorithms and beat a GPU algorithm for doing this. And one of the interesting things we were able to accomplish is comparing CPU performance to GPU performance. We have a CPU that has four processors, but each of those four processors has 32 -- or has eight-element vectors, now, these AVX vectors. And so we made sure we took full add advantage of those vector processors. So, you know, when you're programming a GPU, you may have what ends up being, you know, 16 or more processors, each with 32-element vectors, and you want to make sure you're comparing that properly with the CPU that has, you know, four processors with eight-wide vectors. And so we're getting -- on just four-processor Ivy Bridge system we're getting 22X speedups, 27X speedups from the pure scaler code, pure serial code, because, you know, we're using the multiple cores plus the multiple vectors appropriately. And we beat the pants off the GPU implementation of a variation of this. And that was estimated from this -- from the -- kind of the top of-the-line GPU implementation. So we've made some good headway in implementing these nearest neighbor searches. The other thing we're working on is ViVid, and ViVid is this vision video library that has all sorts of layers built on kind of this low-level GPU implementation, a C++ player, and a Python layer. And we've been mostly focusing at the low end, the low layer of this construction. And ViVid is basically our main feature detector, our way of processing images into the features that describe the images, the features that get tracked when you're trying to keep a grid on a moving face, for example. And it consists of three components. One's a filter bank component, one is a block -- a histogramming component, and then the third one is this pairwise distance, which ends up being, you know, the kind of brute force nearest neighbor algorithm they use for this particular kind of data. And you have blocks. You take your image, and you have a moving window that creates these small blocks, 16-pixel-by-16-pixel blocks, and each of these blocks can be separated into 16 elements, each of which is a four-pixel cell, and then we compute these histograms over each one of these little 4-pixel-by-4-pixel cells, and those histograms are basically the results of convolving these particular filters -- that's something like a hundred filters -- over those little 4-pixel-by-4-pixel regions. Now, why would you do that? Well, that converts this blocks into this vector, this long vector that's basically the response of these histograms, that you can then use to describe the content of that block such that you can compare it to other blocks and other images. And if the same kind of image is in that block, then it will give you that same feature vector. And the distances between those feature vectors is better. If you just compare the pixels and the 16-by-16-pixel block with another 16-by-16 block, you'll get a bad answer, even if those pixels is just -- even if that block is moved over one pixel. If everything's moved over one pixel, you'll get a bad answer, but if you use one of these feature descriptors, then you get a much better indication that those two images are displaying the same thing. And so this is giving us a feature descriptor, kinds of like SIFT or SURF or other feature detectors, and this one works pretty well. And this has been sort of the target of our parallelization effort to try to speed these things up, scale them up. And so we've implemented them. The convolution is implemented as a GPU process doing four blocks at the same time in order to save overhead of the apron space you need when doing a convolution, the histogramming basically looks through the filter responses. You're applying about a hundred filters to each of those little cells in order to see what the response is, and so doing a hundred convolutions at a time, and then looking at the answer to that convolution. And what we want is the histogram, the filters that give you the highest response rate. And so we have to do some histogramming. And there's all sorts of techniques we've been looking at in order to improve this histogramming, using an atomic scatter instead of a gather and ignoring low responses, not histogramming filters that don't respond very well so that we avoid the chance of a collision, which creates a -- so we can do faster, kind of sloppier histogramming that still works well. And then this pairwise distance matrix, which is basically comparing the distance between two feature vectors, you know, every feature vector with every other feature vector, in order to find matches in that particular case. And so we've gotten quite a few speedups. It's interesting to compare the CPU speedups with the CUDA speedups, doing 3-by-3 filters, a filter bank of 3-by-3 filters versus 5-by-5 or 7-by-7. The CPU does just fine with all of those, but we hit some serious register ->>: Sir, in each case your speedups are relative to what? >> John Hart: Serial. >>: And it's the same serial for both of them? >> John Hart: Yes. >>: Thank you. >> John Hart: And, in fact, these 4Xs actually go up a little bit more than 4X because there's been some performance improvement in addition to the -- in addition to just making them parallel. And histogramming, you know, we don't get as good a CPU speedup, but the GPU speedup is significantly degraded. And likewise with pairwise distance. We're doing quite well. >>: [inaudible]. >> John Hart: Yeah. >>: [inaudible] is that the case here as well? >> John Hart: No. Not yet. We're trying to. >>: Ah. Okay. So you're ->> John Hart: Yeah. These are just four core without using the vectors. >>: Oh. Okay. >> John Hart: And these are, of course, using the vectors. >>: So now it makes a lot more sense. >> John Hart: Yeah. Yeah. Sorry. I should have mentioned that, differentiated that from the nearest neighbor. So we're struggling. Our current plan is to -- this is using TBB, but we really want to get an OpenCL implementation of this so we can kind of target all the architectures. And we're in the process of doing that. We've got some preliminary results. So here's OpenCL on two GPUs, on the same GPU comparing CUDA to OpenCL, and OpenCL is falling a little short, but I think, you know, we just finished these results, and it may be as much an issue of not knowing how to -- the same tricks with OpenCL that we know with CUDA. But we're getting there. And that's just on the pairwise distance. We're in the process of getting the rest of the elements to open up. >>: Have you done experiments yet with running that on OpenCL on an Intel CPU? >> John Hart: That's why we're doing it. We don't have those results yet, though. >>: [inaudible] feedback we'd very much appreciate. >> John Hart: Yes. And I do know that. Yeah. These results are I think two days old. >>: Okay. >> John Hart: And so also, you know, we've looked at power. And this is a collaboration with -I think Josep yesterday talked about taking some of that face grid matching code and -- or the motion tracking code and applying some of the tools from DPJ to that code. And so, you know, we've been using the AvaScholar application to sort of provide code to -- for some of these other tools. And David Padwa [phonetic] and Maria Garzeron [phonetic] have been working on basically trying to schedule things like these filter banks across multiple processors. And in the process of doing that, we got some nice data on the CPU system power on an Ivy Bridge system for running those hundred filters on a sample image. And so this is the power rate, you know, joules per second, by basically being able to turn up or down the clock rate of the CPU. And you can see that, you know, if you have more power, you end up running it faster. And if you have less power, you end up running it slower. And the question is that's fine, but system power is joules per second, and so if you want to look at the total energy used, you've got to divide this by this in order to figure out the ->>: [inaudible]. >> John Hart: No, you have to multiply that by that in order to figure out the total amount of energy used. And so if we do that, we can find out at what clock rate we want to run our -- what clock rate we want to run our CPU if we want to compute these, the feature detector using the least amount of power possible. And the answer ends up being at a clock rate that runs at about 114 milliseconds. I think that's the 3.2 gigahertz clock rate. So a 3.4 gigahertz clock rate computes it faster, computes it in a little over a hundred milliseconds, but with a lot more -- using a lot more energy. And so trying to find the sweet spot for this particular application gets revealed by that data. >>: Yes, and you got this power by just plugging a wall power meter in? >> John Hart: That's a good question. I don't -- it's -- no. I don't remember what we used to measure that. It was much more complicated than with a wall power meter, though. Because we have those results too. But these are -- I don't remember the details for how that is. I can find out, though. >>: I guess the question is there's so much different parts of the system and the different parts are measured in different ways, so it may only be the wall [inaudible] multiply by time. I don't know. >> John Hart: Right, right. Yeah. I would have check. This is something we provided the code for and I happened to see the results from, so it's not my project. But I can go back. I mean, we have some Ivy Bridge systems, and we instrumented one, and I think the instrumentation was more significant than just looking at the wall power coming out of it. And this is just kind of scratching the surface. This is -- you know, it's -- you want to do this on your cell phone, right, and if your battery is getting low and you still want to point your camera at something and have it automatically translated or have it tell you what you're looking at, you want to find out, you know, at what clock rate do I run the thing and still get the answer and use least amount of battery. And the other thing is just showing another example of, you know, using AvaScholar code to help -- to provide sample code for some of these other parallel processing and power processing projects. Okay. And that's it. Thanks. [applause]. >> Minh Do: All right. So I'm going to present a continuation of what John just explained of the kind of overall arching property that we are doing at the University of Illinois. And I'm going to zoom in on more -- the bigger aspect of that, which is on the presentation side, so how can we enhance the visualization aspect of this AvaScholar. But you can also see in the context of very important ubiquitous applications out there are going to be deployed in a variety of devices. And here's a set of my colleagues and former and current Ph.D. students that have been work only this project. Matthiew, Matt now is actually working right here at Microsoft. So let's start with the very common application that all of us get benefit is the current video communication systems. So, you know, ranging from Skype, G Talk, Google Talk, VUDU, and iChat. So these kind of system I think now, you know, it is the reason why all of the laptop have a camera on it, the cell phone have not one but two. There's a rear-facing and there's a front-facing camera, which is just really for video chat. So certainly it is a very ubiquitous application, and it's also the one that demand a lot of applications. But I'm going to -- my thesis here is, well, that is still very simple. Because what you see in those system is -- basically it's record the video, the visual scenes, and simply transmit and then playback. All right. Now, what we want -- you know, so that obviously is very effective, you know, have people to communicate visually, but the reason that, you know, many of us do travel and go to meet people, you know, face-to-face, it really indicated that user demand something more. The ultimate goal is can we replace those actual face-to-face meetings. And that what -- this is a big set of demand now and research in our area now is how to provide the computational tools that enhance these visual communication system, provide user a very immersive experience, need to elaborate more on that, because if you happen to see the plenary talk by Rick Rashid, you know, it's exactly about this, you know, how can we provide this remote augmented realities. Now, but the key thing is that you see there's a system out there that tried to, you know, elevate this, you know, commodity visual communication system, like, you know, TelePresence, but they all require very expensive hardware and room setup, whereas we all like the convenient of carrying our laptop or mobile phone around, so the vision is can we using computations but with commodity hardware and cameras provide the realtime visual communication systems, be able to let multiple parties remotely interact and visual feel that they are being there in the same common space. So that is a vision. And if you elaborate a little bit, you can think about, well, so -- you know, we can think about this overall structure which we have multiple devices, we have some soft processing, the capturing size you get the information, you packetize them, and then you send to multiple parties, and then individual viewers now can select their viewpoint, their compositions, depending what they are interested in, and depend on who are they talking to. So that's going to open up much richer visual communication system than what we have seen now with Skype and so on. So just recap, you can see really, you know, we have an existing structure, I think is enormously successful, you know, if you think about Skype, but if you look at that, you can see the simple record capture and then transmit. You can see in the capturing side, camera have more and more pixels now, it's already reaching the limit of what the human can see. On the display side, people already have a very high resolution display. It's also reaching the capacity of the human eyes. In terms of the bandwidth, you know, we always worry about the bandwidth, but the psychology tell us very clearly that human eye can only absorb 10 megabit per second, and more of the network, you know, even Wi-Fi now start reaching that. So you can see that we all -- from the technology [inaudible] we all reach that, that human limit of absorbing visual information. What's really missing is, well, can we synthesize, can we, you know, generate some of these more novel viewing experience that's beyond what, you know, the single camera can capture. So I think that that opened a lot of opportunities. And because of the application demand realtime experience, very efficient computing on commodity hardware I think is a key element there. So just very, you know, set up a key example, here's a paper we recently presented. That kind of articulates there would be a future when we can take multiple parties, we can put them in a common environment and they can really feel being there, being there in the sense that in the same environment, they know that the other party also in the same environment, that are being looked at, so -- but of course you can see that still have a lot of artifacts, the lighting is not here. So, again, it's showing there's a potential to enhance our, you know, user's Skype communication system. But, yes, a lot more computation is needed there. So what are the resource challenge and what are the computational going to provide us? Here's a number of, you know, really important tasks [inaudible] give us, scratch on the surface, and how parallel computing can help them. But, again, you know, the key bottom line here is that we want to use a low-cost commodity hardware, you know, divide cameras as well as like, you know, CPU, GPU on our mobile platform, or laptop. The experience have to be, you know, like -- you think about like being there, really they want to sit down to another person and feel like you are both in the same environment. And the problems are going to be how to capture, how to synthesize, and how to transmit and encode those information. Okay. Now, if you think about our current user communication, you think about capturing the color information, and what it is is it's take a 2D -- sorry, taking a 3D one and project into an image. And when you project a 3D one into an image, what you lose is exactly the depth information. You know, I have a pixel with some color now, you know, in my eye, but, you know, if I only have a single image, I don't know how far that light going to hit the object. So that's why you can see the excitement around the Kinect in our community, you know, computer visions, video processing, and computer graphics, in the sense that now we have some commodity low-cost devices that could capture in addition to color information the depth information. And that allows to now in real time really fully capture that 3D information and then be able to now do something much more sophisticated than just simply display a recorded video. So but the -- the challenge coming in is these devices, as we already learned and know very well about, you know, digital camera, right, there's a lot of processing inside the SL or the handheld camera in order to turn typically a noisy capture signal into a high-quality images. And same here with the depth camera. You know, we often end up with a very poor quality data. But, yes, you know, we want to present it at a very high quality visual experience in viewing to the user. So here's where, again, computation can come in. And let me, you know, illustrate with a simple application here. Our scenario we think about, we have some commodity, color, and depth camera. And based on that a small setup, it could potentially all integrate it inside with our laptop, you know, in the display bezel. And the ability to take those actual captured images and synthesize a novel viewpoint. I don't go into detail why that would be important, but, you know, here's an algorithm that we proposed, and the -- you know, basically it is very standard, but, you know, clever ways of fusing and utilizing the complementary color and depth information in enhancing the visual quality. And the example here in the real lab, and we were able to do in real time thanks to using a lot of GPU, be able to synthesize another viewpoint which, again, back to the example John mentioned early on, AvaScholar, you can think about the teacher now talking to students. And we think about one thing student like to come to the real class is a feel that the teacher looking at them while they're learning or listening to the lecture, whereas for remote participants normally they feel they're completely left out because they never look at the eye gaze of the professor, professor's tend to look elsewhere. But now with this eye gaze correction, the participant remotely can feel like, well, they are being looked at and they feel they are in the same environment. Now, this problem, I'm sure if you use Skype, for example, you also experience when the camera is on top, you look at the screen, so the remote parties, you know, look, you know, even feel like you are looking down. And, again, you know, it breaks out the communication, it breaks out the trust. So, again, it need to be real time and this processing using color depth, we deploy using parallel architecture, provide us the real time capabilities with significant speedup. Another example that John probably -- as John briefly mentioned about this is, again, when you have typical data from the Kinect due to many, many reasons, you know, the infrared interfere with a lot of other things, you know, when the Kinect try to capture the depth image, and you end up with a lot of noisy data. One thing we know is we understand the object that, you know, moving, the same human face, but moving around, so we can have very efficient way of accumulated data dealing with something like, you know, deformable, and then based on that can build a very high-quality depth map. Again have to be in real time so that when you, you know, present and render that person to the other side, you have a very high-quality image. You can even do some stylization. So, for example, render that person, look younger, look lighter, you know, the viewers, for example, increase engagements, many potential application can be done once you capture this data in real time. And of course, you know, again we deploy that and try and utilize with a lot of our colors here, you know, we are experienced in parallel computing and get a lot of significant speedup. So let me, you know -- I'm deeply inspired by the talk by David yesterday, you know, what are the big ideas. So let me -- if I can leave one key idea here -- and let me go through a key example, one of the example here. One thing, you know, through this experiment, like, you know, we have some application, we really need real time in order to talk to our colleagues in parallel computing and learn all the tricks, and with their help be able to fight, get realtime algorithms. Now, on the reverse direction, slowly as we learn about, okay, what would make an algorithm easily to parallelizable, also there's some other feedback on reverse direction as well. So here's one example. So one key application in image and video processing is how to enhance the quality. And this is a very basic building block. You know, it's inside, you know, all of the camera, all of the laptop here. And it's building block for, for example, doing a style matching, image stylizations, and many others. The problem here is that you have some data and it is noisy. I try to just exaggerate here. And you try to enhance a particular pixel using some kind of local average. So of course, you know, we probably would know just simple linear, some kind of filtering would help, but the key challenge there is how can you do that filtering, that [inaudible] average pixel that, you know, in the same object but not, you know, outside, you know, from other area. So, in other words, you want to have some kind of edge-preserving filter. And that is what -- the bilateral filter is one of the very powerful tools, become ubiquitous now in visualization and image processing. The idea is instead of just using the typical spatial invariant filter, which, you know, the white depend on the related position of the neighbor pixel to the current pixel, we're also bringing in the difference of -- in the value. And using that we can build up that adaptive weight for, you know, locally when we want to try to refine estimate of a current pixel. And of course bilateral filter, you can see that it's very challenging to parallelize in GPU. For example, there's a lot of effort on that. And the algorithm of course when it was developed it was purely in the term of, well, that's an algorithm that's a sequential one, and then, you know, they got [inaudible] and, okay, someone else can try to parallelize them and make them faster. But then what's interesting, and there's some work coming out actually from Microsoft Research in Asia, the observations say that, well, okay, that -- now we understand which element of the algorithm would be very easy to parallelize and what's not. And of course now the goal is to try to enhance the visual quality of the image. We can approximate them in a different way. I won't get into the detail, but there is the so-called guided filter which essentially they break that sequential convolution spatial adaptive into multiple stage. Multiple stage. And each stage simply just computing the sum of the pixel around some box, and that box could have some adaptive size. And that -- you know, to file the sum around the box, as we know we can use so called the integral image which is, you know, yeah, I know the sum of the image up to this point, then any side here I can just simply have in four operations. So extremely fast. And now this guided filter is embarrassingly parallelizable and it provide, you know, almost 200X speedup compared to the best GPU implementation of the bilateral filter. And, yes, now it bring a lot of these very hard problems like, you know, optical flow, dance map, stereo matching, into, you know, real time now. And we, again, understanding about the parallelism, we extend that even further, work that I recently presented at CVPR. We can extend that shape to arbitrary support. And, again, having this multistitch filtering architecture which look in the locally and then algorithm together. And the key idea is each of those step is -- rely on very simple computation. You look at the scan line and you just find the difference between two points, for example. And just to get the point across, so, for example, we want to find out, you know, enhancing this particular pixel using the neighboring pixel, right, and the weight, you know, which pixel should contribute that particular Y dot pixel there, listed over here, so bilateral filter, again, the key idea is I only want to get the pixel of the same object to influence my next estimate here. I don't want to, you know, get the pixel from other object get influenced, and those are the mask, right, adaptive masks, it various from one pixel to the other pixel. So these are the example, but then you can see that by decompose -- decompose is filtering into multiple stage, each of them is very simple to parallelize. Again, it's basically just -- you know, require a lot of just sum and averaging. So I won't go into detail of the math, but, you know, we can show and prove that we have the same effect, and of course it provides a very efficient -- for example, try to improve a very noisy depth on the top right, a picture there, and down here that's improved using guided filter or using our newly developed filter. So, again, what the main idea is here, I think is one as we see, you know, as we move in with a lot of computing power, highly parallelizable, computing power platform in hand, I think we can aim for something much more impressive than the current visual communication system can provide now. We can provide something like chain the viewpoint, we can correct for the eye gaze, we can throw away the background and embed the object, you know, multiparties, and in the right perspective in a virtual environment. All of this relies on how to capture the depth of these 3D informations in real time, and of course, you know, none of those can be possible in real time without the help of parallel computing. But, you know, one of the key thing I would say here is because of this visualizations, there's no single right or unique answer. There's a lot of approximations. So, again, a co-design of an algorithm developed with someone knowing about the parallel architecture would allow us to come up with very effective, highly parallelizable algorithm that, again, provide that high performance. Okay. Thanks. [applause]. >> Minh Do: Yes. >>: There was a demo about ten years ago between Berkeley and Illinois where there was a dancer on stage at Illinois and a dancer on stage in Berkeley and they danced together by having 3D capture and sending the images back and forth, so they danced with one another's images. And the bottleneck that they discovered was [inaudible] to make all of that happen. And so has that changed? >> Minh Do: Yeah. Yes. So yeah. We [inaudible] work at Illinois, my colleagues [inaudible] the bandwidth at that time -- yeah. So ten years ago of course you don't have that high bandwidth. Also the other is they really don't know how to process data. So with what I'm thinking about like our human eyes, we see the information in 2D. And we -- and that is what, you know, people manage a 10 megabit per second the most. Now, for the [inaudible] system, does this capture fully the 3D data [inaudible] and they ship it over. So what we think here is, well, we -- by take the data and do some kind of analysis, we can condense that information and, you know, get closer to what need to be represented on the other side. So I think to answer the question, yes, you know, the bandwidth getting better now, but also with some more processes we can reduce that information and, you know, still provide that high-quality rendering finally. >>: So the other reason I ask, it sounds like you want to scale up to many people all interacting, so you're going to need more bandwidth because you have many different people you're trying to merge into one image that everybody shares. >>: Great question. Yes. Yes. So the question here is, you know, how about when I have multiple parties, yes. So there's two ways. One is let's say a small number of parties, you can do a mash, which is, you know, everyone talk to the other, or maybe there's when everyone do in cloud, so you can send a note, and then that get composition and then, you know, you send back the composite image. >>: I'm confused why you say the bilateral filter is difficult to parallelize. Is it that it's difficult to vectorize, or is it that it's difficult to parallelize? >> Minh Do: The reason for that is it is the weight here on the range, depending on the difference of the current pixel with many of the local -- so there's a lot of dependencies and, you know -- so it's hard to cut, you know ->>: So it's a vectorization problem. >> Minh Do: Yes, yes. >>: I see. >> Minh Do: And that support -- normally people will try to get [inaudible] support so they get a good estimate. >>: So this -- because there's not good scattered gathered support on these [inaudible]. >> Minh Do: Yes, yes. Whereas we turn the problem around -- at least with the current -- if we use a guided filter on this cross base, become extremely simple to parallelize. We just -- so this kind of formulation here, it's mean that you sequentially compute the algorithm designed with sequential computing in mind, whereas now we understand about parallelisms, people start developing algorithms that now much suitable for, you know, when they deploy to multicore, many core. >>: I think the hard part of parallelizing that is when you actually go to implement it, you're like the threshold below -- you know, if you have two pixels that are -- if you look at the thing in the red box, a lot of those pixels are black after you've applied the green thing, so you threshold those, and that thresholding is the if statement that occurs with vectorization. It's not in that equation ->>: Okay, thank you. That helps. Because looking at the equations, it's like that's a standard geometric decomposition problem to be doing ->> Minh Do: Yeah, but you don't compute that [inaudible] threshold way, many other -- >>: Sure. Sure. I can see that. >>: Or you have to do [inaudible] get back to zero. >>: Or you use a k-d tree. >>: Yes. >>: Yeah. >>: Okay. >> Juan Vargas: Okay. So many, many questions. Thank you very much. [applause]. >> Juan Vargas: Now we have Gerald Friedman talking about PyCASP, scalable multimedia content analysis on parallel platforms using Python. He's comes from UC Berkeley. >> Gerald Friedland: So yeah. So some of them -- some of you have been at the Par Lab retreat and you have probably seen their like different specialized version of this. This is basically an overview talk of the not-so-specialized version. So first of all, a little correction. I'm Gerald Friedland, with the land, not the man. And I work at ICSI mostly, at the International Computer Science Institute. That's a nonprofit, private institute affiliated with UC Berkeley. But this is collaborative work with many people. So there's definitely Katya and Kurt in there who are at UC Berkeley, but then also lots of other UC Berkeley students, as well as a little bit of work has been done by Dan Ellis, who's actually at Columbia University. So let me just dig into this. Basically the motivation for me, was I'm not a core parallel guy, is -- I'm multimedia guy, the motivation for me is if I look at the Internet, I'm actually happy because right now the multimedia field is growing drastically. Like every two years we have the same amount of data uploaded that has been uploaded last [inaudible] years before, pretty much doubling of data in the Internet. This is some random graphics on how this might look like from Raphael Troncy from MozCamp. It's old and [inaudible] but it's basically just saying we have more and more data every minute. We can also ask other people. For example, YouTube claims that 65,000 videos are uploaded per day. And that's just YouTube. Or there is 48 hours of video per minute. That means every minute I speak, another 48 hours of material are on this site. And if you think about, this is just YouTube, right? There are so many other social networking site. Another one is Flickr, and we know that they have about a million images uploads per day, and everybody knows that they are officially going downhill, but still there's a million images per day. And then also many people, when they think of social networks, more think of Twitter, but what they often forget is that there is about every hundredth message has an image or a video associated, which, at a hundred million messages per day, gives you about a million images or videos per day as well. So all this multimedia data is currently going like in massive amounts onto the Web. So so what? Right? So why do we care? Right? So usual answer could be, well, let Twitter and YouTube and so on deal with that. But, in fact, it's more than that because what's really interesting for us as researchers in so, so many fields is that this consumer produced multimedia content allows empirical studies at never-before-seen scale. Right? So basically these videos -- and people usually really look at me angrily when I say this and going to correct me after I said it, these videos are mirror into the vault, right? And the reason why people look angrily at me is because they are highly biased mirroring to the vault, right? And there's various biases. And actually just studying these biases is already interesting. But in any ways, if you look at literature right now, sociology, medicine, economics, environmental sciences actually even have studied these videos. And we as computer scientists are actually more like how come you actually find a video, right? So we started this like very bottom, but in reality we can do a service to all these fields. And of course there's a buzzword for it now. It's called big data. We've done large scale for quite some while, and then somebody came up with a buzzword. Now I can put myself into the bin. Okay. In any way, so the problem here is a practical one. How can we students and researchers as individuals effectively work on big data. Right? So the problem is whenever I said -- whenever, you know, I say something like, cool, we have all this data, and then people say, well, yeah, but in order to access it you need to go to Google or to Microsoft, for that matter, right? And the other problem is like if you have the data, then many people, what they do is, because they can't work on a million videos, and that would even be small data compared to big data, they work on 10 images and try their algorithm and say, oh, it works for ten images so it should probably work for 10 million as well. And that's maybe true or may not be true. So in any way, what researchers want to do, they want to keep doing what they're doing, so they want to play around with different statistical modeling techniques for a different problem, like, you know, I have this method, can I train Gaussian mixture models, can I train neural networks, can I train whatever I want. And they basically want to prototype their effort. But the bottleneck is the processing time. That's why I said before they'll probably choose a subsample that's so small that it might not scale. And you want to prototype in a productivity language, right? So you don't want to -- if you use EC2 or something right now, you're probably prototyping in C++, and that's not really prototyping, that's already hard-core coding. So you want to prototype with something like mostly MATLAB or Python. And the other thing is also you want to leverage and integrate existing tools. So you have used your face recognizer tool forever that works and that you build on. You don't want to just now say how to I port this to this weird parallel hardware that I don't know how to use. So that's basically -- these are the requirements. So what are the issues? The issues are computation time is the main bottleneck. If you develop anything with big data, just feature extraction takes forever. Right? So I'm currently in the Aladdin project, which is 150,000 videos, and one deliverable is just features it takes a week. So that's the problem. Computation here is the main bottleneck. And that makes it for a slow experiment turnaround which makes it for people being more conservative about experiments, which is what we don't want. And then the deployment part is the other one. So let's say I have written up something or some experiment that sort of works, then how do I scale it up to the, you know, 10 million videos and so on. Yeah. And the problem there is the way we do it right now is we scale it up by asking an expert programmer to sort of write, you know, a computational driver would be sort of the simpler phrasing of basically port it to a platform that will make it massively parallel, for example, CUDA or multicore. The problem here is once you've ported it to one of them, and we know this multicore and GPU parallelism, for example, is different, then it's stuck to that platform. And then, you know, in a couple years, there's another platform out and you want to do it completely differently, and then even though your other platform might not exist anymore -- I mean, we even had this problems in Par Lab -- and then your code does not run anymore. So that's not how you want to work. So basically we have this approach. We call this right now PyCASP. And we try to solve all these problems at once. Let's see how far we got. So basically the idea is you write all your code in Python. Right? So as an application programmer, you don't deal with the sort of super special thing. And then we have something that's called SEJITS, which is basically Just-in-Time compiling this Python to whatever platform. And then we try to provide for sort of a productive environment. That means what we want to do is we want to do Python, but on the other hand we want to when we use Python have efficiency in the background so that you can run your stuff fast, and that efficiency, since it's hidden, can also be portable because it's some stuff we call whatever we have depending on the hardware, and then we hope because it's portable it's also scalable, scalable in the sense that you can run it on one CPU, eight CPUs, on a GPU, or a cluster of GPUs. And in fact these are the experiments that I'm going to show you. And so the whole idea is that there's sort of different people involved. Right? So you have the hardware architect, then you have the expert parallel programmer that creates something, as I said, like a computational driver, and then you have the application developer who just writes Python. And then in the end the end user uses your application, but it's not that important because right now we talk about research, so we might not have end users, but will be later on at that level that you actually can deploy these programs. And then you create -- the problem here is that's all nice, but then where do you start, right? So you have so many problems, how many sort of drivers do you write since we're talking about computation. And so our idea for that is, first of all, let's stick to content analysis. That's just because the people involved in the project have some background in this. And the second thing is let's try if patterns can help with this approach. So basically the idea is to limit the -- to identify the most important patterns and then basically, based on these patterns, try to identify like which of these patterns will influence most application. And, yes, we will never ever be able to solve the entire space of problems. But then there's probably something like a long-tail distribution of the algorithms that are used and the frequency. And then basically if you can have the top-end used algorithms be parallelized, that would already help quite a lot. So before I go over this, I wanted to say one thing about SEJITS, sort of this is the idea. There are people in this room who can explain this way better than me, but I just basically -- the overview is that you write your code in whatever language, in this case Python, and then there are certain template files that tell the framework how to compile it down to C and then later on also to really run a compiler to binary codes that then can run on the platform, for example, CUDA or multicore. And this whole thing is called Selected Embedded Just-in-Time Specialization. But, again, there are other people in this room who can tell you more about this. We're just using it. And so back to the patterns. So the patterns that we came up with right now for an initial take on the framework, and we came up actually with more patterns for a more sophisticated framework but would also solve more sort of really multimedia problems. Right now this is very audio specific. And it's basically we said, well, there's sort of three major groups of patterns. First of all, there's structural patterns, and I'm starting at the bottom here. Structural patterns means you need to compose components somehow. Right? The typical component composition is a pipeline, right? So we have A doing some filtering, B doing some filtering, C doing some filtering, how do you propose those pipeline. And there's other ones -- so this would be pipe and filter. But then there's iterative and then there's MapReduce, and there's going to be more of these. But these are, first of all, three that are really important. And then we have computational patterns. So we know, for example, that we have this spectral methods, graph methods, and then inside these methods, inside these algorithms, we can then go down and say, well, what are the concrete algorithms based on these patterns that we want to implement, and then we end up with the application patterns that are then quite specific but not that specific. So convolution, as everybody can imagine, is not that specific. It's pretty general. So basically these are sort of the first patterns that we sort of imagine, but from those we actually only have implemented the GMM pattern and the MapReduce pattern. And, in fact, the K means is sort of a side effect. But now instead of saying, oh, we didn't have a lot of progress, I'm going to use this small thing to my advantage. I'm going to say, well, let's take these two, how generalizable are they. Right? Because a major problem of this framework will be are these patterns actually the suitable tool for parallelization. That means are they specialized enough so that we can implement them but then generalized enough that they help lots of applications at the same time. Right? So you don't want to -- this is a reasonable tradeoff. And so what's interesting right now, we only, as I said, have two patterns. But then from these two patterns we could already create three applications, and one was sort of a speaker diarization application -- I'm going to talk about these in a minute -- video event detection application, and a music retrieval application. And as you can see, these are like three rather different applications just based upon two core patterns. So let me talk a little bit of what they do so you see why they're different. So speaker diarization basically is -- for historic reasons we worked on this, is basically who spoke when. You give a speech, you give the input as a speech track, and then the output is clusters of this is speaker one, this is speaker two, this is speaker three, this is speaker one again. And we had a -- this used to be -- usually, a couple years ago, this used to be in the order of real time. That means for a 10-minute speech it would take about 10 minutes. Now it's 250 times faster than that. So for right now a minute of speech would take like -- it's like this, right, or a 10 minutes of speech is like this as well. So 250 minutes of speech would be one minute of processing. And, anyway, then you have video event detection, which is, as I said, I'm part of this Aladdin, which is TRECVID MED. So we have 150,000 videos and we have to categorize them into wedding ceremony, changing attire, and the weirdest category, which we haven't solved yet, is winning a race without a vehicle. So give me all the videos that show winning a race without a vehicle. So, yeah, we're working on that. The point is we part of a big team where we're using this on audio, but this is of course really helpful was now we can run all these experiments on 150,000 videos. And then the one that I want to go further like as an example in this talk is music recommendation. I don't know if you know, but recently Columbia, and that's the collaboration with [inaudible] has released a database of 1 million songs. 1 million songs is quite exciting because we estimate this, that there's about 16 million songs in total recorded since we have recording. So that means if you solve a problem in this 1 million song space, the high likelihood you basically already solved it completely because 16 million is not so far away from 1 million. And, in fact, in terms of scalability, we definitely solved it. Just in terms of accuracy we still have to work on that. But the point is the -- this database is really big data. I mean, it's a million songs. And, anyway, what we created was a music recommendation system that would basically work based on the contents. So most of you probably know music recommendations is still based on tags, right? So I say rock, and then you get rock songs back. But the idea here was to do it by song content. That means I give you a list of songs in terms of songs. So basically, for example, it could be UI choose library, but instead of the metadata you give it the MP3s, and then it finds similar songs based on the MP3s. Again, that was just an example of we could do this rather than, you know, being like maximally interesting for the user. But the point here is can we do this. Right? So, first of all, we wanted -- we said we wanted to be more productive than just implementing low-level C code. Well, so what we did is we actually implemented all these three in C, and then we implemented them again using our framework. And what happens is that you see that the whole speaker diarization engine, who spoke when, now takes 50 lines of Python code. And the video event detection -- that's not entirely correct -is basically the diarization engine plus a couple more lines of code. I should correct that because there's a little more. It's probably a hundred lines. And then the music recommendation is about 500 lines of code. And I'm going to actually go deeper into the music recommendation system because I'm going to show you the typical diagrams that you see for such a system, which are quite complex, but, you know, we can definitely do it in 500 lines of Python. And now people might say, yeah, but all the back end takes all the code. Right. Except I can run in 500 lines of Python -- these are runable without the back end. That means it just takes forever. Right? But if we're at the back end, then I'm much faster. Right? So lines of code -- yeah, right. Okay. So what we have there is what we do is we have this big database, so I'm going to just go quickly through this because I'm assuming people either know this or it will take a little bit too long. So basically the point is what we do is we train a large database of all the songs we have in the 1 million song thing, and then we -- once we have this sort of the -- we have a space of features basically and of models for the 1 million songs. That's some offline processing we do. And then in the online phase, what we do is we have a little demo that works a little differently and it's little bit confusing, so please pay attention. What we do is -- what we initially, as I said, what the thing does is you give it an MP3 and then it computes features and compares it to the general universal background model and gives you all the MP3s back that thinks are similar. So now because in the demo here I can't actually upload an MP3, it would take way too long. What we did is we created a little Web site called Pardora, the similarities to known services are probably intentional. In any ways, that Pardora site, what it does is if you give it actually a keyword and then that keyword retrieves some music from a different subset and then takes that music to create the engine. That's kind of way to do it this way. But, again, we just wanted to put our point -- I mean, we just wanted to make our point. And, in fact, let me show, first of all -- okay. So that doesn't matter that I'm running on up battery. Hah. I know. Okay. Okay. What? Okay. I need a power adapter otherwise it wouldn't connect. >>: This one? >> Gerald Friedland: Yeah. Thank you. >>: [inaudible]. >> Gerald Friedland: That's a lot of intelligence for a little computer to not let me continue my presentation. Okay. In any way, good. So the point here is this is my -- this is sort of -- this is sort of the demo. But before I go into the demo, we of course did some measurements. So basically based on Nick Fury and on the songs returned and the songs sifted through, how does the system scale. And there's actually a couple of interesting points. So this is basically really based on 1 million songs. We're actually querying 1 million songs. So in the last retreat I think when we presented this we were querying only 10K songs because we had some issues. Now we are querying 1 million songs, the entire thing. And then a couple interesting points here at this database, and this is basically where you get some bumps between the -- this is basically I/O bumps. Because if you have -- if you do actually the 1 million song thing, then you start to -- it's basically -- it's working, but then if you do less than this, you don't have to do the same kind of caching in I/O than if you do the full thing. But in any ways, it's scaling most linearly and it's taking actually really less than second to query 1 million songs. And we got to that point, we were like really happy. Now we can do really all kinds of algorithm development based off 1 million songs. I mean, that's kind of really cool. Of course this takes a cluster of 16 GPUs to do that. Sure. But in the front end it's only Python, and I can now try all kinds of music retrieval algorithms based on a million songs. And, again, if I solve this -- if I solve it on a million songs, I probably have solved it for most of the space. Because, again, it's 1/16 of the space already. So I'm just going to show you how this feels. I have a little video. So the right side is the GUI where somebody enters a keyword to retrieve some songs which are then used for retrieval of other songs. And that is not that interesting part, even though people will be attracted to the graphs. I hope that on the left side you can see basically what happens there. And it's just basically your proof of concept thing that you'll see it's Python that's being used, and the first time the query is done, the Python is compiled. You'll see GCC messages. And you'll also see how long the query takes, and then later on we do a second query you see it's not compiled again because we already compiled. So the compile time is basically part of that. So that's what we're doing. So this is the sort of -- yeah. It's really Python. Right? And now it's basically compiling it. And then somebody's entering I think Jess or indie rock. Okay. And then we're querying the database. And then we're having some results here. And then we ask Grooveshark for us to play them. Except you won't hear them right now because we don't use audio. Yeah. So that would be sort of the player. So when we go back and you do a second query, again, that part is not the interesting one. We go here again. It's not compiled again. It just does the query. And then that's it. So that's kind of cool to do it on 1 million songs this way. So and the two slides are the conclusion, which is basically we now have a pattern-oriented framework for hardware-independent specialization, which we like. And we aim at productivity, efficiency, portability, scalability. And we show in the proof of concept that we can create three diverse audio-based multimedia content analysis applications based off two patterns that we implemented. And this will allow a better handle on big data for diverse research communities. I'm hoping, and I'm actually believing in this, will enable new research because the experimentation time barrier is lowered. Right? So that's the point. Now I can do all kinds of stuff directly on a million songs rather than subsampling. And, yeah, there's future work, though. So this all is sort of like -- looks like a product, but of course it's not. We have to do more. We have to implement more patterns. We'd like to extend this to visual and textual media, and later maybe also to other computational tasks. And then there's one other thing that I didn't talk about or only briefly is this all came from the Par Lab, so that was all about parallelization. But the problem is, if you handle big data, there's also an I/O bottleneck. And the I/O bottleneck can sometimes really be bad. So that needs to be sort of tackled in the framework as well. And but the most important thing I think is for us to develop a community where the sort of expert programmers for the parallel hardware work together with the -- with the researchers or the users of the framework so that the users tell them what they need and the sort of hardware specialists can make the stuff most efficient. Yeah. And that's it. >> Juan Vargas: So are there any questions? There's one question. >> Gerald Friedland: Yeah. >>: So how long is the preprocessing of the million songs? >> Gerald Friedland: So for 10K it's about 14 seconds, and then it scales almost linearly, so it's a couple of minutes only. >>: How many songs have you tapped in the database at this point? >> Gerald Friedland: 1 million. >>: 1 million? >> Gerald Friedland: Yeah. It's the entire database. >>: So if I check for [inaudible] going to there. >> Gerald Friedland: I don't know. So the selection of the 1 million is not done by me. But you can check it. >>: Like 282 gigabytes, so it's huge. >> Gerald Friedland: Yeah. >>: But I don't know if they completely automate it. There's some hand annotation in the database, I think. >> Gerald Friedland: There is some hand annotation. Yeah. That's the whole reason why they have the database, because there's hand annotation. And then -- it was actually hand annotated, now that I -- yeah. >>: I thought they had some hand because they had tags. Though, at any rate ->> Gerald Friedland: I think it's all hand annotated. There's no automatic. >> Yeah, yeah. >> Gerald Friedland: Yeah, yeah. >>: Yeah, because I started looking at it. As a musician I find it fascinating. >> Gerald Friedland: It is fascinating. >>: And I want to go through and write graph clustering algorithms to try and see if I can track influences across artists. >> Gerald Friedland: Yeah. Now you can. Actually just, you know -- just you can. >>: Find the roots of music. >>: Yeah, yeah. I think it would be really cool. >> Juan Vargas: Okay. So the next speaker is Leo, and he's going to be talking about parallel components for an energy-efficient Web browser. And he [inaudible]. >> Leo Meyerovich: I'll probably bug Dave in ten minutes. Yeah. So I'm Leo. My advisor, Ras, is sitting back there in the black shirt. And this is actually -- a lot of the demos I'll be talking about actually two undergrads of ours just did, one this weekend and the other about a month ago. So this is pretty cool. And so actually for a lot of the people who have seen earlier parts of this work, we actually, like I just said, we have a lot of new stuff, both in terms of architecture and kind of new ways of parallel programming. So I actually wanted to kick off with a demo. It's probably a risky idea because it's made -- I actually challenged the undergrads this weekend to do a new visualization using our system. And so kind of in the spirit of one of the workshops going on next door, there's -- they're talking elections. And so here we have a very, very simple visualization of -- actually, this is -- you're not seeing anything, are you? See, everybody's flipping here. Well, I'll do this demo, then we'll see what happens. Okay. So here what we're seeing is just a population map in Russia, just who lives where. And then what -- so recent election, not everybody voted, so I'm just going to resize based on the different districts in Russia, who voted. And if anybody's familiar with Russia, thinks are kind of fun there. So what really -- what's really kind of fun to look at is, you know, which districts had a normal voting percentage and which districts maybe had a hundred percent, which sounds a little bit suspicious. And so now we just fade in. Like some of these, like the really dark ones, are the ones that are a hundred percent. So probably it's not you voting for the president, it's the president voting for you in these dark ones. But this actually -- like I'm kind of cheating here. Like this isn't really Russia. This is like only about -- I think this one's about a hundred districts. Maybe we can load in about a thousand districts and we can kind of look at what's going on in that one. But you see now this tween is going to a lot slower. If I tried to change the size, it's really bumpy, right? In reality, in Russia we have about 100,000 districts. And so if you want to kind of understand what's going on in 100,000 districts, we're not going to be running this in the browser today. But using the same code that kind of outputted this thing, this visualization browser, we kind of got this compiler running on a GPU running last week, and so this is where we get into dangerous demo territory. We can do the same visualization using 100,000 nodes. Where is my mouse? Here we go. So this is -- now we're using the same kind of visualization, which I can do the tween, I could resize the nodes, like hundred thousand nodes, I can change the colors. This is actually kind of interesting. See this dark solid one? That was kind of one of those really regions of 100 percent voting. And this is actually all the same political party is red, is the United Russia. So be careful if you talk to those guys, I guess. But the -- so this -- this is kind of fun. Because using the same kind of high-level scripting language we generated the same visualization widget, you could debug a new browser or you can do this GPU version. And what we started to realize is when we actually did visualizations of, you know, more like 100,000 things, a million things, maybe they'll do visualization paradigms and actually work. And now we have this new domain of like, well, how do we understand this stuff and like how do we slice and dice. And if you can script this in a weekend, I think this can be kind of exciting. All right. So now for the actual -- let's do some science. Okay. So, like I said, I want to talk about architecture and patterns. We just saw in Soviet Russia president votes for you. And so, like I said, this is big data. But we're also -- part of the reason we're doing this browser work is actually we're interested in small devices. So seen a lot about kind of Google glasses. If you see on the right corner, there's actually a contact lens that people figured out how to put a display on. So this is another domain where like in the coming years we want to have essentially what app do you have to run, you have to run the browser. So we want to be targeting these things. So big data, small devices. All of them, as predicted, power wall. We're not getting single thread of performance. And so if you actually look at the devices today, the solution's pretty clear. So for the Windows 8 phones and pretty much everybody else, NVIDIA phones, everybody, four cores, 128-bit-wide SIMD per core, and then you have a coprocessor in the case of the Tegra 3. It's 12 cores on GPUs. So this is what the browser needs to be running on. Today it's only using a little bit of that. And so -- so okay. So what do we need to speedup, parallelize, if we want to optimize this stuff? Actually, I will run out of power if I don't turn off the GPU. Sorry. So the question is what do we need to parallelize. I'm showing you this cool benchmark that the IE 8 team took. You can slice and dice this different ways, but you would predict that, okay, JavaScript is probably something we'll have to target. And you're right. So in this case it's about 20 percent. And I showed in that early demo we actually have -- we're working on some new DSLs that will kind of take care of this part of it. But a lot of our work is actually fixing the kind of 80 percent case, which is these libraries for the rest of the browser that are kind of rendering layout, all these things, like parsing. And so we actually -- a lot of our work is actually just straight-up algorithms for a regular code. And so that's been a project over the past year, is both sides of this. So I just want to actually focus on the kind of -- the nonJavaScript part, just like those libraries. And so basically a Web page comes in, you parse it like a normal compiler. Once you get this document, you have some templating step which is basically imagine you want to say all the headings on my page have double the font size, you want to attach those constraints to all your nodes in your document. Then we lay it out, figure out what goes where, and then finally we send it to render to fill in those pixels. And so we have to -- for a lot of these we have to come up with actually pretty new algorithms for these regular computations. For all of it again, we have to actually figure out what's the new browser architecture, which that's been a lot of our work. And what I actually want to spend most of the rest of the talk on is actually how do we deal with the code explosion problem. Because if we're regressively optimizing this, the algorithms and doing new architecture, we have to do this for a whole lot of code. Like the browser is one of the biggest code bases running on your client today. And so I'll talk about that. But before then I do want to make a couple of kind of essentially shout outs to our collaborators in kind of recent efforts on the architecture and algorithm side. So for the algorithms we've done a lot on things like finite state machines and computations over trees. And actually it's starting to show up elsewhere. One of our collaborators, I don't know if he's in the room, but Todd Mytkowicz has been looking at things like, well, what if you want to do malware detection on the Internet, like your search engine finds fraud results, like finds malicious Web pages. A lot of the -- are the same algorithms and actually even for advanced versions of them are actually showing up in that type of work. And then on the architecture side, our collaborators have been taking -- for the Qualcomm team took a look at our templating algorithms, and now actually I'm working with the Mozilla team on architecting their new server browser, where, again, they're going to take that -- we're looking how to put in the templating algorithms and actually maybe even -- still unsure, but maybe even using our actual layout engine like the actual binaries we're generating. So there's been a lot of cool stuff. But I want to talk about how we deal with the code explosion, like how do we deal with all this code complexity of really aggressively optimizing this big code base. And so I want to talk about how we do this for a layout engine, which I've been mostly focusing on. And the layout is the component that figures out what goes where, what's the sizes of things, what are the positions, for each word where is it. And people have tried to parallelize this type of computation before. And when practice comes out, it becomes this tradeoff between essentially performance or correctness and you pick one. And generally on the Web, if your Web page looks wrong, people don't like your browser. So it's a very -- it's a rough ride to actually achieve this. So to give you kind of intuition for why this is, for a lot of the algorithms we found to get really good parallel speedups, we have to actually -- we actually created new algorithms. I don't want to get into them here, but hopefully some of these sound like terms you haven't heard before. And so that kind of gives you the idea. And so we're doing a different approach to writing this type of code. So first we're actually writing a specification of how the layout language CSS is supposed to work. And from there we will -- we're running it through our synthesis tool, this special type of compiler I'll talk about that will not only find you a sequential implementation, but will find a parallel one. And kind of fitting with the talks today, it's -- we're not -- we're going to actually find how to decompose this layout engine into parallel tree traversals, and then once we use the synthesis tool to kind of find this decomposition, then it's more traditional compilation and specialization to go from the tree traversals to the specializations, and that's more like the SEJITS style of work. But we need to get there first. And so I want to talk about that, kind of this -- the rest of the talk is just about how we get there. So, like I said, the layout engine figures out what goes where. And so in this case we basically found out that the -- a lot of layout languages can be thought of this way, and we find out how to parallelize CSS. And the basic intuition is given a document, the layout engine is going to do a sequence of tree traversals. So, for example, in the first tree traversal might be a bottom-up one where I compute the width and the height for each node, like I'll solve those constraints locally, so maybe I'll have a black and a green thread, they'll run in parallel. Once they're ready, we move on to the next nodes, maybe the red thread completes on some node. And we keep going up. Maybe we have top-down tree traversals for some of them. In this case every node is essentially a logical spawn for this type of parallelism. And here we're computing the X and Y attribute for each document node here. And maybe there's some final bottom-up traversal. And we have this whole kind of essentially spaces of different types of traversal strategies. And for whatever layout engine, like, for example, that demonstration I showed you in the beginning, like you can generally decompose it into these types of traversals. And this is, again, at the logical level, not the actual implementation. So write -- so the question is how do we write these. And so, like I said, we split the problem. First you just write kind of the sequential spec or actually the logical spec where you just say given an input what should the output look like. And we have a nice little language here. It's descended from something called attribute grammars. It's a declarative formalism. And the cool thing is we actually added a language extension on top of this which lets you talk about the decomposition into these tree traversals. And this is interesting in two ways. First, if you look at the basic schedule, the stuff in green is actually kind of like the stuff I showed you a couple slides back of those tree traversals. And the way you read this is the first tree traversal is going to be this bottom-up tree traversal. And when you -- you instantiate it in a way that will compute all of the width attributes for each node. We're not actually saying how do that. That's not the functional spec. And then maybe we actually want to compose tree traversals. In this case we think there's at least two tree traversals, and so after we do the bottom-up, you know, semicolon, then do the next one. So this is already cool because we split out this definition of the decomposition into these tree traversals. But the second part that -- which really gets to some new territory here is actually writing all of this out explicitly is painful. Oftentimes you don't know how to -- you don't actually know the valid decomposition, or having to specify each attribute and flow layout engine would just be painful. And so we kind of -- we -- you can actually just write question marks, essentially, inside of this bit of code. And it will be up to our synthesizer to figure out what are valid ways, like ways to correctly fill in these holes that maybe even optimizer answer questions to us about them. This has changed how we actually develop. We're building different tools essentially to exploit this mechanism, so it kind of changes how we program, do parallel programming. And to understand that, I'm going to actually talk a little bit about how that works, and then you'll kind of understand the tools. So for any sort of -- basically we think about this as a search for how to fill this in. And so the -when you put a new traversal into our system, you tell it how to make a small step in the search, and from there we can stitch in the overall search. So, for example, if you have a horizontal box and maybe it has some assignment to the X attribute that's a function of some V attribute, we need -- and then you want to ask, well, if I have a top-down tree traversal is it okay to sequence on a bottom-up tree traversal that will compute that attribute, is that a valid way to continue, to compose on extra work. And so when you add in a traversal to our system to kind of teach it about how to use it, basically what it would do is the addition would look at, okay, well, if this is a bottom-up tree traversal and I want to see if adding -- if it really could compute this X attribute on each node, well, either that V that it depends on is computed in the previous tree traversal because then everything is already computed, or it's that the attribute is computed in that same tree traversal. Imagine like a stencil way front where it's computing as you go along. And so then therefore if it's bottom-up, that means that the attribute is available from earlier in that same traversal. And once we have kind of all of these local steps, we can actually do kind of full searches -searches of full completions of the schedules. So maybe we first guess that we can decompose a system into a top-down traversal that only computes the width. Clearly we don't compute everything, and actually maybe it's entangled with the width and height. And so we have to -- we might need to compute more. And so this is wrong, and then we'll try another alternative. We find out this alternative works, but it doesn't compute everything. And so we'll have to compose on -- we'll have to sequence on another tree traversal to compute for attributes. Maybe again that works, but it still didn't compute everything, so we have to sequence on another one. And you can imagine building -- you can find different sorts of programs this way. And so basically now our computer's working for the developer. It's finding this decomposition into tree traversals for them. And this is really powerful. So this has completely changed how we develop programs. So, for example, when you start out writing this widget, you actually don't know anything really about how to decompose it, so you just put a giant question mark and ask, well, can you do anything. And this is basically enumeration of the search space and basically you stop once you find all parallel things. Then later on you might be happy with that, you continue with development, and then you might want to make it a bit better. Like maybe you're in a browser and you want to tweak your browser. And so there you might say, well, I actually expected a different parallelization scheme to work, and you might constrain the search space to only look in using those decompositions and ask did that work, like did it work, did it not work, and then you can start understanding how that works. Or you might actually use something like an auto-tuner to say, well, maybe on different devices different traversals are better. Maybe they'll compute less and therefore it will fit into smaller caches per traversal. Then finally once you really are happy you have to realize that we're dealing with lots of code and like evolving standards, and so whenever the logical spec changes, we need to make sure that we don't break that parallel -- or we work with the parallel decompensation. Like maybe the parallel decomposition assumes some sort of independence, and then once you change your code, that gets violated. And now that we have that as part of the language and part of the synthesizer, it will actually check to see if this schedule is still valid. And so I'll actually show you a bit about this. So okay. As an example, like I said, we've been using this to write the CSS spec. So here you see our favorite Wikipedia Web site. It's a little off, but it's pretty good for about a summer's of work. And here we see some trick key features like nested text and tables. And the cool thing is to parallelize this, the CSS language, it actually came down to this is the sequence of tree traversals and actually how a freshman in Berkeley was able to specify how to do it. In the six lines -- or one, two, three, four, five lines, he specified what the parallel decomposition is for CSS. And pretty much you understand this spec already. There's one extra thing I can talk to you about later. It's pretty much a richer form of decomposition. I'd like of course like this -- once you have the decomposition, you have to actually say, well, how do we map it down to hardware, how do we do the specialization. Here I don't want to go into the algorithms, but I do want to say this is real. We actually have been writing new algorithms. You saw in the beginning we've been doing stuff on GPUs. There our kind of target is to do about 1 million nodes real time. On multicore we have -- we ran on an Opteron. We find about subnets scaling in eight cores. This is relative to sequential. And actually because we're using code generators we can do very aggressive sequential optimizations, so we actually have -- in this graph we actually -- I combine the results. We have super linear speedup. And then we've also been playing with vectorization, so within a core. And so we found for some tree traversals we can vectorize them. And here we show on some Web sites that, yes, on -- with our new algorithm we actually do get 4X speedup. There's a reason it's super linear. I can talk to you later about that. But kind of cool is we actually started measuring power numbers. In this case we see that on -we see about a 3.5X energy savings. So, you know, power and performance are in this particular case similar. So this is a quick recap of what we just saw. The reason we're doing our work is because browsers needs to change. They need to handle bigger jobs and also move on to smaller form factors. And we found that for -- we want to do DSLs for that 20 percent case, but I think -- but for the 80 percent case is really we need to get these algorithms out of like, you know, papers into real big code. And getting that will be a big step. And part of what we found to be very effective is decompositions into these traversals. And unlike kind of traditional, you know, parallel four loops or whatnot, it's -- we found synthesis to be kind of an enabling technique for -- on both the compiler side and the expression side. So that is it, and I hope that inspired at least some thoughts with this talk. >> Juan Vargas: Do we have any questions for Leo? Thank you. Thank you very much. [applause]. >> Juan Vargas: The last speaker for the session is Ras, Ras Bodik. He's going to be talking about program synthesis for systems biology. He comes from UC Berkeley. And after him we'll have a break and we will come back to the room to continue with the next session at 3:30. >> Ras Bodik: Okay. I think I can start. So what we are looking at is one of the smallest parallel programs that nature built for us. You can think of it as a small distributed systems. It's a worm that contains about 900 cells. It's small enough to serve as a subject for various studies in biology because it has translucent skin so you can see how the cells develop with your naked eyes, or nearly naked eyes. And before I tell you more about our collaborators and how this work came together, I want to leave you with a little bit of a puzzle which asks what do dining philosophers have in common with systems biology, especially developmental systems biology. So here you see on the right how the worm develops and here on the left how the dining philosophers are trying to dine. So this talk should be giving by Saurabh, the postdoc in my group who essentially discovered a connection between the synthesis of concurrent programs that we did as part of the UPCRC and systems biology. Jasmin is our biologist collaborator, and there is no better place apparently to find a biologist than Microsoft Research, and she together with Nir, one of the people who started the field of executable biology. And Ali and Evan are our students. Ali's a grad student; Evan is an undergrad heading to MIT this fall. And they did, of course, as you know, all the work. So for the systems biology, it's a process of modeling the models, mathematical or other complex processes in biology. And what the development of biology is looking at is one of the questions which is not something I should emphasize, should try and answer directly. We are not curing cancer. We are even far from understanding it. But in order to understand cancer you want to understand what happened with the genes, how they malfunction so that the cell growth and differentiation is broken. So to understand cancer you want to investigate cell differentiation, because each broken differentiation may lead to cancer. There are two ways how a cell can differentiate. One of them is a single cell develops into cells of different types. Another one, the multiple identical cells differentiate by communicating with each other. And so understand differentiation you need to understand communication between cells. And so the worm is perfectly the animal because you can see the cells, how they divide, and it grows relatively quickly. And there is one particular part of the worm that is studied by biologists. There are about six or seven cells and another cell which is called an anchor cell. And this one generates signals stronger to the cells that are close by and weaker to those have further by. These are called vulval precursor cells. They are sort of stem cells that, depending on the communication between, demand a signal from the anchor cell, divide in a particular way, they determine their fate. So here is the state of these cells before they actually take their fate. And here is after one step of division. So we can see now one of these cells, this one divided into two, and they have -- they have taken a particular fate, fate No. 2. This one divided and took fate No. 2, and then depending on what fate they have, they start growing differently and then eventually form the desired organ after a few more divisions. And what we want to understand is actually what program the cells run in this division cycle from the point when they're identical, not differentiated, to the point where they actually establish their fate. So this is the -- you can ask the biological question of understanding how fate is determined in different ways, but systems biology asks it essentially by trying to determine what is the problem that the cells run in order to so robustly determine their fate, robustly in the sense that in a wild-type animal without any mutation, the fate is always 3, 2, 1, 2, 3 and the organ is formed properly. Okay. So our goal is to infer the programmer or reviewer working in program synthesis. We want to synthesize it from experiments in the lab. So now what is common between the dining philosophers? Well, if you look at the dining philosophers, you see that they somehow, or a particular set of dining philosophers somehow end up always eating. Perhaps they end up eating their food in bounded amount of time without the deadlock. So clearly they are running some algorithm in their head that allows them to communicate and accomplish the goal. Now, you could ask them what the program is, but philosophers are not going to talk with you. So you need to do something else. And the methodology which is of course a metaphor for -yes. >>: Can you actually explain the dining philosophers problem? >> Ras Bodik: The dining philosophers problem is that you have -- every philosopher has a meal in front of himself or herself and there are chopsticks, there is a chopstick between every meal, and the philosopher can only eat if they grab both of the chopsticks. And now you can see the possibility of the deadlock, for example, if they don't have any reasonable program running, each of them holds one chopstick. By not releasing it, nobody else can grab two, and the dinner party stops. So the philosophers are not going to talk to you and tell you what program they are running to actually successfully eat. So you need to do something what biologists do with the cells. You can mutate the philosopher somehow, run their program, observe the results, then repeat it for various other mutations, and then from that infer what programs they're actually running. And the mutations that come to mind, perhaps blindfold the philosopher so they cannot read what the others are doing, you can tie their hands behind their back or something of that sort. And then maybe you come up with a program like this which will tell you that once the philosopher has both forks, they eat, and now they could perhaps eat again, so this model describe with Petri net or they can release the fork so that somebody else can eat, then they wait on somebody else to release the fork, and then they can repeat the whole thing again. And so this model of the philosopher of course is not obtained by observing a real philosopher, but this is the process that we want to do. We want to observe the process and then infer the program because then the program describes what the cell is doing. What is important in the process is the language, the modeling language that we choose. Here they use Petri net to describe what philosophers do. We have designed a different language. So how do biologists actually do it. What they do, they mutate the cell, and then they let it develop and observe the changes, observable changes in the phenotype of the cell. So what you see here is the wild type sort of cell that has not been mutated. Here is time, the development, and here is the one -- here is [inaudible] mutated was mutated in a particular way. And what biologists observe, that something bad has happened and from that they make inferences. Okay. Now, these experiments are recorded in the following way. They say for a set of genes, what are the mutations. Well, here is one particular experiment number, three where this gene is wild type, this one is wild type, this one has been knocked out, and this one is wild type. And so on and so on. And the results tell you what fate these cells, six of them, have taken. So here fate 3, fate 3, fate 2. This one is the interesting part. It actually requires a new algorithm. Here we are saying the cells, depending on how you run the experiment, could end up in fate 1 or 2. So sort of the cell is nondeterministic, but it's telling you that under that particular mutation this one, experiment 7, the systems become nondeterministic, sort of multistable. Depending on chance, the cells can take different fate. And here is the unique modeling challenge, that we want to synthesize a program that is complete in the sense that it can reproduce all observed behaviors, right, so if there are various conditions and there are different nondeterministic outcomes that biologists observe in the lab, you need to synthesize a program that can replay them all, because if it doesn't, then presumably it's not modeling everything that the cell is doing. Okay. Now, so from this general mutations experiments, biologists infer protein interaction. So they say if you knock out some of these genes then there is negative regulation on the following proteins. How do they know there is regulation of proteins? Because they know what these proteins do. And so from observable differences they work back towards these proteins and now they have sort of pairwise interactions. So the interaction would say sort of lst-2 negatively regulates this map. Okay. Okay. Now, these pairwise things are then put together into informal diagrams that look like this. So what we have here are three of those VPC cells, okay, here is the anchor cell, here are signals between them, strong, medium, and weaker, because this cell and that cell are further away. And inside you have proteins, okay, and these positively influence each other. And you have also these receptors that sort of act as ports on the cell. You can sort of abstractly model them as proteins too. Here it shows how these three cells take one fate, another fate, another fate by interacting in a particular way. So these are the diagrams that biologists put together from the many experiments. What you often don't know is are these diagrams complete, are these interaction actually explaining the cells taking their fate, number one. Number two, we would really like to run this program and observe how the cells communicate and what sort of exchange of information happens in order for the fate to take a particular result. So we would really like to turn this static information into an accurate dynamic model that can run in a simulation. So here is where executable biology, Jasmin and Nir's work, come in. So executable biology writes a model that is a program that you can run, model, check and verify that indeed it agrees with the experiments. So these verification models can tell you that, oh, there is something correct in the model. Under a particular execution it disagrees with experiments, and model checking is needed to handle the combinatorial nature of communication, the fact that cells proceed at different rates. And you can discover potentially new interactions between proteins and then go to the lab and verify whether your prediction is correct and sort of work from in silico to in vivo understanding. So semantics of these models, those that we work with, the concentrations I usually discretize, time is also discretized as well, and that's usually enough to find sort of causal behavior, the interesting behavior, even though we don't know concentrations precisely. But in this setting usually perhaps the scientists don't really care what the actual concentration is. The cells are concurrent automata that progress at slightly at different rates, so there is asynchrony but not arbitrary asynchrony, sort of have a little flag between each other. This project is called bounded asynchrony. And the timing is modelled with state progression in this automata. Okay. Here is a program, a model that our collaborators wrote in a language called Reactive Modules, the language used for model checking. And I won't go into detail, but this is just a small fragment. And it turned out that it's writing models in this language not because the language is bad but writing these models in general is laborious because they involve timing, asynchrony, [inaudible] discretization of concentrations. So models are hard to understand and debug. RM is partly a problem because it allows you to write models that do not correspond to biological explanations, so you can build strange abstractions due to its clairvoyance. I won't go into details here. So executable biology is great, but writing models is difficult, so here is where we come in. And we try to synthesize these models rather than writing them by hand. Okay. So our contribution is that we'll synthesize this model from some initial knowledge provided by the biologists, and so you obtain them faster. But we go beyond that. We try to tell you whether there are other executable models that can explain the phenomena differently. And then we'll tell you what experiments you need to run, if this is so, to disambiguate those models and rule out those that are incorrect. So here is a quick example of the language that we have designed. We don't use Petri net, we use something else. It's a high-level language because that leads to smaller programs, smaller search space, and faster synthesis, and it resembles these biological diagrams because biologists then hopefully can read it better. And the language has sort several levels of hierarchy. The overall model takes cell mutation which you can think of as configuration input and a schedule and outputs the fate pattern. The body's inside. You have several cells as automata which advance asynchronously according to the schedule and communicate. Within each cell we have these proteins or components, in other words, okay, and these are modelled essentially lookup function, discretization of how concentrations change. So each component has a concentration and gets a signal, in the next timestamp it goes up or down depending on its state and other concentrations. And what we synthesize, it is these update functions that we synthesize. And the rest is provided by biologists because this corresponds to the knowledge of the system they already have. And but this is the part that is really hard to write by hand. So the results. So we were able to synthesize the model that previously our collaborator took several months to write, and we obtained a nearly complete model of program for that model. It had some bugs and after a few weeks of trying to fix it we failed, and so we just said let's go straight to synthesis. So we believe this is partly because of the language high-level nature partly because our synthesis algorithms do automate the hard part. This is what came into the synthesizer. Essentially here are the six cells, okay, here is the anchor cell. Each cell has the following skeleton. Essentially you have various proteins here and interactions between them, and then we synthesize essentially the update functions that go in that correspond to timing. This one in particular is pretty difficult because it has two positive interactions coming in, one negative, and it is not clear how we influence this important pathway. And here is the input, actually a fraction of the input, coming from the experiments. It otherwise had about 40 or 50 rows. And here is the example of two update functions that come out. So there are three discrete levels and it tells you how the level goes up depending on what is the input states. And you now start so see why it is so difficult to write these update functions by hand to actually build a model that corresponds to experiments observed in the lab. The second result is we are able to help essentially with the following scenario. Imagine that the biologist develops or synthesizes a model that verifies against all experiments. And he makes conclusions from the model, publishes the results, and then another biologist performs more mutation experiments. These mutation experiments are rarely complete because they're expensive to make and under some mutations the cell actually die. And so as this new experiment appears, it could be that it invalidates the model that the biologist had, it invalidates the conclusion, and you have your nightmare scenario. So here we'd like to ascertain that the experiments that we have performed so far actually are sufficient to create a nonambiguous sort of constraint system, and in particular we would like to confirm that there is no alternative model that explains the phenomena in the cell. And so we have concluded that for this particular modeling the experiments that people have performed so far are sufficient and there is no alternative model that could have different outcome, different fate. Okay. Now, you could ask the other question which is now imagine you don't trust these experiments and you would like to redo them to validate them. You would like to avoid doing all 50 experiments and maybe do a smaller set. And so we can tell you what is a minimal sort of experiment you need to rerun to still have a nonambiguous system that leads to the same model. Turned out that 10 percent of experiments is enough because they do constrain the result, and we can tell you which 10 percent you need to perform to have confidence in your modeling. Finally, I showed you there are no behaviorally distinct models. At least there is no other model that would have a different fate on some future experiments, future mutation. But it could be that there are models that differ how they work inside. Perhaps in one of them there is an interaction between these two proteins [inaudible] with another between another two proteins. That you cannot distinguish by looking at the fate and the development of the cell because, after all, they have the same fates. What you need to do is so-called instrument the cell with fluorescent genes and observe what happens, how these genes express during the evolution. Here is the time. Right. So this is sort of a more involved experiment. We would again like to reduce the amount of work that you need to do by tagging only proteins that would differentiate two models. And so here we can tell you where tagging would be necessary. And indeed we found an alternative model that of course behaves the same way on the outside but inside it's a different interaction between proteins that explain the cell. And this is something that apparently surprised our biologist collaborators. So what I didn't talk about to conclude is actually the work that perhaps us in formal methods find more interesting, that executable biology actually pushes on the boundaries of what we do. The synthesis of these models is different than synthesis of sort of concurrent programs, because we want these maximal programs, in the sense that if the cell exhibits races the model must replay them all because otherwise it's not modeling what's going on in the cell. And it turns out that leads to a new kind of synthesis problem that doesn't need 2QBF but a 3QBF logical solver. The specs are incomplete because we don't have all the experiments. And so there is a lot of unknown behavior. And that motivates analysis that goes beyond synthesis. We don't want to just synthesize the model but actually analyze the space of the plausible models, plausible given the observations. And so in that sense we are not really synthesizing but modeling the plausible explanations. And then you want to answer questions such as are more experiments necessary and so on and so on. And so I'll stop here and thank you for your attention. [applause]. >> Juan Vargas: Do we have guesses for Ras? Yes. >>: So [inaudible] it seems that you're dealing with some complex situations, for instance you want to be able to [inaudible] do you have a sense how often [inaudible] as opposed to when they [inaudible] so how frequently these type of more complex analysis [inaudible]? >> Ras Bodik: Well, I don't know how advanced is other modeling, but in this particular model of the cell, we are dealing with multiple genes already. This is so slow. Okay. Just a second. So here we are looking at four genes, okay, and another cell which could be formed or essentially killed before the experiment. So here we have four or more genes already. I know that our collaborators have another dataset that is richer by maybe one or two more genes which has about 70 experiments. Perhaps not all of them necessary, some redundant as it may turn out. But it looks like the trends starts really deeper and deeper and combinatorial mutations. >>: We know for cancers there can be literally hundreds of [inaudible] genes involved in a particular pathology, so these can grow huge. You're really looking at a simple little almost toy problem, which is what makes C. elegans so fascinating. >> Ras Bodik: Right. So I'm not an expert on how the complexity of the experiments go. Perhaps in a particular experiment that biologists do they focus on sort of one interaction between a gene and a protein or a set of genes and a pathway of proteins and they could say just one edge in this model which I've shown here. But overall then it does lead to a combinatorial space. >> Juan Vargas: There's another question. >>: So often the biological rules are more like rules of thumb. So I could imagine that you go through and run this combinatorial problem and you come up with either a really strange like physically impossible answer or no answer. So do you have any ideas how that kind of tell for sure that the biologists have oversimplified the problem? Like you mentioned that often the signaling stopped at discretized, which is probably not realistic, just to take on one thing for this particular experiment. >> Ras Bodik: Well, we don't have enough experience, but in this particular model be variable to synthesize an executable model whose discretization does explain the behavior. >>: The biologists were recently happy with it or ->> Ras Bodik: Well, I wouldn't speak for all biologists. You know, there is -- modeling as appropriate for biological systems may take a long time to resolve. So biologists apparently prefer modeling with differential equations, then there is the school of stochastic modeling. You could think this concurrent modeling, concurrent program modeling as sort of abstraction of stochastic modeling where you replace probabilities with nondeterministic decisions. Right? Where instead of probabilities we are tuning the schedule which tells you how cells alternate. There is a simple explanation which can suggests that this modeling is an abstraction of what happens deep inside, if you believe that proteins behave according to their concentrations. If that's not the case, then this modeling might not be a true abstraction of the system. It's a good question. I don't think anybody really knows it. >>: I guess as a follow-up, any thoughts on how to compare schools of thought? You could imagine that the different groups of biologists want different models and multiple of them will explain the results. >> Ras Bodik: Well, I would -- a safe answer at this point for me would be to say that there are situation in which this sort of modeling that is causal, nonquantitative but qualitative is appropriate, maybe when you're looking at longer running processes where you want to see how protein concentrations develop based on starting conditions, maybe stochastic modeling is better, other than the one that purely focuses on whether the communications happens this way or the other way might be better. Going back on that would be a speculation for me at this point, so I'd rather not say much more. >> Juan Vargas: Well, thank you. Thank you, Ras. >> Ras Bodik: Thank you. >> Juan Vargas: Thank you very much. [applause]