>> Jim Larus: Welcome to the first UPCRC Multicore Applications Workshop. That's quite a mouthful. I should point out for those of you who are not familiar with this room that on the other side of this wall there's breakfast, there's some fruit and coffee and juice and things like that. Feel free to help yourself. We will have a break later in the morning, but if you want something now ->>: And if you're embarrassed to charge past a speaker for breakfast, you can go around. >> Jim Larus: You could go all the way around all of the lecture halls, or you could just go on that side of the room. We'd be okay with that. So I just wanted to say a few words and sort of set the context for this. I think that most of our visitors know what I'm going to say, but many of the Microsoft people may not be familiar with the context. And so I thought I'd just spend a couple minutes explaining what UPCRC is and why we're having this workshop. So I think the first statement is pretty obvious to everyone in this room, that multicore's success is hugely important to both Microsoft and Intel. And because of that the two companies got together a few years ago to jointly fund two research centers: one at UC Berkeley and one at the University of Illinois at Urbana-Champaign. And they are large multidisciplinary, multi-investigator projects. I didn't actually count, but my guess is that both of them are roughly about ten faculty members involved, around 20 graduate students and staff. They're funded for three to five years. A lot of money coming from both of the companies to fund these projects. So it's a really big research effort, far larger than anything else that I think Microsoft Research has tried in the past ten years I've been here. And they really have a broad focus. If you look at the projects, they're encompassing programming, systems issues, a little bit of computer architecture, and applications. And really the goal of both of these is to help bring multicore computing into the mainstream of computing. Can we figure out a way of making multicore accessible to the broad range of programmers. And multicore on the desktop have a lot of compelling applications and compelling tools. Why this workshop. Well, there are really two challenges facing multicore. The first is how do we program parallel shared memory computers, a challenge that's been around for a long time, certainly as long as I've been in the computer science community. And then the other challenge, of course, is what are the killable applications for multicore. Why would my mother go out and buy an 8-ray parallel computer for her desktop. Still like to know the answer to that question if anybody has any suggestions. And these two questions are really sort of dependent on each other. How do we build the tools without knowing what kind of applications we're going to write for these machines, how do we build the applications without the tools. And so really the goal of this project is to try to make progress on both of them. And because these projects are run in CS departments and because we're Microsoft, it looks like there's a lot more emphasis placed on the first question than the second question. And so one of the goals of this workshop is to try to readdress that imbalance and put a little more light on the application side of things and bring together people from Intel and Microsoft and Illinois and Berkeley who are really focused on the application side of these questions and get them to sort of talk about what we're doing, where the opportunities for future advancement are, what we would like to hear from the -- what we would like to see from the tools research, and basically maybe start getting more of a community of people interested in this and hopefully get a little bit more emphasis on the applications than on the -- just on the tools side of things. So that's what the motivation for this workshop is. Really happy that all of you came, that we've got a great turnout from all four groups of researchers. We have talks from all four groups of researchers. And having said that, let me just sort of put up the schedule. We're going to start this morning with a sort of welcome from Tony Hey, and then we're going to start moving into something that just was, for the lack of a better term, called visual computing by me which is a real hodgepodge of different areas. I'm sure that people who are in these areas contributory negligence when I sort of put them all together. But I think there's a lot of interesting synergies. We have a break this morning. We have a slightly late lunch; I apologize for those of you who were on other time zones for the lunch times. We have some more talk and then a separate session this afternoon talking about social interactions. And then this evening there's a reception for all our visitors and invited Microsoft guests. I have maps. It's literally just about a few blocks that way from here in a new facility Microsoft just opened. So those of you who need to know where to go, I'll be happy to give you a map. Great. And if anybody has any questions and concerns, please sort of catch me, ask. All the visitors should have -- if you checked in at the front desk, you should have gotten a little thing that will let you hook up to the wireless network. If you haven't, again, check with me and we can get you one. The bathrooms, just to answer the other question, were if you go out that door, walk all the way down to the end of the hall, turn left, they're right there. Any other questions before we get going? Great. Tony. >> Tony Hey: Great. Well, not much more to say. Certainly add my welcome to everybody from Berkeley, Illinois, Intel and Microsoft Research. And just I thought it would be interesting just to give a couple of perspectives how I got interested in parallel computing, was a seminar given by Carver Mead when I was on sabbatical at Caltech. And it was an interesting seminar. There was a room full of Caltech faculty and everybody was waiting and waiting and waiting. And about till a quarter past the hour someone went and fetched Carver Mead from his lab; he'd forgotten about it. And he gave a great talk. He gave it with a slide deck and his talk was really an explanation of Moore's law. So that's one of the things that -- he was the Gordon Moore professor at Caltech, and he explained why Moore's law worked. And his talk was breathtaking. He gave a talk which showed there were no -- this is '81 -- no engineering obstacles to transistors and chips getting smaller and faster for the next 20 years. And it's clearly -- that was in '81, and we've really gone a bit beyond that, but we've now come to the end of the free ride. So that was the thing that inspired me to get into parallel computing. I was doing quantum chromo-dynamics. I was a physicist in those days. And I would classify QCD as an application that can soak up as many cycles as you need. But it was clear we needed a bigger computer than the VAX-11/780, which is what I had at the time. So that's how I got interested. And it was also clear to me at that point that distributed memory systems -- it was the days of Crays and supercomputers and shared memory. Distributed memory systems were in the end going to outperform these shared memory systems like the Ymp and Xmp and so on. So it took a long time to come to pass, but I think that's clear for the scientific applications, distributed memory programming is the way people program these machines with tens of thousands of processors at the national labs and so on. There were things like SIMD and MIMD, of course. And I worked in Europe on a processor called the transputer. Now, the transputer was an interesting processor. It had multiple CPU and an FPU and it had memory and had communication hardware on it. I think it was about sort of 10, 15 years ahead of its time. But it is now clear that with the many-core, multicore ideas, we're now revisiting some of those ideas that were embodied in the transputer. So I'm finding that particularly interesting. There used to be a joke when I gave talks on parallel computing, which, you know, parallel computing is the future, and then people said and always will be. And the interesting thing is that it is now really the future. It is actually -- I had not seen the fact that it was going to be so critical for the IT industries, for Intel and Microsoft to actually master programming on multicore chips. And we have to do better than MPI. We never succeeded in 20 years -- my research group and other in the audience who we interacted with, we never made it easy. And so we really need to do better. And I'm looking forward to the focus on applications. My favorite application is the one that Dave Patterson used, which is one of these. As I get older, you come -- I need one of these which has a camera, and so I'm listening to Rick's talk in a moment, which will actually say, well, that looks like Dan Reid over there, he's aged a bit in the last five years and last time you met him you talked about this and that and he owes you a beer. And so then I can go and say hi, Dan, nice to see you. How about that beer? Because I've completely forgotten his name. He, of course, has a similar device which tells him that he owes me a beer, and then he avoids me. Okay. So those are the sort of consumer applications that maybe -- that Jim's mother might buy. You never know. But there I think is the key, and I'm very, very pleased to see the focus on real practical applications which could be a general interest as opposed to quantum chromo-dynamics, which is not of general interest, I have to say. So just one last word. I first met Andrew -- and so this is how we put the deal together with Intel. It was because Andrew called me up and said how about doing something together. And as you know, Intel and Microsoft working together, it's like two porcupines mating. It has to be done with care. I think I first met Andrew when he was chair of the PPOC [phonetic] program committee. And he organized, I don't know, some program committee meeting on January the 2nd in San Diego. So it seemed a bit unreasonable, but then I flew all the way from the UK to go and do that. And I was amazed at this program committee in that there were all these young turks and they trashed everybody's papers, so nothing was good enough and they just selected a few. And then the program committee members who had papers, well, then had to leave the room. And then the remaining program committee members trashed their papers too. So it was an interesting experience. I argued for some generosity. So I'm looking forward to today's events and seeing what progress has been made. And I think it's really important. And the other important thing that Jim didn't say is that the reception in the Spitfire bar actually has beer. So I'll certainly be there tonight. Okay. So I think we're ahead of schedule, Jim, so unless Dennis wants to say something, I think it's Rick -- you ready to start? I'll hand it over to you, Jim. >> Jim Larus: Yes. I just -- one announcement. All nonMicrosoft speakers, we need you to sign one of these release forms for your talk. >>: We have one form they have sign very small ->> Jim Larus: We have one form you have to sign really small on it. But it's just to give us permission to both -- there are two boxes you also have to initial. One is to just use it internally; the other one is to put it externally. Hopefully everybody's okay with putting it externally. I've had a number of requests from both schools that -- and I think these talks would actually be a real asset to people interested in the area. So please remember to do that. And also they would like your talk on this memory stick, so please do that as well. Great. So our first speaker -- let me just go back, put that back up -- is Rick Szeliski from MSR who's going to be talking about some of the great and really fascinating work that's been going on in his group on vision. >> Rick Szeliski: Okay. Thank you, Jim. It's my microphone live now? Well -- okay, thank you. So, first of all, let me say that even though the program says it's just me, there's actually three of us talking. I thought it was just easier to keep short on the content line. But I'm going to give a brief introduction. Sudipta Sinha, who is a researcher in our group, will talk a little bit about Photosynth and some of the ways we're extending it. And then Sameer Agarwal, who is a postdoc at the University of Washington, will talk about solving very, very large matching problems involving tens of thousands or hundreds of thousands of images. So let me just start with a brief introduction to some of the applications that you might see of computer vision where a lot of parallelism is required. And I'm not going to cover most of these applications. As a matter of fact, some of the subsequent speakers from the various other institutions attending here will be covering some of these in more detail. But just to give you a flavor for why we need large amounts of computing, and we'll continue to do so for the next decade or so, there's this idea of computational photography, where rather than just taking a photo with a camera and then printing it or sharing it on the Web you can take multiple photographs -- let's say different focus settings or different exposures or with and without a flash, or even from different points of view -- and then put them together. And we're starting to see this really permeate into the general public, so people now commonly take lots of photos like this and create panoramas. But other things you can do is you can take different exposures with your cameras and then merge them all together. So our group has been working in these areas. And we'll show you one particular thing called Photosynth which is a way of taking photographs from different points of view of moving around between them. So this is one area that can soak up a lot of computation. One that takes even more is whenever you move to video. There the processing requirements get a lot, lot higher because you basically have to process an image 30 frames a second. So, of course, there are the classic applications like compression. Phil Chou's group works in that area. Video enhancement, which is an issue with some of the low-quality cameras we have or maybe stabilizing videos that are taken from jittery cameras. And something that's just on the horizon, this 3D video. Now, you might say, oh, 3D video, 3D movies, people have talked about this forever, what's the big deal, it doesn't look so exciting. But the fact is a lot of the movie studios are thinking this is what's going to keep people coming back to the movie houses. And a lot of major directors -Disney is investing heavily in this, James Cameron's working on a feature film. And of course if it comes to the movie theater, it's going to come to your home eventually. So we're going to see 3-dimensional or at least stereoscopic video coming to your home probably in the next, who knows, three, four years. And that will soak up more computing power because you might want to display this video for different viewers in the room or change your viewpoint. Then there's stuff more in the immediate mobile space where you might be wanting to recognize photos. So just like when Tony said he wants to see Dan's photo and recognize who that is, that's one application. So and then also recognizing where you are, taking a photo of your environment, getting the name of that building, translating signs that you see, overlaying information as you're looking around. So there's this whole space of things. And of course you might say, well, okay, this is the mobile platform so this has nothing to do with multicore. Of course most of you probably know better, right? I mean, the mobile devices will be just as multicore as everything else. And then there are applications that you can think of as being a little bit more embedded, like robotics. We're starting to see a lot of computer vision making its way into cars. So, for example, under poor visibility conditions, sometimes the vision system can detect things and sound an alarm for the driver better than the driver themselves can do. People talk a lot about home monitoring, applications for our aging population. And eventually we would like to have the computer understand you and your state of mind and your desires just as well as other human assistance can. So there's whole idea of looking at people. So these are just some potential computer vision applications. And I probably don't have to explain to you why they soak up a lot of pixels -- or computons, because it just takes a lot of time to analyze an image and to really understand what's going on. So in our group, which is called the Interactive Visual Media Group, we have a number of projects. This project called Lincoln is the one where we recognize things on cell phones. Photo Tourism was a project that started at the University of Washington as a collaboration, and Sameer will be up, will be talking about some of these things. It evolved into the Microsoft product called Photosynth which Sudipta will talk about. We do things with stitching, we've done 3D video, we've done video walkthroughs. And mostly I put this slide up to -- for people who are interested and didn't get enough details from my talk to just go visit our group page. You can find more detail about some of these research projects. So I will mention one application very briefly. This isn't rocket science, but just shows you why multicore, when done correctly, is an incredible benefit. This is just the simple idea of convolving an image, blurring it, sharpening it, using what's known as a separable filter, which is you use one-dimensional filters horizontally and one-dimensional vertically. And one of the developers in our group, Simon Winder, thought about this hard and decided that the right way to do this is to strike the image vertically and have each core basically working on its own separate area. And then after its finished doing a vertical convolution, then it transposes things as it's writing them out, and then you can do it once more and this results in a very fast application. So this is a great piece of code. We use it all the time. Unfortunately we gave this code to a different product group, the Photosynth team, that uses some of these tools. And they discovered they were already multicoring at the task level. They were giving each core a different image. And everything collided and slowed down. So this really shows the need for things like concert, like operating systems which will basically arbitrate between different applications. They each think that they know how to use the parallelism on the multicore. So this will be an issue in the future. So let me summarize my introduction by just saying that visual media in general will probably provide almost unlimited amounts of computational requirements. Some of these applications I think are very rich and compelling and everybody understanding what they do. Everybody wants a cell phone that will tell you what is in front of it. People like these kinds of three-dimensional immersive experiences like Photosynth. But that we who -- you know, our group works on primarily the applications side, but we have to think about how to program it, there are a lot of interesting challenges because not everything is just obviously data parallel. There are also a lot of sparse systems techniques being used and things that aren't really even at the pixel level or at the continuous numerical level. There are things like information retrieval. Basically, if you want to match an image against a very large database, you have to start using things like inverted indices and document frequency counts and things like that. So this is a very rich area. And with that I'm going to turn the mic over to Sudipta, who is going to take it from here and tell you a little bit about what he's been doing in the Photosynth area. >> Sudipta Sinha: Thanks, Rick. So I'm going to about Photosynth which some of you may have already seen. Basically what's happening in that little video clip up there is we are doing a slide show between images of a scene which were taken from different positions, but by having processed those images to figure out how they overlap and how the cameras were actually located in a 3D scene we were actually able to do a much better scene transition, which hopefully gives you a better feel of what the scene looks like. So the main idea behind Photosynth is you process the images and automatically figure out where the cameras are located, and then based on that you -- it gives -- it provides you the ability to browse a large collection of photos in 3D. So by figuring out that those two photographs were taken from somewhere near positions, it allows you to actually move around in the 3D world showing the photographs that were taken. So this is sort of the classical approach in sort of the image-based rendering community, which people have been trying to do. And what Photosynth really had sort of been able to do is bring that to sort of the consumer. So the task that I'm going to talk about mainly, the application is often called view interpolation. And Sameer later on is going to talk more about the background of how Photosynth actually works by -- he will go into detail on feature matching instructions for motion. But basically starting with those photographs shown on the left, you come up with -- you have to estimate where the cameras are and a sparse 3D point cloud of the scene. So then Photosynth actually works on top of this. So what I'm going to talk about is how can we improve the transitions that I showed you. So the way Photosynth works right now is it uses a single-plane proxy for every image to do the transitions. Whereas, if you actually knew some approximate three-dimensional shape of the scene, this would actually allow you to do a much better transition. So this is sort of showing a quick comparison of what happens when -- this is showing the scene -- photo transition with single-plane proxy. And the reason things are blurred is because the single-plane proxy doesn't fit the 3D geometry of the scene. So now what I'm showing you is the way the transitions would look if you had approximate depth of the scene. So things are much more aligned, and it actually feels like the camera is moving from one position to another in this sequence. So the big challenge is how do we compute depth automatically for a wide range of inputs, we're talking of user, consumer photography where people are not going to take photographs in special conditions. So this has been a big -- sort of a big challenge in the computer vision community, and we've started to see robust and really practical stereo algorithms which sort of address some of the challenges in the dense correspondence problem. So the basic problem that stereo solves is given two images of the scene, figuring out a dense map of pixels from the first image into the second image. And although this sounds like -- so this is sort of a common theme in computer vision. And like Rick said, this is extremely compute heavy and really the key to solving the stereo problem as well. People have recently shown a big speedup by using the GPU. So the GPU has this SIMD model to exploit parallelism, and that's what the speedup comes from in stereo. So here is an example. One of the algorithms proposed recently, this is a -- the way this algorithm works is it first figures out depth at specific points in the scene. And then once you have the depth at those pixels, it sort of spreads them. So it's a surface growing algorithm. And the benefit is that it's quite easily parallelizable. Because really you need to operate -- you're operating a local region of the scene and gradually computing depth for all the pixels. So there's a final step which is sort of a global step, which is once you've computed individual depth maps for the images, then how do you parallelize that, how do you fuse them together there your model, which is basically what this animation is showing on the right. So for Photosynth, we are exploring a similar pipeline. But one of the differences that we are looking for a way to come up with a lightweight reconstruction of the scene. So the way -- one of the approaches that we are exploring is starting from the sparse set of scene points which were recovered by structure from motion, we do some line reconstruction. And then based on the sparse geometric information we recover a set of candidate dominant planes. And once we have this set of dominant planes, we figure out how to assign each of these planes to pixels in the images. And so the pseudo-colored image on the right shows you one of those assignments. And corresponding to that assignment, you get a depth map. So this is going to be our representation with which we can do the kind of transitions that I showed earlier. So in this big processing pipeline, there are a large number of sort of tasks that we solve. Some of these are sort of -- so there's really three levels of parallelism that we can exploit there. So things that are trivial to parallelize are sort of the [inaudible] tasks: feature detection, edge extraction, vanishing point detection. And then there are tasks which work at the level of groups of images. So typically -- so, for example, the semi-dense stereo algorithm works by picking a reference image, figuring out a set of neighboring images, and then computing the depth map of the reference image. And then this is repeated for each image in the collection. And then there are certain tasks like global tasks, and Sameer will talk about bundle adjustment, which is sort of one of the key nonlinear optimization problems that have to be solved in these kind of systems. So the main idea is that it should be fairly easy to figure -- once you locate which are your regions, sort of the parallel regions in your pipeline, it should be possible to exploit task-level parallelism by running multiple instances of multiple -- multiple instances of your problem, solving them in parallel. And that's basically what I have on the right showing that -- the diagram shows that idea. So yeah. So those are some of the steps I was talking about. In our processing pipeline, sort of most of the time is spent at either -- at both the dense stereo step, so dense stereos solving this dense correspondence map between pairs of images. And the graph-cut optimization is, again, trying to solve the pixel -- the label assignment problem on the whole image. So those are really -- those steps are really operating at each pixel in the image, and that's why it takes so much time to run. But so this is showing how long Photosynth takes to run, sort of the different steps in there. And Photosynth is fairly fast, and people have been building collections of close to a thousand images all on one machine. It's also very memory intensive because you pretty much like work with large sets of images and memory all the time. So to summarize, we are looking at ways to improve Photosynth. And some of the improvements that we are thinking of require a lot more compute than Photosynth currently uses. But a lot of this should also be parallelizable, and we are looking into how to achieve that. And behind the scenes sort of there's the structure from motion problem as well as the image-based modeling problem, which is -- this whole domain is extremely CPU intensive. And there's a lot of scope to exploit task parallelism. Yes. >>: If I understand correctly, what you were describing was sort of the batch processing time for those sets of images, right? >> Sudipta Sinha: Yes. >>: Is there an incremental kind of characterization? If I were to add an image to a set or [inaudible] reprocess the whole thing? >> Sudipta Sinha: No. So yeah. So there are different steps. So the feature matching step, it's actually possible to incrementally add these images the way you talked about. So once you know the reconstruction for end cameras, given a new camera, you figure out -- you do the feature detection for that image, but then figuring out the pose for that camera is a fairly incremental step. >>: So do you know what the actual bottlenecks are at the next level of detail past the one you showed at the end? I mean, what are the parts that are taking the time [inaudible]? >> Sudipta Sinha: So when you do the dense scene modeling, it's really the order of the number of pixels. So dense stereo -- so one of the things we're exploring here is coming up with simplified representation of the scene. But it still requires you to solve the dense stereo problem. Because you're assigning depth to every pixel. So that's really where the -- where sort of the bottlenecks are going to be, because ->>: Maybe the simplest question is is it memory bound or is it compute bound? Since it's -- it's a lot of data, right? >> Sudipta Sinha: It's data, but it's going to be compute bound. And, again, so the graph-cut optimization is sort of one of the components which is used a lot, so we are using it in our pipeline, it's used in other image stitching tasks. It's -- it depends on what resolution you set up your graph. But it's again -- so the graph-cut step is somewhat harder to parallelize compared to stereo, because stereo is basically -- you can run -- it's all -- it can be done in SIMD fashion, whereas graph-cut -- again, there are different graph-cut algorithms, some which are possible to parallelize, like the push-relabel type of graph-cut algorithms are easier to parallelize. But that's, again, another component which takes time and it's formerly used in a lot of applications. >>: [inaudible] full automation, no user input? >> Sudipta Sinha: As of now, so everything I described is without user input, yeah. >>: But is your goal to sort of make this sort of available on the client and to enable that scenario, or, I mean, push a lot of this to the client where then the user couldn't interact? >> Sudipta Sinha: So, I mean, sort of there has been work in the interactive domain in the research community. But I think the real power of the system will come when you do everything automatically. And we're talking about working with hundreds of images and potentially up to thousands. >>: No, but I'm thinking of the people who use Photoshop and they do all sorts of correction on the images and sort of the ability to sort of combine editing in this mass scale. Because I think that my experience is that, you know, there's -- you know, I have this huge number of images and the ability to see them all is great, but also to apply mass edits, you know, and do things like that would be very fundamental. But I would need the stuff probably on my CPU -- on my desktop to take advantage of things, you know, the large image, all that stuff. >> Sudipta Sinha: Yeah, I mean, so the 3D modeling part of the pipeline is somewhere where the user can definitely improve the results. It wouldn't change the -- well, so you could probably skip the dense stereo part and do something faster there if the user gave you some input. But otherwise, I mean, compute-wise, it's still going to be spending the same amount of resources on the problem. >>: Yeah. So I guess I don't understand why the dense stereo step is done pixel by pixel. Couldn't you just look at -- you know, identify what you might call disparity classes and segment the image into sets of pixels with the same disparity and then work on those as a group instead of individually? >> Sudipta Sinha: Yes, you could, but then you're making -- so when you start off with no knowledge about the scene, and if you want to get an accurate depth map, you would have to basically go down to every pixel. So the problem is there's a lot of -- the problem inherently ill posed. So we do end up enforcing this constraint that pixels near to each other should have the same disparity. But to -- given a perfectly textured scene, if you wanted to compute an accurate depth map, you would have to basically step -- take one-pixel steps in this disparity space to come up with the optimal solution. So, yes, there are faster algorithms which are trying to either sample the disparity space less densely or making early commitments that can sort of -- those are sort of the speedups that are possible. >>: I was kind of -- I mean, we should take this [inaudible] segment the disparity space and take it as far as you can go. If you have to go further to pixel level for pieces of it, you do that. But you might get a pretty clean solution. >> Jim Larus: Yeah. And there are stereo matches that first segment the image, then work at that level. >>: Yeah, yeah, yeah. Yeah. >> Rick Szeliski: Let's take one more question. I want to make sure the last of our triplet of speakers has a chance. So go ahead. >>: Two quick ones. One, is this only three dimensional or do you assume that all the images are sort of shown from this ground level [inaudible]. >> Sudipta Sinha: So the input set is completely unstructured. We don't make any assumptions on the scene. That's sort of one of the things he thinks -- he challenges. We want this to work on general input collection. >>: And the other one, if the input set is not a set collection of still images but the video that was shot from why the customer was moving, is it sort of an easier task since you could potentially use motion vectors that are already in the [inaudible] video stream? >> Sudipta Sinha: So the [inaudible] that I've described and the whole system is geared toward static scenes. Because the whole idea is matching feature points, we assume that the scene is static so that things have not moved. If things are moving, then the moving objects will not be reconstructed. >>: [inaudible] but the camera is moving [inaudible] moving around the scene. >> Sudipta Sinha: Yeah. So that's, again, the same structure for motion problem, because it's a sequence of images taken from different viewpoints. And that's easy to build into Photosynth. >> Rick Szeliski: It does get easier. >>: So if a car goes by, it might be trouble. >> Sudipta Sinha: Well, it's going to be treated as an outlier in this whole pipeline. Everything else will be reconstructed unless you specifically -- yeah. >> Rick Szeliski: So thank you. >> Sameer Agarwal: Thank you. My name is Sameer Agarwal and this is joint work with Noah Snavely at Cornell; Ian Simon, Steve Seitz at University Washington; Rick Szeliski at Microsoft Research. And as Rick and Sudipta talked about, this work originated from work done at the U-Dub known as photo tourism which aimed at reconstructed basically tourist landmarks from tourist photo collections from places like Flickr. And the size of the system that we were talking about there was couple hundred to a thousand or so images at that time. And the system was then built into a product by Microsoft and -- known as Photosynth, and it can handle, I believe, a couple thousand images right now. But if you go back -- go on the Web and look at image collections, I just went to Flickr and typed the word Venice -- and this is the image sizes that come up in Picasa -- it's 7.8 million. For cities like Rome we're talking about 26 million images. And New York and London are 40 million images. And the aim -- and the thing that we wanted to do was build a system that will go on the Web, type the name of a city, download every image for that city, and try and reconstruct that city, the three-dimensional structure of that city. To give ourselves a slightly more practical target, we decided all right, we'll take -- we'll download a million images of Rome, match all the images and build a 3D model of the city and do it in a fully distributed manner on a thousand cores in 24 hours. So what I'll describe today is some progress towards that, and I'll try and convince you that we're not completely crazy. So why do it? Well, first of all, because it's cool. Because we can do it. But on a more pragmatic note, the single most interesting thing about tourist photographs is that they go where things like street view won't go. They capture notions of interest. People go to interesting places. They capture things at different times of the day, different locations. They capture interiors as well as exteriors. And our hope is that when we do this reconstructions, all these things that people have captured in their photographs, they'll be represented in there. And once we have these models, we can put them in things like Google Earth, Virtual Earth. And the next generation of GPS, for example, instead of giving this line diagram will likely show you how you will get there. Another thing -- we did something very similar to a project at Georgia Tech, which is called 40 cities, is that they are trying to map the evolution of the city of Atlanta over time, especially since Atlanta has undergone some very dramatic changes in its architecture. And the thing is that once you have the 3D structure and the visual representation in your computer, you can mine it, you can understand how the city grew, what are similarities across different parts of the city, what are sort of canonical structures across cities. And every single time I tell somebody about this project, there's usually somebody in the audience who asks me about setting a game. Grand Theft Auto Rome, for example, where I can actually blow up real stuff in a real city instead of trying to fake a city. So what are the challenges here? The challenges are both scale [inaudible] computer vision algorithm, especially for structure from motion systems like this have traditionally been designed assuming you have a single processor and most of what you need to do can be fit in RAM or it can be easily accessed from the disk. For the scales of things that we are talking about, there is no -- there certainly isn't a question of being able to fit in RAM, but you can't even think about fitting most of this stuff on a single disk. So we are necessarily talking about a distributed memory system. Then once you start talking about [inaudible] distributed memory system, the other question is how do you actually distribute. So we've been classically -- most computer vision systems assume a single processor and things being done serially. Now this offers us a fundamental opportunity to look at ways in which we can look at problem decomposition, efficient ways of breaking these problems up and combining these results. So there's work to be done both in data distribution, in dynamic load balancing. And when you get to the nitty-gritty, sparse distributed linear algebra. So where are we now? In the system that I'll -- well, very briefly talk about today, some of the stuff is parallel. Our image download, our feature extraction and image matching system, which is sort of the first thing that we really spend some time working on, our parallel. Our actual geometric reconstruction system is not a distributed memory [inaudible]. It uses -- it has components [inaudible] where it uses a nonlinear square solver which is [inaudible] parallel. But the big part, the distributed memory system, is basically the matching problem. So to give you an idea of the kind of thing that we are trying to do, what we are fundamentally trying to do or need to do as a first state of this problem is that given a pair of images, we need to identify what points in those two images correspond to the same 3D point in the real world. And this is even when implemented very efficiently, it's still a very compute-intensive problem. If you were to do this for a million images, we're talking about generating at least a couple terabytes of data on disk, about half a trillion pairwise image comparisons. And if you're hoping to do about 10,000 comparisons per second, at least spending a year and a half doing this. So there's no way doing sort of all -- comparing all pairs of images is going to be practical here. So the trick is to figure out what are the important -- what are the important places in your image collection to send time on. The other thing which is sort of very key about this problem is that most of the images are actually garbage. Since we are just going to Flickr and typing the word Rome and downloading everything that comes up, a lot of the images have nothing to do with the 3D structure of Rome. So it's also sort of a needle in haystack problem where only about 10 to 15 percent of the images that we'll download will eventually have any effect on the 3D structure that we reconstruct. The majority -- the vast majority of images won't be of any use. So it's important that we very quickly remove them from consideration. The inspiration for the algorithm that we designed is this match graph for the city of Rome. So we took about 20,000 images and we actually compared every pair of those images to see -- to sort of get -- to get a view of what does this match graph look like. So if there are two images which see the same part of the world, then there's an edge connecting them. And as you can see, this is a fairly sparse graph but quite clumpy. And this is very characteristic of tourist datasets because there tourist locations that people will go and take photographs. And then, you know, they'll walk the street, not take a lot of photographs, then come to the next tourist attraction, take a lot of photographs. But, again, there is some distribution around the tourist locations, but the scene is clumpiness. Which means that if you're able to find these clumps and get a couple of photographs inside these clumps, it's probably quite easy to grow these clumps. And that's the approach that we take. So we have a matching system which is -- which has multiple rounds of matching. And in each round we first heuristically come up with pairs of images that we think are looking at the same thing in the world, and then we go ahead and do sort of this detail matching between them say all right, compare most of the points are sort of the interesting points in this image with the interesting points in the other image and see if they're looking at the same thing. And then there is some geometric cleanup. Those of you who know what RANSAC is, we use some Random Sample Consensus algorithms to geometrically verify that what we are doing is not just two black pixels being matched; they actually geometrically make sense. So our system architecture, it's a two-layer system. It's sort of home brew. There's an underlying layer of python code, which is a distributed computing engine that we wrote. And the actual matching system is written as an application on top of it. It's actually platform independent. I wrote it on Linux and then I brought it to Microsoft Research and I ran it on the Windows cluster here. It's aimed at doing data-intensive computation. It's very MapReduce-like. And it has extensive support for local caching of data. So we know the entire structure of the computation, so we took care to design operators and caching algorithms which are aware of this. So what sort of performance does this give? So we looked at three different datasets of increasing size. We looked at the city of Dubrovnik, which is a city on the coast in Croatia. At the time of our experiment, there were about 60,000 images on Flickr. And doing the matches there took about five hours. This was about 320 cores. Rome and Venice we did 150- and 250,000 images, and they took nine and 27 hours respectively. This of course raises the question, all right, you're doing this sort of approximate matching; how good are you? So going back to our groundtruth dataset for Rome, and we compared our matching results to theirs. And it turns out that for about a quarter of a percent of compute effort, we were able to get more than 90 percent of the true matches. And there are no false positives here because the detail matching algorithm that was used is the same in the groundtruth experiment and the one that we are doing. >>: I'm sorry, groundtruth here is just brute-force algorithm. >> Sameer Agarwal: The brute-force algorithm. >>: [inaudible] >> Sameer Agarwal: [inaudible] checking. We're assuming that that part of the algorithm is good enough that we can treat it as a groundtruth. You're very right. So had awesome results. These are -- oh. I should talk about the reconstruction process. So the reconstruction process basically starts with a pair of images. We figure out what sort of 3D construction or 3D world it captures. We add some more points by triangulating the position -- the camera position of those two images. And then depending on what camera sees those point, estimate the pose of some more images. And then do a [inaudible] bundle adjustment, which is basically just a fancy name for a nonlinear least squares minimization where we try and adjust both the 3D position of the points as well as the parameters of the cameras so that the structure matches the observation made in the images. And we repeat this until we can't add any more images to it. So it's an incremental process. The key compute task here is the bundle adjustment, the nonlinear least squares problem. It's a very large, sparse nonlinear least squares problem where -- let me just describe the symbols. M is the number of observations, sort of the number of points that we see in each image [inaudible] the number of actual 3D points being reconstructed, so it's single point maybe seen in multiple images. And N is the number of cameras. And the characteristic of this problem is that M is way larger, then P is way larger than N. To give you an example of the size for one of the Venice components, we had 14,000 images corresponding to, so that's N equal to 14,000, P of about 4.4 million points, and M corresponding to about 27 million. So we're talking about -- so if you're using Levenberg-Marquardt, at every time step you're solving a linear system which has about 54 million rows and about 14, 15 million columns. The trick to solving this -- so this linear system has a very nice structure. It's a structure that's present in most [inaudible] problems. And we explored that to actually reduce the size of the linear system using a [inaudible] trick down to something which is basically the number of camera can by number of cameras. So we bring it down to about 14,000 into 9 -- 140,000 or so by 140,000. And the state-of-the-art software for doing this was not fast enough for our purposes, so wrote our own. It's designed to explore all levels of sparsity. It has a bunch of nice bells and whistles for people who care and [inaudible]. And the existing state-of-the-art solver basically only used a dense solver. So and our solver even prevented both sparse methods as well as preconditioned CG solvers. And in the experiment that we have tried, it's an order magnitude of more faster than the solvers which are there. The largest problem that we have solved is the one that I talked about earlier with about 14,000 images and 27 million observations. >>: What kind of precondition are you using here? >> Sameer Agarwal: This uses a very simple block diagonal preconditioner and works quite well. So here are the results for the city of Rome. So the city of Rome when you do this matching, we don't know what's in the city of Rome. We just have a big pile of images. We then go back and look at, all right, what do the various sort of clumps look like. So we chose the more interesting ones, and I'm showing you those. So this is the Colosseum. It contains I think about 1,400 images. So the black [inaudible] that you see are the positions of the cameras, and that's the 3D point cloud. So I talked about the interior. So this is the interior of St. Peter's Cathedral. And this has about 2,400 images. So, remember, the original data site for Rome was about 150,000 images. These are the particular clumps corresponding to the interesting tourist sites that came out of it. Then this is the Venice dataset. This is the Canal. And this started with a quarter of a million images, and this contains I think another 3,000 images on the Rialto Bridge. The biggest reconstruction that we did is the San Marco Square, which is the 14,000 image reconstruction. And... >>: [inaudible] >> Sameer Agarwal: That's a good question. So there are two kinds of [inaudible]. Some of them are actually photographs from the Campanelli Tower. The other are actually cameras which are actually not very well conditioned, so we haven't actually removed them from the data -- their position is actually quite uncertain. So at some time when you're using a telephoto lens, you'll get these things floating in the air. I could have pruned it and showed you a more cleaner version, but it's a slightly cruddier version. >>: [inaudible] helicopter shot. >> Sameer Agarwal: There might be a few helicopter shots. So I talk about building Rome in a day. So I don't quite have Rome in a day. What I do have for you is Dubrovnik in a day. So this is the old city of Dubrovnik. And here we were actually able to reconstruct the entire city. It's small enough and there were enough shots both from the ground as well as from a hill that you could sort of match all the photographs together. So this contains about 4,800 photographs, starting from about 16,000 photographs. And this entire reconstruction was done in less than a day. >> Rick Szeliski: Now, you're looking at the point POC [phonetic] version, but, you know, you could put the photographs in here and do the kind of transitions that Sudipta was showing earlier. >>: What do the points represent? The center of the photograph? >> Sameer Agarwal: No. The points are actually on the surface. So, for example, if I were to image this room, these are points in the walls. These are solid surfaces. >>: [inaudible] these two photographs? >> Sameer Agarwal: [inaudible] photographs, yes. >>: So I'm not quite sure I understand the density of the point clouds. So when you zoom in, some areas are much denser than others. >> Sameer Agarwal: Yes. >>: Is that geometric or is that photo density? >> Sameer Agarwal: It depends both on photo density as well as the amount of texture present. So, for example, if you have a flat-colored wall and you don't really detect very many features, then you can't really match them. We depend on getting sort of discriminative points that we can match across images. So where are we now? Our image matching system is a distributed batch-oriented system. Since we wrote it ourselves, it's not particularly [inaudible]. So if it dies, it dies. The largest experiment that we've done up to now is quarter of a million images in 27 hours on 500 cores. This is at least 10 to 15 times slower than where we want to be. We are quite confident getting the 10x. The 15x would require -- I think 10x is [inaudible] in the current system, the next 5x requires more research. Yes. >>: [inaudible] computational estimate on matching painting the images over again? >> Sudipta Sinha: The video that I showed you or ->>: [inaudible] Rick talked about ->>: Building a 3D model out of this [inaudible]. >> Sameer Agarwal: So the navigation using the basic Photosynth interface, that's pretty straightforward. That you can do in real time now, once you have this data. Because the key thing that you need there is the position of the camera, and then you can just [inaudible]. But if you want to do something like what Sudipta is doing, we're talking quite a bit of time. Hard for me to even put a number on it. >>: Yeah. But you've already done all the least squares heavy lifting, right? >> Sameer Agarwal: Only for the camera ->>: Huh? >> Sameer Agarwal: We have the camera poses from this. What Sudipta is doing is still getting a very dense reconstruction on a per-pixel basis. That's another order of magnitude work. >>: Yeah, yeah, yeah. So if you have ->>: So if you treated this [inaudible] then it's an order of magnitude less than doing for every pixel. But it's still a fair bid because you just don't know if the density is enough. >>: [inaudible] nonhomogeneous density [inaudible]. >>: Sure. >>: It should be easy; it turns out not to be. >>: It ought to be easier than that. I mean, you shouldn't have to go all the way back to the least squares of pixels from these least squares of features. I mean, there ought to be some bridge. >> Jim Larus: Let's thank this group. They've done a great job. [applause] >> Jim Larus: [inaudible] get the next [inaudible] going, Dennis Lin. >> Dennis Lin: Okay. Good morning. My name is Dennis Lin. And in some senses I have the opposite issue because the work that I have done, I'm going to talk about today, was work with Mert Dikmen and other members of the Image Formation and Processing Group at the University of Illinois. So but only I came here, so I get to talk about everything. Now, when our advisor last got up and talked about computer vision, he basically said -well, as he said, I quote, we have no idea what we're doing. And in some sense that's true. Because if you look at other parts of, say, image or video processing, we've kind of stabilized on the algorithms. In image compression, we pretty much stabilized on JPEG. You take little blocks, you do a little bit of DCT encoding and you have a compressed image. And you do kind of the same thing with video, except you do some motion compensation and there's some other tricks in there. And but in some sense, you know, there is still work to be done. But there's been something that's good enough for the industry to accept and people have the standard that they are working off of. Similarly, in some sense vision and rendering are opposites. And in rendering, industry has largely settled on micro-polygonization. Micro if you're rendering movies, just regular polygonization if you're doing video games. And we don't have that in vision yet. We haven't really gotten to the point where we have a lot of stable algorithms where we all -- that we can claim in our to-go-to algorithms when we want to solve vision problems. And that's what makes it exciting. The other thing that makes it exciting is that vision is highly parallel. The best-known vision system, of course, is our human brain. A large percentage of our brain is devoted to processing visual input. And it is a highly parallel system. Our neurons are clocked at something like 10 hertz, but we have lots of them and so things go fast. The other thing is that vision is local, which makes it good for -- makes it nice for data parallelism. We tend to worry about objects which are local in space. Now, the next person will tell you how the light bouncing off of this projector is affecting the way my face looks. But when I worry about vision, I tend not to -- when we're doing object detection, we tend not to worry about that too much. We tend to worry about just the couple pixels around that object, and we try to pick out what -- recognize what that object is. Similarly, when we describe events, we can describe long-running events, like me pacing back and forth at this talk. But often when we describe events for, say, event detection, we're talking about short-term events because that's just easier for us to understand. So in many ways we're talking about fairly compact subsets of space time that we're trying to recognize. Now, on the other hand, when we're starting to work with vision and videos, we start to run into performance issues. So we have realtime constraints if we're doing -- so we're going to talk about two applications: hand tracking and video event detection. And hand tracking we're trying to use it for human computer interfaces. So we have a real -- a fairly strict realtime constraint. We probably don't need to go all the way at 30 frames a second, but at least, you know, five to ten would be nice. Some -- when we work -- go up to batch processing, things actually get worse. Because then we're expected to be processing hundreds of hours of video. And even then a realtime system would be kind of on the slow side. So if you have a hundred hours of video to process, we really don't want to be waiting two, three days for this to finish. So in some sense we actually want super real time when we move to the batch systems. So first I'm going to talk about my work on hand tracking. The idea is to have a gesture-based interface. And it would be useful for environments when you don't want to be carrying around an input device, maybe a public terminal, maybe an augmented or virtual reality system where it's too cumbersome to carry things around. And it's also a first step towards sign language recognition. To actually complete sign language recognition, we need to be able to track both hands at once and understand something about facial expressions. And this is a hard task. The hand is a relatively small about. We have a data glove that we use to record hand gestures. And making this motion here exceeds the frequency at which I can actually sample. You would actually see the data go from fully closed to fully -- it can go from fully closed/fully open to 80 hertz, which was the sampling rate of our glove. And the nice thing about tracking hands is that you're sure what color it is. It's always skin color and you don't have to worry about clothing, artifacts, wrinkles. The bad thing is that there's nothing to track. You can, you know -- maybe my shirt, you can lock onto my buttons and kind of track me as I move around. On the fingers really by the time -- at the resolutions that you're likely to see, you basically have a featureless blob. So we actually use [inaudible]. Then comes the infuriating part of this. The hand is structured. Obviously the set of all possible hand silhouettes is a relatively small space compared to the space of all possible binary images. But it's not characterized easily by a bunch of linear vectors. Because if we combine two of these, we end up with a blur. And that's because the motion is articulated. There is pattern in the motion. These [inaudible] can only pivot around this joint because there is a joint there. But to characterize this in a nice kind of linear or easy to understand way is just impossible. You end up with a fairly complicated space that you have to work your way through. And the space is large. Your fingers, if you count all the degrees of freedom, has about 20 internal degrees of freedom, you add six to place the palm and [inaudible] was about 26 level degrees of freedom for each hand. If you think about the rest of your body and you -- not counting the hands, you have about 50 degrees of freedom if you count all the joints. So it's kind of like saying that both your hands are getting close to the entire joint angle space of your body. So there are a couple approaches to this. One way is to take your image and do a bunch of regression or some database search or some other method and go straight from here to here. So go from the image to the joint angles, which is what we're ultimately after. We want to know the state of the hand. The other approach is to solve -- do analysis by synthesis. We will synthesize a hand given a candidate of -- we propose a joint angle and we synthesize the possibility. And then we compare it with what we actually see in the camera. And then we do some twiddling. We propose something else. And we update this and we go around the circle until we've -- until the joint angle -- until this image matches that image, which means that these joint angles are right. Now, this has drawn kind of simple circle. In order to get this to work, you actually need to do lots and lots of candidates at once. So tens of thousands, if not more. So and the other important part here is this error computation. The error -- if you just use sum-squared difference, you run into the issue of the error goes in the wrong way. So you want to decrease error when you're solving. And if you suppose that green is what we see in the camera, blue is what a current candidate is proposing, white is the overlap. So in this case we -- the camera has a thumb sticking out but our thumb -- our virtual thumb isn't quite out far enough. If we move the thumb further out, we actually increase the error, if we just use a naive sum-squared difference metric. The reason is because we're exposing more of the thumb without reducing the error associated with the camera. And, conversely, we just shrink the thumb back into the palm, we actually reduce the error. This is the wrong direction to go. And if we just use this kind of metric, we wouldn't be able to actually converge the hand. The base and the traction becomes much smaller. Obviously, if we continued doing this, this will eventually go down to the minimum. But getting there, it would be difficult. So what we need to do is actually use chamfer distance. And to do chamfer -- so the -this is a little bit technical, but the key point is that if you look at this part here, the error associated with this point here is its distance to the closest blue. So this means that if we try to tuck the thumb back into the thumb -- into the palm, this -- the error associated with this becomes greater because it's now really far away from any pixel in our rendered image. And so, conversely, if we move the thumb closer, this error goes down and that compensates for the fact that this -- we're adding error here. >>: [inaudible] every pixel? >> Dennis Lin: Yes. And actually that's one of the keys -- so, yes. We sum independently over each pixel and it's the actual total sum that we care about. That's actually one of the important things that we worry about later when we try to accelerate this. So in order to achieve chamfer distance, we actually need to perform a distance transform. Now, often what people would do in the literature is they would perform an exact chamfer distance transform with the camera image, because that's easy. That only comes in about 30 times -- or 10 or 30 times a frame. And they do some kind of approximation, a database lookup with some -- and for the other direction, because that distance transform you have to do tens of thousands of times. Well, maybe a couple thousand times per frame. So this is what we want. If we -- this is a simplified diagram. If we have two pixels -two rectangles that correspond roughly to parts of our hand, so just imagine that these two bits of our hand, we want to compute the distance transform. And if you're rendering a graphics card, you can do something similar. You can render this image. So we have the same blue regions. And this doesn't extend forever, but we don't need that; we only need a little bit of a buffer. And it's a little hard to see, but you can see that it goes from red to yellow, and it does the right thing in the middle where these two roughly meet. So this isn't a circle. It's kind of a rectangle. We can fix that with adding more polygons. But this is a close enough approximation for most purposes. And the key to this is that we can render this per candidate, so we can get a full distance -- sorry. We can get a full distance transform for every generated image. And the way we do this is we actually when we render every polygon as part of the hand, we also render a bunch of additional polygons at the back of the z-buffer. And we can use the z-buffer to -- we can use the z-buffer to disambiguate which part should go in front. So this is the first pass that I made at making the hand tracker. And the nice thing about using OpenGL for rendering is that, well, we're actually rendering a hand. So this isn't like traditional GPUs where you render kind of an image, a square or rectangle, and you don't actually ever expect to look at it. We actually want an image of the hand to compare to the image that we have on the camera. And so in that case, OpenGL gives us nice things, perspective camera, hardware polygonization, the hardware z-buffer which it gives us a distance transform. Mipmap-based zooming, perspective-correct camera angles. The problem is that I started running into bottlenecks. Nobody plays games at a thousand frames a second. And I want tens of thousands of these things per second. Also, if I was writing a multimillion dollar game and the people at NVIDIA would rewrite their drivers so that my code ran fast. But I'm not, so my code runs slow. And OpenGL is a very flexible interface, but it also gives the driver implementers a lot of flexibility, so I can't ever quite figure out what I'm doing wrong or how I can make things go fast. Also, the exact kind of shading and drawing I'm doing is not a perfect fit for modern graphics hardware. But the biggest bottleneck I found was the fact that I needed to actually render the image. So if I render the image, I need to actually write it into memory. And that actually turned out to be a bottleneck because I don't actually care about the error image; I care about the sum of the errors. And if I can get to the point where I don't need to actually write -- do all those memory writes, then I would actually -- and then the read back in, then I would actually get a significant performance advantage. So we ended up -- I ended up rewriting my own software render, which is different from OpenGL. And since I'm dealing with relatively simple geometries, I'm only rendering silhouettes of hands, I based it off of a simple [inaudible] which is basically distance to the line segment. And so each of these bits of finger is a little line segment, and then you compute -- and everything is -- inside a certain radius of that distance is considered inside the finger, and then everything outside is outside. But since each of the time that we're computing inside/outside this test, we also compute the distance, we also can compute the distance transform on the fly as we go. So with the software render, we implemented -- oh, I'm sorry. We implemented in C++ and in CUDA. There's actually also a Python version for reference. And so the CUDA version has the advantage of avoiding the writing to memory. So if your CUDA gives you access to the small on-chip cache of a GPU, and what we do is we render into that cache and we then only write out the final results, because all we care about is just the sum, we don't care about the intermediate values. So these renderings are all 128 by 128. So relatively small in terms of actual rendering, but I want lots of them. And it's actually bigger than the on-chip cache; it's only 16 K. So you have to do some tiling. So here are the speed performance results. So the fastest -- actually, if you pull a couple more tricks, I've gotten this a little bit higher, maybe closer to 10,000 comparisons per second. And okay. And this is actually a pretty good speed. It's still not quite where it want it to be, but it's a reasonable -- you start getting reasonable results when you are talking about 25,000 comparisons per second. Don't look too hard about this number and really don't look too hard on this number. I didn't put in anywhere near as much effort in optimizing this. This is actually running on Mesa, so just the slowest possible OpenGL implementation in existence. This is semi-reasonable. I was -- the image I'm rendering should fit in L2 cache, so it should be okay. But I didn't take any special care to make sure everything fits in L1 cache. And I'm not using vector instructions like SSC. So I can probably get another magnitude, maybe another 10x or so on this if we pushed it. And this is using a single core -- a single core 2 core. And this is using -- even though this is a dual GPU card, I'm only using one GPU for this number here. And let's look at results. So here's tracking the hand. And it does a reasonable job. This is actually running -- I've actually sped the video up 2x relative to what the live output would be. Or, in other words, the tracker actually runs at half the speed that you're currently seeing. When you're doing tracking, it's not exactly fair to talk about frames per second. So this is display at ten frames a second, and the tracker runs at five frames a second. But in some sense it's all about latency because if you have a thousand frames a second and you can keep up with that somehow, your tracking becomes a lot easier because nothing changes from one frame to another. >>: [inaudible] >> Dennis Lin: Yes. Oh, I'm sorry. We use two cameras because we're only using the silhouette. And if you have a situation where you do this, it becomes impossible to see where the fingers are. Whereas, if you have a second view you can get some guesses. So you can see when I get to the -- kind of the ->>: The little nonblue artifacts read at the right side, for example, what are those? >> Dennis Lin: I think those are just artifacts. So the ->>: Errors or something? >> Dennis Lin: No. That's the other rather important thing. I don't really have groundtruth for this because I don't know where my joints really are, and I'm not about to sit in front of an x-ray to actually figure this out. I mean, I've seen one paper where they took a bunch of cadavers, insert a surgical wire and then move the joints around and then took x-rays of that. But there's a limit to what I'm willing to do for science. [laughter] >> Dennis Lin: So that's our first application. Our second application is video event detection. And this is case where -- this is part of the TREC events sponsored by NIST. TREC originally was a text retrieval conference, and then they had a TRECVID. They had a TRECVID which is about video processing, and they did a bunch of, say, BBC video and tried to annotate those. And then they added surveillance and event detection. And surveillance event detection is basically we want to detect events. So here's a person putting a cell phone to his ear or her ear. There are two people hugging up here, embrace. And then -- so this is London Gatwick Airport. There are five different camera views; one was of an elevator, which isn't that interesting. Oh, sorry. First I'm putting down object. People aren't supposed to walk through those doors, there's a person pointing there. And there's going to be one person taking the picture. So that's a sample of the events that we are supposed to be detecting. There's ten of them in total. We didn't actually do all of them. We did a reasonable number of them. So like I said, this is surveillance, but it's from London Gatwick Airport, five stationary cameras, 50 hours of training, 50 hours of testing. That was 2008. 2009, this year, they gave us this also as training, and they're going to give us 40 more hours as testing. And so if you add this all together, we have 9 million frames to process. And this is our overall pipeline. It's actually fairly straightforward. We take our video, we do some background subtraction to subtract out completely uninteresting parts, we extract some features. And then we just use nearest neighbor as our classifier. So we take -- we take the features that we extracted and the feature -- our canonical features from the database. And we do lots of comparisons, and then we output some values. And we do some post-processing filtering and we do some -- for the output. The key here is that we generate lots and lots of data in this step because we have on the orders of maybe a thousand windows per frame by the time we're done scaling, and then we generate 600 value features out of this. >>: [inaudible] >> Dennis Lin: We actually used [inaudible] motion features. So it's actually a fairly simple feature, which probably is why it doesn't work that well. So we take optical flow on a bunch of frames. So this is a motion feature, so we're working our way through time. And we take the optimal flow image and we separate into positive X minus X, positive Y and minus Y. So we sum all the positive Xs and we sum all of the negative Xs and we get two discrete values. That way it let's us understand if something's waving back and forth so you -- as opposed to standing still. Because if we just added everything together, it would average out to zero. And then we just concatenate these together. And there's some parameters on how many frames to take and how big to subdivide each region and so on, so forth. And but we ended up picking something that had about 600 values per window. And a window spanned about a second in time. So if we look at the processing involved in this, so this is the version that we kind of used before. This is -- it was dominated by the feature extraction and a pairwise distance. The optical flow and the decode -- the image decoding for insignificant. We then ported in these two parts into the GPU. And as you can see, they're now these two pieces here. So we significantly shrunk the component of -- those two components of the processing. We added this here, this component transferring to the GPU. But we've made significant strides in actually improving the overall speed. So here's the per-component timing. We didn't actually use a GPU optical flow. We actually used the same code, so that's why these times are the same. Like I said, we added a -- we added some time because we needed to transfer things to the GPU. But there is a -- we got a massive increase in the feature extraction in the pairwise distance. Again, don't look at this too closely. We -- I spent about a day writing both of these versions, like one day each. So they're not very well optimized, and those numbers can come down significantly. So and then on to results. So this the sad part actually. So if we look at it -- take the pointing example. There is about 2,000 actually pointing instances. We managed to find about 200 of them, but we gave about 30,000 false positives. That's actually pretty bad. That gives a score of 3.6. Zero is ideal. One is if you submitted nothing. So as you can see -- as you can see, not too many people actually broke 1 for this. I mean, it's a really hard vision problem. I mean, it was one of the things about computer vision that humans' brain, which is really good at these things, we can detect pointing no matter which way I'm facing, if I'm pointing down here, pointing up here. Getting computers to do it, we're still not quite there yet. We did do better on -- for some tasks we actually used a specialized detector. So remember the opposite flow when people are walking through these doors the wrong way? Well, what we did is we used a 3D Gabor filter which detected motion fill in this way, so we used the Gabor filter bank. And then we detected the peaks in that and we produced output. And that actually worked reasonably robustly. So in some sense, if we can understand our problem and we have some chance to fiddle with it, we can tend to do better. So here we actually broke -- we actually broke 1. We didn't do the best. IntuVision clearly did much better because they got 9 out of the 12 true positives and only returned 12 false positives. So I think the thing -- the key takeaway here is that if we have some chance of tuning our algorithm or adapting our algorithm to the task at hand, we tend to do significantly better than if we try to just apply a general purpose algorithm. And even if we are doing general purpose algorithms, we need some way of experimenting to try to tune the parameters. So which brings me to a short plug on the system that we -- framework we actually built. Mert isn't here, so I can tell you that this is actually the second implementation of the system. The first time we tried it, it was written basically purely in C. And we had lots of issues with -- basically it turned into one giant ball of mud and it was basically possible to extend. So we decided to start over. And it's a Python framework. And the idea is to take care a lot of the grunt work of reading in video, identifying the video, pulling in the video labels, pulling in the location labels. And then we implemented a bunch of the features. We borrow heavily from OpenCV and various libraries. And we also -- we also just incorporated an optical flow library. So a GPU-base optical flow library. That's actually -- I haven't used it yet, but I'm told it's actually pretty robust. So here's a Web site if you ever want to go and visit, submit patches. And I think that brings me to the end of my talk. The key points I want to emphasize is that having faster computers helps us get our job done. I don't think the hand tracking the way that I've proposed it is feasible with -without the amount of performance I'm getting out of the GPUs right now. Also, for event detection, it doesn't necessarily help us get good results. But in some sense, letting us get bad results faster lets us improve better. And, you know, like I said in the beginning, we -- there is no go-to algorithm for computer vision, say, in event detection. So we're definitely on the cutting edge of research here. And having more processing lets us explore our space better and lets us try new things, especially if we're going to be dealing with, say, 40 hours of testing data and we actually want to see -- or if we're doing tens of hours of testing data, we actually want, say, super real time evaluation to see how our algorithms are doing. And I think also at the same time we also need a system which lets us do that. We need something that's actually fairly flexible and can adapt to changing conditions and changing hardware and changing algorithms. So I think that concludes my talk. Are there any questions? Yes. >>: If you're running the security analysis in real time, can you bring up the accuracy? >> Dennis Lin: I imagine not. Because, I mean, we only make one pass through the data in our testing phase. We do do a little bit post-processing. So, if anything, it would bring down the accuracy. At some point if we're doing buffering, then it probably becomes equivalent. Again, the events themselves are local. So if we buffer enough video, it's just like doing static -- it's just like doing batch detection. So unfortunately there are more significant flaws in our algorithm as currently implemented. We're actually going to be trying different features. Since we have a Gabor feature, we're going to do a Gabor filter bank. Maybe that will work. And we're trying some other ideas. And hopefully we can get better results for this year. Yes. >>: What's your [inaudible] rate? >> Dennis Lin: So the video is Pal [phonetic] at 25 frames a second. We actually didn't have the resources to process that, so we actually skipped every five frames. So we were actually processing at 1/5 a frame a second. And I think it was taking like a minute to do a frame. So we were running a cluster. So that was -- when we did this, we were running on a cluster of CPUs because we didn't have the GPU code up and running. Let's see. The new numbers. So once we have the new numbers -- okay. So that's in milliseconds. And that's basically the total execution for one frame. So I think we're actually getting to the point where we're doing a second a frame. Or let's see. If we add these numbers together -- about 200 milliseconds per frame. So we're doing about five frames a second with the GPU version. >>: [inaudible] >> Dennis Lin: Well, the optical flow is just -- so it's the optical flow is just OpenCV's implementation. So at our frame size it looks like it was running about 80 milliseconds per frame. >> Jim Larus: Well, let's thank our speaker. [applause] >> Andrew Wagner: Okay. Great. So I'm sort of in the same position Dennis is, Dennis was in, in the sense that I've been collaborating with a bunch of great people at the University of Illinois: my advisor, Yi Ma, and particularly my lab mate, John Wright. And I get to present a whole line of research that we've been doing on face recognition. So why face recognition? Face recognition is one of these tasks that people have been working on in computer vision for a very long time but still hasn't reached a very large level of adoption out in the real world. There doesn't seem to be any really huge technical reason why we shouldn't be expecting our cars to just have little cameras in them that recognize us, they know who we are and let us in. Same with our houses. But still not working, for some reason. So, in any case, I'm going to go into some depth on some of the attempts that we've -some of the progress we've been making towards getting face recognition to work for the access control case. And all of this is going to be for just a single test image. We're not doing anything with video. So why is face recognition desirable as a general technique for access control and recognition? Well, it's noncontact. You don't have to touch anything. The measurement can happen instantaneously, if you're just taking a single picture of the person. So currently we're limited by the time it takes us to actually process the one image we take. No really special sensors are needed. Like iris recognition you have to stand pretty much right next to the thing to get it to work. And attacks are also very conspicuous. One of the applications in the real world that they do have is there I think -- is it Toshiba has a laptop that has a face recognition software built in. But people have found that if you hold up a picture of the person, that's enough to defeat it. But if you're doing that outside of a building, it's going to be pretty conspicuous that you're waving around a photo of someone, right? So that's a plus compared to, you know, other techniques, where if you're swiping a card it's not obvious if you're holding someone else's card, for instance. And also recognition could be fully automated at well. The dream is that you walk up to the door, it recognizes you and just lets you in. So another couple applications where face recognition is beginning to be adopted is in people -- I'm sorry, is there is question? Okay -- is in sorting their photos on their home computers. So this is a pretty low-stakes and easy application, because, number one, you have a small number of people. Since you're just sorting photos of your kids, no one's going to die if you have a crummy recognition rate with it. And last of all, you can kind of mask the poor performance of the face recognition algorithm if you have a clever user interface and you just say, oh, we're going to suggest some potential matches. It doesn't have to have a very good recognition rate to be useful. However, face recognition hasn't gained much adoption at all for security applications. And after 9/11, there were some very high-profile public trials where they tried installing very expensive face recognition systems in public places, like in airport, and they also installed it in a football stadium, and it pretty much failed. It used up far more time of the employees than it was worth. And it was only working, you know, at best half the time. So why is face recognition difficult? It turns out that people faces look very different from different illuminations. Humans are very good at just naturally -- it's a question for the psychologists of how to figure out exactly how the human brain does this. But we have some sort of 3D representation of people's faces and we know what they look like under different illuminations if we have seen that person a whole lot before. Humans actually aren't very good at face recognition if you've only gotten one glimpse of a person or you only have one image of a person, or if you're searching over a lot of people. And this is a pretty common experience. I know -- this is my first time in this town, and just walking around, I kept seeing people, oh, I know that person, oh, no, I don't, no, I do -- you know, I kept seeing people I thought I recognized. But that's, again, our -- even humans' face recognition isn't as good as we think it is. So the -- and, again, the other things that make face recognition difficult is that obviously if you have -- if you're just looking at a single image of your face, it looks totally different if the pose is different. And also you may not always get a clean shot of a person's face. They may be wearing a hat. They may be wearing sunglasses. They may just have their mouth open, and that creates an area similar to an occlusion. And classical -- some of the classical algorithms don't do very well on this. So techniques like Eigenfaces, LDA, nearest neighbor. So the line of research in our lab is based on an idea of searching for sparse representation of your test image in terms of your training images. And I'll go into more detail about what I actually mean about that later. So I'm going to talk about -- first I'm going to talk about how we actually take training images. You need to take images of the people to build a database of what they look like. And it turns out that that's very -- it's very sensitive to how you take those images. You really have to do a good job of that. Also we have to do a very good job of automatically aligning the test image to the training images. And I'm also going to give a quick overview of the techniques we're investigating to make this happen in a reasonable time frame. So just a quick single example of the size of data that I'm talking about. So these rectangles are -- so this is -- these -- the full thing is the full image as it -- roughly as it comes off the camera. And these are the portions of the image that we actually use for recognition. And so, for example -- and the alignment is defining the mapping between a 60-by-80 image back to the full resolution image. These rectangles in the other space are 60-by-80 pixels wide. So if you have poor alignment, that's what this black rectangle is. And it doesn't have to be very -- it doesn't have to be shifting very much to make a difference. If the alignment is poor, then when you find your representation, the coefficient for the correct user is often lower than the other ones. So that's a fail. You'll see what these actually mean later. And similarly, if you use good alignment but you don't have a large enough set of illuminations in your training set, then you also get a very -- often a very low coefficient for the user that you care about. But if you nail them both, if you have a very good alignment, the white box, and you also have a very good set of training illuminations, you can do recognition very accurately. So a little more on illumination. So a lot of the classical assumptions that people -- a lot of the classical assumptions that people make about objects are, first of all, that they're convex. This simplifies things a lot because you can't have any shadowing from one part of a person's face to another. They assume that -- they assume a Lambertian reflectance model. And they assume that the lighting of the object is distant. And if you make all of these assumptions, there's a really elegant proof that is just looking at the spherical harmonic basis functions for illuminations. They've shown that you can get away with a very -- with basically a nine-dimensional subspace. If you just have the right nine images of a person's face, then you could under those classical assumptions do a very good job of representing that person's face. Unfortunately there's a catch here, and that's that most of those classical assumptions just don't really hold at all. So here's an example. A testing image, you can see we have shadowing in a number of places on our faces, we have specularities in our eyes, on our nose, forehead. However, one nice thing that we have is that there is still this linear relationship between the training images and the testing image. Just because your camera's effectively counting photons, the superposition principle still holds. So our strategy in our system is to experimentally determine what illuminations we actually need to keep around. So sort of the traditional way that has been used for capturing these recognition images, at least for research systems, is to either build a big dome or build a big array of camera flashes in front of your -- around your user. But you can probably get an idea. That's a pretty major undertaking and it hasn't been done really all that many times. Most people just -- most researchers in this area just operate off of public datasets. It's difficult to reconfigure. It doesn't scale well in the number of flashes. If you want to double the number of flashes, you have to pretty much start over with your hardware. And you have to build that whole thing to get very good coverage and also have the illumination distant. So I said there's got to be a better way. So the way we came up with is to use projectors to indirectly illuminate the person's face. It will end up being a lot easier to reconfigure. It will be easier to change the illumination patterns. And it will be easier to construct, deploy, and still get good angular coverage. So the hardware setup looks like this. We have an array of DLP projectors, just because they have a good contrast ratio, shining light on the walls -- we're not shining light directly on the user -- that indirectly illuminates a user sitting in a chair. And we have cameras that are synchronized with the projectors to take images. So just to give you an idea that we're getting a very good angular coverage, if the user is sitting here, I sort of diagrammed out the beam, so we're able to illuminate their face from below as well as from above vertical or past vertical even. And, again, horizontally, we have them close enough to the corner relative to the illumination that the widest part of the light is actually hitting them from past horizontal. So now we have all this flexibility. We can generate pretty much any illumination on the person's face that we want to. And we need to -- so we need to restrict this problem a little bit. So the technique that we use is we just chose a pretty fine sampling of illuminations in the illumination space. And then we iteratively chose subsets of them to try and figure out which ones we actually need. And as you can imagine, there are diminishing turns when you do this. >>: Single source? Single light source is what you're simulating? >> Andrew Wagner: Yes. We're using a single light source at a time, and we're taking a bunch of them in rapid sequence. So we end up with a database of the person's face under different illuminations. So, in any case, there's diminishing returns in the number of training images that you take. Pretty rapid falloff. We ended upsetting on just grabbing that point, 38 training images. Again, much larger than the theoretically predicted nine training images. And also there's also diminishing returns in terms of the angular coverage. So we were -the subsets we were choosing were increasing radially, and it turns out that -- and this is pretty intuitive -- that illuminations, if you're illuminating the person from directly behind their head, not really so important. You still do need illuminations from behind horizontal because light coming from behind you can still hit a portion of your face that's visible in the camera. And that's sort of the test for whether a point light source can affect the image. So, again, there are two configurations that we take images in. We take with the user facing the wall, so we get all the frontal illuminations, then we flip them around in the chair and we take another set of images for a total of 38. Anyone have any questions about this configuration? Okay. And, again, we're not shining light directly into the user's face. >>: So how long does it take for [inaudible]. >> Andrew Wagner: Right now it's taking a few seconds to capture all of these images. But just because the synchronization between the cameras and the projectors could use some work, you could probably get that under a second if you had access to the hardware. Like I haven't taken apart the projectors and gotten access to the clock for the DLP chip. There are people who have done this for doing realtime -- if you're building 3D models of a person's face in real time, they've done this, where they take apart the projector and use the clock from the chip to synchronize -- to really synchronize the projector with their camera. So now that we've got this great database of training images of a person's face under different illuminations, how do we go ahead and align the images. So as I said before, you know, just touching base again, if you have bad alignment, it doesn't work. If you have -- if you have insufficient training images, it doesn't work. If you get them both, it's right. So now we're on to the second thing. So what do we do with our data? We take our training images, those are all at the top, and then we -- right now we're performing a manual alignment on them. Since we take them in rapid succession, it's not as difficult as it seems like. You only need to use click features twice for each person who sits down. But we're also working on automating that. So in any case, we manually align -- get a good alignment of all the training images to each other. We throw out the color information and we stack up the pixels into columns of a matrix. So for each user we have a matrix of data that's the number of pixels that we have tall by the number of users wide. So this is an extremely tall matrix. And even when we take all of these -- we have one of these for each user. Even when we concatenate them into one big global data matrix, this is still a very large -- a very, very tall matrix in general. So how do we actually find this representation? So the assumption that SRC makes is that our testing image is a weighted superposition of our training images by the coefficients X, and we also assume that we have some sparse error. So some pixels on the person's face could be just flat wrong do to some reason, usually occlusion. However, that model assumes that the images are aligned. Now, if you include alignment in the optimization, then unfortunately breaks the convexity of your problem, in particular it's breaking -- the way we have it formulated, it's breaking the convexity of our constraint set. So what we can do is we can basically linearize our constraint and iterate between solving the system and recomputing the linearization. Furthermore, as a performance optimization, we can get away with performing a similar optimization on a per-user basis. So we align the test image to each individual user's set of training images first, and then after that we can perform a global optimization. And there -- okay. Again, just reiterating. Since we've linearized our constraint set, we have to solve this optimization problem. And we have to -- and then we recompute the Jacobian of our error function with respect to the parameterization of our transformation, of our image warping. >>: [inaudible] your constraints will be satisfied? I mean, you can [inaudible] but it doesn't guarantee that your constraints will be satisfied. >> Andrew Wagner: Yeah. Exactly. I mean, the convex problem, you're guaranteed to converge to something if you set up your optimization correctly. But you're absolutely right. With this iterative alignment, it's possible that if the person's face starts off with a far enough deviation that it will converge to something completely different. >>: What was I thinking was that this is the same kind of thing that you do for things like sequential programming [inaudible] putting loop outside this linearization which controls this linearization itself. And you need to test how much you progress along the objective as well as the constraints and then make sure that one doesn't sort of bounce to the other [inaudible]. Just doing this would only work very, very [inaudible]. >> Andrew Wagner: Okay. I'll look into that and we can talk offline if you want. So he's absolutely correct. This does pretty much only work locally. But the good news -- well, I'll get to the region of convergence results later. >>: If the person is sitting [inaudible]? >> Andrew Wagner: In between the frontal set of images and the rare set of images, there's alignment to be done. But that's alignment within training images. Right now I'm talking about alignment of the testing image to the training images. The testing image comes in and it has some completely unknown alignment. So what we do, after we perform this per-user alignment, is we invert that transformation that we found and we apply it to each training user's set of images. So now we have a new global A in which, at least for the correct user, the training images are aligned with the testing image. And we perform a global optimization that's pretty much exactly out of the SRC paper where we're, again, minimizing over the L1 norm of our X and our error subject to the constraint. And we actually use the -- we use the -- this is an important thing. We use the coefficients in our representation for recognition. You expect to see large coefficients on the correct user. >>: I notice your error metric is based on an L1 norm throughout. Any intuition about why you think that's a good idea? >> Andrew Wagner: Well, so the L1 norm is a convex relaxation of the sparsity of these vectors. So we know that X -- in this step X should be sparse because it should -- and at least we know that there at least exists one solution where all of the -- where a small number of the coefficients in X are large. Because those are -- A is 38 times the number of users you have wide. But we only expect to see energy concentrated on the images for the correct user. And E is sparse because the error, at least for large errors like occlusions, they'll tend to be spatially localized in the image. Only a small subset of the pixels will be completely crepted. Does that answer your question? Okay. So, again, an overview of the algorithm. We find some alignment between our testing image and each of our sets of training images for each user. And as a performance optimization, we find that we can get away with just keeping a top -- just keeping a subset of those for the final recognition step, which I showed on the last slide. Okay. Now, back to address your question about alignment a little more. We computed the region after of attraction of our alignment algorithm by synthetically perturbing our test image by different amounts and computing the actual full recognition rates at each point. And it turns out that the alignment works for up to roughly about a quarter of the width between the outside of your eye corners, which if your face is misaligned, that's actually a pretty significant misalignment. And it's able to converge for an even for dramatic perturbation of angle. You can be almost -- you know, a little past 45 degrees and still have it aligned. And just a comparison of our performance between if we use the L2 norm, we found that there are a lot of cases where it doesn't converge to a good thing. And in this case we're using perspective transformation as our class of image warpings, so the rectangle there is -- you can see it's skewed. But it actually does converge to something much more reasonable if we're using the L1 norm. And although our algorithm is designed and really only meant for images of training users taken from one viewpoint that you know, like right now we're using always frontal images, we found that it does exhibit at least a little bit of robustness to out-of-plane pose variation. So that's just a little bonus thing. That's not a big result. There are a lot of people working on performing face recognition really in the general case where they're building three-dimensional model of the person's face so they really can recognize them from a lot of angles. We're not doing that. And just some results to give you an idea of how we're doing on public datasets. So we pretty much clobber the traditional subspace-based face recognition methods. On recognition we're succeeding in getting recognition rates above 90 percent. And for detection -- so recognition is where you're answering the question of, you know, who's in the image; detection is just is it a person in your dataset -- we clobber them as well. Our algorithm performs a lot better. And then, again, the diagonal in this ROC diagram is random. So if you have a -- if you're just giving random results, then you get a diagonal. >>: [inaudible] flip meaning true positive false. >> Andrew Wagner: Right. Okay. And so some of the examples where our algorithms can still have trouble. So for the multi-PIE dataset, the training images were taken in a different session. They actually had everyone come back in on a later date. And it turned out that a lot of things have changed in people's faces. Some people dyed their hair. Some people were wearing glasses in their testing image that they weren't in their training images, vice versa; some people took off their glasses, changed their makeup, grew a beard. So there's still a lot of challenges even once you've taken care of illumination. And here are just some experiments on a dataset that we took in our lab. Since the public datasets don't -- most of them have rather insufficient testing images. They take all these images under the illumination rig but they don't have time to take people outside and different entrances in your building and really collect a database of people under different illuminations. So we did that in our set. So in our dataset, we have much better training illuminations because we have our nice carefully controlled acquisition system. But we have much less controlled testing images as far as illumination goes. And we're able to still, again, get above 90 percent recognition rate on the reasonable cases where it's just normal people's faces or optical eyeglasses. Our performance does begin to fall if they have occlusions like sunglasses. Okay. So we're able to get very good recognition rates. But there's a downside to this. The recognition takes a long time. Right now it's taking about five minutes between -well, between two and five, depending on the number of training users you have in your dataset. Right now we're operating with about a hundred training users with about 38 training images. And really to be useful for access control, it needs to happen in under a second. You don't want to be standing there waiting for the computer to crunch before it lets you in the building if it's raining. And we have a -- there are a couple of areas where we can try to get the speedup. So there's this level of course-grain parallelism that I already talked about where we're doing per-user alignment on a ->>: So how much would having a wet face increase your error rate? >> Andrew Wagner: Due to specularity? >>: Due to rain. >> Andrew Wagner: Due to rain. >>: Yes. >>: It's important in Seattle. [laughter] >> Andrew Wagner: Well, I mean, that's actually a very good point. So one of the big things about a person's face that changes when they get wet, either it's because of sweat or because of rain. So you get a whole lot more specular reflection. And that's something that if you have an insufficient set of training illuminations, you're much more susceptible to failing for. So having this larger set of 38 training images already captures some of the information about how your face responds to specular reflections. You can't really do a perfect job, but you can do better. And also the real -- the challenge will be to achieve very efficient fine-grained parallelism of this. So our algorithm gets very good results. It's able to make good use of global information to the image. We haven't had to chop up our image to operate on subsets of it all at all. So we haven't sacrificed any of the global information of the image. It's conceptually very simple, but computationally it's very expensive. We're doing a bunch of linear algebra on very tall matrices. And these are the reasons why we basically can't get away with using off-the-shelf L1 representation algorithms. So our arrays are extremely tall, the number of pixels you have by the number of users. We have some domain-specific constraints. Since illumination is positive, we constrain our coefficients to be to believe as well, otherwise there's a danger of overfitting. >>: So then why won't the number of users grow a lot? I mean, more people in pixels? Or why is that not likely? >> Andrew Wagner: It could. Right now we're focusing on access control. And usually there's only so many people who work and should be in a given facility. In the building I work in, it's 300 people. But, yeah, you're -- exactly. And that is another direction in which you'd like to scale up. And right now our main idea for that is doing a per-user alignment. Right now most of our computation happens in the alignment step. >>: [inaudible] 90,000 employees for 300 buildings. >> Andrew Wagner: Yeah. True. If you're doing something for an entire campus, that will be a more challenge -- a much larger challenge. But, you know, even if you -- it will probably be useful to add security to your swipe cards before it is able to completely replace them. And that will at least give you some -a little bit robustness against people stealing someone else's swipe card and using that to get in. It can give you some at least some idea of whether it's actually the person who should be holding that card. So what do the computations actually look like? They're basically -- there are sort of two kinds -- there are sort of two general classes of algorithms for performing this sparse reconstruction. The first class for first order methods, basically projection-based methods, the expensive parts that are inside the inner loop end up being multiplications -wait, I'm talking about forth order methods. So for first order methods, it's -- you have a very tall matrix, your data matrix A. And you're multiplying that by -- you're multiplying that by vectors. For the first order method, you get a matrix vector -- that's a typo. You get matrix vector operations both for the tall times, the small vector on the other side, but also -- you also have the left multiply as well that you have to compute. And you also have a per-pixel thresholding step. >>: [inaudible] >> Andrew Wagner: Yeah. These As are dense. So the sparse that I'm talking about is sparse in the representation, the coefficients that you're searching for, not in the data matrix itself. >>: Please tell me that you don't also have a 38-by-38 inverse. >> Andrew Wagner: We actually -- well, for the second order method -- it's a linear system that you have to solve to get the step direction for second order methods. Right now most of -- all of these results that I've given you so far were with the second order interior point method. >>: [inaudible] using-pseudo inverse [inaudible]. >> Andrew Wagner: Hmm? It's a linear system that we're solving, and we're using a linear system solver. We're not computing a pseudo-inverse and then multiplying. So, in any case, in the second order methods, the interior point methods, the most expensive computation is an A transpose times a humongous diagonal matrix. So we're only storing the diagonals in this computation times another matrix. So this precludes using off-the-shelf matrix multiplication routines. And as we just mentioned, there's a smaller system that you have to solve to finish computing the step direction. And right now this is -- at least for per-user alignment, this is only a 32-by-32 system. So it doesn't take a lot of computation to solve it. But if you're going to extend this to scale it up to a whole lot of users, if you're planning on having a thousand users or more, you're going to have to start including a whole lot more users in the set of users that you keep around for the final recognition step. >>: Is single enough, or do you need double? Or how much precision do you need? >> Andrew Wagner: Our interior point method implemented on the CPU use double precision. But we think we can get away with single precision. So numerical precision actually is an issue in this. And one of the things that makes the computational problem not something that you can just use off-the-shelf detectors algorithms for. In particular, the largest matrix, A, is images that came off of a camera, so it only has A precision to start with. So some of the optimizations that we're looking at are keeping -on the GPU keeping A in its original resolution and then just keeping around scale factors for how each column in a row should be scaled. Because normalization is something that you also have to take care of. I've skipped that because it's an implementation detail. Okay. And, again, just thanking my lab mates. What questions do you guys have? Any more? Okay. Thanks a lot. [applause]