>> Jim Larus: Welcome to the first UPCRC Multicore... quite a mouthful.

advertisement
>> Jim Larus: Welcome to the first UPCRC Multicore Applications Workshop. That's
quite a mouthful.
I should point out for those of you who are not familiar with this room that on the other
side of this wall there's breakfast, there's some fruit and coffee and juice and things like
that. Feel free to help yourself. We will have a break later in the morning, but if you
want something now ->>: And if you're embarrassed to charge past a speaker for breakfast, you can go around.
>> Jim Larus: You could go all the way around all of the lecture halls, or you could just
go on that side of the room. We'd be okay with that.
So I just wanted to say a few words and sort of set the context for this. I think that most
of our visitors know what I'm going to say, but many of the Microsoft people may not be
familiar with the context. And so I thought I'd just spend a couple minutes explaining
what UPCRC is and why we're having this workshop.
So I think the first statement is pretty obvious to everyone in this room, that multicore's
success is hugely important to both Microsoft and Intel. And because of that the two
companies got together a few years ago to jointly fund two research centers: one at UC
Berkeley and one at the University of Illinois at Urbana-Champaign.
And they are large multidisciplinary, multi-investigator projects. I didn't actually count,
but my guess is that both of them are roughly about ten faculty members involved,
around 20 graduate students and staff. They're funded for three to five years. A lot of
money coming from both of the companies to fund these projects. So it's a really big
research effort, far larger than anything else that I think Microsoft Research has tried in
the past ten years I've been here.
And they really have a broad focus. If you look at the projects, they're encompassing
programming, systems issues, a little bit of computer architecture, and applications.
And really the goal of both of these is to help bring multicore computing into the
mainstream of computing. Can we figure out a way of making multicore accessible to
the broad range of programmers. And multicore on the desktop have a lot of compelling
applications and compelling tools.
Why this workshop. Well, there are really two challenges facing multicore. The first is
how do we program parallel shared memory computers, a challenge that's been around
for a long time, certainly as long as I've been in the computer science community. And
then the other challenge, of course, is what are the killable applications for multicore.
Why would my mother go out and buy an 8-ray parallel computer for her desktop. Still
like to know the answer to that question if anybody has any suggestions.
And these two questions are really sort of dependent on each other. How do we build the
tools without knowing what kind of applications we're going to write for these machines,
how do we build the applications without the tools.
And so really the goal of this project is to try to make progress on both of them. And
because these projects are run in CS departments and because we're Microsoft, it looks
like there's a lot more emphasis placed on the first question than the second question.
And so one of the goals of this workshop is to try to readdress that imbalance and put a
little more light on the application side of things and bring together people from Intel and
Microsoft and Illinois and Berkeley who are really focused on the application side of
these questions and get them to sort of talk about what we're doing, where the
opportunities for future advancement are, what we would like to hear from the -- what we
would like to see from the tools research, and basically maybe start getting more of a
community of people interested in this and hopefully get a little bit more emphasis on the
applications than on the -- just on the tools side of things.
So that's what the motivation for this workshop is. Really happy that all of you came,
that we've got a great turnout from all four groups of researchers. We have talks from all
four groups of researchers. And having said that, let me just sort of put up the schedule.
We're going to start this morning with a sort of welcome from Tony Hey, and then we're
going to start moving into something that just was, for the lack of a better term, called
visual computing by me which is a real hodgepodge of different areas. I'm sure that
people who are in these areas contributory negligence when I sort of put them all
together. But I think there's a lot of interesting synergies.
We have a break this morning. We have a slightly late lunch; I apologize for those of
you who were on other time zones for the lunch times. We have some more talk and then
a separate session this afternoon talking about social interactions. And then this evening
there's a reception for all our visitors and invited Microsoft guests.
I have maps. It's literally just about a few blocks that way from here in a new facility
Microsoft just opened. So those of you who need to know where to go, I'll be happy to
give you a map.
Great. And if anybody has any questions and concerns, please sort of catch me, ask. All
the visitors should have -- if you checked in at the front desk, you should have gotten a
little thing that will let you hook up to the wireless network. If you haven't, again, check
with me and we can get you one.
The bathrooms, just to answer the other question, were if you go out that door, walk all
the way down to the end of the hall, turn left, they're right there.
Any other questions before we get going? Great.
Tony.
>> Tony Hey: Great. Well, not much more to say. Certainly add my welcome to
everybody from Berkeley, Illinois, Intel and Microsoft Research.
And just I thought it would be interesting just to give a couple of perspectives how I got
interested in parallel computing, was a seminar given by Carver Mead when I was on
sabbatical at Caltech. And it was an interesting seminar. There was a room full of
Caltech faculty and everybody was waiting and waiting and waiting. And about till a
quarter past the hour someone went and fetched Carver Mead from his lab; he'd forgotten
about it.
And he gave a great talk. He gave it with a slide deck and his talk was really an
explanation of Moore's law. So that's one of the things that -- he was the Gordon Moore
professor at Caltech, and he explained why Moore's law worked.
And his talk was breathtaking. He gave a talk which showed there were no -- this is
'81 -- no engineering obstacles to transistors and chips getting smaller and faster for the
next 20 years. And it's clearly -- that was in '81, and we've really gone a bit beyond that,
but we've now come to the end of the free ride.
So that was the thing that inspired me to get into parallel computing. I was doing
quantum chromo-dynamics. I was a physicist in those days. And I would classify QCD
as an application that can soak up as many cycles as you need. But it was clear we
needed a bigger computer than the VAX-11/780, which is what I had at the time.
So that's how I got interested. And it was also clear to me at that point that distributed
memory systems -- it was the days of Crays and supercomputers and shared memory.
Distributed memory systems were in the end going to outperform these shared memory
systems like the Ymp and Xmp and so on.
So it took a long time to come to pass, but I think that's clear for the scientific
applications, distributed memory programming is the way people program these
machines with tens of thousands of processors at the national labs and so on.
There were things like SIMD and MIMD, of course. And I worked in Europe on a
processor called the transputer. Now, the transputer was an interesting processor. It had
multiple CPU and an FPU and it had memory and had communication hardware on it. I
think it was about sort of 10, 15 years ahead of its time. But it is now clear that with the
many-core, multicore ideas, we're now revisiting some of those ideas that were embodied
in the transputer. So I'm finding that particularly interesting.
There used to be a joke when I gave talks on parallel computing, which, you know,
parallel computing is the future, and then people said and always will be.
And the interesting thing is that it is now really the future. It is actually -- I had not seen
the fact that it was going to be so critical for the IT industries, for Intel and Microsoft to
actually master programming on multicore chips. And we have to do better than MPI.
We never succeeded in 20 years -- my research group and other in the audience who we
interacted with, we never made it easy.
And so we really need to do better. And I'm looking forward to the focus on applications.
My favorite application is the one that Dave Patterson used, which is one of these. As I
get older, you come -- I need one of these which has a camera, and so I'm listening to
Rick's talk in a moment, which will actually say, well, that looks like Dan Reid over
there, he's aged a bit in the last five years and last time you met him you talked about this
and that and he owes you a beer. And so then I can go and say hi, Dan, nice to see you.
How about that beer? Because I've completely forgotten his name.
He, of course, has a similar device which tells him that he owes me a beer, and then he
avoids me.
Okay. So those are the sort of consumer applications that maybe -- that Jim's mother
might buy. You never know. But there I think is the key, and I'm very, very pleased to
see the focus on real practical applications which could be a general interest as opposed
to quantum chromo-dynamics, which is not of general interest, I have to say.
So just one last word. I first met Andrew -- and so this is how we put the deal together
with Intel. It was because Andrew called me up and said how about doing something
together. And as you know, Intel and Microsoft working together, it's like two
porcupines mating. It has to be done with care.
I think I first met Andrew when he was chair of the PPOC [phonetic] program committee.
And he organized, I don't know, some program committee meeting on January the 2nd in
San Diego. So it seemed a bit unreasonable, but then I flew all the way from the UK to
go and do that. And I was amazed at this program committee in that there were all these
young turks and they trashed everybody's papers, so nothing was good enough and they
just selected a few.
And then the program committee members who had papers, well, then had to leave the
room. And then the remaining program committee members trashed their papers too. So
it was an interesting experience. I argued for some generosity.
So I'm looking forward to today's events and seeing what progress has been made. And I
think it's really important.
And the other important thing that Jim didn't say is that the reception in the Spitfire bar
actually has beer. So I'll certainly be there tonight.
Okay. So I think we're ahead of schedule, Jim, so unless Dennis wants to say something,
I think it's Rick -- you ready to start? I'll hand it over to you, Jim.
>> Jim Larus: Yes. I just -- one announcement. All nonMicrosoft speakers, we need
you to sign one of these release forms for your talk.
>>: We have one form they have sign very small ->> Jim Larus: We have one form you have to sign really small on it. But it's just to give
us permission to both -- there are two boxes you also have to initial. One is to just use it
internally; the other one is to put it externally. Hopefully everybody's okay with putting
it externally. I've had a number of requests from both schools that -- and I think these
talks would actually be a real asset to people interested in the area.
So please remember to do that. And also they would like your talk on this memory stick,
so please do that as well. Great.
So our first speaker -- let me just go back, put that back up -- is Rick Szeliski from MSR
who's going to be talking about some of the great and really fascinating work that's been
going on in his group on vision.
>> Rick Szeliski: Okay. Thank you, Jim. It's my microphone live now? Well -- okay,
thank you.
So, first of all, let me say that even though the program says it's just me, there's actually
three of us talking. I thought it was just easier to keep short on the content line. But I'm
going to give a brief introduction.
Sudipta Sinha, who is a researcher in our group, will talk a little bit about Photosynth and
some of the ways we're extending it. And then Sameer Agarwal, who is a postdoc at the
University of Washington, will talk about solving very, very large matching problems
involving tens of thousands or hundreds of thousands of images.
So let me just start with a brief introduction to some of the applications that you might
see of computer vision where a lot of parallelism is required.
And I'm not going to cover most of these applications. As a matter of fact, some of the
subsequent speakers from the various other institutions attending here will be covering
some of these in more detail.
But just to give you a flavor for why we need large amounts of computing, and we'll
continue to do so for the next decade or so, there's this idea of computational
photography, where rather than just taking a photo with a camera and then printing it or
sharing it on the Web you can take multiple photographs -- let's say different focus
settings or different exposures or with and without a flash, or even from different points
of view -- and then put them together.
And we're starting to see this really permeate into the general public, so people now
commonly take lots of photos like this and create panoramas. But other things you can
do is you can take different exposures with your cameras and then merge them all
together.
So our group has been working in these areas. And we'll show you one particular thing
called Photosynth which is a way of taking photographs from different points of view of
moving around between them. So this is one area that can soak up a lot of computation.
One that takes even more is whenever you move to video. There the processing
requirements get a lot, lot higher because you basically have to process an image 30
frames a second.
So, of course, there are the classic applications like compression. Phil Chou's group
works in that area. Video enhancement, which is an issue with some of the low-quality
cameras we have or maybe stabilizing videos that are taken from jittery cameras.
And something that's just on the horizon, this 3D video. Now, you might say, oh, 3D
video, 3D movies, people have talked about this forever, what's the big deal, it doesn't
look so exciting. But the fact is a lot of the movie studios are thinking this is what's
going to keep people coming back to the movie houses. And a lot of major directors -Disney is investing heavily in this, James Cameron's working on a feature film.
And of course if it comes to the movie theater, it's going to come to your home
eventually. So we're going to see 3-dimensional or at least stereoscopic video coming to
your home probably in the next, who knows, three, four years. And that will soak up
more computing power because you might want to display this video for different
viewers in the room or change your viewpoint.
Then there's stuff more in the immediate mobile space where you might be wanting to
recognize photos. So just like when Tony said he wants to see Dan's photo and recognize
who that is, that's one application. So and then also recognizing where you are, taking a
photo of your environment, getting the name of that building, translating signs that you
see, overlaying information as you're looking around.
So there's this whole space of things. And of course you might say, well, okay, this is the
mobile platform so this has nothing to do with multicore. Of course most of you
probably know better, right? I mean, the mobile devices will be just as multicore as
everything else.
And then there are applications that you can think of as being a little bit more embedded,
like robotics. We're starting to see a lot of computer vision making its way into cars. So,
for example, under poor visibility conditions, sometimes the vision system can detect
things and sound an alarm for the driver better than the driver themselves can do.
People talk a lot about home monitoring, applications for our aging population. And
eventually we would like to have the computer understand you and your state of mind
and your desires just as well as other human assistance can. So there's whole idea of
looking at people.
So these are just some potential computer vision applications. And I probably don't have
to explain to you why they soak up a lot of pixels -- or computons, because it just takes a
lot of time to analyze an image and to really understand what's going on.
So in our group, which is called the Interactive Visual Media Group, we have a number
of projects. This project called Lincoln is the one where we recognize things on cell
phones. Photo Tourism was a project that started at the University of Washington as a
collaboration, and Sameer will be up, will be talking about some of these things.
It evolved into the Microsoft product called Photosynth which Sudipta will talk about.
We do things with stitching, we've done 3D video, we've done video walkthroughs. And
mostly I put this slide up to -- for people who are interested and didn't get enough details
from my talk to just go visit our group page. You can find more detail about some of
these research projects.
So I will mention one application very briefly. This isn't rocket science, but just shows
you why multicore, when done correctly, is an incredible benefit.
This is just the simple idea of convolving an image, blurring it, sharpening it, using
what's known as a separable filter, which is you use one-dimensional filters horizontally
and one-dimensional vertically.
And one of the developers in our group, Simon Winder, thought about this hard and
decided that the right way to do this is to strike the image vertically and have each core
basically working on its own separate area. And then after its finished doing a vertical
convolution, then it transposes things as it's writing them out, and then you can do it once
more and this results in a very fast application.
So this is a great piece of code. We use it all the time. Unfortunately we gave this code
to a different product group, the Photosynth team, that uses some of these tools. And
they discovered they were already multicoring at the task level. They were giving each
core a different image. And everything collided and slowed down.
So this really shows the need for things like concert, like operating systems which will
basically arbitrate between different applications. They each think that they know how to
use the parallelism on the multicore. So this will be an issue in the future.
So let me summarize my introduction by just saying that visual media in general will
probably provide almost unlimited amounts of computational requirements. Some of
these applications I think are very rich and compelling and everybody understanding
what they do. Everybody wants a cell phone that will tell you what is in front of it.
People like these kinds of three-dimensional immersive experiences like Photosynth.
But that we who -- you know, our group works on primarily the applications side, but we
have to think about how to program it, there are a lot of interesting challenges because
not everything is just obviously data parallel.
There are also a lot of sparse systems techniques being used and things that aren't really
even at the pixel level or at the continuous numerical level. There are things like
information retrieval. Basically, if you want to match an image against a very large
database, you have to start using things like inverted indices and document frequency
counts and things like that.
So this is a very rich area. And with that I'm going to turn the mic over to Sudipta, who
is going to take it from here and tell you a little bit about what he's been doing in the
Photosynth area.
>> Sudipta Sinha: Thanks, Rick. So I'm going to about Photosynth which some of you
may have already seen.
Basically what's happening in that little video clip up there is we are doing a slide show
between images of a scene which were taken from different positions, but by having
processed those images to figure out how they overlap and how the cameras were
actually located in a 3D scene we were actually able to do a much better scene transition,
which hopefully gives you a better feel of what the scene looks like.
So the main idea behind Photosynth is you process the images and automatically figure
out where the cameras are located, and then based on that you -- it gives -- it provides
you the ability to browse a large collection of photos in 3D. So by figuring out that those
two photographs were taken from somewhere near positions, it allows you to actually
move around in the 3D world showing the photographs that were taken.
So this is sort of the classical approach in sort of the image-based rendering community,
which people have been trying to do. And what Photosynth really had sort of been able
to do is bring that to sort of the consumer.
So the task that I'm going to talk about mainly, the application is often called view
interpolation. And Sameer later on is going to talk more about the background of how
Photosynth actually works by -- he will go into detail on feature matching instructions for
motion.
But basically starting with those photographs shown on the left, you come up with -- you
have to estimate where the cameras are and a sparse 3D point cloud of the scene. So then
Photosynth actually works on top of this.
So what I'm going to talk about is how can we improve the transitions that I showed you.
So the way Photosynth works right now is it uses a single-plane proxy for every image to
do the transitions. Whereas, if you actually knew some approximate three-dimensional
shape of the scene, this would actually allow you to do a much better transition.
So this is sort of showing a quick comparison of what happens when -- this is showing
the scene -- photo transition with single-plane proxy. And the reason things are blurred is
because the single-plane proxy doesn't fit the 3D geometry of the scene.
So now what I'm showing you is the way the transitions would look if you had
approximate depth of the scene.
So things are much more aligned, and it actually feels like the camera is moving from one
position to another in this sequence.
So the big challenge is how do we compute depth automatically for a wide range of
inputs, we're talking of user, consumer photography where people are not going to take
photographs in special conditions.
So this has been a big -- sort of a big challenge in the computer vision community, and
we've started to see robust and really practical stereo algorithms which sort of address
some of the challenges in the dense correspondence problem.
So the basic problem that stereo solves is given two images of the scene, figuring out a
dense map of pixels from the first image into the second image. And although this
sounds like -- so this is sort of a common theme in computer vision. And like Rick said,
this is extremely compute heavy and really the key to solving the stereo problem as well.
People have recently shown a big speedup by using the GPU. So the GPU has this SIMD
model to exploit parallelism, and that's what the speedup comes from in stereo.
So here is an example. One of the algorithms proposed recently, this is a -- the way this
algorithm works is it first figures out depth at specific points in the scene. And then once
you have the depth at those pixels, it sort of spreads them. So it's a surface growing
algorithm.
And the benefit is that it's quite easily parallelizable. Because really you need to
operate -- you're operating a local region of the scene and gradually computing depth for
all the pixels.
So there's a final step which is sort of a global step, which is once you've computed
individual depth maps for the images, then how do you parallelize that, how do you fuse
them together there your model, which is basically what this animation is showing on the
right.
So for Photosynth, we are exploring a similar pipeline. But one of the differences that we
are looking for a way to come up with a lightweight reconstruction of the scene. So the
way -- one of the approaches that we are exploring is starting from the sparse set of scene
points which were recovered by structure from motion, we do some line reconstruction.
And then based on the sparse geometric information we recover a set of candidate
dominant planes.
And once we have this set of dominant planes, we figure out how to assign each of these
planes to pixels in the images. And so the pseudo-colored image on the right shows you
one of those assignments. And corresponding to that assignment, you get a depth map.
So this is going to be our representation with which we can do the kind of transitions that
I showed earlier.
So in this big processing pipeline, there are a large number of sort of tasks that we solve.
Some of these are sort of -- so there's really three levels of parallelism that we can exploit
there. So things that are trivial to parallelize are sort of the [inaudible] tasks: feature
detection, edge extraction, vanishing point detection.
And then there are tasks which work at the level of groups of images. So typically -- so,
for example, the semi-dense stereo algorithm works by picking a reference image,
figuring out a set of neighboring images, and then computing the depth map of the
reference image. And then this is repeated for each image in the collection.
And then there are certain tasks like global tasks, and Sameer will talk about bundle
adjustment, which is sort of one of the key nonlinear optimization problems that have to
be solved in these kind of systems.
So the main idea is that it should be fairly easy to figure -- once you locate which are
your regions, sort of the parallel regions in your pipeline, it should be possible to exploit
task-level parallelism by running multiple instances of multiple -- multiple instances of
your problem, solving them in parallel. And that's basically what I have on the right
showing that -- the diagram shows that idea.
So yeah. So those are some of the steps I was talking about. In our processing pipeline,
sort of most of the time is spent at either -- at both the dense stereo step, so dense stereos
solving this dense correspondence map between pairs of images. And the graph-cut
optimization is, again, trying to solve the pixel -- the label assignment problem on the
whole image. So those are really -- those steps are really operating at each pixel in the
image, and that's why it takes so much time to run.
But so this is showing how long Photosynth takes to run, sort of the different steps in
there. And Photosynth is fairly fast, and people have been building collections of close to
a thousand images all on one machine. It's also very memory intensive because you
pretty much like work with large sets of images and memory all the time.
So to summarize, we are looking at ways to improve Photosynth. And some of the
improvements that we are thinking of require a lot more compute than Photosynth
currently uses. But a lot of this should also be parallelizable, and we are looking into
how to achieve that.
And behind the scenes sort of there's the structure from motion problem as well as the
image-based modeling problem, which is -- this whole domain is extremely CPU
intensive. And there's a lot of scope to exploit task parallelism. Yes.
>>: If I understand correctly, what you were describing was sort of the batch processing
time for those sets of images, right?
>> Sudipta Sinha: Yes.
>>: Is there an incremental kind of characterization? If I were to add an image to a set or
[inaudible] reprocess the whole thing?
>> Sudipta Sinha: No. So yeah. So there are different steps. So the feature matching
step, it's actually possible to incrementally add these images the way you talked about.
So once you know the reconstruction for end cameras, given a new camera, you figure
out -- you do the feature detection for that image, but then figuring out the pose for that
camera is a fairly incremental step.
>>: So do you know what the actual bottlenecks are at the next level of detail past the
one you showed at the end? I mean, what are the parts that are taking the time
[inaudible]?
>> Sudipta Sinha: So when you do the dense scene modeling, it's really the order of the
number of pixels. So dense stereo -- so one of the things we're exploring here is coming
up with simplified representation of the scene. But it still requires you to solve the dense
stereo problem. Because you're assigning depth to every pixel. So that's really where
the -- where sort of the bottlenecks are going to be, because ->>: Maybe the simplest question is is it memory bound or is it compute bound? Since
it's -- it's a lot of data, right?
>> Sudipta Sinha: It's data, but it's going to be compute bound. And, again, so the
graph-cut optimization is sort of one of the components which is used a lot, so we are
using it in our pipeline, it's used in other image stitching tasks. It's -- it depends on what
resolution you set up your graph.
But it's again -- so the graph-cut step is somewhat harder to parallelize compared to
stereo, because stereo is basically -- you can run -- it's all -- it can be done in SIMD
fashion, whereas graph-cut -- again, there are different graph-cut algorithms, some which
are possible to parallelize, like the push-relabel type of graph-cut algorithms are easier to
parallelize.
But that's, again, another component which takes time and it's formerly used in a lot of
applications.
>>: [inaudible] full automation, no user input?
>> Sudipta Sinha: As of now, so everything I described is without user input, yeah.
>>: But is your goal to sort of make this sort of available on the client and to enable that
scenario, or, I mean, push a lot of this to the client where then the user couldn't interact?
>> Sudipta Sinha: So, I mean, sort of there has been work in the interactive domain in
the research community. But I think the real power of the system will come when you do
everything automatically. And we're talking about working with hundreds of images and
potentially up to thousands.
>>: No, but I'm thinking of the people who use Photoshop and they do all sorts of
correction on the images and sort of the ability to sort of combine editing in this mass
scale. Because I think that my experience is that, you know, there's -- you know, I have
this huge number of images and the ability to see them all is great, but also to apply mass
edits, you know, and do things like that would be very fundamental. But I would need
the stuff probably on my CPU -- on my desktop to take advantage of things, you know,
the large image, all that stuff.
>> Sudipta Sinha: Yeah, I mean, so the 3D modeling part of the pipeline is somewhere
where the user can definitely improve the results.
It wouldn't change the -- well, so you could probably skip the dense stereo part and do
something faster there if the user gave you some input.
But otherwise, I mean, compute-wise, it's still going to be spending the same amount of
resources on the problem.
>>: Yeah. So I guess I don't understand why the dense stereo step is done pixel by pixel.
Couldn't you just look at -- you know, identify what you might call disparity classes and
segment the image into sets of pixels with the same disparity and then work on those as a
group instead of individually?
>> Sudipta Sinha: Yes, you could, but then you're making -- so when you start off with
no knowledge about the scene, and if you want to get an accurate depth map, you would
have to basically go down to every pixel.
So the problem is there's a lot of -- the problem inherently ill posed. So we do end up
enforcing this constraint that pixels near to each other should have the same disparity.
But to -- given a perfectly textured scene, if you wanted to compute an accurate depth
map, you would have to basically step -- take one-pixel steps in this disparity space to
come up with the optimal solution.
So, yes, there are faster algorithms which are trying to either sample the disparity space
less densely or making early commitments that can sort of -- those are sort of the
speedups that are possible.
>>: I was kind of -- I mean, we should take this [inaudible] segment the disparity space
and take it as far as you can go. If you have to go further to pixel level for pieces of it,
you do that. But you might get a pretty clean solution.
>> Jim Larus: Yeah. And there are stereo matches that first segment the image, then
work at that level.
>>: Yeah, yeah, yeah. Yeah.
>> Rick Szeliski: Let's take one more question. I want to make sure the last of our triplet
of speakers has a chance. So go ahead.
>>: Two quick ones. One, is this only three dimensional or do you assume that all the
images are sort of shown from this ground level [inaudible].
>> Sudipta Sinha: So the input set is completely unstructured. We don't make any
assumptions on the scene. That's sort of one of the things he thinks -- he challenges. We
want this to work on general input collection.
>>: And the other one, if the input set is not a set collection of still images but the video
that was shot from why the customer was moving, is it sort of an easier task since you
could potentially use motion vectors that are already in the [inaudible] video stream?
>> Sudipta Sinha: So the [inaudible] that I've described and the whole system is geared
toward static scenes. Because the whole idea is matching feature points, we assume that
the scene is static so that things have not moved. If things are moving, then the moving
objects will not be reconstructed.
>>: [inaudible] but the camera is moving [inaudible] moving around the scene.
>> Sudipta Sinha: Yeah. So that's, again, the same structure for motion problem,
because it's a sequence of images taken from different viewpoints. And that's easy to
build into Photosynth.
>> Rick Szeliski: It does get easier.
>>: So if a car goes by, it might be trouble.
>> Sudipta Sinha: Well, it's going to be treated as an outlier in this whole pipeline.
Everything else will be reconstructed unless you specifically -- yeah.
>> Rick Szeliski: So thank you.
>> Sameer Agarwal: Thank you. My name is Sameer Agarwal and this is joint work
with Noah Snavely at Cornell; Ian Simon, Steve Seitz at University Washington; Rick
Szeliski at Microsoft Research.
And as Rick and Sudipta talked about, this work originated from work done at the U-Dub
known as photo tourism which aimed at reconstructed basically
tourist landmarks
from tourist photo collections from places like Flickr.
And the size of the system that we were talking about there was couple hundred to a
thousand or so images at that time.
And the system was then built into a product by Microsoft and -- known as Photosynth,
and it can handle, I believe, a couple thousand images right now.
But if you go back -- go on the Web and look at image collections, I just went to Flickr
and typed the word Venice -- and this is the image sizes that come up in Picasa -- it's 7.8
million. For cities like Rome we're talking about 26 million images. And New York and
London are 40 million images.
And the aim -- and the thing that we wanted to do was build a system that will go on the
Web, type the name of a city, download every image for that city, and try and reconstruct
that city, the three-dimensional structure of that city.
To give ourselves a slightly more practical target, we decided all right, we'll take -- we'll
download a million images of Rome, match all the images and build a 3D model of the
city and do it in a fully distributed manner on a thousand cores in 24 hours.
So what I'll describe today is some progress towards that, and I'll try and convince you
that we're not completely crazy.
So why do it? Well, first of all, because it's cool. Because we can do it. But on a more
pragmatic note, the single most interesting thing about tourist photographs is that they go
where things like street view won't go. They capture notions of interest. People go to
interesting places. They capture things at different times of the day, different locations.
They capture interiors as well as exteriors.
And our hope is that when we do this reconstructions, all these things that people have
captured in their photographs, they'll be represented in there.
And once we have these models, we can put them in things like Google Earth, Virtual
Earth. And the next generation of GPS, for example, instead of giving this line diagram
will likely show you how you will get there.
Another thing -- we did something very similar to a project at Georgia Tech, which is
called 40 cities, is that they are trying to map the evolution of the city of Atlanta over
time, especially since Atlanta has undergone some very dramatic changes in its
architecture.
And the thing is that once you have the 3D structure and the visual representation in your
computer, you can mine it, you can understand how the city grew, what are similarities
across different parts of the city, what are sort of canonical structures across cities.
And every single time I tell somebody about this project, there's usually somebody in the
audience who asks me about setting a game. Grand Theft Auto Rome, for example,
where I can actually blow up real stuff in a real city instead of trying to fake a city.
So what are the challenges here? The challenges are both scale [inaudible] computer
vision algorithm, especially for structure from motion systems like this have traditionally
been designed assuming you have a single processor and most of what you need to do can
be fit in RAM or it can be easily accessed from the disk.
For the scales of things that we are talking about, there is no -- there certainly isn't a
question of being able to fit in RAM, but you can't even think about fitting most of this
stuff on a single disk. So we are necessarily talking about a distributed memory system.
Then once you start talking about [inaudible] distributed memory system, the other
question is how do you actually distribute. So we've been classically -- most computer
vision systems assume a single processor and things being done serially. Now this offers
us a fundamental opportunity to look at ways in which we can look at problem
decomposition, efficient ways of breaking these problems up and combining these
results.
So there's work to be done both in data distribution, in dynamic load balancing. And
when you get to the nitty-gritty, sparse distributed linear algebra.
So where are we now? In the system that I'll -- well, very briefly talk about today, some
of the stuff is parallel. Our image download, our feature extraction and image matching
system, which is sort of the first thing that we really spend some time working on, our
parallel.
Our actual geometric reconstruction system is not a distributed memory [inaudible]. It
uses -- it has components [inaudible] where it uses a nonlinear square solver which is
[inaudible] parallel. But the big part, the distributed memory system, is basically the
matching problem.
So to give you an idea of the kind of thing that we are trying to do, what we are
fundamentally trying to do or need to do as a first state of this problem is that given a pair
of images, we need to identify what points in those two images correspond to the same
3D point in the real world.
And this is even when implemented very efficiently, it's still a very compute-intensive
problem. If you were to do this for a million images, we're talking about generating at
least a couple terabytes of data on disk, about half a trillion pairwise image comparisons.
And if you're hoping to do about 10,000 comparisons per second, at least spending a year
and a half doing this.
So there's no way doing sort of all -- comparing all pairs of images is going to be
practical here.
So the trick is to figure out what are the important -- what are the important places in
your image collection to send time on.
The other thing which is sort of very key about this problem is that most of the images
are actually garbage. Since we are just going to Flickr and typing the word Rome and
downloading everything that comes up, a lot of the images have nothing to do with the
3D structure of Rome. So it's also sort of a needle in haystack problem where only about
10 to 15 percent of the images that we'll download will eventually have any effect on the
3D structure that we reconstruct. The majority -- the vast majority of images won't be of
any use.
So it's important that we very quickly remove them from consideration.
The inspiration for the algorithm that we designed is this match graph for the city of
Rome. So we took about 20,000 images and we actually compared every pair of those
images to see -- to sort of get -- to get a view of what does this match graph look like.
So if there are two images which see the same part of the world, then there's an edge
connecting them.
And as you can see, this is a fairly sparse graph but quite clumpy. And this is very
characteristic of tourist datasets because there tourist locations that people will go and
take photographs. And then, you know, they'll walk the street, not take a lot of
photographs, then come to the next tourist attraction, take a lot of photographs.
But, again, there is some distribution around the tourist locations, but the scene is
clumpiness. Which means that if you're able to find these clumps and get a couple of
photographs inside these clumps, it's probably quite easy to grow these clumps. And
that's the approach that we take.
So we have a matching system which is -- which has multiple rounds of matching. And
in each round we first heuristically come up with pairs of images that we think are
looking at the same thing in the world, and then we go ahead and do sort of this detail
matching between them say all right, compare most of the points are sort of the
interesting points in this image with the interesting points in the other image and see if
they're looking at the same thing.
And then there is some geometric cleanup. Those of you who know what RANSAC is,
we use some Random Sample Consensus algorithms to geometrically verify that what we
are doing is not just two black pixels being matched; they actually geometrically make
sense.
So our system architecture, it's a two-layer system. It's sort of home brew. There's an
underlying layer of python code, which is a distributed computing engine that we wrote.
And the actual matching system is written as an application on top of it.
It's actually platform independent. I wrote it on Linux and then I brought it to Microsoft
Research and I ran it on the Windows cluster here. It's aimed at doing data-intensive
computation. It's very MapReduce-like. And it has extensive support for local caching
of data. So we know the entire structure of the computation, so we took care to design
operators and caching algorithms which are aware of this.
So what sort of performance does this give? So we looked at three different datasets of
increasing size. We looked at the city of Dubrovnik, which is a city on the coast in
Croatia. At the time of our experiment, there were about 60,000 images on Flickr. And
doing the matches there took about five hours. This was about 320 cores.
Rome and Venice we did 150- and 250,000 images, and they took nine and 27 hours
respectively. This of course raises the question, all right, you're doing this sort of
approximate matching; how good are you?
So going back to our groundtruth dataset for Rome, and we compared our matching
results to theirs. And it turns out that for about a quarter of a percent of compute effort,
we were able to get more than 90 percent of the true matches. And there are no false
positives here because the detail matching algorithm that was used is the same in the
groundtruth experiment and the one that we are doing.
>>: I'm sorry, groundtruth here is just brute-force algorithm.
>> Sameer Agarwal: The brute-force algorithm.
>>: [inaudible]
>> Sameer Agarwal: [inaudible] checking. We're assuming that that part of the
algorithm is good enough that we can treat it as a groundtruth. You're very right.
So had awesome results. These are -- oh. I should talk about the reconstruction process.
So the reconstruction process basically starts with a pair of images. We figure out what
sort of 3D construction or 3D world it captures. We add some more points by
triangulating the position -- the camera position of those two images.
And then depending on what camera sees those point, estimate the pose of some more
images. And then do a [inaudible] bundle adjustment, which is basically just a fancy
name for a nonlinear least squares minimization where we try and adjust both the 3D
position of the points as well as the parameters of the cameras so that the structure
matches the observation made in the images. And we repeat this until we can't add any
more images to it. So it's an incremental process.
The key compute task here is the bundle adjustment, the nonlinear least squares problem.
It's a very large, sparse nonlinear least squares problem where -- let me just describe the
symbols. M is the number of observations, sort of the number of points that we see in
each image [inaudible] the number of actual 3D points being reconstructed, so it's single
point maybe seen in multiple images. And N is the number of cameras.
And the characteristic of this problem is that M is way larger, then P is way larger than N.
To give you an example of the size for one of the Venice components, we had 14,000
images corresponding to, so that's N equal to 14,000, P of about 4.4 million points, and M
corresponding to about 27 million.
So we're talking about -- so if you're using Levenberg-Marquardt, at every time step
you're solving a linear system which has about 54 million rows and about 14, 15 million
columns.
The trick to solving this -- so this linear system has a very nice structure. It's a structure
that's present in most [inaudible] problems. And we explored that to actually reduce the
size of the linear system using a [inaudible] trick down to something which is basically
the number of camera can by number of cameras. So we bring it down to about 14,000
into 9 -- 140,000 or so by 140,000.
And the state-of-the-art software for doing this was not fast enough for our purposes, so
wrote our own. It's designed to explore all levels of sparsity. It has a bunch of nice bells
and whistles for people who care and [inaudible].
And the existing state-of-the-art solver basically only used a dense solver. So and our
solver even prevented both sparse methods as well as preconditioned CG solvers.
And in the experiment that we have tried, it's an order magnitude of more faster than the
solvers which are there.
The largest problem that we have solved is the one that I talked about earlier with about
14,000 images and 27 million observations.
>>: What kind of precondition are you using here?
>> Sameer Agarwal: This uses a very simple block diagonal preconditioner and works
quite well.
So here are the results for the city of Rome. So the city of Rome when you do this
matching, we don't know what's in the city of Rome. We just have a big pile of images.
We then go back and look at, all right, what do the various sort of clumps look like. So
we chose the more interesting ones, and I'm showing you those.
So this is the Colosseum. It contains I think about 1,400 images. So the black
[inaudible] that you see are the positions of the cameras, and that's the 3D point cloud.
So I talked about the interior. So this is the interior of St. Peter's Cathedral. And this has
about 2,400 images.
So, remember, the original data site for Rome was about 150,000 images. These are the
particular clumps corresponding to the interesting tourist sites that came out of it.
Then this is the Venice dataset. This is the Canal. And this started with a quarter of a
million images, and this contains I think another 3,000 images on the Rialto Bridge.
The biggest reconstruction that we did is the San Marco Square, which is the 14,000
image reconstruction. And...
>>: [inaudible]
>> Sameer Agarwal: That's a good question. So there are two kinds of [inaudible].
Some of them are actually photographs from the Campanelli Tower. The other are
actually cameras which are actually not very well conditioned, so we haven't actually
removed them from the data -- their position is actually quite uncertain. So at some time
when you're using a telephoto lens, you'll get these things floating in the air. I could have
pruned it and showed you a more cleaner version, but it's a slightly cruddier version.
>>: [inaudible] helicopter shot.
>> Sameer Agarwal: There might be a few helicopter shots.
So I talk about building Rome in a day. So I don't quite have Rome in a day. What I do
have for you is Dubrovnik in a day. So this is the old city of Dubrovnik. And here we
were actually able to reconstruct the entire city. It's small enough and there were enough
shots both from the ground as well as from a hill that you could sort of match all the
photographs together.
So this contains about 4,800 photographs, starting from about 16,000 photographs. And
this entire reconstruction was done in less than a day.
>> Rick Szeliski: Now, you're looking
at the point POC [phonetic] version, but,
you know, you could put the photographs in here and do the kind of transitions that
Sudipta was showing earlier.
>>: What do the points represent? The center of the photograph?
>> Sameer Agarwal: No. The points are actually on the surface. So, for example, if I
were to image this room, these are points in the walls. These are solid surfaces.
>>: [inaudible] these two photographs?
>> Sameer Agarwal: [inaudible] photographs, yes.
>>: So I'm not quite sure I understand the density of the point clouds. So when you
zoom in, some areas are much denser than others.
>> Sameer Agarwal: Yes.
>>: Is that geometric or is that photo density?
>> Sameer Agarwal: It depends both on photo density as well as the amount of texture
present. So, for example, if you have a flat-colored wall and you don't really detect very
many features, then you can't really match them.
We depend on getting sort of discriminative points that we can match across images.
So where are we now? Our image matching system is a distributed batch-oriented
system. Since we wrote it ourselves, it's not particularly [inaudible]. So if it dies, it dies.
The largest experiment that we've done up to now is quarter of a million images in 27
hours on 500 cores. This is at least 10 to 15 times slower than where we want to be.
We are quite confident getting the 10x. The 15x would require -- I think 10x is
[inaudible] in the current system, the next 5x requires more research. Yes.
>>: [inaudible] computational estimate on matching painting the images over again?
>> Sudipta Sinha: The video that I showed you or ->>: [inaudible] Rick talked about ->>: Building a 3D model out of this [inaudible].
>> Sameer Agarwal: So the navigation using the basic Photosynth interface, that's pretty
straightforward. That you can do in real time now, once you have this data. Because the
key thing that you need there is the position of the camera, and then you can just
[inaudible]. But if you want to do something like what Sudipta is doing, we're talking
quite a bit of time. Hard for me to even put a number on it.
>>: Yeah. But you've already done all the least squares heavy lifting, right?
>> Sameer Agarwal: Only for the camera ->>: Huh?
>> Sameer Agarwal: We have the camera poses from this. What Sudipta is doing is still
getting a very dense reconstruction on a per-pixel basis. That's another order of
magnitude work.
>>: Yeah, yeah, yeah. So if you have ->>: So if you treated this [inaudible] then it's an order of magnitude less than doing for
every pixel. But it's still a fair bid because you just don't know if the density is enough.
>>: [inaudible] nonhomogeneous density [inaudible].
>>: Sure.
>>: It should be easy; it turns out not to be.
>>: It ought to be easier than that. I mean, you shouldn't have to go all the way back to
the least squares of pixels from these least squares of features. I mean, there ought to be
some bridge.
>> Jim Larus: Let's thank this group. They've done a great job.
[applause]
>> Jim Larus: [inaudible] get the next [inaudible] going, Dennis Lin.
>> Dennis Lin: Okay. Good morning. My name is Dennis Lin. And in some senses I
have the opposite issue because the work that I have done, I'm going to talk about today,
was work with Mert Dikmen and other members of the Image Formation and Processing
Group at the University of Illinois. So but only I came here, so I get to talk about
everything.
Now, when our advisor last got up and talked about computer vision, he basically said -well, as he said, I quote, we have no idea what we're doing. And in some sense that's
true. Because if you look at other parts of, say, image or video processing, we've kind of
stabilized on the algorithms.
In image compression, we pretty much stabilized on JPEG. You take little blocks, you do
a little bit of DCT encoding and you have a compressed image.
And you do kind of the same thing with video, except you do some motion compensation
and there's some other tricks in there.
And but in some sense, you know, there is still work to be done. But there's been
something that's good enough for the industry to accept and people have the standard that
they are working off of.
Similarly, in some sense vision and rendering are opposites. And in rendering, industry
has largely settled on micro-polygonization. Micro if you're rendering movies, just
regular polygonization if you're doing video games.
And we don't have that in vision yet. We haven't really gotten to the point where we have
a lot of stable algorithms where we all -- that we can claim in our to-go-to algorithms
when we want to solve vision problems.
And that's what makes it exciting.
The other thing that makes it exciting is that vision is highly parallel. The best-known
vision system, of course, is our human brain. A large percentage of our brain is devoted
to processing visual input. And it is a highly parallel system. Our neurons are clocked at
something like 10 hertz, but we have lots of them and so things go fast.
The other thing is that vision is local, which makes it good for -- makes it nice for data
parallelism.
We tend to worry about objects which are local in space. Now, the next person will tell
you how the light bouncing off of this projector is affecting the way my face looks. But
when I worry about vision, I tend not to -- when we're doing object detection, we tend not
to worry about that too much. We tend to worry about just the couple pixels around that
object, and we try to pick out what -- recognize what that object is.
Similarly, when we describe events, we can describe long-running events, like me pacing
back and forth at this talk. But often when we describe events for, say, event detection,
we're talking about short-term events because that's just easier for us to understand.
So in many ways we're talking about fairly compact subsets of space time that we're
trying to recognize.
Now, on the other hand, when we're starting to work with vision and videos, we start to
run into performance issues. So we have realtime constraints if we're doing -- so we're
going to talk about two applications: hand tracking and video event detection. And hand
tracking we're trying to use it for human computer interfaces. So we have a real -- a
fairly strict realtime constraint. We probably don't need to go all the way at 30 frames a
second, but at least, you know, five to ten would be nice.
Some -- when we work -- go up to batch processing, things actually get worse. Because
then we're expected to be processing hundreds of hours of video. And even then a
realtime system would be kind of on the slow side. So if you have a hundred hours of
video to process, we really don't want to be waiting two, three days for this to finish.
So in some sense we actually want super real time when we move to the batch systems.
So first I'm going to talk about my work on hand tracking. The idea is to have a
gesture-based interface. And it would be useful for environments when you don't want to
be carrying around an input device, maybe a public terminal, maybe an augmented or
virtual reality system where it's too cumbersome to carry things around.
And it's also a first step towards sign language recognition. To actually complete sign
language recognition, we need to be able to track both hands at once and understand
something about facial expressions.
And this is a hard task. The hand is a relatively small about. We have a data glove that
we use to record hand gestures. And making this motion here exceeds the frequency at
which I can actually sample. You would actually see the data go from fully closed to
fully -- it can go from fully closed/fully open to 80 hertz, which was the sampling rate of
our glove.
And the nice thing about tracking hands is that you're sure what color it is. It's always
skin color and you don't have to worry about clothing, artifacts, wrinkles.
The bad thing is that there's nothing to track. You can, you know -- maybe my shirt, you
can lock onto my buttons and kind of track me as I move around. On the fingers really
by the time -- at the resolutions that you're likely to see, you basically have a featureless
blob. So we actually use [inaudible].
Then comes the infuriating part of this. The hand is structured. Obviously the set of all
possible hand silhouettes is a relatively small space compared to the space of all possible
binary images. But it's not characterized easily by a bunch of linear vectors. Because if
we combine two of these, we end up with a blur. And that's because the motion is
articulated. There is pattern in the motion.
These [inaudible] can only pivot around this joint because there is a joint there. But to
characterize this in a nice kind of linear or easy to understand way is just impossible.
You end up with a fairly complicated space that you have to work your way through.
And the space is large.
Your fingers, if you count all the degrees of freedom, has about 20 internal degrees of
freedom, you add six to place the palm and [inaudible] was about 26 level degrees of
freedom for each hand.
If you think about the rest of your body and you -- not counting the hands, you have
about 50 degrees of freedom if you count all the joints.
So it's kind of like saying that both your hands are getting close to the entire joint angle
space of your body.
So there are a couple approaches to this. One way is to take your image and do a bunch
of regression or some database search or some other method and go straight from here to
here. So go from the image to the joint angles, which is what we're ultimately after. We
want to know the state of the hand.
The other approach is to solve -- do analysis by synthesis. We will synthesize a hand
given a candidate of -- we propose a joint angle and we synthesize the possibility. And
then we compare it with what we actually see in the camera. And then we do some
twiddling. We propose something else. And we update this and we go around the circle
until we've -- until the joint angle -- until this image matches that image, which means
that these joint angles are right.
Now, this has drawn kind of simple circle. In order to get this to work, you actually need
to do lots and lots of candidates at once. So tens of thousands, if not more.
So and the other important part here is this error computation. The error -- if you just use
sum-squared difference, you run into the issue of the error goes in the wrong way. So
you want to decrease error when you're solving. And if you suppose that green is what
we see in the camera, blue is what a current candidate is proposing, white is the overlap.
So in this case we -- the camera has a thumb sticking out but our thumb -- our virtual
thumb isn't quite out far enough. If we move the thumb further out, we actually increase
the error, if we just use a naive sum-squared difference metric. The reason is because
we're exposing more of the thumb without reducing the error associated with the camera.
And, conversely, we just shrink the thumb back into the palm, we actually reduce the
error. This is the wrong direction to go. And if we just use this kind of metric, we
wouldn't be able to actually converge the hand. The base and the traction becomes much
smaller.
Obviously, if we continued doing this, this will eventually go down to the minimum. But
getting there, it would be difficult.
So what we need to do is actually use chamfer distance. And to do chamfer -- so the -this is a little bit technical, but the key point is that if you look at this part here, the error
associated with this point here is its distance to the closest blue. So this means that if we
try to tuck the thumb back into the thumb -- into the palm, this -- the error associated with
this becomes greater because it's now really far away from any pixel in our rendered
image.
And so, conversely, if we move the thumb closer, this error goes down and that
compensates for the fact that this -- we're adding error here.
>>: [inaudible] every pixel?
>> Dennis Lin: Yes. And actually that's one of the keys -- so, yes. We sum
independently over each pixel and it's the actual total sum that we care about. That's
actually one of the important things that we worry about later when we try to accelerate
this.
So in order to achieve chamfer distance, we actually need to perform a distance
transform. Now, often what people would do in the literature is they would perform an
exact chamfer distance transform with the camera image, because that's easy. That only
comes in about 30 times -- or 10 or 30 times a frame. And they do some kind of
approximation, a database lookup with some -- and for the other direction, because that
distance transform you have to do tens of thousands of times. Well, maybe a couple
thousand times per frame.
So this is what we want. If we -- this is a simplified diagram. If we have two pixels -two rectangles that correspond roughly to parts of our hand, so just imagine that these
two bits of our hand, we want to compute the distance transform.
And if you're rendering a graphics card, you can do something similar. You can render
this image. So we have the same blue regions. And this doesn't extend forever, but we
don't need that; we only need a little bit of a buffer. And it's a little hard to see, but you
can see that it goes from red to yellow, and it does the right thing in the middle where
these two roughly meet.
So this isn't a circle. It's kind of a rectangle. We can fix that with adding more polygons.
But this is a close enough approximation for most purposes.
And the key to this is that we can render this per candidate, so we can get a full
distance -- sorry. We can get a full distance transform for every generated image.
And the way we do this is we actually when we render every polygon as part of the hand,
we also render a bunch of additional polygons at the back of the z-buffer. And we can
use the z-buffer to -- we can use the z-buffer to disambiguate which part should go in
front.
So this is the first pass that I made at making the hand tracker. And the nice thing about
using OpenGL for rendering is that, well, we're actually rendering a hand. So this isn't
like traditional GPUs where you render kind of an image, a square or rectangle, and you
don't actually ever expect to look at it. We actually want an image of the hand to
compare to the image that we have on the camera.
And so in that case, OpenGL gives us nice things, perspective camera, hardware
polygonization, the hardware z-buffer which it gives us a distance transform.
Mipmap-based zooming, perspective-correct camera angles.
The problem is that I started running into bottlenecks. Nobody plays games at a thousand
frames a second. And I want tens of thousands of these things per second.
Also, if I was writing a multimillion dollar game and the people at NVIDIA would
rewrite their drivers so that my code ran fast. But I'm not, so my code runs slow. And
OpenGL is a very flexible interface, but it also gives the driver implementers a lot of
flexibility, so I can't ever quite figure out what I'm doing wrong or how I can make things
go fast.
Also, the exact kind of shading and drawing I'm doing is not a perfect fit for modern
graphics hardware. But the biggest bottleneck I found was the fact that I needed to
actually render the image.
So if I render the image, I need to actually write it into memory. And that actually turned
out to be a bottleneck because I don't actually care about the error image; I care about the
sum of the errors. And if I can get to the point where I don't need to actually write -- do
all those memory writes, then I would actually -- and then the read back in, then I would
actually get a significant performance advantage.
So we ended up -- I ended up rewriting my own software render, which is different from
OpenGL. And since I'm dealing with relatively simple geometries, I'm only rendering
silhouettes of hands, I based it off of a simple [inaudible] which is basically distance to
the line segment.
And so each of these bits of finger is a little line segment, and then you compute -- and
everything is -- inside a certain radius of that distance is considered inside the finger, and
then everything outside is outside. But since each of the time that we're computing
inside/outside this test, we also compute the distance, we also can compute the distance
transform on the fly as we go.
So with the software render, we implemented -- oh, I'm sorry. We implemented in C++
and in CUDA. There's actually also a Python version for reference. And so the CUDA
version has the advantage of avoiding the writing to memory. So if your CUDA gives
you access to the small on-chip cache of a GPU, and what we do is we render into that
cache and we then only write out the final results, because all we care about is just the
sum, we don't care about the intermediate values.
So these renderings are all 128 by 128. So relatively small in terms of actual rendering,
but I want lots of them. And it's actually bigger than the on-chip cache; it's only 16 K.
So you have to do some tiling.
So here are the speed performance results. So the fastest -- actually, if you pull a couple
more tricks, I've gotten this a little bit higher, maybe closer to 10,000 comparisons per
second. And okay. And this is actually a pretty good speed. It's still not quite where it
want it to be, but it's a reasonable -- you start getting reasonable results when you are
talking about 25,000 comparisons per second.
Don't look too hard about this number and really don't look too hard on this number. I
didn't put in anywhere near as much effort in optimizing this. This is actually running on
Mesa, so just the slowest possible OpenGL implementation in existence.
This is semi-reasonable. I was -- the image I'm rendering should fit in L2 cache, so it
should be okay. But I didn't take any special care to make sure everything fits in L1
cache. And I'm not using vector instructions like SSC. So I can probably get another
magnitude, maybe another 10x or so on this if we pushed it.
And this is using a single core -- a single core 2 core. And this is using -- even though
this is a dual GPU card, I'm only using one GPU for this number here.
And let's look at results. So here's tracking the hand. And it does a reasonable job. This
is actually running -- I've actually sped the video up 2x relative to what the live output
would be. Or, in other words, the tracker actually runs at half the speed that you're
currently seeing.
When you're doing tracking, it's not exactly fair to talk about frames per second. So this
is display at ten frames a second, and the tracker runs at five frames a second. But in
some sense it's all about latency because if you have a thousand frames a second and you
can keep up with that somehow, your tracking becomes a lot easier because nothing
changes from one frame to another.
>>: [inaudible]
>> Dennis Lin: Yes. Oh, I'm sorry. We use two cameras because we're only using the
silhouette. And if you have a situation where you do this, it becomes impossible to see
where the fingers are. Whereas, if you have a second view you can get some guesses.
So you can see when I get to the -- kind of the ->>: The little nonblue artifacts read at the right side, for example, what are those?
>> Dennis Lin: I think those are just artifacts. So the ->>: Errors or something?
>> Dennis Lin: No. That's the other rather important thing. I don't really have
groundtruth for this because I don't know where my joints really are, and I'm not about to
sit in front of an x-ray to actually figure this out. I mean, I've seen one paper where they
took a bunch of cadavers, insert a surgical wire and then move the joints around and then
took x-rays of that. But there's a limit to what I'm willing to do for science.
[laughter]
>> Dennis Lin: So that's our first application. Our second application is video event
detection. And this is case where -- this is part of the TREC events sponsored by NIST.
TREC originally was a text retrieval conference, and then they had a TRECVID. They
had a TRECVID which is about video processing, and they did a bunch of, say, BBC
video and tried to annotate those.
And then they added surveillance and event detection. And surveillance event detection
is basically we want to detect events. So here's a person putting a cell phone to his ear or
her ear. There are two people hugging up here, embrace. And then -- so this is London
Gatwick Airport. There are five different camera views; one was of an elevator, which
isn't that interesting.
Oh, sorry. First I'm putting down object. People aren't supposed to walk through those
doors, there's a person pointing there. And there's going to be one person taking the
picture.
So that's a sample of the events that we are supposed to be detecting. There's ten of them
in total. We didn't actually do all of them. We did a reasonable number of them. So like
I said, this is surveillance, but it's from London Gatwick Airport, five stationary cameras,
50 hours of training, 50 hours of testing. That was 2008. 2009, this year, they gave us
this also as training, and they're going to give us 40 more hours as testing.
And so if you add this all together, we have 9 million frames to process.
And this is our overall pipeline. It's actually fairly straightforward. We take our video,
we do some background subtraction to subtract out completely uninteresting parts, we
extract some features. And then we just use nearest neighbor as our classifier.
So we take -- we take the features that we extracted and the feature -- our canonical
features from the database. And we do lots of comparisons, and then we output some
values. And we do some post-processing filtering and we do some -- for the output.
The key here is that we generate lots and lots of data in this step because we have on the
orders of maybe a thousand windows per frame by the time we're done scaling, and then
we generate 600 value features out of this.
>>: [inaudible]
>> Dennis Lin: We actually used [inaudible] motion features. So it's actually a fairly
simple feature, which probably is why it doesn't work that well.
So we take optical flow on a bunch of frames. So this is a motion feature, so we're
working our way through time. And we take the optimal flow image and we separate
into positive X minus X, positive Y and minus Y. So we sum all the positive Xs and we
sum all of the negative Xs and we get two discrete values. That way it let's us understand
if something's waving back and forth so you -- as opposed to standing still. Because if
we just added everything together, it would average out to zero.
And then we just concatenate these together. And there's some parameters on how many
frames to take and how big to subdivide each region and so on, so forth.
And but we ended up picking something that had about 600 values per window. And a
window spanned about a second in time.
So if we look at the processing involved in this, so this is the version that we kind of used
before. This is -- it was dominated by the feature extraction and a pairwise distance. The
optical flow and the decode -- the image decoding for insignificant. We then ported in
these two parts into the GPU.
And as you can see, they're now these two pieces here. So we significantly shrunk the
component of -- those two components of the processing. We added this here, this
component transferring to the GPU. But we've made significant strides in actually
improving the overall speed.
So here's the per-component timing. We didn't actually use a GPU optical flow. We
actually used the same code, so that's why these times are the same. Like I said, we
added a -- we added some time because we needed to transfer things to the GPU. But
there is a -- we got a massive increase in the feature extraction in the pairwise distance.
Again, don't look at this too closely. We -- I spent about a day writing both of these
versions, like one day each. So they're not very well optimized, and those numbers can
come down significantly.
So and then on to results. So this the sad part actually. So if we look at it -- take the
pointing example. There is about 2,000 actually pointing instances. We managed to find
about 200 of them, but we gave about 30,000 false positives. That's actually pretty bad.
That gives a score of 3.6. Zero is ideal. One is if you submitted nothing.
So as you can see -- as you can see, not too many people actually broke 1 for this. I
mean, it's a really hard vision problem. I mean, it was one of the things about computer
vision that humans' brain, which is really good at these things, we can detect pointing no
matter which way I'm facing, if I'm pointing down here, pointing up here.
Getting computers to do it, we're still not quite there yet. We did do better on -- for some
tasks we actually used a specialized detector. So remember the opposite flow when
people are walking through these doors the wrong way? Well, what we did is we used a
3D Gabor filter which detected motion fill in this way, so we used the Gabor filter bank.
And then we detected the peaks in that and we produced output.
And that actually worked reasonably robustly. So in some sense, if we can understand
our problem and we have some chance to fiddle with it, we can tend to do better. So here
we actually broke -- we actually broke 1. We didn't do the best. IntuVision clearly did
much better because they got 9 out of the 12 true positives and only returned 12 false
positives.
So I think the thing -- the key takeaway here is that if we have some chance of tuning our
algorithm or adapting our algorithm to the task at hand, we tend to do significantly better
than if we try to just apply a general purpose algorithm.
And even if we are doing general purpose algorithms, we need some way of
experimenting to try to tune the parameters.
So which brings me to a short plug on the system that we -- framework we actually built.
Mert isn't here, so I can tell you that this is actually the second implementation of the
system.
The first time we tried it, it was written basically purely in C. And we had lots of issues
with -- basically it turned into one giant ball of mud and it was basically possible to
extend. So we decided to start over.
And it's a Python framework. And the idea is to take care a lot of the grunt work of
reading in video, identifying the video, pulling in the video labels, pulling in the location
labels.
And then we implemented a bunch of the features. We borrow heavily from OpenCV
and various libraries. And we also -- we also just incorporated an optical flow library.
So a GPU-base optical flow library. That's actually -- I haven't used it yet, but I'm told
it's actually pretty robust.
So here's a Web site if you ever want to go and visit, submit patches.
And I think that brings me to the end of my talk.
The key points I want to emphasize is that having faster computers helps us get our job
done. I don't think the hand tracking the way that I've proposed it is feasible with -without the amount of performance I'm getting out of the GPUs right now.
Also, for event detection, it doesn't necessarily help us get good results. But in some
sense, letting us get bad results faster lets us improve better.
And, you know, like I said in the beginning, we -- there is no go-to algorithm for
computer vision, say, in event detection. So we're definitely on the cutting edge of
research here. And having more processing lets us explore our space better and lets us
try new things, especially if we're going to be dealing with, say, 40 hours of testing data
and we actually want to see -- or if we're doing tens of hours of testing data, we actually
want, say, super real time evaluation to see how our algorithms are doing.
And I think also at the same time we also need a system which lets us do that. We need
something that's actually fairly flexible and can adapt to changing conditions and
changing hardware and changing algorithms.
So I think that concludes my talk. Are there any questions? Yes.
>>: If you're running the security analysis in real time, can you bring up the accuracy?
>> Dennis Lin: I imagine not. Because, I mean, we only make one pass through the data
in our testing phase. We do do a little bit post-processing. So, if anything, it would bring
down the accuracy.
At some point if we're doing buffering, then it probably becomes equivalent. Again, the
events themselves are local. So if we buffer enough video, it's just like doing static -- it's
just like doing batch detection.
So unfortunately there are more significant flaws in our algorithm as currently
implemented. We're actually going to be trying different features. Since we have a
Gabor feature, we're going to do a Gabor filter bank. Maybe that will work. And we're
trying some other ideas. And hopefully we can get better results for this year. Yes.
>>: What's your [inaudible] rate?
>> Dennis Lin: So the video is Pal [phonetic] at 25 frames a second. We actually didn't
have the resources to process that, so we actually skipped every five frames. So we were
actually processing at 1/5 a frame a second. And I think it was taking like a minute to do
a frame. So we were running a cluster. So that was -- when we did this, we were running
on a cluster of CPUs because we didn't have the GPU code up and running.
Let's see. The new numbers. So once we have the new numbers -- okay. So that's in
milliseconds. And that's basically the total execution for one frame. So I think we're
actually getting to the point where we're doing a second a frame. Or let's see. If we add
these numbers together -- about 200 milliseconds per frame.
So we're doing about five frames a second with the GPU version.
>>: [inaudible]
>> Dennis Lin: Well, the optical flow is just -- so it's the optical flow is just OpenCV's
implementation. So at our frame size it looks like it was running about 80 milliseconds
per frame.
>> Jim Larus: Well, let's thank our speaker.
[applause]
>> Andrew Wagner: Okay. Great. So I'm sort of in the same position Dennis is, Dennis
was in, in the sense that I've been collaborating with a bunch of great people at the
University of Illinois: my advisor, Yi Ma, and particularly my lab mate, John Wright.
And I get to present a whole line of research that we've been doing on face recognition.
So why face recognition? Face recognition is one of these tasks that people have been
working on in computer vision for a very long time but still hasn't reached a very large
level of adoption out in the real world.
There doesn't seem to be any really huge technical reason why we shouldn't be expecting
our cars to just have little cameras in them that recognize us, they know who we are and
let us in. Same with our houses. But still not working, for some reason.
So, in any case, I'm going to go into some depth on some of the attempts that we've -some of the progress we've been making towards getting face recognition to work for the
access control case.
And all of this is going to be for just a single test image. We're not doing anything with
video.
So why is face recognition desirable as a general technique for access control and
recognition? Well, it's noncontact. You don't have to touch anything. The measurement
can happen instantaneously, if you're just taking a single picture of the person. So
currently we're limited by the time it takes us to actually process the one image we take.
No really special sensors are needed. Like iris recognition you have to stand pretty much
right next to the thing to get it to work.
And attacks are also very conspicuous. One of the applications in the real world that they
do have is there I think -- is it Toshiba has a laptop that has a face recognition software
built in. But people have found that if you hold up a picture of the person, that's enough
to defeat it.
But if you're doing that outside of a building, it's going to be pretty conspicuous that
you're waving around a photo of someone, right? So that's a plus compared to, you
know, other techniques, where if you're swiping a card it's not obvious if you're holding
someone else's card, for instance.
And also recognition could be fully automated at well. The dream is that you walk up to
the door, it recognizes you and just lets you in.
So another couple applications where face recognition is beginning to be adopted is in
people -- I'm sorry, is there is question? Okay -- is in sorting their photos on their home
computers. So this is a pretty low-stakes and easy application, because, number one, you
have a small number of people. Since you're just sorting photos of your kids, no one's
going to die if you have a crummy recognition rate with it.
And last of all, you can kind of mask the poor performance of the face recognition
algorithm if you have a clever user interface and you just say, oh, we're going to suggest
some potential matches. It doesn't have to have a very good recognition rate to be useful.
However, face recognition hasn't gained much adoption at all for security applications.
And after 9/11, there were some very high-profile public trials where they tried installing
very expensive face recognition systems in public places, like in airport, and they also
installed it in a football stadium, and it pretty much failed. It used up far more time of the
employees than it was worth. And it was only working, you know, at best half the time.
So why is face recognition difficult? It turns out that people faces look very different
from different illuminations. Humans are very good at just naturally -- it's a question for
the psychologists of how to figure out exactly how the human brain does this. But we
have some sort of 3D representation of people's faces and we know what they look like
under different illuminations if we have seen that person a whole lot before.
Humans actually aren't very good at face recognition if you've only gotten one glimpse of
a person or you only have one image of a person, or if you're searching over a lot of
people. And this is a pretty common experience. I know -- this is my first time in this
town, and just walking around, I kept seeing people, oh, I know that person, oh, no, I
don't, no, I do -- you know, I kept seeing people I thought I recognized. But that's, again,
our -- even humans' face recognition isn't as good as we think it is.
So the -- and, again, the other things that make face recognition difficult is that obviously
if you have -- if you're just looking at a single image of your face, it looks totally
different if the pose is different.
And also you may not always get a clean shot of a person's face. They may be wearing a
hat. They may be wearing sunglasses. They may just have their mouth open, and that
creates an area similar to an occlusion.
And classical -- some of the classical algorithms don't do very well on this. So
techniques like Eigenfaces, LDA, nearest neighbor.
So the line of research in our lab is based on an idea of searching for sparse
representation of your test image in terms of your training images. And I'll go into more
detail about what I actually mean about that later.
So I'm going to talk about -- first I'm going to talk about how we actually take training
images. You need to take images of the people to build a database of what they look like.
And it turns out that that's very -- it's very sensitive to how you take those images. You
really have to do a good job of that.
Also we have to do a very good job of automatically aligning the test image to the
training images. And I'm also going to give a quick overview of the techniques we're
investigating to make this happen in a reasonable time frame.
So just a quick single example of the size of data that I'm talking about. So these
rectangles are -- so this is -- these -- the full thing is the full image as it -- roughly as it
comes off the camera. And these are the portions of the image that we actually use for
recognition.
And so, for example -- and the alignment is defining the mapping between a 60-by-80
image back to the full resolution image. These rectangles in the other space are 60-by-80
pixels wide.
So if you have poor alignment, that's what this black rectangle is. And it doesn't have to
be very -- it doesn't have to be shifting very much to make a difference. If the alignment
is poor, then when you find your representation, the coefficient for the correct user is
often lower than the other ones. So that's a fail. You'll see what these actually mean
later.
And similarly, if you use good alignment but you don't have a large enough set of
illuminations in your training set, then you also get a very -- often a very low coefficient
for the user that you care about. But if you nail them both, if you have a very good
alignment, the white box, and you also have a very good set of training illuminations, you
can do recognition very accurately.
So a little more on illumination. So a lot of the classical assumptions that people -- a lot
of the classical assumptions that people make about objects are, first of all, that they're
convex. This simplifies things a lot because you can't have any shadowing from one part
of a person's face to another. They assume that -- they assume a Lambertian reflectance
model. And they assume that the lighting of the object is distant.
And if you make all of these assumptions, there's a really elegant proof that is just
looking at the spherical harmonic basis functions for illuminations. They've shown that
you can get away with a very -- with basically a nine-dimensional subspace. If you just
have the right nine images of a person's face, then you could under those classical
assumptions do a very good job of representing that person's face.
Unfortunately there's a catch here, and that's that most of those classical assumptions just
don't really hold at all. So here's an example. A testing image, you can see we have
shadowing in a number of places on our faces, we have specularities in our eyes, on our
nose, forehead.
However, one nice thing that we have is that there is still this linear relationship between
the training images and the testing image. Just because your camera's effectively
counting photons, the superposition principle still holds.
So our strategy in our system is to experimentally determine what illuminations we
actually need to keep around.
So sort of the traditional way that has been used for capturing these recognition images,
at least for research systems, is to either build a big dome or build a big array of camera
flashes in front of your -- around your user. But you can probably get an idea. That's a
pretty major undertaking and it hasn't been done really all that many times. Most people
just -- most researchers in this area just operate off of public datasets.
It's difficult to reconfigure. It doesn't scale well in the number of flashes. If you want to
double the number of flashes, you have to pretty much start over with your hardware.
And you have to build that whole thing to get very good coverage and also have the
illumination distant.
So I said there's got to be a better way. So the way we came up with is to use projectors
to indirectly illuminate the person's face. It will end up being a lot easier to reconfigure.
It will be easier to change the illumination patterns. And it will be easier to construct,
deploy, and still get good angular coverage.
So the hardware setup looks like this. We have an array of DLP projectors, just because
they have a good contrast ratio, shining light on the walls -- we're not shining light
directly on the user -- that indirectly illuminates a user sitting in a chair. And we have
cameras that are synchronized with the projectors to take images.
So just to give you an idea that we're getting a very good angular coverage, if the user is
sitting here, I sort of diagrammed out the beam, so we're able to illuminate their face
from below as well as from above vertical or past vertical even. And, again, horizontally,
we have them close enough to the corner relative to the illumination that the widest part
of the light is actually hitting them from past horizontal.
So now we have all this flexibility. We can generate pretty much any illumination on the
person's face that we want to. And we need to -- so we need to restrict this problem a
little bit. So the technique that we use is we just chose a pretty fine sampling of
illuminations in the illumination space. And then we iteratively chose subsets of them to
try and figure out which ones we actually need.
And as you can imagine, there are diminishing turns when you do this.
>>: Single source? Single light source is what you're simulating?
>> Andrew Wagner: Yes. We're using a single light source at a time, and we're taking a
bunch of them in rapid sequence. So we end up with a database of the person's face
under different illuminations.
So, in any case, there's diminishing returns in the number of training images that you
take. Pretty rapid falloff. We ended upsetting on just grabbing that point, 38 training
images. Again, much larger than the theoretically predicted nine training images.
And also there's also diminishing returns in terms of the angular coverage. So we were -the subsets we were choosing were increasing radially, and it turns out that -- and this is
pretty intuitive -- that illuminations, if you're illuminating the person from directly behind
their head, not really so important. You still do need illuminations from behind
horizontal because light coming from behind you can still hit a portion of your face that's
visible in the camera. And that's sort of the test for whether a point light source can
affect the image.
So, again, there are two configurations that we take images in. We take with the user
facing the wall, so we get all the frontal illuminations, then we flip them around in the
chair and we take another set of images for a total of 38.
Anyone have any questions about this configuration? Okay. And, again, we're not
shining light directly into the user's face.
>>: So how long does it take for [inaudible].
>> Andrew Wagner: Right now it's taking a few seconds to capture all of these images.
But just because the synchronization between the cameras and the projectors could use
some work, you could probably get that under a second if you had access to the hardware.
Like I haven't taken apart the projectors and gotten access to the clock for the DLP chip.
There are people who have done this for doing realtime -- if you're building 3D models of
a person's face in real time, they've done this, where they take apart the projector and use
the clock from the chip to synchronize -- to really synchronize the projector with their
camera.
So now that we've got this great database of training images of a person's face under
different illuminations, how do we go ahead and align the images.
So as I said before, you know, just touching base again, if you have bad alignment, it
doesn't work. If you have -- if you have insufficient training images, it doesn't work. If
you get them both, it's right. So now we're on to the second thing.
So what do we do with our data? We take our training images, those are all at the top,
and then we -- right now we're performing a manual alignment on them. Since we take
them in rapid succession, it's not as difficult as it seems like. You only need to use click
features twice for each person who sits down. But we're also working on automating
that.
So in any case, we manually align -- get a good alignment of all the training images to
each other. We throw out the color information and we stack up the pixels into columns
of a matrix. So for each user we have a matrix of data that's the number of pixels that we
have tall by the number of users wide. So this is an extremely tall matrix.
And even when we take all of these -- we have one of these for each user. Even when we
concatenate them into one big global data matrix, this is still a very large -- a very, very
tall matrix in general.
So how do we actually find this representation? So the assumption that SRC makes is
that our testing image is a weighted superposition of our training images by the
coefficients X, and we also assume that we have some sparse error. So some pixels on
the person's face could be just flat wrong do to some reason, usually occlusion.
However, that model assumes that the images are aligned. Now, if you include alignment
in the optimization, then unfortunately breaks the convexity of your problem, in
particular it's breaking -- the way we have it formulated, it's breaking the convexity of our
constraint set.
So what we can do is we can basically linearize our constraint and iterate between
solving the system and recomputing the linearization.
Furthermore, as a performance optimization, we can get away with performing a similar
optimization on a per-user basis. So we align the test image to each individual user's set
of training images first, and then after that we can perform a global optimization.
And there -- okay. Again, just reiterating. Since we've linearized our constraint set, we
have to solve this optimization problem. And we have to -- and then we recompute the
Jacobian of our error function with respect to the parameterization of our transformation,
of our image warping.
>>: [inaudible] your constraints will be satisfied? I mean, you can [inaudible] but it
doesn't guarantee that your constraints will be satisfied.
>> Andrew Wagner: Yeah. Exactly. I mean, the convex problem, you're guaranteed to
converge to something if you set up your optimization correctly. But you're absolutely
right. With this iterative alignment, it's possible that if the person's face starts off with a
far enough deviation that it will converge to something completely different.
>>: What was I thinking was that this is the same kind of thing that you do for things
like sequential programming [inaudible] putting loop outside this linearization which
controls this linearization itself. And you need to test how much you progress along the
objective as well as the constraints and then make sure that one doesn't sort of bounce to
the other [inaudible]. Just doing this would only work very, very [inaudible].
>> Andrew Wagner: Okay. I'll look into that and we can talk offline if you want.
So he's absolutely correct. This does pretty much only work locally. But the good
news -- well, I'll get to the region of convergence results later.
>>: If the person is sitting [inaudible]?
>> Andrew Wagner: In between the frontal set of images and the rare set of images,
there's alignment to be done. But that's alignment within training images. Right now I'm
talking about alignment of the testing image to the training images. The testing image
comes in and it has some completely unknown alignment.
So what we do, after we perform this per-user alignment, is we invert that transformation
that we found and we apply it to each training user's set of images. So now we have a
new global A in which, at least for the correct user, the training images are aligned with
the testing image. And we perform a global optimization that's pretty much exactly out
of the SRC paper where we're, again, minimizing over the L1 norm of our X and our
error subject to the constraint.
And we actually use the -- we use the -- this is an important thing. We use the
coefficients in our representation for recognition. You expect to see large coefficients on
the correct user.
>>: I notice your error metric is based on an L1 norm throughout. Any intuition about
why you think that's a good idea?
>> Andrew Wagner: Well, so the L1 norm is a convex relaxation of the sparsity of these
vectors. So we know that X -- in this step X should be sparse because it should -- and at
least we know that there at least exists one solution where all of the -- where a small
number of the coefficients in X are large. Because those are -- A is 38 times the number
of users you have wide. But we only expect to see energy concentrated on the images for
the correct user.
And E is sparse because the error, at least for large errors like occlusions, they'll tend to
be spatially localized in the image. Only a small subset of the pixels will be completely
crepted. Does that answer your question? Okay.
So, again, an overview of the algorithm. We find some alignment between our testing
image and each of our sets of training images for each user. And as a performance
optimization, we find that we can get away with just keeping a top -- just keeping a
subset of those for the final recognition step, which I showed on the last slide.
Okay. Now, back to address your question about alignment a little more.
We computed the region after of attraction of our alignment algorithm by synthetically
perturbing our test image by different amounts and computing the actual full recognition
rates at each point.
And it turns out that the alignment works for up to roughly about a quarter of the width
between the outside of your eye corners, which if your face is misaligned, that's actually a
pretty significant misalignment. And it's able to converge for an even for dramatic
perturbation of angle. You can be almost -- you know, a little past 45 degrees and still
have it aligned.
And just a comparison of our performance between if we use the L2 norm, we found that
there are a lot of cases where it doesn't converge to a good thing. And in this case we're
using perspective transformation as our class of image warpings, so the rectangle there
is -- you can see it's skewed. But it actually does converge to something much more
reasonable if we're using the L1 norm.
And although our algorithm is designed and really only meant for images of training
users taken from one viewpoint that you know, like right now we're using always frontal
images, we found that it does exhibit at least a little bit of robustness to out-of-plane pose
variation. So that's just a little bonus thing. That's not a big result.
There are a lot of people working on performing face recognition really in the general
case where they're building three-dimensional model of the person's face so they really
can recognize them from a lot of angles. We're not doing that.
And just some results to give you an idea of how we're doing on public datasets. So we
pretty much clobber the traditional subspace-based face recognition methods. On
recognition we're succeeding in getting recognition rates above 90 percent. And for
detection -- so recognition is where you're answering the question of, you know, who's in
the image; detection is just is it a person in your dataset -- we clobber them as well. Our
algorithm performs a lot better.
And then, again, the diagonal in this ROC diagram is random. So if you have a -- if
you're just giving random results, then you get a diagonal.
>>: [inaudible] flip meaning true positive false.
>> Andrew Wagner: Right.
Okay. And so some of the examples where our algorithms can still have trouble. So for
the multi-PIE dataset, the training images were taken in a different session. They actually
had everyone come back in on a later date.
And it turned out that a lot of things have changed in people's faces. Some people dyed
their hair. Some people were wearing glasses in their testing image that they weren't in
their training images, vice versa; some people took off their glasses, changed their
makeup, grew a beard. So there's still a lot of challenges even once you've taken care of
illumination.
And here are just some experiments on a dataset that we took in our lab. Since the public
datasets don't -- most of them have rather insufficient testing images. They take all these
images under the illumination rig but they don't have time to take people outside and
different entrances in your building and really collect a database of people under different
illuminations.
So we did that in our set. So in our dataset, we have much better training illuminations
because we have our nice carefully controlled acquisition system. But we have much less
controlled testing images as far as illumination goes. And we're able to still, again, get
above 90 percent recognition rate on the reasonable cases where it's just normal people's
faces or optical eyeglasses. Our performance does begin to fall if they have occlusions
like sunglasses.
Okay. So we're able to get very good recognition rates. But there's a downside to this.
The recognition takes a long time. Right now it's taking about five minutes between -well, between two and five, depending on the number of training users you have in your
dataset. Right now we're operating with about a hundred training users with about 38
training images.
And really to be useful for access control, it needs to happen in under a second. You
don't want to be standing there waiting for the computer to crunch before it lets you in the
building if it's raining.
And we have a -- there are a couple of areas where we can try to get the speedup. So
there's this level of course-grain parallelism that I already talked about where we're doing
per-user alignment on a ->>: So how much would having a wet face increase your error rate?
>> Andrew Wagner: Due to specularity?
>>: Due to rain.
>> Andrew Wagner: Due to rain.
>>: Yes.
>>: It's important in Seattle.
[laughter]
>> Andrew Wagner: Well, I mean, that's actually a very good point. So one of the big
things about a person's face that changes when they get wet, either it's because of sweat
or because of rain. So you get a whole lot more specular reflection. And that's
something that if you have an insufficient set of training illuminations, you're much more
susceptible to failing for.
So having this larger set of 38 training images already captures some of the information
about how your face responds to specular reflections. You can't really do a perfect job,
but you can do better.
And also the real -- the challenge will be to achieve very efficient fine-grained
parallelism of this.
So our algorithm gets very good results. It's able to make good use of global information
to the image. We haven't had to chop up our image to operate on subsets of it all at all.
So we haven't sacrificed any of the global information of the image. It's conceptually
very simple, but computationally it's very expensive. We're doing a bunch of linear
algebra on very tall matrices.
And these are the reasons why we basically can't get away with using off-the-shelf L1
representation algorithms. So our arrays are extremely tall, the number of pixels you
have by the number of users. We have some domain-specific constraints. Since
illumination is positive, we constrain our coefficients to be to believe as well, otherwise
there's a danger of overfitting.
>>: So then why won't the number of users grow a lot? I mean, more people in pixels?
Or why is that not likely?
>> Andrew Wagner: It could. Right now we're focusing on access control. And usually
there's only so many people who work and should be in a given facility. In the building I
work in, it's 300 people. But, yeah, you're -- exactly. And that is another direction in
which you'd like to scale up. And right now our main idea for that is doing a per-user
alignment. Right now most of our computation happens in the alignment step.
>>: [inaudible] 90,000 employees for 300 buildings.
>> Andrew Wagner: Yeah. True. If you're doing something for an entire campus, that
will be a more challenge -- a much larger challenge.
But, you know, even if you -- it will probably be useful to add security to your swipe
cards before it is able to completely replace them. And that will at least give you some -a little bit robustness against people stealing someone else's swipe card and using that to
get in. It can give you some at least some idea of whether it's actually the person who
should be holding that card.
So what do the computations actually look like? They're basically -- there are sort of two
kinds -- there are sort of two general classes of algorithms for performing this sparse
reconstruction. The first class for first order methods, basically projection-based
methods, the expensive parts that are inside the inner loop end up being multiplications -wait, I'm talking about forth order methods.
So for first order methods, it's -- you have a very tall matrix, your data matrix A. And
you're multiplying that by -- you're multiplying that by vectors. For the first order
method, you get a matrix vector -- that's a typo. You get matrix vector operations both
for the tall times, the small vector on the other side, but also -- you also have the left
multiply as well that you have to compute. And you also have a per-pixel thresholding
step.
>>: [inaudible]
>> Andrew Wagner: Yeah. These As are dense. So the sparse that I'm talking about is
sparse in the representation, the coefficients that you're searching for, not in the data
matrix itself.
>>: Please tell me that you don't also have a 38-by-38 inverse.
>> Andrew Wagner: We actually -- well, for the second order method -- it's a linear
system that you have to solve to get the step direction for second order methods.
Right now most of -- all of these results that I've given you so far were with the second
order interior point method.
>>: [inaudible] using-pseudo inverse [inaudible].
>> Andrew Wagner: Hmm? It's a linear system that we're solving, and we're using a
linear system solver. We're not computing a pseudo-inverse and then multiplying.
So, in any case, in the second order methods, the interior point methods, the most
expensive computation is an A transpose times a humongous diagonal matrix. So we're
only storing the diagonals in this computation times another matrix. So this precludes
using off-the-shelf matrix multiplication routines.
And as we just mentioned, there's a smaller system that you have to solve to finish
computing the step direction.
And right now this is -- at least for per-user alignment, this is only a 32-by-32 system. So
it doesn't take a lot of computation to solve it. But if you're going to extend this to scale
it up to a whole lot of users, if you're planning on having a thousand users or more, you're
going to have to start including a whole lot more users in the set of users that you keep
around for the final recognition step.
>>: Is single enough, or do you need double? Or how much precision do you need?
>> Andrew Wagner: Our interior point method implemented on the CPU use double
precision. But we think we can get away with single precision. So numerical precision
actually is an issue in this. And one of the things that makes the computational problem
not something that you can just use off-the-shelf detectors algorithms for.
In particular, the largest matrix, A, is images that came off of a camera, so it only has A
precision to start with. So some of the optimizations that we're looking at are keeping -on the GPU keeping A in its original resolution and then just keeping around scale factors
for how each column in a row should be scaled. Because normalization is something that
you also have to take care of. I've skipped that because it's an implementation detail.
Okay. And, again, just thanking my lab mates. What questions do you guys have? Any
more? Okay. Thanks a lot.
[applause]
Download