17099 >>: All right. If everybody will take your... announcement. The reception this evening, I have maps up...

advertisement
17099
>>: All right. If everybody will take your seats, we can get going again. So just one
announcement. The reception this evening, I have maps up here. It's very close to where we are
now. It's literally I'd say a quarter of a mile from where we are. So it's easy to walk. It will take
five minutes to walk from here.
There are maps up there. The basic trick is just walk up on one side of the parking garage or the
other, the shorter way is to walk on that side of the parking garage to the next street, turn left, and
walk up the street to the stoplight, cross the street. You'll see it in front of you.
Microsoft has a big new complex of buildings. There's a soccer field there. There's a bunch of
places that look like a store. The restaurant there is called Spitfire. That's where we'll be at 6:00.
And everybody's obviously invited. If you need a map, they're up front. I have plenty of copies.
So all right. Back to the presentations. So there we go.
>>: So my talk is going to be about work that's going on at Berkeley and at the UPCRC about
applying image recognition techniques to image retrieval.
And so we can start the talk while I guess while this is getting settled down. But I think
fundamentally the problem we're trying to solve is we're trying to make digital media actually
worth something to people. And there's this needle in the haystack problem, when you start
getting a lot of digital media in your setup at home that you can't actually find it. So you can't
actually use it.
I like to say that the incremental value of a new piece of data is sometimes negative. Right?
Because the more data you accumulate, the lesser you can actually use it. So that's a problem
that we need to fix if we're going to keep moving forward.
So the problem that we're looking at is trying to find images from consumer databases. So these
are all pictures of my family and I would like to find one that I'm interested in. How do I do that?
So I'd like to talk a little bit about this spectrum that I see between image recognition techniques
and retrieval. So there's already a lot of solutions out there for doing image search of various
kinds.
At the one end where we're talking about things that have traditionally been called image
retrieval, we have very lightweight search problems where we're looking at meta data search with
images tagged.
If I'm looking for a picture of Jim in my database it looks for all pictures that have a tag with Jim in
it and it pulls them up.
There's a spectrum that goes down all the way down to image recognition techniques, where we
go from just looking at textual context and image tags to looking at coarse features on images like
give me the histogram of this image. If I search for an image that looks like a mountain, well, a lot
of mountains have similar color profiles. So that might work. Of course that doesn't work on a lot
of problems. So we continually add more and more features and more finer features until we get
to the level of actual trying to understand the semantics of what's going on in the image and as
we do this we are going from computations and image ranging from milliseconds all the way to
several minutes.
And so the goal here in this project is to show how parallelism can sort of shift the kinds of things
that we do in image retrieval, and bring things that traditionally have been far too expensive to
apply to image search into the domain where we can actually think about that.
So there's a lot of interesting computationally intensive algorithms that people use in computer
vision, and we saw some great talks this morning that were going over some computer vision
problems.
The ones that I'm going to talk about today that we've been working on are really high quality
image contour detection and some machine learning techniques for classifying images. Namely
support vector machines.
So we see things like, for example, the highest quality image contour detector that's known today
takes minutes per image. With parallelism and rethinking the algorithms behind these, we take it
down to seconds.
So we're trying to make it so that we can apply techniques which were far too expensive to even
consider for search applications and bring them into search.
And so this is kind of the outline of what a lot of retrieval applications look like. We start out with
a set of images. If we have a database, presumably we're incrementally updating this database.
But as we get new images we perform feature extraction on them.
Then the user is going to give us a query. They're going to ask a question to this database. And
that is going to be used along with the features that we have to train a classifier that differentiates
between images that are interesting or potentially interesting to the user and those which are not.
And then that classifier is going to be applied to the database and things are going to be returned
to the user and iterate from there.
So first I'm going to talk about this image contour detection problem we've been working on which
falls under feature extraction. So one of the fundamental steps that we have to do to understand
objects in an image is actually figure out what they are. So image segmentation is a very
fundamental problem in computer vision, and it's closely allied with image contour detection.
So once you have a set of good contours from an image, it's fairly straightforward to go to a good
segmentation of that image.
What I mean by that the work to find good contours can be on the order of several minutes and
the work to actually go from contours to segments can be on the order of one second.
So these are closely related problems, and the idea is once we have segments of an image, we
can then extract features from each of the segments and classify the segments and expose them
for searching.
The problem is that image segmentation is a hard problem. And high quality segmentations can
take minutes per image. So let's talk a little bit about the image contour detector that we're
working with.
So firstly it's good to point out that like most computer vision problems, this is subjective. The
actual contours in an image depend on your perspective and maybe even your mood.
For example, if I showed you a picture of a koala and asked you to tell me what the boundaries of
objects in the scene are, some people might say that the koala's arm is a separate object from
the koala. And they would be right, I mean depending on their perspective.
I mean, sometimes you want to differentiate between the arm of the koala and other times you
want to say no the koala is just one object. So it's actually a very subjective problem. And the
surprising thing is that humans actually agree pretty well on this problem. And so [inaudible],
computer vision professor we're collaborating with at Berkeley, got together a group of
undergrads and had them label a set of several hundred images and find all the boundaries and
do multiple labeling of each of the images, and then there's some sort of correspondence
problem to figure out what images are actually what.
And it turns out that people actually agree sort of on what the important contours of the image
are. As you can see in this middle image right here, we've got overlays of a lot of humans'
interpretation of the koala. You can see there's only one person who decided the arm was a
separate object from the rest of the koala but it's still an important piece of information.
On the right here are contours that are generated by the algorithm that we're talking about. And
so when we talk about accuracy of image contour detection, we're going to be basing it off of
ground truth on this test set of images that people have ground labeled. Our goal is to find
objects that people find interesting.
And so speaking to that, here's a precision recall curve. So that's related to this false positive
curve that people, RLC curve that people were showing earlier. So for any of these detectors
there's a straightforward thing to do.
I could say if I was really concerned with never labeling a pixel as an edge, I could say this image
has no edges in it. And I would be right in the sense that I would never label a pixel that's an
edge that's not a pixel.
However, I wouldn't find any of the pixels which are edge pixels. So that would be over in this
regime, where we're high precision. We're never claiming that a pixel's an edge when it's not.
But low recall in the sense that we didn't actually find any edges.
Conversely, we can label every single pixel in the image as an edge and we will for sure not miss
any edge pixels but it won't be very useful.
So there's always a trade-off in these kinds of algorithms between precision and recall and so the
goal is to be as far up close to the top and to the right as possible. Sorry. So this algorithm that
we're working with called global probability of boundary algorithm published last summer. It's
currently the most accurate image contour detection known. This graph is showing like the past
decade of computer vision research. They went through and found the papers and applied the
techniques from these papers to their database and showed that, yes, actually they are improving
the quality of contours, and we've gotten to it this point, this red point.
If you boil down the curve into a single number that kind of maximizes both precision and recall
it's called the F metric, then humans are at 0.79 and this algorithm gets 0.70. Something like the
snowball edge detector or the canny edge detector would be somewhere around 0.5.
So the problem with this algorithm, even though it's getting pretty good and it's finding really
useful boundaries is that it's very computationally intensive. It runs about 3.7 minutes per small
image. And that's a 0.15 mega pixel image. Limits its applicability. If you tried to index all the
images on the web by the time you actually went through even applying a huge big datacenter to
the problem there would be so many more images you wouldn't be able to keep up.
So this is a really computationally intensive problem. And so we've been looking at parallel
algorithms and implementations for this algorithm. And we have brought down the time for image
contour detection from about four minutes to two seconds using very parallel algorithms on Nvidia
graphics processors I'll talk about that. What I want you to get from the slide is there's a lot of
pieces to it. There are things like machine learning algorithms like K means where we're doing
clustering on the image.
Traditional imaging processing things like convolving by filter banks and computer vision things
that are used in other approaches such as nonmax suppression and intervening contour codes.
We have a big generalized eigen solver which was the most dominant part of the computational
problem I'll talk about it in a second. And other imaging processing like skeletonization, there's a
lot of things going on here and we've worked on all of them.
Just to sort of give you a flavor of what kind of computations we're looking at. K means is a
clustering problem where you start out with unlabeled data. And you need to cluster that data
into K clusters. The way that's done is basically by the guess and check method.
You will start out and you assign random labels to all of your points. And then you figure out
where the centroid of all the clusters implied by those labels are.
And then you relabel the points based upon which centroid is closest to each point and iterate
and hope you converge somewhere.
And this is done on the image to find important textures. So the images con involve with a set of
universal textures, and these correspond to feature vectors for every pixel about 30 elements
long, and then we cluster those into, say, 32 or 64 different clusters. Gives us a way of ignoring
some of the noise in the image while still finding similarity between pixels.
So another thing that's going on here is gradient computation. We're looking for edges. So this is
kind of a way of suppressing noise when we're looking at edges. If you imagine two half disks
centered at every pixel each oriented at some orientation and have some radius.
If we sell up all of the responses in each half of the disk and compare how different those are,
then we get a metric of how strong of an edge we have at that orientation at that radius at that
pixel.
So we looked at this, the original implementation actually went through all of the work of summing
up pixels over each of these half disks. And we decided there was a better way. So we changed
the algorithm to use integral images. Integral images are basically a way of using parallel prefix
sums, which have been around for a long time to do these sorts of computations where we're
doing sums of overlapping windows on images.
The way it works you basically take a window of -- you basically do a computation where the
value at each point is equal to the sum of all of the elements to its top left. So, for example, this
box, if I do the integral image, then it's going to turn into a single pixel right corresponding -- let's
see where is it? -- corresponding about here in the image. That's going to be the sum of all
these things in here. And this allows us to avoid repeated overlapping summations.
That brings down the computational complexity and really helps out. It's also good for
parallelism, because we've replaced a lot of histogram problem with scans, which have a lot
better data dependencies.
So the next thing that was really important computationally was this eigen solver. This is spectral
graph partitioning approach. So we had some talks this morning about graph partitioning, and it's
a very useful problem. The thing is there's a lot of different metrics you can apply to graph
partitioning. And the min cut method, which is the common one used in a lot of graphic
partitioning approaches is bad for image segmentation image contour detection because it tends
to cut off lots of tiny little pieces everywhere.
So min cut is saying I want you to break the graph that will disturb the fewest number of edges in
the graph.
But what it does it tends to isolate things that don't have a lot of edges connecting them to the
graph. If you change the metric from a min cut metric to a normalized cut metric where we're
looking to minimize the number of edges that we cut that are normalized to the sum of all of the
edges that are leaving the sub graph that we're cutting, that problem actually is much better for
image segmentation.
However it's NP hard. So instead of solving it directly we approximate it with an eigen solver. So
it's kind of interesting to show that these problems are related. And we can apply an eigen solver
to this problem.
So the other way to think about this is that this allows us to use global information about the
things in the image to find boundaries. Instead of the gradients I was just talking about is a very
local operation wherefore every pixel we're looking at a radius, trying to see if things are different
enough to say there's an edge at that pixel.
Instead of doing that, this is more -- it's a way of globally understanding the image. So every
pixel is related only to a few pixels in the surrounding neighborhood, and we turn those relations
into a sparse matrix that talks about how each pixel is related to some of its neighbors, when we
find the eye again vectors of this matrix it turns out to correspond to regions that should group
together. And here's an example of what it looks like. So here's a picture of a guy in a hula skirt
and the eigenvector, one of the eigenvectors that come out of this.
One thing to notice is that the eigenvector does a good job finding important boundaries in the
image. There's a lot of texture in here. There's a lot of confusion in the foliage back here of
things that might be edges. But it's hard to say.
And the eigenvector comes out of this formulation is actually really nice. So the eigen solver
problem itself is actually very interesting computationally. In the pile up we happen to have a
number of experts on sparse eigen systems, Jim Demmell being one of them and we went to Jim
and said we have this eigen problem how do we solve it. He helped us doing some algorithmic
exploration we found using a Lancerest algorithm with this kind of not commonly used content
called the Cullum-Willoughby ended up being the best choice for this algorithm.
These are the kind of results we're seeing. So like I said earlier, the original code was running at
about 3.7 minutes. We took it and parallelized it over dual socket quad core system using P
threads and got it down to about 30 seconds. And our GPU implementation takes two seconds.
So we can say that there's a lot of good parallelism there. It's interesting to see how Amdell's law
bites us in one form or another for these different platforms.
Of course a lot of reduction from the eigen solver is because we changed the algorithm and gone
to a more efficient algorithm for this problem. But these kind of results are exciting because this
particular image contour detector is very powerful, but it's not very widely used because it's so
computationally intensive. We believe with these kind of results people can start applying this in
a lot of situations where they couldn't before.
Just some other data points about our GPU implementation. We actually ran it on a number of
different GPUs. This is the same binary, just running it on a different number of cores, all the way
from two cores in the integrated graphics that are now in Apple's laptops, all the way up to the
GTX 280, which is the biggest GPU in Nvidia sales right now.
And things scale pretty well. You'll notice that at 30 cores we have two data points. Somehow I
dropped a label, but the lower one is the Tesla board has lower memory bandwidth. We can see
that kind of measures the memory bandwidth dependence here.
And also we can check scaling behavior in terms of image size to see what kinds of images we
can handle. Turns out that we're limited on the Tesla board limited to four megabytes of memory
we're limited to 4.8 mega pictures because there's a lot of data generated in this process.
But we can see that the runtime scales fairly well in the number of pixels. And most importantly is
accuracy. If you're speeding something up but you are throwing accuracy out the window it's not
much of a help.
On this benchmark comparing against ground truth for all of these hand contoured images, we
achieve the exact same accuracy as the serial version. So kind of summarizing for this segment
of my talk, I think that parallelism really is able -- we really can use parallelism to practically take
things that were much too computationally intensive to be widely applied and bring them into
domains they wouldn't have been able to address previously.
And so I'm pretty excited about that. So the next thing I want to talk about is classification. So
this is the process of analyzing images. And classifying them as interesting or not.
Or can be used for other recognition purposes as well. So we have spent some time looking at
support vector machines a widely used technique for classification. In the content based image
retrieval context using a support vector machine means finding a decision surface which
separates image classes and these classes are going to be defined by the search query itself.
So the goal of the classifier is to maximize the margin between the classes which gives us a
pretty general classifier which hopefully is resistant to noise. And so training in SVM is the
process of you have a set of test images that you know the truth and you want to learn from them
the correct classifier. And it's a quadratic programming optimization process where we have the
number of variables in this optimization process is equal to the number of test images that we
have labeled true for. And go through and the goal is to find a weight for each of those images.
And the great thing about support vector machines is that they can be nonlinear. So you can find
a linear classification surface, which is basically just a hyper plane in some feature space
separating positive examples from negative examples or you can use a kernel which allows you
to get rather nonlinear classifiers but because of the formation of the SVM you avoid overfitting to
your data which is the common problem with nonlinear classification approaches.
So to do this we are actually using the sequential minimal optimization algorithm which was
invented at Microsoft Research by John Platt. I guess he's probably not here today. I was
hoping to say hi.
But it's a great algorithm. And it's widely used in support vector machine training. What it does is
you may have several thousand up to hundreds of thousands of test points where you know the
truth. And the question is well how do I find a set of weights for all of these. The hardest possible
thing would be to update, you have some guess about the weights and you update that entire
vector of weights every time you make a step. And that's a full traditional quadratic performing
method. Sequential minimal optimization goes to the extreme of only updating two of the weights
at a time.
So out of your 10,000 weights you're only going to look at two of them. In that case the quadratic
programming problem turns into a trivial one dimensional problem that we could have solved
back in high school.
Basically we have some constraints. They describe a box in two dimensions and then we have
another constraint that describes a line in two dimensions so queer going to be maximizing over a
line some quadratic function, and that's pretty easy.
So what this means is that the actual optimization steps of updating the vector are trivial. The
hard part is figuring out what those mean. So the actual work in the algorithm is computing some
[inaudible] optimality conditions which is done for every training point and you reduce all over all
of those to figure out which points you're going to be updating for the next step of your algorithm
and you just keep iterating.
So we implemented this also on Nvidia graphics processes and we saw using the Gaussian
kernel fairly widely used we saw nine to about 35 nine speed up over XVM which is the standard
package for doing support vector machine training on CPUs.
And so we were pretty excited about that. We published a paper on it and put our software out
on the Internet and think we have around 250 downloads of it as of today which makes me pretty
happy. So people are actually using it and the bad thing about that is that I get bug requests.
But ->>: [inaudible].
>>: Yeah, I guess that's right. I don't actually have to fix them. They just need to tell me about
them.
>>: [inaudible].
>>: Right. So the other side of using a classifier, I just talked about training it where you're trying
to find a decision surface. The other side is actually applying that decision surface to a database.
And for support vector machines that evolves evaluating an equation, and it ends up looking like
a big matrix multiply plus other stuff. So we also implemented that and got fairly good results.
We had both a P thread version on the CPU as well as a two-way core two duo as well as GPU
version using Nvidia's graphic processor from 2006. So these results are kind of dated but
overall our classification results were identical to the serial version. When you took the classifier
we trained with our own training method and applied it using our own classification method they
provided the same results. And I also want to talk briefly about another product that we've got
going on in kernel dimensionality reduction, which is another technique for performing
classification.
In this technique, what we're trying to do is take a high dimensional dataset like, for example,
images with features on them and we want to take it from something with really high dimension
down to something with much lower dimension going from 600 dimensions to 20 dimensions.
In so doing the classification problem becomes a lot easier. So in this -- I think this picture sums
it up really well. We've got -- this is from a 13 dimensional real world dataset describing wine.
And it measured wine in different, 13 different ways and then the classifier is it any good or
something like that.
And if you take that real world dataset and use kernel dimensionality reduction, the three different
classes that we're trying to distinguish boil down into three nearly disjoint regions in the two
dimensional space. Whereas other techniques for doing dimensionality reduction they kind of
tend to leave the data on top of each other which makes it harder to classify.
So Mark Murphy in my group at Berkeley is working on this, and it's a noncovex optimization
problem in order to train this thing with simulated annealling gradient descent and to cut to the
chase he's implemented a portion of it and it's working pretty well.
He's seeing fairly good speed up. So again parallelism will work here. So I think that the
challenge of all of this work is putting it together into a functional CBIR system that lets people
search a database and actually returns what they want to see. And we're working on that.
We haven't actually done all of that yet. We've been working on these components instead. But
we believe that using more sophisticated feature extraction techniques like the contour detection I
showed you as well as more sophisticated classification techniques will lead to improved
performance. And that will let people do things they couldn't do before. And that's what we're
trying to demonstrate in our work.
So that's it.
>>: So question here, so would the user provide the query into the system, do you solve the
problem of open set, maybe simply the query isn't labeled in the set, what do you do?
>>: Right. So the way that we are looking at that is so the user is going to provide some images
that they're interested in seeing.
>>: [inaudible].
>>: A couple of them.
>>: Figuring out what they want, koala.
>>: Right. This is content-based. So the user says I want an image that looks like this one.
>>: So rather than user recognition to label whatever.
>>: Right. No, this confusion is natural because they're related problems. If I was able to go
through my database and find all of the things that had a koala in them and then I gave them an
image that had a koala in it. If I had to figure it out it would make searching a lot better. It is
performing this problem.
So they are related problems. And we're coming up with sort of hybrid solution that takes
advantage of previous queries, things that it has learned about the database during previous
queries to make them stronger, coming from the user.
>>: So you have to find some similar kind of thing.
>>: That's why we're -- that's why we're working on exactly how to do that. But, yes, there are
CBIR systems that do things both ways and we're trying to come up with something that makes
sense.
>>: Question, how fast does it have to be for it to be a product in the sense that somebody would
take it and market it. How do you know when it's best?
>>: Right. So I think there's two major components of these systems. One is the feature
extraction stuff, which is kind of done offline. It's done when the images enter the database. And
hopefully that needs to be fast enough to keep up with the stream of images that are entering the
database. And then the second part is the actual online part when the user's entering a query,
and that needs to be user tolerable.
So what does the user tolerate for a search, I think a couple of seconds. That needs to be really
quick.
>>: 300 milliseconds.
>>: There you go 300 milliseconds from Jim Lair.
>>: Let's thank our speaker.
[applause]
>>: Hello. Am I on? So thanks for fitting me in. Rick talked this morning about stuff we can do
with large collections of images. We all know one very easy way to get a large collection of
images is to record some video. I'll talk about a bunch of work we've done in the area of video.
And my goal with all these explorations or my dream is to take effects that Hollywood people
make today and ensure that my number one customer, the only person I know who actually does
anything with video, which is my mother, will actually, can play around with them.
One of the things -- so this is my mom's laptop. She likes it very much. And I guess one of the
funny things while we've spent all this time saying parallel programming is just around the corner,
does anyone know how many cores are in this thing?
>>: Four.
>>: Is it only four?
>>: Two or four.
>>: In the GPU.
>>: So you have to be careful. I'm using Intel definition for because we have Intel people in the
audience. Intel definition for core is slightly different. They would say 16 or 32.
>>: I see.
>>: Improve the core efficiency.
>>: The main thing is that the programming model is parallel. So we've been saying it's in the
future. Well, it's already in the past in some sense because people coding for this thing are
coding for a machine in their minds, programmer's minds, many more cores than are in it. I know
you didn't like the picture. I gave you a picture that's much more to be the kind of thing you like
the look of.
So what are we going to do? Well, here's a classic, here's a piece of video of my office. And I've
performed a traditional video edit on it which I've overlaid a logo. And I've used some extra
special graphics to make it nicely 3-D. But it's not really 3-D. It's just a 2-D overlay on a 2-D
video. What we might want is something a bit more like this video here. The 3-D object I've
embedded is living more in the world of the video rather than superimposed on it. So I can do
this computation.
But it's effectively the key to doing this, and we can see the logo's beginning to appear again, the
key to doing this is to run photosynth on the several hundred frames of video to generate 3-D
cameras. So the same 3-D camera positions you saw before are now smooth trajectory, you
render the 3-D object with that. We worked on ways to make it easy to make animations like this
one. So you thought the blink tag was bad when it came out in '92, you thought word art was
bad, well this is the future.
So again these sort of edits we know how to make the user interface very simple, but it actually
requires a massive amount of compute to preprocess the video in order that these edits are easy
to render.
There are still some problems with that. So, for example, this video is -- I'm very proud to say,
made Bill Gates laugh. At that point. At that point there was a sound from the man opposite me.
So a great thing is happening here. One part of your brain which does geometry and stuff is
really happy that this is probably on the white board. That's where you think it might be. Another
part of your object recognition system, which is looking at the relative depth ordering of objects
suddenly gets nastily broken when the tech fest goes to the background that's because despite
all my protestations about 2-D overlay on another 2-D thing, turning it into 3-D, this rendering has
absolutely no knowledge of the dense structure of the scene.
Having spent lots of time computing the camera positions we're now in a position to spend more
time computing a representation of the scene which looks a bit like this. Now, this is pretty
rubbish, this is fast rubbish depth data where bright points are far away and dark points are near
to the camera. So it's encoding distance.
This we obtained this representation by looking at the images themselves without using a depth
camera. Because the scene is rigid, I can treat the frame-to-frame transformations, I can pretend
a pair of successive frames is a stereo view and recover a rubbish depth map like this, but this is
nevertheless enough to correctly occlude to generate the clean occlusions that I get in this
sequence. We have the same thing again.
But now it's less disturbing when the video passes the other way. You'll notice it's still not perfect.
A bit of boiling around here and so on. But certainly for home video use or at least in the near
term you'd expect that this would be, this is not as disturbing an output as we had the first time.
By the way, the reason I can use the rubbish depth samples is because I have access to an
algorithm by Antonio [inaudible] and Tony Sharp at Cambridge which they called geodesic
segmentation. But this algorithm is probably not interesting for this workshop because it's
extremely fast.
So you can process a large CT volume and generate these segmentations completely
interactively. So we'll just skate over that.
Okay. So I said that one of the reasons I could recover depth from that sequence was because it
was a rigid scene. You'll notice there weren't any researchers running backwards and forwards in
front of the whiteboard. If we were studying tracking there would have been lots of researchers
but no camera. I've always wanted to process scenes a bit more like this one. So you have an
object like this giraffe which is undulating and articulating as it moves through the scene. We
have a pair of giraffes, one occluding the other. The camera's panning.
I want to know how to deal with nonrigid motion and what the representations are for doing that.
So one of the things, very simple thing that you might want to do if you've got a video, is attach an
overlay but have the overlay look like it's following an object in the scene.
So you know, and clearly the only application any of us can ever think of for anything is
advertising. So let's assume that's what we want to do here. Okay. Now that, come on that's got
to be simple. What am I going to do, I'll click on some point in the giraffe and follow it through
time, moving the object. That's got to be easy.
And indeed it's not bad. So here's the interface. Written in math lab for anyone who knows this is
extremely slow. The user has clicked on the giraffe's eye and the cyan trajectory represents a
search throughout the video. 200 frames of video. It searched for points matching the giraffe's
eye 200, 400 frames a second and what the user is doing is scrubbing backwards and forwards
through the video, and checking that the trajectory is correct even though the back giraffe's eye
was occluded during the transformation.
So we're kind of looking at the position of that point throughout the video overlaid on one frame.
And you can see that basically at this stage, good check there, the other giraffe's nose came in
front. So the user is happy that he's correctly tracked the video. And in this case it was two
clicks. We have other examples where a 20 minute video that we processed we managed to
process in three minutes including the user interaction.
So this eight times real time means including the user, starting from the point before the user
touches the mouse.
The key is that you need to store 24 bytes per pixel to score a KD tree for every frame. In order
to do this there's a massive amount of backing stored. So we precompute on the video and then
we can get it fast. That's something that we want to deal with.
Okay. Back to this guy. Well, can you see what I've changed? Oh look a new house has
appeared. That's an easy edit. We could do that in the '90s. What's the representation that
allows us to do that edit?
This is a tricky edit. I'm going to put something on the giraffe's neck, and it should follow the
undulations correctly. We couldn't do it in the '90s we can do this now, we're happy with that.
What's the representation that allows me to do that? Well, again, I load up the video. I crunch on
it for several days after several days crunching I end up with a representation of the video that
looks a bit like this. I split it into two static photos, foreground layer and a background layer. This
giraffe, the one in the background, didn't move enough. So he just got locked into the
background so his neck lifted. That didn't get noticed.
So I crunch on the video. I split it into these layers and then I can do something like attach
houses in normal photo editing to this layer and generate the image we just saw.
Okay. So there's the background layer. I just use whatever photo editing technique I like to put
in the houses and then we rerender. We get the sequence we just saw.
And then finally this allows us to deal with more complicated structures. So this guy's face if you
watch his mustache you'll see it appears slowly and flashes on and off. The same sort of
technique. 3-D object. Nonrigid 3-D object. What photosynth would do, the photosynth
philosophy which I adhere to is you would take this object and reconstruct a three dimensional
interpretation of what's happening. Well, what we decided we would reconstruct a two
dimensional interpretation of what was happening. Instead of reconstructing into 3-D we did it
into 2-D pasting each, all the new information from every frame of video into this representation,
which is called a mosaic.
Now you can draw on the representation itself. So we're drawing on sort of this one picture the
eyebrows et al. and rerender those, mix them in with the original video and generate this effect,
which is reasonably convincing, and the mustache disappears correctly as it's occluded by the
subject's face.
Okay. So that's some stuff you can do with video. As we know it's a splendid source of
embarrassing parallelism. If you can get the video to the cores. So we want to do lots of
computation on a big chunk of video. For many of these computations each core would like to
see all the video you want you like to supplies it by frames but sometimes you like to see all the
video at once.
But given that you've got the video down at the cores, we can do lots of stuff. Most of the stuff I
want to do take turns at 20 meg video into a half a terabite of data. So actually a lot of it is disk
bound rather than RAM bound or CPU bound. But things are getting close.
So that thing happens. And then as in photosynth you get some sort of N squared or gathered do
that on the frame, haul that information back. Not a lot of data coming back let's say 20 K per
data per frame but then you do some horrible gather of size N cubed.
And that's because the nonembarrassingly parallel sequential algorithms that we love in video,
you might think video, do stuff in every frame, throw away the old answers. Bad idea. If you've
got 99 percent reliability frame to frame, and your video's a thousand frames long, 99.9 to the
power of a thousand, or .99 to the power of a thousand is a small number.
And then of course we've just got the standard stuff. If I want to render this video with 100
different image processing operations per frame, I need to decide how to split that over whatever
cores I have. That's a standard problem. Right. Thank you.
[applause]
>>: Time for a question or two. If there are any questions. Great. Thank you.
>>: There's one over there. Sorry.
>>: So some of this is going to be done very slowly off line and then beyond the user's
interpretation. And then something fast happens for the user using all that stuff, sitting on disk.
And so what would be done, for example, is drawing the mustache, the face had been processed
off line.
>>: That's the idea. Sometimes but something we would love to do is half the time when you run
this it doesn't get the face exactly the way the user might have believed it should be. One thing if
you're blinking it's supposed to make space in this representation for both the open and closed
eye. So they're both sort of living there. We would love to have user interaction to fix if I don't
have line computed stuff and that's just completely inconceivable at the moment because you
would make an interaction, even researcher level interaction, tweak it run it for another few days,
tweak it again.
So that would be something that would be great to fix.
>>: So you didn't have to make any major changes to get it to work on the person's face.
>>: We didn't use the photosynth. We actually used a different bundler for this. No, sorry, to
work on the face. It can't do it at the moment. We used, photosynth embeds tracks, you know,
imagine if you follow the eye you generate a track like the cyan curve that we saw on the giraffe.
Photosynth takes hundreds of thousands of those tracks and embeds them into a 3-D space.
We are taking 100,000 of those tracks, embedding them nonlinearly into 2-D space not close to
linearly into a 3-D space. So it's quite a different problem. The mechanics of it are quite different.
It's a multi-dimensional scaling problem, basically.
>>: So I'm thinking of an app. So there are Avatars in these sort of online interactive world,
would this be usable there.
>>: To recover the texture map.
>>: Get an Avatar to look like you but change running in real time with your interaction.
>>: Yeah, except that even the tracking bit, even -- so once you have -- what you would do take
the texture map every new framing you would discover the texture mapping with every new
frame. Our implementation of that is currently like several minutes per frame.
So, yeah, if you could real time that you would have a lot of opportunities. But that's part of the
preprocess. Because our output is just the motion vectors.
>>: Thank you.
[applause]
>>: Okay. Amazingly it seemed to like my system instead of giving trouble connecting. My
name is Jim Held, director of [inaudible] means I'm cat herder in our central labs at Intel.
We have a program of research that you might characterize as the future of multi-core. It is how
far can we go, how fast, what will we have to do?
It's a large program. Several dozen projects. Couple hundred people at the moment. And
covers everything from software to hardware.
Usually starting from the application level and analysis of applications down. Looking at usage
models as well as the low level primitives and how to best support them in hardware.
So a very comprehensive program. One of the major areas of application is what we call
immersive connected experiences, which I'll define in just a moment. In fact, first I'm going to
introduce the concept as we describe it. It's sort of a combination of social and visual computing.
So it's a nice bridge topic between the two. What are the requirements we see in it, what our
research agenda is. What the various projects are. My goal is to introduce these not to go into
depth on the actual details of the research but vector you to the people who are doing that work.
If you're interested, I'm sure they'll be delighted to core respond with you. Potentially there will be
some collaboration with the folks, both at Microsoft Research or at the universities and UPCRC.
Some of the folks are here in the audience. I'll highlight that when they come up. Others,
different parts of the world. I'll mention that as well.
Now, ICE comes really from a combination of trends coming together. Social networking has
been growing tremendously. A very important use of computing for us to be aware of.
One aspect of social networking that really uses compute on the client is the user generated
content. Now, that can be just the text that someone puts in a blog or texting in ABAC room but
people are getting more sophisticated creating sometimes their own rather interesting videos and
in the virtual worlds more complex 3-D content.
The idea is we're moving from a world where the content displayed and used in applications
doesn't just come from professionals, as it does in a traditional PC game, for example.
Broadband connectivity means we can draw on a combination of the client and a rich network of
servers. And so-called cloud, for example. And the computation, one of the questions is how
best to take advantage of that computation capability and to deal with the complexity of having a
distributed application implies.
Mobile computing is growing. It won't be too long before desktops begin to disappear or turn
entirely into a local server or specialized work station because so many people value the ability to
take the client with them, whether in the form factor I'm using here or hand-held device or
combination working together as the earlier CBIR presenter illustrated with his combination of
iPhone and his Macintosh computer.
Visual computing is something we're very interested in because it's becoming a major fraction of
the compute use on the clients. PC clients are increasingly being called on to interact in a much
more rich way. Visual computing is part of that.
When I talk about visual computing, I mean much more than graphics, though. Visual computing
in our terminology is a combination of not just photo realistic rendering of images, but also the
combination with modeling and simulation that makes your interaction with that representation
realistic and responsive.
It requires physics. It requires recognition, as well as rendering. It requires a mix of computation,
not just data flow, flop oriented graphics processor, but the more general purpose and actually
the combination of the different kinds of compute come into play with the kind of applications we
think of when we talk about visual computing.
So all of the different kind of very interactive application environments that, yes, include video and
3-D and other visual representation, but mixed in with a representation of the world.
Now, there's one other aspect that comes to what we call connected experience. Why do we say
immersive connected experiences? And that's because many of those visual computing uses
draw on the fact that people want to connect to other people; that we're not just talking about the
client being connected to the server to deliver computes or storage, but we're also talking about
the gaming and collaboration and retailing and all the other ways in which the computer is used
as a tool of communication and interaction.
And so immersive as well as connected environments, in fact we see that the direction that the
web is going, going from what was originally text oriented and static stereotype interaction to add
more and more graphics and video and increasingly 3-D, we see immersive connected
experiences, our name for that combination and the direction that the 3-D Internet is going.
So there are applications that draw on compute and the connection to enhance the actual world
to mix computed images and computed modeling information with video and data from the real
world.
I'm sure you're all aware of the applications that allow you to go to a location, see a map of it,
overlay that map with a satellite picture, draw on another application to actually get a 3-D
representation that you can interact with, see a street level representation.
So we really are taking the actual world and overlaying on it some abstractions that add
information to the raw image that people are seeing. We're also creating completely artificial
worlds. Whether it be in a multi player games and massively multi player games like the World of
War Craft et cetera are becoming very popular. But there are also completely artificial
environments that are created by the end users where they interact in unstructured ways or
around a structure but not in a prefabricated prepared game.
Those so-called virtual worlds as opposed to an environment created for a predetermined game
are becoming extremely popular. Many millions of users. Last year, 2007, actually, more than a
billion dollars was invested, but in 2008, in an absolutely tough economic environment that was
still almost $600 million invested in. In fact, over twice as many different world investments,
different companies in 2007, Club Penguin acquisition by Disney accounts for like 400 million of
that billion.
So it's actually broadening and continuing. Most of the interaction is going on without really using
the full potential of a computer to deliver a rich model and a rich visual environment. You see
Habo and Neo pets and Pet Tropica which involve interaction through a browser and more a two
and a half D animation environment.
They still have the idea of visual representation, high level of interactivity, model of the world, and
in fact commerce. Almost all of them involve the ability for, naturally, the participants to spend
money, to add to their environment. They involve them creating and having possessions in that
environment. Some of them, like Second Life allow that economic return to flow both ways
lending dollars in and out of the environment.
So we're getting a large number of virtual worlds. A lot of participants. It's taking hold in the
youth. Most of the participants like well certainly a little more than half are preteen or younger.
And so people are growing up with this as a way of interacting with each other through virtual
environment.
The last bullet here on the right-hand side I think is particularly significant, as Second Life moved
up in the Nielsen rankings of what they call PC games. These are actually 180,000 measured
households. It's not survey. It's not self-report. It's like Nielsen rating and TV, the instrument
systems and track what's used and how.
And it accounted for the second largest fraction of the total minutes played and people were
spending like 600 -- average 680 minutes per week engaged with the virtual world in Second Life.
Second only to World of War Craft.
So very important. A lot of eyeball time. A lot of participants having their own economy, grabbing
the consumers of tomorrow. In fact, significant consumers fraction of the money spent today. So
of course interesting to Intel as a future usage model.
We also have that augmented reality as opposed to the virtual world, particularly with mobile
devices. The mobile devices are great to carry around sensors. Cameras, acceleration devices
to sense orientation and motion, all kinds of things can be packaged into that device with a
wireless connection to a lot of compute resources. What do you do with it? Well, often that
interaction involves augmenting what's being captured and the device with information that's
centrally stored or centrally available.
We've got some examples there. Being able to use the camera on the device and overlay
instruction on what to do to disassemble the engine, for example, or a translation or identifying
some device.
Another area where we see that this combination of visual computing with connectedness and the
balance of compute is going to really, increasingly, create new usage models that draw on
multi-core computing, not just in the servers but on the hand-helds as well. So ICE is our name
for what we think is a compelling set of new usage models or uses of computers that's going to
drive consumption of compute and gives us an opportunity to add value by responding to it, by
giving the right kind of platforms, the right kind of processors.
What does it need? Well, there are all kinds of challenges if we're going to respond to those
uses. One, of course, traditionally figuring out what the demands are on the server and the client.
Memory bound, compute bound, storage. What kind of compute, what kind of operations are
being done, what kind of network connectivity is required. All of those are our traditional role and
something we do with all of our applications that we analyze, that we get, excuse me, in-house or
build in order to shape what we do.
There's also, though, a lot we need to do in the distributed computing realm, and I'll talk a little bit
more about in the virtual world space. We think one of the barriers to that leading up to its
potential is really the architecture, the distributed system architecture and we have folks working
on that in our labs.
We need to look at how we deal with the fact that we're going to have hand-held devices as well
as very powerful laptop devices and things in between. How do we deal with that diversity and
the fact that there's always an installed base. You can't design for the latest. You can't design
for one fixed model. How do you avoid a lowest common denominator approach to this
distributed computing client model?
Visual content. We talked about being something that end users were creating. As much as half
or more of the cost of the title for a game is traditionally in the artwork. It's in the content, so to
speak. That just isn't there in these virtual world environments.
When end users are creating their own video and want to create their own 3-D devices, they
come up against a very difficult clumsy environment. The ability to create a video is much
advanced over the ability to create a very rich, complex environment for your virtual world, much
less design an Avatar that suits or represents you the way you want to.
Having the ability for people to create and then move their content where they want to go,
because it is their content, after all, it shouldn't just be a property of each individual place they go,
is another challenge for fulfilling this.
And finally how do we deal with sensors and connectivity when we're talking about the mobile
devices? Now, we've done analysis on platform demands. [Inaudible] and the application
research lab has done analysis of Second Life, the client, and then indirectly from the binary
operation of the server and the network connections.
We find, certainly, virtual worlds like Second Life have tremendous demands on the servers.
Very compute intensive work. 10 times as much demand on the servers as something like the
World of War Craft games, because of the way you interact with a much richer ad hoc
environment.
Clients as well really draw on the CPU and the GPU, that visual computing is a key part of what
the job of the client is in the virtual world space at least. Again, very compute intensive, very high
utilization. Very important to the experience.
And finally the network connectivity, it highlights another challenge. You'll see there are two lines
there. In the virtual world space you can't predistribute content on a DVD and draw on it. It's
being created and shared with people as you come across them into the world.
You move into a different land or world or part of the world and meet people and places and
things that you've never met before. How can they be predistributed. That really places
tremendous demands on the network, the ability to cache that effectively and compress it is
important.
>>: These numbers, are they kind of for like an okay experience, a New Yorker experience?
This is like we want for a great experience, this is how much more compute you need.
>>: Was it a great experience or was it an average experience? I think it was an average
experience. I don't think this is what we would want.
>>: They want great experiences.
>>: I should say there's a lot of potential to utilize even more to improve it.
I challenge with this aspect of ICE in the virtual world is dealing with complexity of the
environment. The number of users, the number of objects, the object behavior, the realism of the
simulation, all drastically increase the complexity.
It's far from linear in the number of objects. It's far from linear in the complexity, the realism of the
modeling that you do. And that right now, because of our inability to really supply all it's desired
means that the current implementations are severely limited, simplifying, for example, the shapes
of an Avatar in determining collisions, in latency and delay effects.
It's easy to move too fast in the wrong circumstances and fly right through a wall, to leave behind
certain objects that you desire.
So the challenges, because of delays in processing introduce artifacts and display and
synchronization and slow down the interaction. It would be infeasible to have a virtual concert to
thousands of people in a virtual world environment today. It just could not sustain the interaction
among that many people.
>>: So some people have done work in sharing bandwidth. We have a whole bunch of people
standing together with cell phones. They could take the union of their bandwidth by
communicating with one another through bluetooth or something like that. And they've done
some demo where they can actually multiply the bandwidth. Would that be a useful feature here?
>>: The ability to create a richer activity, yes. And the architecture of the distributed system,
making it more scaleable and techniques like that is part of the research that is going on in the
labs.
This is part of what we have to figure out how to deliver. The distributed system involves not just
a given virtual world or environment like the Second Life world, but interaction with others, the
ability to share content back and forth.
It has both global resources to manage as well as regional things to take care of, assets,
simulations in local islands, for example. And they have connections to powerful clients. Let me
see if I can make this run. Compute capable as well as limited, but very sensor-rich devices.
How do we design for this kind of environment? That's what we're interested in. That's what our
research is going on. Each of these pipes is a potential bottleneck. Each of these resource
managers and services are potential bottleneck, unless you have the ability to replicate and
scale. The ability to dynamically adjust to the client that is being used and have multiple types of
clients participate is really critical. The ability to share dynamically the workload between the
server resources and the client. Those are the requirements that we want to meet.
Okay. So what we see happening, in order to really scale, is we need to move from monolithic
creations to the much more modular horizontal building block approach to designing these
environments.
Essentially, the same kind of thing that the Internet went through. I don't know how many of you
used Prodigy or Compuserve, one of the first beta programs I was on with Microsoft, I reported
everything on Compuserve.
Those evolved to today's web, and the open platform that allows a much richer set of things to
grow, unanticipated things to grow and share. That's what we think needs to happen in order to
fully realize the capabilities and the environment.
In the individual content space, we need to move from professionally created content to the end
user being able to create their own content easily and satisfy their own creativity, and own what
they create and move it around, because it becomes an important resource for them that they
value. It's their identity in many cases.
And then we have to deal with a problem of delivery of content. We can't predistribute. We have
to have the ability to deliver just in time and cache effectively. So we really think that there's
plenty of research necessary in order to make this kind of experience really high quality and
scaleable to rich environments, lots of objects, lots of people, very faithful modeling of the
environment.
And the challenges in each area have generated research going on. You've seen some of the
initial results and glad to share papers on workload characterization.
We're doing work to understand the platform demands and detail. And we'll carry those forward
as we do with all of our workload analysis to optimizing our server and client platforms to support
them.
In the distributed computation space, we're really working with the industry on proposing
modifications to the application architectures. I'll talk a little bit more about that later, and
research on dynamic repartitioning of the workload between the client and the different kinds of
client and the central compute capabilities.
A big effort -- I'm going to highlight in the slides today because the research is actually in our
Beijing lab, and Yiman Zang wasn't able to be here today I wanted to show some of that because
he couldn't speak for it.
And I'll also standpoint out some of the stuff going on with the distributed computing that Mick
Bowman is leading.
So let's start with that content creation. One of the ways in which we can deal with both the
distribution and the creation of content is to parameterize it. You don't want people sculpting in or
drawing in these 3-D models.
You don't want them to assemble in crude components. You want them to be able to customize
the appearance, for example, of their Avatar very easily just by moving sliders to match what they
want.
You want scripting to be able to interact with a parameterized expression modular. So you don't
send images as this video is creating to describe someone who is changing the, moving their
features as well as changing their expression. You want to send parameters to a model that is
predistributed. That gives you much more flexibility in design. It gives you the ability to send
much higher semantic information, and therefore much more compressed information over your
network.
Now, they're drawing on databases for their research. One from the Beijing university of
technology and another from the University of Binghamton on facial expressions. I'll show you a
little more how in a minute.
One thing we'd like to be able to do is create an Avatar that represents us better to take a picture,
for example, identify the features within it that correspond to an abstract model.
Map that model back to the features on that picture and take the result and map that to our model
and therefore be able to manipulate the model.
So what we've done we've taken a flat 2-D picture of someone, identified the features, mapped
those features to a model of a database of models and then we can apply the 2 D as a texture to
that model and manipulate it in three dimensions.
Two dimensions three dimensions but moreover it's the person. It's me.
Also I would want to be able to parameterize a variety of things. The expression on the lower left,
for example, going from neutral to happy, configure the gender characteristics of my Avatar or my
characters.
I'd want to be able to parameterize the dynamic interaction and map it to a particular instance of a
person. So capture this kind of change of expression, apply it to an imagine, any kind of image
where I can identify the features, and then have the resulting expression shown in that face.
Facial features even ethnicity all can be done by use of that 3-D database, treating it as a model,
allowing what we use to call morphing techniques to make it a continuously variable transition.
This to me was science fiction when I was growing up, right? You answered the phone with an
Avatar that looked perfect, even though you had just gotten up, or maybe you represent yourself
as an Avatar of Brad Pitt or Arnold Schwarzenegger or whoever, you can do that with the
technology that's being developed on research in content tools in our Beijing lab.
You do want to be creating 3-D models, not just mapping textures doing the expressions of an
Avatar and that's a tedious process that's again very expensive to do.
And what they've done is developed a tool to take pictures around an object, provides view
planning to suggest what would be the best camera positions to capture an object, and do the
image-based modeling. In other words, tracking and doing correspondence analysis against the
images and turn that into a 3-D representation and extract the texture, creating, therefore, a 3-D
model that you can use and animate in your virtual world.
So going from something that is done by experts and expensive and tedious, might be improved
with a 3-D scanner, to something the ordinary person can do by the appropriate use of their
digital camera.
We have work at the Intel Research Seattle lab on what we call everyday sensing and perception.
And our goal is to be able to infer context from sensors in your hand-held device that are accurate
90 percent of the time for 90 percent of your day. 90 percent is actually very hard to do and
something you're not going to be able to do without visual information, without video capture and
vision and recognition.
Moreover, if you're talking about doing it with sensors that are in a hand-held, one of their
fundamental problems is the ability to do preprocessing on this device to reduce the bandwidth
required and the latency and pass it to the server where the heavy duty compute goes on and
then the result comes back.
And so they're very much dealing with the ability to create a synopsis in order to reason on it with
much more compute power in a central compute resource.
We're engaging the virtual world community through open Sim. It allows us to work with a wide
range of industry partners and a diverse community, Microsoft, IBM, Intel, a wide variety of
participants on this simulation environment for virtual world.
We have been working with partners to try to put together a road map for virtual worlds in order to
try to improve the success rate or the likelihood of success of all these virtual worlds that are
springing up to understand what would make them more available, how people could be more
productive with their investments by taking all of our state-of-the-art industry-wide thinking and
making it available.
We are creating -- oop, I moved too quickly. Here we go. How am I doing for time, Jim?
>>: About five minutes.
>>: Great. Thank you. We wanted a test bed that was computationally interesting and
challenging that would give us content and experience with real world customers.
So we partnered with the SIGARC and IEEE in order to create a science Sim, a simulation
environment for representing the results and the interactive environment for super computing
2009.
And so the idea is to have the opportunity for interactive presentation of results in a simulated
environment be part of something the high influence computing community could take advantage
of.
Both a chance to show off as well as a future education tool. And actually the result of our
interaction has been concrete. I think it's been like a 10-X improvement in scaling so far from
architectural changes and suggestions that we've proposed to the open Sim community.
We've got an AITF MMOXOGP proposal in to create the interface or actually use industry
standard interfaces to make the virtual world simulation environment more scaleable. And I'm
sure Mick would be glad to fill you in on what that exactly entails.
We're actively engaged in this open environment to facilitate what we think is something people
are interested in, could be able to use, would exploit the value of the computes we can deliver,
but needs some architectural leadership from industry players like Intel and Microsoft.
This ICE research actually ties together with our other main initiatives. I mentioned how we have
a tera scale computing initiative. And the performance to support that and the platform
characteristics that allow the kind of computes to be delivered, the balance platform with the right
acceleration, memory, bandwidth, compute capabilities, all of those are part of the tera scale
computing initiative.
We have another large effort we call carry small live large, which is the idea of getting
tremendous benefit from our hand-held computers, working in concert with clients, peers and
server world that involves the use of sensors and wireless connectivity futures, that the
augmented reality research is part of.
Visual computing is really a big thrust for us. We've been talking about work underway to
develop the Laraby architecture and have talked to SIGARC and elsewhere about the
characteristics of Laraby.
We're making a big investment in that space for future product. That needs to be fed by or
research -- we've had a long standing relationship with Sarland so we put a large amount of
money into what will be our largest lab research effort in Europe. It will be part of our Intel lab
environment or collection, I'm missing the word, in Europe.
Jim Hurley is part of our CTG visual architecture research lab and our microprocessor group in
CTG. And colleagues at Sarland who we've actually again been working with for some time. And
others in Germany that are going to be participating on a really rich research agenda that, again,
is much more than just raw graphics or 3-D.
The concept of visual information, recognition capture and the simulation and output, what it
makes possible in terms of interactions, the human computer interaction that's possible given
these modalities.
So again another big investment in what is very much the underlying technologies for the
immersive computing environments.
So we think this is an exciting area of applications. We sort of come way up from our previous
talks about the enabling algorithms technologies to a much broader view of the ecosystem trends,
what it's getting rise to in usage models and applications.
We think it's one that's going to draw on all the research we've described so far, to deliver
something that end users are already interested in, already beginning to pay for and will make
use of our processor and platform technology.
Any questions?
>>: Thank you.
[applause]
>>: Good afternoon. My name is Phil Chou. I manage the communication and collaboration
systems group here at MSR, and we do a bunch of work in telepresence, and I'm also involved in
a cross group initiative in telepresence here.
Telepresence is a word that was coined by Bill Buxton in the early nineties as part of the Ontario
telepresence project. He meant it to be rather broad in interpretation.
Technology to support a sense of social proximity despite geographic and/or temporal difference.
Today telepresence in the industry takes a narrower view, that of a video conferencing
experience that creates illusion that remote participants are in the same room with you. It's more
of a tele immersion definition, I think.
But the best way to understand what that means in the industry today is to take a look at this
30-second Cisco commercial. [commercial]
>>: So...
>>: Welcome to a network where body language is business language. Welcome to the human
network.
>>: We've just seen a high definition full sized life sized video conferencing experience that's so
compelling that one participant forgot that his counterpart was actually remote.
This Cisco telepresence system, along with others like it, costs several hundreds of thousands of
dollars per room. And, of course, you need at least two rooms to make it useful.
And that doesn't mention the additional tens of thousands per month of operating costs. So it
finds a rather narrow market. But there is a market popular among executives in particular.
Microsoft itself is purchasing a bunch of these things.
But the general feeling we have at Microsoft is that these are somewhat akin to the mainframes of
the '70s, and so there's a lot of opportunity sort of downstream from that.
So one of the takes we have on telepresence in research is that it could be closer to ubiquitous
computing, which is a notion that Mark Weiser in the computer systems lab at Xerox
Park promulgated in'88 through '95. And the idea is that instead of computation being delivered
through boxes on your desk or from the back room, it actually would computation would be woven
into the fabric of the environment. So they built a bunch of devices, park, tabs, pads and live
boards which are today's PDAs and tablets and interactive white boards.
An example today might be Craig Munday, our CTO's conference room over here in which the
room, every surface of the room is some interactive display and interactive surface. All the walls,
the table and everything.
So there the room is the computer and that might be what ubiquitous computing might be today.
We're interested more in the communication aspects of ubiquitous computing. This was called
ubiquitous media by Buxton in the early '90s. And it's really driven by sensing and rendering
devices, input and output devices for both audio and video, meaning microphones, cameras,
speakers, displays. So the ubiquity of those things, particularly in cell phones.
All of us have all four of these modalities sitting in our pockets. And there's also more
infrastructure-like things, laptops, desktop monitors, conference room devices, IP cameras.
We have these things like I can see six different cameras in this room. So luxury rooms are
highly equipped. Highways. We have hundreds of cameras publicly accessible just looking at
our highways. Cities like New York and London have thousands of cameras just connected to
the police network.
So these devices are really totally ubiquitous, if you look at a regular office, here's a picture of my
office. There are many devices connected to the Internet, my laptop, my phone, my desktop and
my camera phone, which is taking this picture.
And each of those devices has four or five or more high bandwidth sensors or rendering devices.
So there's 25 broadband or rendering sensing devices in the picture. Over here, everybody's
bringing laptops and so forth. I'm estimating there are 55 to 70 microphones, loud speakers,
cameras or displays in that conference room.
So they're really ubiquitous, and the question we ask is how can we use these devices to assist
with better communication between people.
So there's several opportunities. One could be, for example, better audio capture. You get
microphones that are closer to the people who are actually speaking. Similarly close-up views of
participants, laptops, native cameras that are actually looking at your face, can use this ad hoc
array of sensing devices to locate people in space in the room and help people remotely
understand what is going on in the room.
For the people in this room or for people in the remote rooms who are listening in on this
conversation, the spatiallized audio, get a better sense for these guys over here to get a better
sense of what's happening spatially in this room. Over here, there's flexible use of display real
estate because there's lots of displays all over.
So a lot of opportunities and I'll go through some of these. This is just showing where all the 25
devices are. Here's an example of using an ad hoc array of microphones. In this case distributed
on the laptops there of people in the meeting rooms to do sound source separation.
So putting four people on their four laptops into separate sound channels. If they're not
separated, one of the let's say laptop No. 2 its microphone sounds kind of a mixture.
>>: [demo].
>>: They're talking over each other. And we want to put them into a single channel. So let's say
Channel 2. Basically cuts out the first guy. And then just brings in the second guy.
>>: [demo].
>>: And we can build into the source separation location information. So we either using time
and flight-based techniques or energy-based techniques.
Here's an overview picture of people in the room gathered around microphones glued to the
desk, and this shows that we can locate through sound that's uttered by only the people who are
talking. So no pinging sounds or anything in the room that we can jointly locate the active
speaker here and this red dot and all of the microphone locations, which are accurate to the
ground truth down to a few centimeters.
And if we use energy bases maybe a little less accurate. But we don't need the tight
synchronization between devices.
So what you can do, you get the source separated and you get the sources located in space.
Then you can help the other people and the other remote rooms understand what's happening
spatially.
So instead of the traditional Mono signal in a room, which sounding like this.
>>: [demo].
>>: So that's the traditional Mono thing. And only the people in the middle can probably hear the
stereo. But if they are spatiallized, it's easier to tell what's happening in the room, how many
people there are.
>>: [demo].
>>: So it's taking advantage of cocktail party effects that people have. Of course, if you do the
spatiallization tricks then you have to deal with the acoustic echo cancellation problems that
happen in the room.
If I'm in my office listening to spatiallized audio, and I'm talking into my microphone, that's being
interfered with by stuff coming from now two loud speakers instead of just one. And there's some
inherent ambiguities that go on with two loud speakers.
We need to cancel out what's coming out of these loud speakers so the guys back in the
conference room don't hear what they've just said. But fortunately because we're able to control
the spatiallization, we're able to disambiguate anything here. And cancel out what comes in
through the microphone. So here's a concocted example of what the microphone may hear.
You're hearing now stuff spatiallized audio playing out here and you're hearing a Mono signal.
>>: [demo].
>>: That's actually a mixture of spatiallized stuff.
>>: [demo].
>>: That was just the near end talker talking. And now you can hear that most of that stuff goes
away except when she talks.
>>: [demo] so that's all for ad hoc microphone arrays and the same thing can be done with
camera arrays. We have cameras on all the laptops sitting in front of you.
One possible thing to do is for the remote person is to look at the screen and every moment that
is selected the speaker who is talking, the active speaker, select the best camera at any time
possible moment. That's sort of a traditional way of doing conferencing.
You could also show all the cameras at once and maybe a Hollywood Squares type of thing. You
could actually see all the faces, how people are reacting to what other people are saying.
But we're inspired by some of the photosynth work to try to get a better spatial sense of what's
going on in the room. So when people are looking in one direction, the remote person can
understand that they're looking the person is looking at who is talking or whatever.
So we have this meeting viewer, which essentially has this 3-D context view and an active
speaker view over here. And I wish I could demo it on this laptop, but I'm just going to have to
show you an old version by video.
And over here the user is trying to manipulate the three dimensional scene to try to figure out,
well, to get the best view on who's talking. And so these people are kind of arranged into the
three dimensional space.
We're also looking at mobile things, trying to augment phones with something as simple as stereo
ear buds or goggles. I won't be talking about these today, though. We're also looking at what we
call them bodied social proxies, essentially George in a box, carry him into a conference room
and set him up and you can talk to him, and things that are related, other kinds of stand devices
some robotic types things. Some that sit on your desk. But what I want to go into a little bit more
detail, though, about is not those things but more what we're doing along the lines of immersion.
So how we're evolving, immersion from things that are just sitting around, you're sitting at your
desk, to things that are basically taking wider and wider fields of view and involving you in an
immersive experience.
We have the idea that you might as an individual just sit down at a desk somewhere, have a very
wide field of view, and then you're talking with your remote counterparts who are all distributed in
different continents, you're sitting around a common virtual table. And as more people are added
to this table, as more people join the meeting, the table may grow a little bit bigger. People slide
to the side to make room for the new person. The audio is coming from the person that is talking.
You see the people from different angles, depending where they are sitting around the table.
We're in a common environment. The environment may have tools that support the meeting
that's going on and so forth. So there are a number of immersive cues that we're taking a look at,
both in audio and video. Peripheral awareness is an important thing, we believe.
That's why we're looking at large fields of view and surround sound. Visual consistency is
important. The consistency between my environment and the remote environment. And if you
have multiple people, the consistency between their environments. And of course sort of spatial
cues.
Andrew showed a nice video where the occlusion cue was kind of messed up. Let me talk about
the last two on this list, what we're doing for motion and stereo.
So motion parallax, we sometimes call, monocular 3-D because even if you only have one eye
you can get a sense of depth through your motion. So what you see on the left is a video of how
we can change what's displayed on the screen as a function of the position of the viewer in order
to get a sense of three dimensions.
>>: Can't help but notice if the viewer is moving a little slower. Is that because there's some
framer issues there.
>>: Not so much from the generation, but the tracking in this case was done by a magnetometer.
And there's some issues with that.
After this was made we started tracking using face detection and head pose estimation, and
that's actually what you see on the right side. So here Chau is looking at this screen and this red
rectangle is what's detecting his face and size right now and driving his screen.
And we're also now doing a little more sophisticated stuff with head pose estimation. We're also
looking at stereo 3-D. Typically using, for display systems, existing technologies such as auto
stereoscopic displays or displays where you need some kind of goggles.
These require, even just putting images on these things actually requires a fair amount of -there's a lot of data. So for each pixel that you're trying -- you scan through this thing, putting
pixels down on the display. Each one comes from, say, a different camera.
So if we have a five -- we have -- upstairs we have a five view auto stereoscopic display we have
five cameras feeding this display. For each pixel that we output to the display we have to pick it
from one of the cameras. And at a particular position on the camera, depending on how the
cameras are calibrated. So we have to do some bylinear interpolation, for example, on each of
those cameras, and we just couldn't fit it into our four corner machine. But we squeezed it into
the graphics processor. But there's some issues even doing elementary things here.
That's on the display side. Of course, you need to capture multi view imagery. So image-based
rendering is an important element of that. And so we do spend some time trying to do camera
array stuff and view interpolation. I don't want to go all the way to the end.
In order to put people into a common background, you have to actually remove their existing
background. So foreground, background segmentation and replacement is important.
So with enough computational power you can do a pretty good job at it. This runs many times
real time. So we've been looking at things that are much faster, such as using infrared depth
cams to help speed that up. And they do a remarkably good job in terms of real time
performance.
So easily getting 30 frames a second here. But they're not so good at hair. So we have to kind
of find the right balance between some of these technologies.
Another alternative to the image-based rendering things would be to go more in the graphics root.
And of course we've been looking at using Avatars and they're necessary for representing remote
people who don't have audio and visual, don't have video cameras trained on them like if you're
on a mobile phone.
And then finally we're trying to put all this together into an experience building these things, and
actually the list just for video processing, there's no audio processing listed here but just for video
processing there's a lot of different steps that we have to go through on ascenting side, after the
camera calibration, camera capture, distortion correction, viewpoint interpolation, foreground and
background separation, the coding networking stuff and sort of on the receiving side do some
face tracking to figure out where is the person looking trying to determine where his viewpoint is.
And then after receiving stuff from the network, decoding it, doing some final viewpoint
interpolation stuff and then the scene synthesis.
And this is pretty much, pretty close to linear with the number of other participants in the meeting,
because for each other participant, each one of them has a unique view that you're trying to send
to or receive from.
So I wanted to mention that besides get an immersive experience, part of what's important in
communication is actually various kinds of scene analysis. Some of which we've seen already
with like where are people spatially.
But there are others. So, for example, Chau here is using this fish eye camera shown here to
direct where the pan/tilt/zoom camera should be pointing at.
So we have to figure out what are the salient features, what are the important gestures to
capture.
So one reason scene analysis is important, and this is a bridge to the next, to Lilly's talk, is social
interaction analysis. So you want to know where people are looking. To whom are they looking?
So that maybe the Avatar that you're looking at could be looking back. So you can measure
things like who is influencing whom, who is consistent in their views?
Who is agreeing with whom and disagreeing with whom, and you could use these within the
context of a single meeting to give feedback on you're talking too much.
This guy isn't saying enough. Is the meeting really constructive. Let's say it's a brainstorming
meeting, or is somebody dominating or maybe that's the way it's supposed to be because it's a
presentation.
And then over time you can infer from many, many meetings social interactions that are being
analyzed, what are the groups, what is the social interaction network, what does it look like, from
real meetings conducted through these telepresence systems.
So I will just end by going back to the ubiquitous communication idea. This is just a shot of the
heterogeneity that we live in and will continue to live in and just a statement that there's no single
telepresence or communication device. There will be plenty of things ranging from mobile
phones all the way up to the most sophisticated conference rooms.
And the implications for this of this on architecture, I guess, is well, communication between -there's a lot of data flowing between the end points. So data is a huge thing.
And also the heterogeneity means that it's not really clear where the computation is going to go.
Mobile phones won't have the same capabilities as these high end rooms. The cloud isn't even
pictured here. So you can imagine services in the cloud.
Some of the computation could be used to save on the communication bandwidth. There's all
these interrelated issues and I'll just use that to close. Thanks.
[applause]
>>: Any questions?
>>: So when are we going to get this phone? [laughter].
>>: I'm tired of getting on planes.
>>: One of the next things we're trying to look at, we're talking with Dennis, actually, is how to
deal with large scale meetings such as this one or conferences that we go to and are there any
ways to help do a poster remotely, for example.
>>: Does speech recognition technologies fit into this any?
>>: Yes, and machine translation technology. You would certainly want to be able to bridge
cultural and language barriers like that.
>>: Like making minutes automatically?
>>: Yes, for summarization of meetings, absolutely.
>>: What is the most sophisticated, the things you feel today, that you've actually used to hold a
meeting?
>>: Well, actual use, we don't have a picture here but we did -- here's one. This group, our
group did this roundtable device they call it. So it's like a Polycom speakerphone but it has a
stalk on top five camera on top and do image stitching. It's 10-year-old research technology, but
it came out just a few years ago and products, oh, we actually use it pretty consistently. And it's
in most of the room in this building.
>>: Sophisticated or not?
>>: Is it sophisticated? 10 years ago maybe, yes. Thanks.
[applause]
Download