17099 >>: All right. If everybody will take your seats, we can get going again. So just one announcement. The reception this evening, I have maps up here. It's very close to where we are now. It's literally I'd say a quarter of a mile from where we are. So it's easy to walk. It will take five minutes to walk from here. There are maps up there. The basic trick is just walk up on one side of the parking garage or the other, the shorter way is to walk on that side of the parking garage to the next street, turn left, and walk up the street to the stoplight, cross the street. You'll see it in front of you. Microsoft has a big new complex of buildings. There's a soccer field there. There's a bunch of places that look like a store. The restaurant there is called Spitfire. That's where we'll be at 6:00. And everybody's obviously invited. If you need a map, they're up front. I have plenty of copies. So all right. Back to the presentations. So there we go. >>: So my talk is going to be about work that's going on at Berkeley and at the UPCRC about applying image recognition techniques to image retrieval. And so we can start the talk while I guess while this is getting settled down. But I think fundamentally the problem we're trying to solve is we're trying to make digital media actually worth something to people. And there's this needle in the haystack problem, when you start getting a lot of digital media in your setup at home that you can't actually find it. So you can't actually use it. I like to say that the incremental value of a new piece of data is sometimes negative. Right? Because the more data you accumulate, the lesser you can actually use it. So that's a problem that we need to fix if we're going to keep moving forward. So the problem that we're looking at is trying to find images from consumer databases. So these are all pictures of my family and I would like to find one that I'm interested in. How do I do that? So I'd like to talk a little bit about this spectrum that I see between image recognition techniques and retrieval. So there's already a lot of solutions out there for doing image search of various kinds. At the one end where we're talking about things that have traditionally been called image retrieval, we have very lightweight search problems where we're looking at meta data search with images tagged. If I'm looking for a picture of Jim in my database it looks for all pictures that have a tag with Jim in it and it pulls them up. There's a spectrum that goes down all the way down to image recognition techniques, where we go from just looking at textual context and image tags to looking at coarse features on images like give me the histogram of this image. If I search for an image that looks like a mountain, well, a lot of mountains have similar color profiles. So that might work. Of course that doesn't work on a lot of problems. So we continually add more and more features and more finer features until we get to the level of actual trying to understand the semantics of what's going on in the image and as we do this we are going from computations and image ranging from milliseconds all the way to several minutes. And so the goal here in this project is to show how parallelism can sort of shift the kinds of things that we do in image retrieval, and bring things that traditionally have been far too expensive to apply to image search into the domain where we can actually think about that. So there's a lot of interesting computationally intensive algorithms that people use in computer vision, and we saw some great talks this morning that were going over some computer vision problems. The ones that I'm going to talk about today that we've been working on are really high quality image contour detection and some machine learning techniques for classifying images. Namely support vector machines. So we see things like, for example, the highest quality image contour detector that's known today takes minutes per image. With parallelism and rethinking the algorithms behind these, we take it down to seconds. So we're trying to make it so that we can apply techniques which were far too expensive to even consider for search applications and bring them into search. And so this is kind of the outline of what a lot of retrieval applications look like. We start out with a set of images. If we have a database, presumably we're incrementally updating this database. But as we get new images we perform feature extraction on them. Then the user is going to give us a query. They're going to ask a question to this database. And that is going to be used along with the features that we have to train a classifier that differentiates between images that are interesting or potentially interesting to the user and those which are not. And then that classifier is going to be applied to the database and things are going to be returned to the user and iterate from there. So first I'm going to talk about this image contour detection problem we've been working on which falls under feature extraction. So one of the fundamental steps that we have to do to understand objects in an image is actually figure out what they are. So image segmentation is a very fundamental problem in computer vision, and it's closely allied with image contour detection. So once you have a set of good contours from an image, it's fairly straightforward to go to a good segmentation of that image. What I mean by that the work to find good contours can be on the order of several minutes and the work to actually go from contours to segments can be on the order of one second. So these are closely related problems, and the idea is once we have segments of an image, we can then extract features from each of the segments and classify the segments and expose them for searching. The problem is that image segmentation is a hard problem. And high quality segmentations can take minutes per image. So let's talk a little bit about the image contour detector that we're working with. So firstly it's good to point out that like most computer vision problems, this is subjective. The actual contours in an image depend on your perspective and maybe even your mood. For example, if I showed you a picture of a koala and asked you to tell me what the boundaries of objects in the scene are, some people might say that the koala's arm is a separate object from the koala. And they would be right, I mean depending on their perspective. I mean, sometimes you want to differentiate between the arm of the koala and other times you want to say no the koala is just one object. So it's actually a very subjective problem. And the surprising thing is that humans actually agree pretty well on this problem. And so [inaudible], computer vision professor we're collaborating with at Berkeley, got together a group of undergrads and had them label a set of several hundred images and find all the boundaries and do multiple labeling of each of the images, and then there's some sort of correspondence problem to figure out what images are actually what. And it turns out that people actually agree sort of on what the important contours of the image are. As you can see in this middle image right here, we've got overlays of a lot of humans' interpretation of the koala. You can see there's only one person who decided the arm was a separate object from the rest of the koala but it's still an important piece of information. On the right here are contours that are generated by the algorithm that we're talking about. And so when we talk about accuracy of image contour detection, we're going to be basing it off of ground truth on this test set of images that people have ground labeled. Our goal is to find objects that people find interesting. And so speaking to that, here's a precision recall curve. So that's related to this false positive curve that people, RLC curve that people were showing earlier. So for any of these detectors there's a straightforward thing to do. I could say if I was really concerned with never labeling a pixel as an edge, I could say this image has no edges in it. And I would be right in the sense that I would never label a pixel that's an edge that's not a pixel. However, I wouldn't find any of the pixels which are edge pixels. So that would be over in this regime, where we're high precision. We're never claiming that a pixel's an edge when it's not. But low recall in the sense that we didn't actually find any edges. Conversely, we can label every single pixel in the image as an edge and we will for sure not miss any edge pixels but it won't be very useful. So there's always a trade-off in these kinds of algorithms between precision and recall and so the goal is to be as far up close to the top and to the right as possible. Sorry. So this algorithm that we're working with called global probability of boundary algorithm published last summer. It's currently the most accurate image contour detection known. This graph is showing like the past decade of computer vision research. They went through and found the papers and applied the techniques from these papers to their database and showed that, yes, actually they are improving the quality of contours, and we've gotten to it this point, this red point. If you boil down the curve into a single number that kind of maximizes both precision and recall it's called the F metric, then humans are at 0.79 and this algorithm gets 0.70. Something like the snowball edge detector or the canny edge detector would be somewhere around 0.5. So the problem with this algorithm, even though it's getting pretty good and it's finding really useful boundaries is that it's very computationally intensive. It runs about 3.7 minutes per small image. And that's a 0.15 mega pixel image. Limits its applicability. If you tried to index all the images on the web by the time you actually went through even applying a huge big datacenter to the problem there would be so many more images you wouldn't be able to keep up. So this is a really computationally intensive problem. And so we've been looking at parallel algorithms and implementations for this algorithm. And we have brought down the time for image contour detection from about four minutes to two seconds using very parallel algorithms on Nvidia graphics processors I'll talk about that. What I want you to get from the slide is there's a lot of pieces to it. There are things like machine learning algorithms like K means where we're doing clustering on the image. Traditional imaging processing things like convolving by filter banks and computer vision things that are used in other approaches such as nonmax suppression and intervening contour codes. We have a big generalized eigen solver which was the most dominant part of the computational problem I'll talk about it in a second. And other imaging processing like skeletonization, there's a lot of things going on here and we've worked on all of them. Just to sort of give you a flavor of what kind of computations we're looking at. K means is a clustering problem where you start out with unlabeled data. And you need to cluster that data into K clusters. The way that's done is basically by the guess and check method. You will start out and you assign random labels to all of your points. And then you figure out where the centroid of all the clusters implied by those labels are. And then you relabel the points based upon which centroid is closest to each point and iterate and hope you converge somewhere. And this is done on the image to find important textures. So the images con involve with a set of universal textures, and these correspond to feature vectors for every pixel about 30 elements long, and then we cluster those into, say, 32 or 64 different clusters. Gives us a way of ignoring some of the noise in the image while still finding similarity between pixels. So another thing that's going on here is gradient computation. We're looking for edges. So this is kind of a way of suppressing noise when we're looking at edges. If you imagine two half disks centered at every pixel each oriented at some orientation and have some radius. If we sell up all of the responses in each half of the disk and compare how different those are, then we get a metric of how strong of an edge we have at that orientation at that radius at that pixel. So we looked at this, the original implementation actually went through all of the work of summing up pixels over each of these half disks. And we decided there was a better way. So we changed the algorithm to use integral images. Integral images are basically a way of using parallel prefix sums, which have been around for a long time to do these sorts of computations where we're doing sums of overlapping windows on images. The way it works you basically take a window of -- you basically do a computation where the value at each point is equal to the sum of all of the elements to its top left. So, for example, this box, if I do the integral image, then it's going to turn into a single pixel right corresponding -- let's see where is it? -- corresponding about here in the image. That's going to be the sum of all these things in here. And this allows us to avoid repeated overlapping summations. That brings down the computational complexity and really helps out. It's also good for parallelism, because we've replaced a lot of histogram problem with scans, which have a lot better data dependencies. So the next thing that was really important computationally was this eigen solver. This is spectral graph partitioning approach. So we had some talks this morning about graph partitioning, and it's a very useful problem. The thing is there's a lot of different metrics you can apply to graph partitioning. And the min cut method, which is the common one used in a lot of graphic partitioning approaches is bad for image segmentation image contour detection because it tends to cut off lots of tiny little pieces everywhere. So min cut is saying I want you to break the graph that will disturb the fewest number of edges in the graph. But what it does it tends to isolate things that don't have a lot of edges connecting them to the graph. If you change the metric from a min cut metric to a normalized cut metric where we're looking to minimize the number of edges that we cut that are normalized to the sum of all of the edges that are leaving the sub graph that we're cutting, that problem actually is much better for image segmentation. However it's NP hard. So instead of solving it directly we approximate it with an eigen solver. So it's kind of interesting to show that these problems are related. And we can apply an eigen solver to this problem. So the other way to think about this is that this allows us to use global information about the things in the image to find boundaries. Instead of the gradients I was just talking about is a very local operation wherefore every pixel we're looking at a radius, trying to see if things are different enough to say there's an edge at that pixel. Instead of doing that, this is more -- it's a way of globally understanding the image. So every pixel is related only to a few pixels in the surrounding neighborhood, and we turn those relations into a sparse matrix that talks about how each pixel is related to some of its neighbors, when we find the eye again vectors of this matrix it turns out to correspond to regions that should group together. And here's an example of what it looks like. So here's a picture of a guy in a hula skirt and the eigenvector, one of the eigenvectors that come out of this. One thing to notice is that the eigenvector does a good job finding important boundaries in the image. There's a lot of texture in here. There's a lot of confusion in the foliage back here of things that might be edges. But it's hard to say. And the eigenvector comes out of this formulation is actually really nice. So the eigen solver problem itself is actually very interesting computationally. In the pile up we happen to have a number of experts on sparse eigen systems, Jim Demmell being one of them and we went to Jim and said we have this eigen problem how do we solve it. He helped us doing some algorithmic exploration we found using a Lancerest algorithm with this kind of not commonly used content called the Cullum-Willoughby ended up being the best choice for this algorithm. These are the kind of results we're seeing. So like I said earlier, the original code was running at about 3.7 minutes. We took it and parallelized it over dual socket quad core system using P threads and got it down to about 30 seconds. And our GPU implementation takes two seconds. So we can say that there's a lot of good parallelism there. It's interesting to see how Amdell's law bites us in one form or another for these different platforms. Of course a lot of reduction from the eigen solver is because we changed the algorithm and gone to a more efficient algorithm for this problem. But these kind of results are exciting because this particular image contour detector is very powerful, but it's not very widely used because it's so computationally intensive. We believe with these kind of results people can start applying this in a lot of situations where they couldn't before. Just some other data points about our GPU implementation. We actually ran it on a number of different GPUs. This is the same binary, just running it on a different number of cores, all the way from two cores in the integrated graphics that are now in Apple's laptops, all the way up to the GTX 280, which is the biggest GPU in Nvidia sales right now. And things scale pretty well. You'll notice that at 30 cores we have two data points. Somehow I dropped a label, but the lower one is the Tesla board has lower memory bandwidth. We can see that kind of measures the memory bandwidth dependence here. And also we can check scaling behavior in terms of image size to see what kinds of images we can handle. Turns out that we're limited on the Tesla board limited to four megabytes of memory we're limited to 4.8 mega pictures because there's a lot of data generated in this process. But we can see that the runtime scales fairly well in the number of pixels. And most importantly is accuracy. If you're speeding something up but you are throwing accuracy out the window it's not much of a help. On this benchmark comparing against ground truth for all of these hand contoured images, we achieve the exact same accuracy as the serial version. So kind of summarizing for this segment of my talk, I think that parallelism really is able -- we really can use parallelism to practically take things that were much too computationally intensive to be widely applied and bring them into domains they wouldn't have been able to address previously. And so I'm pretty excited about that. So the next thing I want to talk about is classification. So this is the process of analyzing images. And classifying them as interesting or not. Or can be used for other recognition purposes as well. So we have spent some time looking at support vector machines a widely used technique for classification. In the content based image retrieval context using a support vector machine means finding a decision surface which separates image classes and these classes are going to be defined by the search query itself. So the goal of the classifier is to maximize the margin between the classes which gives us a pretty general classifier which hopefully is resistant to noise. And so training in SVM is the process of you have a set of test images that you know the truth and you want to learn from them the correct classifier. And it's a quadratic programming optimization process where we have the number of variables in this optimization process is equal to the number of test images that we have labeled true for. And go through and the goal is to find a weight for each of those images. And the great thing about support vector machines is that they can be nonlinear. So you can find a linear classification surface, which is basically just a hyper plane in some feature space separating positive examples from negative examples or you can use a kernel which allows you to get rather nonlinear classifiers but because of the formation of the SVM you avoid overfitting to your data which is the common problem with nonlinear classification approaches. So to do this we are actually using the sequential minimal optimization algorithm which was invented at Microsoft Research by John Platt. I guess he's probably not here today. I was hoping to say hi. But it's a great algorithm. And it's widely used in support vector machine training. What it does is you may have several thousand up to hundreds of thousands of test points where you know the truth. And the question is well how do I find a set of weights for all of these. The hardest possible thing would be to update, you have some guess about the weights and you update that entire vector of weights every time you make a step. And that's a full traditional quadratic performing method. Sequential minimal optimization goes to the extreme of only updating two of the weights at a time. So out of your 10,000 weights you're only going to look at two of them. In that case the quadratic programming problem turns into a trivial one dimensional problem that we could have solved back in high school. Basically we have some constraints. They describe a box in two dimensions and then we have another constraint that describes a line in two dimensions so queer going to be maximizing over a line some quadratic function, and that's pretty easy. So what this means is that the actual optimization steps of updating the vector are trivial. The hard part is figuring out what those mean. So the actual work in the algorithm is computing some [inaudible] optimality conditions which is done for every training point and you reduce all over all of those to figure out which points you're going to be updating for the next step of your algorithm and you just keep iterating. So we implemented this also on Nvidia graphics processes and we saw using the Gaussian kernel fairly widely used we saw nine to about 35 nine speed up over XVM which is the standard package for doing support vector machine training on CPUs. And so we were pretty excited about that. We published a paper on it and put our software out on the Internet and think we have around 250 downloads of it as of today which makes me pretty happy. So people are actually using it and the bad thing about that is that I get bug requests. But ->>: [inaudible]. >>: Yeah, I guess that's right. I don't actually have to fix them. They just need to tell me about them. >>: [inaudible]. >>: Right. So the other side of using a classifier, I just talked about training it where you're trying to find a decision surface. The other side is actually applying that decision surface to a database. And for support vector machines that evolves evaluating an equation, and it ends up looking like a big matrix multiply plus other stuff. So we also implemented that and got fairly good results. We had both a P thread version on the CPU as well as a two-way core two duo as well as GPU version using Nvidia's graphic processor from 2006. So these results are kind of dated but overall our classification results were identical to the serial version. When you took the classifier we trained with our own training method and applied it using our own classification method they provided the same results. And I also want to talk briefly about another product that we've got going on in kernel dimensionality reduction, which is another technique for performing classification. In this technique, what we're trying to do is take a high dimensional dataset like, for example, images with features on them and we want to take it from something with really high dimension down to something with much lower dimension going from 600 dimensions to 20 dimensions. In so doing the classification problem becomes a lot easier. So in this -- I think this picture sums it up really well. We've got -- this is from a 13 dimensional real world dataset describing wine. And it measured wine in different, 13 different ways and then the classifier is it any good or something like that. And if you take that real world dataset and use kernel dimensionality reduction, the three different classes that we're trying to distinguish boil down into three nearly disjoint regions in the two dimensional space. Whereas other techniques for doing dimensionality reduction they kind of tend to leave the data on top of each other which makes it harder to classify. So Mark Murphy in my group at Berkeley is working on this, and it's a noncovex optimization problem in order to train this thing with simulated annealling gradient descent and to cut to the chase he's implemented a portion of it and it's working pretty well. He's seeing fairly good speed up. So again parallelism will work here. So I think that the challenge of all of this work is putting it together into a functional CBIR system that lets people search a database and actually returns what they want to see. And we're working on that. We haven't actually done all of that yet. We've been working on these components instead. But we believe that using more sophisticated feature extraction techniques like the contour detection I showed you as well as more sophisticated classification techniques will lead to improved performance. And that will let people do things they couldn't do before. And that's what we're trying to demonstrate in our work. So that's it. >>: So question here, so would the user provide the query into the system, do you solve the problem of open set, maybe simply the query isn't labeled in the set, what do you do? >>: Right. So the way that we are looking at that is so the user is going to provide some images that they're interested in seeing. >>: [inaudible]. >>: A couple of them. >>: Figuring out what they want, koala. >>: Right. This is content-based. So the user says I want an image that looks like this one. >>: So rather than user recognition to label whatever. >>: Right. No, this confusion is natural because they're related problems. If I was able to go through my database and find all of the things that had a koala in them and then I gave them an image that had a koala in it. If I had to figure it out it would make searching a lot better. It is performing this problem. So they are related problems. And we're coming up with sort of hybrid solution that takes advantage of previous queries, things that it has learned about the database during previous queries to make them stronger, coming from the user. >>: So you have to find some similar kind of thing. >>: That's why we're -- that's why we're working on exactly how to do that. But, yes, there are CBIR systems that do things both ways and we're trying to come up with something that makes sense. >>: Question, how fast does it have to be for it to be a product in the sense that somebody would take it and market it. How do you know when it's best? >>: Right. So I think there's two major components of these systems. One is the feature extraction stuff, which is kind of done offline. It's done when the images enter the database. And hopefully that needs to be fast enough to keep up with the stream of images that are entering the database. And then the second part is the actual online part when the user's entering a query, and that needs to be user tolerable. So what does the user tolerate for a search, I think a couple of seconds. That needs to be really quick. >>: 300 milliseconds. >>: There you go 300 milliseconds from Jim Lair. >>: Let's thank our speaker. [applause] >>: Hello. Am I on? So thanks for fitting me in. Rick talked this morning about stuff we can do with large collections of images. We all know one very easy way to get a large collection of images is to record some video. I'll talk about a bunch of work we've done in the area of video. And my goal with all these explorations or my dream is to take effects that Hollywood people make today and ensure that my number one customer, the only person I know who actually does anything with video, which is my mother, will actually, can play around with them. One of the things -- so this is my mom's laptop. She likes it very much. And I guess one of the funny things while we've spent all this time saying parallel programming is just around the corner, does anyone know how many cores are in this thing? >>: Four. >>: Is it only four? >>: Two or four. >>: In the GPU. >>: So you have to be careful. I'm using Intel definition for because we have Intel people in the audience. Intel definition for core is slightly different. They would say 16 or 32. >>: I see. >>: Improve the core efficiency. >>: The main thing is that the programming model is parallel. So we've been saying it's in the future. Well, it's already in the past in some sense because people coding for this thing are coding for a machine in their minds, programmer's minds, many more cores than are in it. I know you didn't like the picture. I gave you a picture that's much more to be the kind of thing you like the look of. So what are we going to do? Well, here's a classic, here's a piece of video of my office. And I've performed a traditional video edit on it which I've overlaid a logo. And I've used some extra special graphics to make it nicely 3-D. But it's not really 3-D. It's just a 2-D overlay on a 2-D video. What we might want is something a bit more like this video here. The 3-D object I've embedded is living more in the world of the video rather than superimposed on it. So I can do this computation. But it's effectively the key to doing this, and we can see the logo's beginning to appear again, the key to doing this is to run photosynth on the several hundred frames of video to generate 3-D cameras. So the same 3-D camera positions you saw before are now smooth trajectory, you render the 3-D object with that. We worked on ways to make it easy to make animations like this one. So you thought the blink tag was bad when it came out in '92, you thought word art was bad, well this is the future. So again these sort of edits we know how to make the user interface very simple, but it actually requires a massive amount of compute to preprocess the video in order that these edits are easy to render. There are still some problems with that. So, for example, this video is -- I'm very proud to say, made Bill Gates laugh. At that point. At that point there was a sound from the man opposite me. So a great thing is happening here. One part of your brain which does geometry and stuff is really happy that this is probably on the white board. That's where you think it might be. Another part of your object recognition system, which is looking at the relative depth ordering of objects suddenly gets nastily broken when the tech fest goes to the background that's because despite all my protestations about 2-D overlay on another 2-D thing, turning it into 3-D, this rendering has absolutely no knowledge of the dense structure of the scene. Having spent lots of time computing the camera positions we're now in a position to spend more time computing a representation of the scene which looks a bit like this. Now, this is pretty rubbish, this is fast rubbish depth data where bright points are far away and dark points are near to the camera. So it's encoding distance. This we obtained this representation by looking at the images themselves without using a depth camera. Because the scene is rigid, I can treat the frame-to-frame transformations, I can pretend a pair of successive frames is a stereo view and recover a rubbish depth map like this, but this is nevertheless enough to correctly occlude to generate the clean occlusions that I get in this sequence. We have the same thing again. But now it's less disturbing when the video passes the other way. You'll notice it's still not perfect. A bit of boiling around here and so on. But certainly for home video use or at least in the near term you'd expect that this would be, this is not as disturbing an output as we had the first time. By the way, the reason I can use the rubbish depth samples is because I have access to an algorithm by Antonio [inaudible] and Tony Sharp at Cambridge which they called geodesic segmentation. But this algorithm is probably not interesting for this workshop because it's extremely fast. So you can process a large CT volume and generate these segmentations completely interactively. So we'll just skate over that. Okay. So I said that one of the reasons I could recover depth from that sequence was because it was a rigid scene. You'll notice there weren't any researchers running backwards and forwards in front of the whiteboard. If we were studying tracking there would have been lots of researchers but no camera. I've always wanted to process scenes a bit more like this one. So you have an object like this giraffe which is undulating and articulating as it moves through the scene. We have a pair of giraffes, one occluding the other. The camera's panning. I want to know how to deal with nonrigid motion and what the representations are for doing that. So one of the things, very simple thing that you might want to do if you've got a video, is attach an overlay but have the overlay look like it's following an object in the scene. So you know, and clearly the only application any of us can ever think of for anything is advertising. So let's assume that's what we want to do here. Okay. Now that, come on that's got to be simple. What am I going to do, I'll click on some point in the giraffe and follow it through time, moving the object. That's got to be easy. And indeed it's not bad. So here's the interface. Written in math lab for anyone who knows this is extremely slow. The user has clicked on the giraffe's eye and the cyan trajectory represents a search throughout the video. 200 frames of video. It searched for points matching the giraffe's eye 200, 400 frames a second and what the user is doing is scrubbing backwards and forwards through the video, and checking that the trajectory is correct even though the back giraffe's eye was occluded during the transformation. So we're kind of looking at the position of that point throughout the video overlaid on one frame. And you can see that basically at this stage, good check there, the other giraffe's nose came in front. So the user is happy that he's correctly tracked the video. And in this case it was two clicks. We have other examples where a 20 minute video that we processed we managed to process in three minutes including the user interaction. So this eight times real time means including the user, starting from the point before the user touches the mouse. The key is that you need to store 24 bytes per pixel to score a KD tree for every frame. In order to do this there's a massive amount of backing stored. So we precompute on the video and then we can get it fast. That's something that we want to deal with. Okay. Back to this guy. Well, can you see what I've changed? Oh look a new house has appeared. That's an easy edit. We could do that in the '90s. What's the representation that allows us to do that edit? This is a tricky edit. I'm going to put something on the giraffe's neck, and it should follow the undulations correctly. We couldn't do it in the '90s we can do this now, we're happy with that. What's the representation that allows me to do that? Well, again, I load up the video. I crunch on it for several days after several days crunching I end up with a representation of the video that looks a bit like this. I split it into two static photos, foreground layer and a background layer. This giraffe, the one in the background, didn't move enough. So he just got locked into the background so his neck lifted. That didn't get noticed. So I crunch on the video. I split it into these layers and then I can do something like attach houses in normal photo editing to this layer and generate the image we just saw. Okay. So there's the background layer. I just use whatever photo editing technique I like to put in the houses and then we rerender. We get the sequence we just saw. And then finally this allows us to deal with more complicated structures. So this guy's face if you watch his mustache you'll see it appears slowly and flashes on and off. The same sort of technique. 3-D object. Nonrigid 3-D object. What photosynth would do, the photosynth philosophy which I adhere to is you would take this object and reconstruct a three dimensional interpretation of what's happening. Well, what we decided we would reconstruct a two dimensional interpretation of what was happening. Instead of reconstructing into 3-D we did it into 2-D pasting each, all the new information from every frame of video into this representation, which is called a mosaic. Now you can draw on the representation itself. So we're drawing on sort of this one picture the eyebrows et al. and rerender those, mix them in with the original video and generate this effect, which is reasonably convincing, and the mustache disappears correctly as it's occluded by the subject's face. Okay. So that's some stuff you can do with video. As we know it's a splendid source of embarrassing parallelism. If you can get the video to the cores. So we want to do lots of computation on a big chunk of video. For many of these computations each core would like to see all the video you want you like to supplies it by frames but sometimes you like to see all the video at once. But given that you've got the video down at the cores, we can do lots of stuff. Most of the stuff I want to do take turns at 20 meg video into a half a terabite of data. So actually a lot of it is disk bound rather than RAM bound or CPU bound. But things are getting close. So that thing happens. And then as in photosynth you get some sort of N squared or gathered do that on the frame, haul that information back. Not a lot of data coming back let's say 20 K per data per frame but then you do some horrible gather of size N cubed. And that's because the nonembarrassingly parallel sequential algorithms that we love in video, you might think video, do stuff in every frame, throw away the old answers. Bad idea. If you've got 99 percent reliability frame to frame, and your video's a thousand frames long, 99.9 to the power of a thousand, or .99 to the power of a thousand is a small number. And then of course we've just got the standard stuff. If I want to render this video with 100 different image processing operations per frame, I need to decide how to split that over whatever cores I have. That's a standard problem. Right. Thank you. [applause] >>: Time for a question or two. If there are any questions. Great. Thank you. >>: There's one over there. Sorry. >>: So some of this is going to be done very slowly off line and then beyond the user's interpretation. And then something fast happens for the user using all that stuff, sitting on disk. And so what would be done, for example, is drawing the mustache, the face had been processed off line. >>: That's the idea. Sometimes but something we would love to do is half the time when you run this it doesn't get the face exactly the way the user might have believed it should be. One thing if you're blinking it's supposed to make space in this representation for both the open and closed eye. So they're both sort of living there. We would love to have user interaction to fix if I don't have line computed stuff and that's just completely inconceivable at the moment because you would make an interaction, even researcher level interaction, tweak it run it for another few days, tweak it again. So that would be something that would be great to fix. >>: So you didn't have to make any major changes to get it to work on the person's face. >>: We didn't use the photosynth. We actually used a different bundler for this. No, sorry, to work on the face. It can't do it at the moment. We used, photosynth embeds tracks, you know, imagine if you follow the eye you generate a track like the cyan curve that we saw on the giraffe. Photosynth takes hundreds of thousands of those tracks and embeds them into a 3-D space. We are taking 100,000 of those tracks, embedding them nonlinearly into 2-D space not close to linearly into a 3-D space. So it's quite a different problem. The mechanics of it are quite different. It's a multi-dimensional scaling problem, basically. >>: So I'm thinking of an app. So there are Avatars in these sort of online interactive world, would this be usable there. >>: To recover the texture map. >>: Get an Avatar to look like you but change running in real time with your interaction. >>: Yeah, except that even the tracking bit, even -- so once you have -- what you would do take the texture map every new framing you would discover the texture mapping with every new frame. Our implementation of that is currently like several minutes per frame. So, yeah, if you could real time that you would have a lot of opportunities. But that's part of the preprocess. Because our output is just the motion vectors. >>: Thank you. [applause] >>: Okay. Amazingly it seemed to like my system instead of giving trouble connecting. My name is Jim Held, director of [inaudible] means I'm cat herder in our central labs at Intel. We have a program of research that you might characterize as the future of multi-core. It is how far can we go, how fast, what will we have to do? It's a large program. Several dozen projects. Couple hundred people at the moment. And covers everything from software to hardware. Usually starting from the application level and analysis of applications down. Looking at usage models as well as the low level primitives and how to best support them in hardware. So a very comprehensive program. One of the major areas of application is what we call immersive connected experiences, which I'll define in just a moment. In fact, first I'm going to introduce the concept as we describe it. It's sort of a combination of social and visual computing. So it's a nice bridge topic between the two. What are the requirements we see in it, what our research agenda is. What the various projects are. My goal is to introduce these not to go into depth on the actual details of the research but vector you to the people who are doing that work. If you're interested, I'm sure they'll be delighted to core respond with you. Potentially there will be some collaboration with the folks, both at Microsoft Research or at the universities and UPCRC. Some of the folks are here in the audience. I'll highlight that when they come up. Others, different parts of the world. I'll mention that as well. Now, ICE comes really from a combination of trends coming together. Social networking has been growing tremendously. A very important use of computing for us to be aware of. One aspect of social networking that really uses compute on the client is the user generated content. Now, that can be just the text that someone puts in a blog or texting in ABAC room but people are getting more sophisticated creating sometimes their own rather interesting videos and in the virtual worlds more complex 3-D content. The idea is we're moving from a world where the content displayed and used in applications doesn't just come from professionals, as it does in a traditional PC game, for example. Broadband connectivity means we can draw on a combination of the client and a rich network of servers. And so-called cloud, for example. And the computation, one of the questions is how best to take advantage of that computation capability and to deal with the complexity of having a distributed application implies. Mobile computing is growing. It won't be too long before desktops begin to disappear or turn entirely into a local server or specialized work station because so many people value the ability to take the client with them, whether in the form factor I'm using here or hand-held device or combination working together as the earlier CBIR presenter illustrated with his combination of iPhone and his Macintosh computer. Visual computing is something we're very interested in because it's becoming a major fraction of the compute use on the clients. PC clients are increasingly being called on to interact in a much more rich way. Visual computing is part of that. When I talk about visual computing, I mean much more than graphics, though. Visual computing in our terminology is a combination of not just photo realistic rendering of images, but also the combination with modeling and simulation that makes your interaction with that representation realistic and responsive. It requires physics. It requires recognition, as well as rendering. It requires a mix of computation, not just data flow, flop oriented graphics processor, but the more general purpose and actually the combination of the different kinds of compute come into play with the kind of applications we think of when we talk about visual computing. So all of the different kind of very interactive application environments that, yes, include video and 3-D and other visual representation, but mixed in with a representation of the world. Now, there's one other aspect that comes to what we call connected experience. Why do we say immersive connected experiences? And that's because many of those visual computing uses draw on the fact that people want to connect to other people; that we're not just talking about the client being connected to the server to deliver computes or storage, but we're also talking about the gaming and collaboration and retailing and all the other ways in which the computer is used as a tool of communication and interaction. And so immersive as well as connected environments, in fact we see that the direction that the web is going, going from what was originally text oriented and static stereotype interaction to add more and more graphics and video and increasingly 3-D, we see immersive connected experiences, our name for that combination and the direction that the 3-D Internet is going. So there are applications that draw on compute and the connection to enhance the actual world to mix computed images and computed modeling information with video and data from the real world. I'm sure you're all aware of the applications that allow you to go to a location, see a map of it, overlay that map with a satellite picture, draw on another application to actually get a 3-D representation that you can interact with, see a street level representation. So we really are taking the actual world and overlaying on it some abstractions that add information to the raw image that people are seeing. We're also creating completely artificial worlds. Whether it be in a multi player games and massively multi player games like the World of War Craft et cetera are becoming very popular. But there are also completely artificial environments that are created by the end users where they interact in unstructured ways or around a structure but not in a prefabricated prepared game. Those so-called virtual worlds as opposed to an environment created for a predetermined game are becoming extremely popular. Many millions of users. Last year, 2007, actually, more than a billion dollars was invested, but in 2008, in an absolutely tough economic environment that was still almost $600 million invested in. In fact, over twice as many different world investments, different companies in 2007, Club Penguin acquisition by Disney accounts for like 400 million of that billion. So it's actually broadening and continuing. Most of the interaction is going on without really using the full potential of a computer to deliver a rich model and a rich visual environment. You see Habo and Neo pets and Pet Tropica which involve interaction through a browser and more a two and a half D animation environment. They still have the idea of visual representation, high level of interactivity, model of the world, and in fact commerce. Almost all of them involve the ability for, naturally, the participants to spend money, to add to their environment. They involve them creating and having possessions in that environment. Some of them, like Second Life allow that economic return to flow both ways lending dollars in and out of the environment. So we're getting a large number of virtual worlds. A lot of participants. It's taking hold in the youth. Most of the participants like well certainly a little more than half are preteen or younger. And so people are growing up with this as a way of interacting with each other through virtual environment. The last bullet here on the right-hand side I think is particularly significant, as Second Life moved up in the Nielsen rankings of what they call PC games. These are actually 180,000 measured households. It's not survey. It's not self-report. It's like Nielsen rating and TV, the instrument systems and track what's used and how. And it accounted for the second largest fraction of the total minutes played and people were spending like 600 -- average 680 minutes per week engaged with the virtual world in Second Life. Second only to World of War Craft. So very important. A lot of eyeball time. A lot of participants having their own economy, grabbing the consumers of tomorrow. In fact, significant consumers fraction of the money spent today. So of course interesting to Intel as a future usage model. We also have that augmented reality as opposed to the virtual world, particularly with mobile devices. The mobile devices are great to carry around sensors. Cameras, acceleration devices to sense orientation and motion, all kinds of things can be packaged into that device with a wireless connection to a lot of compute resources. What do you do with it? Well, often that interaction involves augmenting what's being captured and the device with information that's centrally stored or centrally available. We've got some examples there. Being able to use the camera on the device and overlay instruction on what to do to disassemble the engine, for example, or a translation or identifying some device. Another area where we see that this combination of visual computing with connectedness and the balance of compute is going to really, increasingly, create new usage models that draw on multi-core computing, not just in the servers but on the hand-helds as well. So ICE is our name for what we think is a compelling set of new usage models or uses of computers that's going to drive consumption of compute and gives us an opportunity to add value by responding to it, by giving the right kind of platforms, the right kind of processors. What does it need? Well, there are all kinds of challenges if we're going to respond to those uses. One, of course, traditionally figuring out what the demands are on the server and the client. Memory bound, compute bound, storage. What kind of compute, what kind of operations are being done, what kind of network connectivity is required. All of those are our traditional role and something we do with all of our applications that we analyze, that we get, excuse me, in-house or build in order to shape what we do. There's also, though, a lot we need to do in the distributed computing realm, and I'll talk a little bit more about in the virtual world space. We think one of the barriers to that leading up to its potential is really the architecture, the distributed system architecture and we have folks working on that in our labs. We need to look at how we deal with the fact that we're going to have hand-held devices as well as very powerful laptop devices and things in between. How do we deal with that diversity and the fact that there's always an installed base. You can't design for the latest. You can't design for one fixed model. How do you avoid a lowest common denominator approach to this distributed computing client model? Visual content. We talked about being something that end users were creating. As much as half or more of the cost of the title for a game is traditionally in the artwork. It's in the content, so to speak. That just isn't there in these virtual world environments. When end users are creating their own video and want to create their own 3-D devices, they come up against a very difficult clumsy environment. The ability to create a video is much advanced over the ability to create a very rich, complex environment for your virtual world, much less design an Avatar that suits or represents you the way you want to. Having the ability for people to create and then move their content where they want to go, because it is their content, after all, it shouldn't just be a property of each individual place they go, is another challenge for fulfilling this. And finally how do we deal with sensors and connectivity when we're talking about the mobile devices? Now, we've done analysis on platform demands. [Inaudible] and the application research lab has done analysis of Second Life, the client, and then indirectly from the binary operation of the server and the network connections. We find, certainly, virtual worlds like Second Life have tremendous demands on the servers. Very compute intensive work. 10 times as much demand on the servers as something like the World of War Craft games, because of the way you interact with a much richer ad hoc environment. Clients as well really draw on the CPU and the GPU, that visual computing is a key part of what the job of the client is in the virtual world space at least. Again, very compute intensive, very high utilization. Very important to the experience. And finally the network connectivity, it highlights another challenge. You'll see there are two lines there. In the virtual world space you can't predistribute content on a DVD and draw on it. It's being created and shared with people as you come across them into the world. You move into a different land or world or part of the world and meet people and places and things that you've never met before. How can they be predistributed. That really places tremendous demands on the network, the ability to cache that effectively and compress it is important. >>: These numbers, are they kind of for like an okay experience, a New Yorker experience? This is like we want for a great experience, this is how much more compute you need. >>: Was it a great experience or was it an average experience? I think it was an average experience. I don't think this is what we would want. >>: They want great experiences. >>: I should say there's a lot of potential to utilize even more to improve it. I challenge with this aspect of ICE in the virtual world is dealing with complexity of the environment. The number of users, the number of objects, the object behavior, the realism of the simulation, all drastically increase the complexity. It's far from linear in the number of objects. It's far from linear in the complexity, the realism of the modeling that you do. And that right now, because of our inability to really supply all it's desired means that the current implementations are severely limited, simplifying, for example, the shapes of an Avatar in determining collisions, in latency and delay effects. It's easy to move too fast in the wrong circumstances and fly right through a wall, to leave behind certain objects that you desire. So the challenges, because of delays in processing introduce artifacts and display and synchronization and slow down the interaction. It would be infeasible to have a virtual concert to thousands of people in a virtual world environment today. It just could not sustain the interaction among that many people. >>: So some people have done work in sharing bandwidth. We have a whole bunch of people standing together with cell phones. They could take the union of their bandwidth by communicating with one another through bluetooth or something like that. And they've done some demo where they can actually multiply the bandwidth. Would that be a useful feature here? >>: The ability to create a richer activity, yes. And the architecture of the distributed system, making it more scaleable and techniques like that is part of the research that is going on in the labs. This is part of what we have to figure out how to deliver. The distributed system involves not just a given virtual world or environment like the Second Life world, but interaction with others, the ability to share content back and forth. It has both global resources to manage as well as regional things to take care of, assets, simulations in local islands, for example. And they have connections to powerful clients. Let me see if I can make this run. Compute capable as well as limited, but very sensor-rich devices. How do we design for this kind of environment? That's what we're interested in. That's what our research is going on. Each of these pipes is a potential bottleneck. Each of these resource managers and services are potential bottleneck, unless you have the ability to replicate and scale. The ability to dynamically adjust to the client that is being used and have multiple types of clients participate is really critical. The ability to share dynamically the workload between the server resources and the client. Those are the requirements that we want to meet. Okay. So what we see happening, in order to really scale, is we need to move from monolithic creations to the much more modular horizontal building block approach to designing these environments. Essentially, the same kind of thing that the Internet went through. I don't know how many of you used Prodigy or Compuserve, one of the first beta programs I was on with Microsoft, I reported everything on Compuserve. Those evolved to today's web, and the open platform that allows a much richer set of things to grow, unanticipated things to grow and share. That's what we think needs to happen in order to fully realize the capabilities and the environment. In the individual content space, we need to move from professionally created content to the end user being able to create their own content easily and satisfy their own creativity, and own what they create and move it around, because it becomes an important resource for them that they value. It's their identity in many cases. And then we have to deal with a problem of delivery of content. We can't predistribute. We have to have the ability to deliver just in time and cache effectively. So we really think that there's plenty of research necessary in order to make this kind of experience really high quality and scaleable to rich environments, lots of objects, lots of people, very faithful modeling of the environment. And the challenges in each area have generated research going on. You've seen some of the initial results and glad to share papers on workload characterization. We're doing work to understand the platform demands and detail. And we'll carry those forward as we do with all of our workload analysis to optimizing our server and client platforms to support them. In the distributed computation space, we're really working with the industry on proposing modifications to the application architectures. I'll talk a little bit more about that later, and research on dynamic repartitioning of the workload between the client and the different kinds of client and the central compute capabilities. A big effort -- I'm going to highlight in the slides today because the research is actually in our Beijing lab, and Yiman Zang wasn't able to be here today I wanted to show some of that because he couldn't speak for it. And I'll also standpoint out some of the stuff going on with the distributed computing that Mick Bowman is leading. So let's start with that content creation. One of the ways in which we can deal with both the distribution and the creation of content is to parameterize it. You don't want people sculpting in or drawing in these 3-D models. You don't want them to assemble in crude components. You want them to be able to customize the appearance, for example, of their Avatar very easily just by moving sliders to match what they want. You want scripting to be able to interact with a parameterized expression modular. So you don't send images as this video is creating to describe someone who is changing the, moving their features as well as changing their expression. You want to send parameters to a model that is predistributed. That gives you much more flexibility in design. It gives you the ability to send much higher semantic information, and therefore much more compressed information over your network. Now, they're drawing on databases for their research. One from the Beijing university of technology and another from the University of Binghamton on facial expressions. I'll show you a little more how in a minute. One thing we'd like to be able to do is create an Avatar that represents us better to take a picture, for example, identify the features within it that correspond to an abstract model. Map that model back to the features on that picture and take the result and map that to our model and therefore be able to manipulate the model. So what we've done we've taken a flat 2-D picture of someone, identified the features, mapped those features to a model of a database of models and then we can apply the 2 D as a texture to that model and manipulate it in three dimensions. Two dimensions three dimensions but moreover it's the person. It's me. Also I would want to be able to parameterize a variety of things. The expression on the lower left, for example, going from neutral to happy, configure the gender characteristics of my Avatar or my characters. I'd want to be able to parameterize the dynamic interaction and map it to a particular instance of a person. So capture this kind of change of expression, apply it to an imagine, any kind of image where I can identify the features, and then have the resulting expression shown in that face. Facial features even ethnicity all can be done by use of that 3-D database, treating it as a model, allowing what we use to call morphing techniques to make it a continuously variable transition. This to me was science fiction when I was growing up, right? You answered the phone with an Avatar that looked perfect, even though you had just gotten up, or maybe you represent yourself as an Avatar of Brad Pitt or Arnold Schwarzenegger or whoever, you can do that with the technology that's being developed on research in content tools in our Beijing lab. You do want to be creating 3-D models, not just mapping textures doing the expressions of an Avatar and that's a tedious process that's again very expensive to do. And what they've done is developed a tool to take pictures around an object, provides view planning to suggest what would be the best camera positions to capture an object, and do the image-based modeling. In other words, tracking and doing correspondence analysis against the images and turn that into a 3-D representation and extract the texture, creating, therefore, a 3-D model that you can use and animate in your virtual world. So going from something that is done by experts and expensive and tedious, might be improved with a 3-D scanner, to something the ordinary person can do by the appropriate use of their digital camera. We have work at the Intel Research Seattle lab on what we call everyday sensing and perception. And our goal is to be able to infer context from sensors in your hand-held device that are accurate 90 percent of the time for 90 percent of your day. 90 percent is actually very hard to do and something you're not going to be able to do without visual information, without video capture and vision and recognition. Moreover, if you're talking about doing it with sensors that are in a hand-held, one of their fundamental problems is the ability to do preprocessing on this device to reduce the bandwidth required and the latency and pass it to the server where the heavy duty compute goes on and then the result comes back. And so they're very much dealing with the ability to create a synopsis in order to reason on it with much more compute power in a central compute resource. We're engaging the virtual world community through open Sim. It allows us to work with a wide range of industry partners and a diverse community, Microsoft, IBM, Intel, a wide variety of participants on this simulation environment for virtual world. We have been working with partners to try to put together a road map for virtual worlds in order to try to improve the success rate or the likelihood of success of all these virtual worlds that are springing up to understand what would make them more available, how people could be more productive with their investments by taking all of our state-of-the-art industry-wide thinking and making it available. We are creating -- oop, I moved too quickly. Here we go. How am I doing for time, Jim? >>: About five minutes. >>: Great. Thank you. We wanted a test bed that was computationally interesting and challenging that would give us content and experience with real world customers. So we partnered with the SIGARC and IEEE in order to create a science Sim, a simulation environment for representing the results and the interactive environment for super computing 2009. And so the idea is to have the opportunity for interactive presentation of results in a simulated environment be part of something the high influence computing community could take advantage of. Both a chance to show off as well as a future education tool. And actually the result of our interaction has been concrete. I think it's been like a 10-X improvement in scaling so far from architectural changes and suggestions that we've proposed to the open Sim community. We've got an AITF MMOXOGP proposal in to create the interface or actually use industry standard interfaces to make the virtual world simulation environment more scaleable. And I'm sure Mick would be glad to fill you in on what that exactly entails. We're actively engaged in this open environment to facilitate what we think is something people are interested in, could be able to use, would exploit the value of the computes we can deliver, but needs some architectural leadership from industry players like Intel and Microsoft. This ICE research actually ties together with our other main initiatives. I mentioned how we have a tera scale computing initiative. And the performance to support that and the platform characteristics that allow the kind of computes to be delivered, the balance platform with the right acceleration, memory, bandwidth, compute capabilities, all of those are part of the tera scale computing initiative. We have another large effort we call carry small live large, which is the idea of getting tremendous benefit from our hand-held computers, working in concert with clients, peers and server world that involves the use of sensors and wireless connectivity futures, that the augmented reality research is part of. Visual computing is really a big thrust for us. We've been talking about work underway to develop the Laraby architecture and have talked to SIGARC and elsewhere about the characteristics of Laraby. We're making a big investment in that space for future product. That needs to be fed by or research -- we've had a long standing relationship with Sarland so we put a large amount of money into what will be our largest lab research effort in Europe. It will be part of our Intel lab environment or collection, I'm missing the word, in Europe. Jim Hurley is part of our CTG visual architecture research lab and our microprocessor group in CTG. And colleagues at Sarland who we've actually again been working with for some time. And others in Germany that are going to be participating on a really rich research agenda that, again, is much more than just raw graphics or 3-D. The concept of visual information, recognition capture and the simulation and output, what it makes possible in terms of interactions, the human computer interaction that's possible given these modalities. So again another big investment in what is very much the underlying technologies for the immersive computing environments. So we think this is an exciting area of applications. We sort of come way up from our previous talks about the enabling algorithms technologies to a much broader view of the ecosystem trends, what it's getting rise to in usage models and applications. We think it's one that's going to draw on all the research we've described so far, to deliver something that end users are already interested in, already beginning to pay for and will make use of our processor and platform technology. Any questions? >>: Thank you. [applause] >>: Good afternoon. My name is Phil Chou. I manage the communication and collaboration systems group here at MSR, and we do a bunch of work in telepresence, and I'm also involved in a cross group initiative in telepresence here. Telepresence is a word that was coined by Bill Buxton in the early nineties as part of the Ontario telepresence project. He meant it to be rather broad in interpretation. Technology to support a sense of social proximity despite geographic and/or temporal difference. Today telepresence in the industry takes a narrower view, that of a video conferencing experience that creates illusion that remote participants are in the same room with you. It's more of a tele immersion definition, I think. But the best way to understand what that means in the industry today is to take a look at this 30-second Cisco commercial. [commercial] >>: So... >>: Welcome to a network where body language is business language. Welcome to the human network. >>: We've just seen a high definition full sized life sized video conferencing experience that's so compelling that one participant forgot that his counterpart was actually remote. This Cisco telepresence system, along with others like it, costs several hundreds of thousands of dollars per room. And, of course, you need at least two rooms to make it useful. And that doesn't mention the additional tens of thousands per month of operating costs. So it finds a rather narrow market. But there is a market popular among executives in particular. Microsoft itself is purchasing a bunch of these things. But the general feeling we have at Microsoft is that these are somewhat akin to the mainframes of the '70s, and so there's a lot of opportunity sort of downstream from that. So one of the takes we have on telepresence in research is that it could be closer to ubiquitous computing, which is a notion that Mark Weiser in the computer systems lab at Xerox Park promulgated in'88 through '95. And the idea is that instead of computation being delivered through boxes on your desk or from the back room, it actually would computation would be woven into the fabric of the environment. So they built a bunch of devices, park, tabs, pads and live boards which are today's PDAs and tablets and interactive white boards. An example today might be Craig Munday, our CTO's conference room over here in which the room, every surface of the room is some interactive display and interactive surface. All the walls, the table and everything. So there the room is the computer and that might be what ubiquitous computing might be today. We're interested more in the communication aspects of ubiquitous computing. This was called ubiquitous media by Buxton in the early '90s. And it's really driven by sensing and rendering devices, input and output devices for both audio and video, meaning microphones, cameras, speakers, displays. So the ubiquity of those things, particularly in cell phones. All of us have all four of these modalities sitting in our pockets. And there's also more infrastructure-like things, laptops, desktop monitors, conference room devices, IP cameras. We have these things like I can see six different cameras in this room. So luxury rooms are highly equipped. Highways. We have hundreds of cameras publicly accessible just looking at our highways. Cities like New York and London have thousands of cameras just connected to the police network. So these devices are really totally ubiquitous, if you look at a regular office, here's a picture of my office. There are many devices connected to the Internet, my laptop, my phone, my desktop and my camera phone, which is taking this picture. And each of those devices has four or five or more high bandwidth sensors or rendering devices. So there's 25 broadband or rendering sensing devices in the picture. Over here, everybody's bringing laptops and so forth. I'm estimating there are 55 to 70 microphones, loud speakers, cameras or displays in that conference room. So they're really ubiquitous, and the question we ask is how can we use these devices to assist with better communication between people. So there's several opportunities. One could be, for example, better audio capture. You get microphones that are closer to the people who are actually speaking. Similarly close-up views of participants, laptops, native cameras that are actually looking at your face, can use this ad hoc array of sensing devices to locate people in space in the room and help people remotely understand what is going on in the room. For the people in this room or for people in the remote rooms who are listening in on this conversation, the spatiallized audio, get a better sense for these guys over here to get a better sense of what's happening spatially in this room. Over here, there's flexible use of display real estate because there's lots of displays all over. So a lot of opportunities and I'll go through some of these. This is just showing where all the 25 devices are. Here's an example of using an ad hoc array of microphones. In this case distributed on the laptops there of people in the meeting rooms to do sound source separation. So putting four people on their four laptops into separate sound channels. If they're not separated, one of the let's say laptop No. 2 its microphone sounds kind of a mixture. >>: [demo]. >>: They're talking over each other. And we want to put them into a single channel. So let's say Channel 2. Basically cuts out the first guy. And then just brings in the second guy. >>: [demo]. >>: And we can build into the source separation location information. So we either using time and flight-based techniques or energy-based techniques. Here's an overview picture of people in the room gathered around microphones glued to the desk, and this shows that we can locate through sound that's uttered by only the people who are talking. So no pinging sounds or anything in the room that we can jointly locate the active speaker here and this red dot and all of the microphone locations, which are accurate to the ground truth down to a few centimeters. And if we use energy bases maybe a little less accurate. But we don't need the tight synchronization between devices. So what you can do, you get the source separated and you get the sources located in space. Then you can help the other people and the other remote rooms understand what's happening spatially. So instead of the traditional Mono signal in a room, which sounding like this. >>: [demo]. >>: So that's the traditional Mono thing. And only the people in the middle can probably hear the stereo. But if they are spatiallized, it's easier to tell what's happening in the room, how many people there are. >>: [demo]. >>: So it's taking advantage of cocktail party effects that people have. Of course, if you do the spatiallization tricks then you have to deal with the acoustic echo cancellation problems that happen in the room. If I'm in my office listening to spatiallized audio, and I'm talking into my microphone, that's being interfered with by stuff coming from now two loud speakers instead of just one. And there's some inherent ambiguities that go on with two loud speakers. We need to cancel out what's coming out of these loud speakers so the guys back in the conference room don't hear what they've just said. But fortunately because we're able to control the spatiallization, we're able to disambiguate anything here. And cancel out what comes in through the microphone. So here's a concocted example of what the microphone may hear. You're hearing now stuff spatiallized audio playing out here and you're hearing a Mono signal. >>: [demo]. >>: That's actually a mixture of spatiallized stuff. >>: [demo]. >>: That was just the near end talker talking. And now you can hear that most of that stuff goes away except when she talks. >>: [demo] so that's all for ad hoc microphone arrays and the same thing can be done with camera arrays. We have cameras on all the laptops sitting in front of you. One possible thing to do is for the remote person is to look at the screen and every moment that is selected the speaker who is talking, the active speaker, select the best camera at any time possible moment. That's sort of a traditional way of doing conferencing. You could also show all the cameras at once and maybe a Hollywood Squares type of thing. You could actually see all the faces, how people are reacting to what other people are saying. But we're inspired by some of the photosynth work to try to get a better spatial sense of what's going on in the room. So when people are looking in one direction, the remote person can understand that they're looking the person is looking at who is talking or whatever. So we have this meeting viewer, which essentially has this 3-D context view and an active speaker view over here. And I wish I could demo it on this laptop, but I'm just going to have to show you an old version by video. And over here the user is trying to manipulate the three dimensional scene to try to figure out, well, to get the best view on who's talking. And so these people are kind of arranged into the three dimensional space. We're also looking at mobile things, trying to augment phones with something as simple as stereo ear buds or goggles. I won't be talking about these today, though. We're also looking at what we call them bodied social proxies, essentially George in a box, carry him into a conference room and set him up and you can talk to him, and things that are related, other kinds of stand devices some robotic types things. Some that sit on your desk. But what I want to go into a little bit more detail, though, about is not those things but more what we're doing along the lines of immersion. So how we're evolving, immersion from things that are just sitting around, you're sitting at your desk, to things that are basically taking wider and wider fields of view and involving you in an immersive experience. We have the idea that you might as an individual just sit down at a desk somewhere, have a very wide field of view, and then you're talking with your remote counterparts who are all distributed in different continents, you're sitting around a common virtual table. And as more people are added to this table, as more people join the meeting, the table may grow a little bit bigger. People slide to the side to make room for the new person. The audio is coming from the person that is talking. You see the people from different angles, depending where they are sitting around the table. We're in a common environment. The environment may have tools that support the meeting that's going on and so forth. So there are a number of immersive cues that we're taking a look at, both in audio and video. Peripheral awareness is an important thing, we believe. That's why we're looking at large fields of view and surround sound. Visual consistency is important. The consistency between my environment and the remote environment. And if you have multiple people, the consistency between their environments. And of course sort of spatial cues. Andrew showed a nice video where the occlusion cue was kind of messed up. Let me talk about the last two on this list, what we're doing for motion and stereo. So motion parallax, we sometimes call, monocular 3-D because even if you only have one eye you can get a sense of depth through your motion. So what you see on the left is a video of how we can change what's displayed on the screen as a function of the position of the viewer in order to get a sense of three dimensions. >>: Can't help but notice if the viewer is moving a little slower. Is that because there's some framer issues there. >>: Not so much from the generation, but the tracking in this case was done by a magnetometer. And there's some issues with that. After this was made we started tracking using face detection and head pose estimation, and that's actually what you see on the right side. So here Chau is looking at this screen and this red rectangle is what's detecting his face and size right now and driving his screen. And we're also now doing a little more sophisticated stuff with head pose estimation. We're also looking at stereo 3-D. Typically using, for display systems, existing technologies such as auto stereoscopic displays or displays where you need some kind of goggles. These require, even just putting images on these things actually requires a fair amount of -there's a lot of data. So for each pixel that you're trying -- you scan through this thing, putting pixels down on the display. Each one comes from, say, a different camera. So if we have a five -- we have -- upstairs we have a five view auto stereoscopic display we have five cameras feeding this display. For each pixel that we output to the display we have to pick it from one of the cameras. And at a particular position on the camera, depending on how the cameras are calibrated. So we have to do some bylinear interpolation, for example, on each of those cameras, and we just couldn't fit it into our four corner machine. But we squeezed it into the graphics processor. But there's some issues even doing elementary things here. That's on the display side. Of course, you need to capture multi view imagery. So image-based rendering is an important element of that. And so we do spend some time trying to do camera array stuff and view interpolation. I don't want to go all the way to the end. In order to put people into a common background, you have to actually remove their existing background. So foreground, background segmentation and replacement is important. So with enough computational power you can do a pretty good job at it. This runs many times real time. So we've been looking at things that are much faster, such as using infrared depth cams to help speed that up. And they do a remarkably good job in terms of real time performance. So easily getting 30 frames a second here. But they're not so good at hair. So we have to kind of find the right balance between some of these technologies. Another alternative to the image-based rendering things would be to go more in the graphics root. And of course we've been looking at using Avatars and they're necessary for representing remote people who don't have audio and visual, don't have video cameras trained on them like if you're on a mobile phone. And then finally we're trying to put all this together into an experience building these things, and actually the list just for video processing, there's no audio processing listed here but just for video processing there's a lot of different steps that we have to go through on ascenting side, after the camera calibration, camera capture, distortion correction, viewpoint interpolation, foreground and background separation, the coding networking stuff and sort of on the receiving side do some face tracking to figure out where is the person looking trying to determine where his viewpoint is. And then after receiving stuff from the network, decoding it, doing some final viewpoint interpolation stuff and then the scene synthesis. And this is pretty much, pretty close to linear with the number of other participants in the meeting, because for each other participant, each one of them has a unique view that you're trying to send to or receive from. So I wanted to mention that besides get an immersive experience, part of what's important in communication is actually various kinds of scene analysis. Some of which we've seen already with like where are people spatially. But there are others. So, for example, Chau here is using this fish eye camera shown here to direct where the pan/tilt/zoom camera should be pointing at. So we have to figure out what are the salient features, what are the important gestures to capture. So one reason scene analysis is important, and this is a bridge to the next, to Lilly's talk, is social interaction analysis. So you want to know where people are looking. To whom are they looking? So that maybe the Avatar that you're looking at could be looking back. So you can measure things like who is influencing whom, who is consistent in their views? Who is agreeing with whom and disagreeing with whom, and you could use these within the context of a single meeting to give feedback on you're talking too much. This guy isn't saying enough. Is the meeting really constructive. Let's say it's a brainstorming meeting, or is somebody dominating or maybe that's the way it's supposed to be because it's a presentation. And then over time you can infer from many, many meetings social interactions that are being analyzed, what are the groups, what is the social interaction network, what does it look like, from real meetings conducted through these telepresence systems. So I will just end by going back to the ubiquitous communication idea. This is just a shot of the heterogeneity that we live in and will continue to live in and just a statement that there's no single telepresence or communication device. There will be plenty of things ranging from mobile phones all the way up to the most sophisticated conference rooms. And the implications for this of this on architecture, I guess, is well, communication between -there's a lot of data flowing between the end points. So data is a huge thing. And also the heterogeneity means that it's not really clear where the computation is going to go. Mobile phones won't have the same capabilities as these high end rooms. The cloud isn't even pictured here. So you can imagine services in the cloud. Some of the computation could be used to save on the communication bandwidth. There's all these interrelated issues and I'll just use that to close. Thanks. [applause] >>: Any questions? >>: So when are we going to get this phone? [laughter]. >>: I'm tired of getting on planes. >>: One of the next things we're trying to look at, we're talking with Dennis, actually, is how to deal with large scale meetings such as this one or conferences that we go to and are there any ways to help do a poster remotely, for example. >>: Does speech recognition technologies fit into this any? >>: Yes, and machine translation technology. You would certainly want to be able to bridge cultural and language barriers like that. >>: Like making minutes automatically? >>: Yes, for summarization of meetings, absolutely. >>: What is the most sophisticated, the things you feel today, that you've actually used to hold a meeting? >>: Well, actual use, we don't have a picture here but we did -- here's one. This group, our group did this roundtable device they call it. So it's like a Polycom speakerphone but it has a stalk on top five camera on top and do image stitching. It's 10-year-old research technology, but it came out just a few years ago and products, oh, we actually use it pretty consistently. And it's in most of the room in this building. >>: Sophisticated or not? >>: Is it sophisticated? 10 years ago maybe, yes. Thanks. [applause]