>> Eyal Ofek: Hello. Good morning. And it's my pleasure to invite Raffy Hamid to Microsoft to give a talk. Raffy is a researcher at eBay Research Labs. He finished his PhD at Georgia Tech and was an associate researcher at Disney Research in Pittsburgh. Raffy. >> Raffy Hamid: Thank you very much and thanks for having me here and for coming to my talk. A lot of my work is about automatic video analysis. That is, given a video how do we figure out what is going on in it? In today's talk I'll speak about some of the stuff that I've done in this context over the last few years. Let me start by sharing with you what my opinions are about why I care about videos and why do videos matter. Video is a unique medium that can capture both the sight and sound of an environment in synchronicity, which makes it really effective to not only capture, but also to view the everyday human experiences. It is easy for us today to take video for granted, but at the time that it was invented a little more than 100 years ago, it was nothing short of a miracle. Just to bring that message home and also for fun I wanted to share with you this excerpt from a 1950s movie called the magic box where the character of the inventor of the video camera is shown to sort of share his great joy at his remarkable achievement. So let's take a view. [video begins] >>: It's a bit like a magic lantern, but instead of one picture at a time, you see eight or more pictures every second and that would [indiscernible] eight pictures every second and they are all merged together into one moving, living picture. See? Of course, there is a bit more to it than that. I'm not saying it's perfect. After all, it works [indiscernible] it works [indiscernible] doesn't it. You can see that. [video ends] >> Raffy Hamid: So starting with these humble beginnings, video has really become an integral part of our day-to-day life. Today we have millions of users uploading and viewing video that they select on YouTube. We have videos making inroads into the social media circle thanks to [indiscernible] and Instagram. More and more people are realizing that they can capture information about the products that they want to sell online using video that they can modify using just text and or images and being at eBay this is something that I'm going to talk about a little bit more in detail. The entire model of news reporting is undergoing a complete transformation. These days any individual with a digital video camera who is at the right place at the right time can capture a breaking event and submit it to one of these news reporting agencies. Also, in the field of online education, videos are playing a great role in places like Khan Academy and Udacity are making thorough usage of videos to disseminate knowledge to the masses. And last but not least, thanks to the online and on-demand video the entire model of entertainment has undergone several transformations over the last few years. With all these really exciting and interesting things happening with and around digital video, it really feels as if you we’re living in an era which is at least partly defined by the medium of digital video. While we've made some commendable progress in terms of capturing and storing and transmitting videos, I still feel that the so-called elusive Holy Grail remains to be the automatic video analysis, and if you can make some major progress in that direction then I feel that the opportunities can truly be endless. For example, we could have systems that could be used for automatic healthcare management of senile individuals. With better surveillance systems we could have improved safety and security, systems that can understand complicated tasks such as medical surgeries would help us provide useful information which could therefore help us improve our performance in these complicated tasks. Again, last but not least, we could have automatic sports visualization and analysis systems that could help us engage better with the different types of entertainment and therefore be able to enjoy our lives in a better way. Again, this is something that I'm going to talk about in more detail in just a bit. So to situate my talk better let me start by showing you an active environment of a soccer field where different players are performing different sports related activities. One of the main challenges for automatic video analysis systems is this wide gap that exists between the low level perceptual data, the pixel values and the more interesting high-level semantic stuff, for example, whether a player is kicking a ball or not. An actual way to bridge this gap is to have a set of intermediate characterizations that could channel this low-level information in an appropriate way so that he could be used at the high inference level. In one such set of intermediate characterizations, start with the assumption that an active environment consists of a set of key objects that define the type of activities that can go on in that environment. So in the case of a soccer field, for example, some of these key objects might be the players on the field, the goalposts, the ball and so on. Different interactions among these key objects over finite duration of time constitute what I call actions. Here for example, I'm showing you an example in showing the action of kicking. Finally, when we put these actions in a specific temporal order they constitute what I call activities, so here, for example, I'm showing you an activity which consists of a player throwing the ball to another player and an opposing team player intercepting the pass and the first player diving at the ball in order to make a save on the goal that is being done by the opposite team. In this space of activity dynamic there is a wide range of very difficult and open research challenges that still are there, and in this talk I wanted to speak about some of the work that I've done at each one of these intermediate characterizations. In particular, for the characterization of key objects I talk about the problem of localizing and tracking these key objects. For the characterization of action I'll talk about the problem of spatiotemporal analysis of videos to figure out what are the more important or interesting parts of these videos. And finally, for the characterization of activities, I talk about the problem of large-scale activity analysis particularly using unsupervised or minimally supervised learning methods. So let me start with the characterization of key objects where the motivating application I want to use for this part of my talk is sports visualization and analysis. I don't know how many of you remember this match between USA and Algeria in World Cup 2010, but one of the goals that was scored by team USA was considered an off-side foul and this actually resulted in the U.S. team not winning the match. Wouldn't it be awesome if we could have visualization systems that could figure out where each one of the players involved in the game is and then figure out who are the key players that are involved in the particular foul and then provide us with some visualization information that could help us understand the dynamics of the sport better? This is the project that I was involved in at Disney Research Pittsburgh. This was done for ESPN where the object was to track these several players, figure out where the second to the last defense player is because that is one of the crucial players to determine whether an offside foul has happened or not, and then draw a virtual offside line, so to speak, underneath that player. Some of the technical challenges that make this problem interesting are tracking under visual occlusions, noisy measurements and dealing with varying illumination conditions. I'll start by giving you an overview of the computation framework that we used to solve this problem and then I will say a few more words about each one of these steps. Along the way I'll try to identify some of the technical contributions that we made in this project. Starting with the set of input videos, the first step we do is extract the foreground and at the end of this step we have pixels that don't only belong to the players on the field, but also the shadows of these players. Since later on we are going to track these players in the image planes of the cameras, these shadows can become distracting, and therefore it's important to remove them. That's exactly what we do in this step of shadow removal. Once we have obtained the blobs of these players we can actually track them and also classify whether they belong to the offense team or the defense team. When we have this sort of spatial information available to us from multiple cameras, we can fuse this information together, search for where the second to the last defense player is and finally draw the virtual offside line underneath that player. So now I'll say a few more words about each one of these steps. Starting with this foreground extraction we begin by building adaptive color mix and models with each one of the pixels in the scene and these models are used to classify each pixel as to whether they belong to the foreground or the background and this is done by thresholding the appearance likelihood. As you saw previously after this step of foreground extraction, we can get pixels not only from the player’s body but also the shadows and since these shadows can be very strong relying purely on appearancebased methods to remove them don't generally work really well, so for our case we relied upon some of the 3-D constraints of our camera system and the basic intuition that we are using here is that shadows cast on planar surfaces are actually view invariant. What that means is if you warp a subset of these views onto a particular view then the pixels that belong to shadows are only going to be the ones that correspond to each other at the same point. By using this very basic invariance and some simple dramatic thresholding we were able to remove these shadows relatively accurately. With the blog of the shadow, of the players in hand, we can also track them using the particle filter based tracking mechanism. The basic intuition is that given the position of a particular player in a particular frame, we use the motion model to hypothesize where this player is going to move in the next frame. And then given the noisy measurement in the next frame, we figure out how good or bad each one of our hypotheses was, and then in maximum likely sense we can figure out the most likely estimate of where the player is in the next frame and this process is repeated over the entire length of the input video. One of the things that went to our benefit for this project was that by classifying these players across different teams… >>: At what granularity were you doing that, predicting where there was going to be that at the frame level? >> Raffy Hamid: Exactly at the frame level and I'll talk a little bit more about the data that we captured and it was either that 30 FPS or 60 FPS. >>: It's strange that you have to rely on a prediction as opposed to just finding what happens in the next frame. >> Raffy Hamid: Right. So we could do detection at each frame but then the problem of how do you ascribe which player corresponds to which player in the next frame is a nontrivial problem, so generally in tracking they use time information because there's a lot of velocity or first order information that you can use to simplify the process. So one of the things that went to our advantage was the appearance of the players in this problem is by design quite discriminative, because you want the players to wear uniforms from different teams so that you can discriminate the two teams from each other. So for the problems of classifying these blobs we were able to use simple human saturation based queues in the nearest neighbor sense to be able to achieve quite high classification results for this two-class classification problem. >>: Was he not tracking the goalkeeper? >> Raffy Hamid: We are tracking the goalkeeper and the goalkeeper had a different template. Yes, that's a good question. So while it was easy for us to classify these players across the team, the flipside of that was that within the team it's very difficult to detect which measurement belongs to which player especially when these players come close to each other, and this was really where some of the technical contributions come in. The way that we tried to solve this problem was to throw redundancy at the problem, so the idea was if you track these players in multiple cameras and some of these trackers in some of these cameras fail, you can still rely on other cameras to actually still get to an accurate inference. To illustrate this point further here, I'm showing you the top down illustrative figure where several players are shown in one half of the field that are being captured from three partially overlapping cameras. In an ideal world if you were only interested in player p4 for now we would get three different measurements in the image plane from each one of these three cameras and we could warp these image plane measurements onto a common ground plane and they would all correspond to the exact same point and we would know exactly where this player is. However, real life is not so nice to us and due to several reasons we have different types of noises that don't let these three measurements correspond to the exact same point. This becomes a really challenging problem as we have multiple players that come very close to each other, and so fundamentally this becomes a correspondence problem. So that if we could find out what measurements correspond to a particular player we would be able to fuse it together and come to an accurate inference. There are several solutions to this problem, and the one that we chose to use was a graph theoretic approach using a complete K-partite graph where the number of partites are equal to the number of cameras that we are using, so here I'm showing you a graph with three partites because I'm showing you three cameras. The nodes in each partite corresponds to the observation of where a player is in the 2-D position of images. Here I'm showing you for illustrative purposes the color of these nodes to be of the same color as the players that are seen from a particular camera. This is an undirected edge weighted graph and the weights on these edges are defined by the partite distance between the appearance of any two players. This leads to the notion of how do you figure out what this correspondence in this graph mean and again you can have several solutions. For example, you could have the notion of maximum, you could have maximal cliques defining what correspondence means in this graph. However, for our problem we figured out that using the notion of maximum length cycles was sufficient to have a robust enough result, so the idea is if you could figure out that these three nodes in the graph form a maximal length cycle, then we would know that these three measurements belong to a particular player, you'd be able to fuse them and then similarly if you could find all of the cycles in this graph you would be able to find out where each one of these players is quite accurately. It turns out that finding these maximum length cycles in K-partite graph is an NP hard problem for any graph with number of partites greater than or equal to three, and so we can't really rely on exhaustive search methods to do that because we needed to do that in pseudo-real-time. And so we had to rely on greedy algorithms to do this search and that presents us with a trade-off between optimality and efficiency and it turns out that there's a whole class of algorithms that you can use for greedy searching and one of the contributions in this work was to actually analyze which of these greedy methods work w when and how well for our problem. >>: Can you say a few words about why, where the inconsistency comes from? Is a vibration in the ground? >> Raffy Hamid: That's a wonderful question. So I would be more than happy. Camera calibration is definitely one of the reasons, because while we are making the assumption that this is a planar surface, no surface is perfectly planar, so there are errors that we get in our homography measurements. Also, because of the lens aberrations on the cameras, there are also different types of noise that can be added, so that assumption is one of the big reasons. The other big reason is the shadow removal step. Just to -- maybe I should go like this. Just to give you an understanding of that problem, why this really happens, I'll show you a graphic if I can find it. There we go. Here I'm showing you a scene where I have the shadows of these players that I've marked green. Here I'm showing you a picture of a corresponding picture where the shadows have been removed. Several of these players the foot pixels are still there. For a lot of of them foot pixels are gone, so essentially from the system’s perspective they are flying in the air, and so when you project it they don't correspond to -- so this is one of the reasons, but yeah, that's a very good question. Just to give you sort of a visual of what this fusion mechanism results in looking like when you actually apply on a real video. This is sort of to give you a sense of how we fuse different measurements as different players move around in the environment. And this leads us to the point of, you know, evaluating this framework and, again, as I mentioned this work was done for ESPN and one of the things that they were very keen about was to make sure that whatever statements we make about the accuracy and performance of our system are really backed by a solid number. To do that we actually went through quite a bit of work. I'll present two sets of experiments for the evaluation part of it. One of them was done at the Pixar Studios in California where they have a soccer field and we actually went there and hired a local soccer team. We requested of them to play different types of games for us and different types of formations and drills. We also requested them to wear jerseys of different colors and these colors corresponded to the team colors of international teams that take part in World Cup. We also captured this data for several days from 9 AM to 6 PM to make sure that we were able to incorporate all the different types of illumination conditions that can happen. Not only did we test this framework in one setting, but to make sure that our framework is robust over different types of stadiums, we also went down to Florida in Orlando at the ESPN Worldwide of Sports where they have this very large facility where kids from all over the U.S., to play soccer amongst other games. Here, as you can see, one of the variables that we looked at was whether our system can perform in conditions of floodlight. Also, you might've noticed that the height of these towers, these cameras is higher, so in this case it goes 60 feet, whereas, in the previous data set it was 40 feet. Another difference was that here we captured data at 720 P, whereas in the previous set we captured it at 10 ADP, so we wanted to see how does the resolution of the image affect the performance of our framework. And in the interest of time I'm not going to belabor too many numerical details of our work. When compared to baseline, suffice to say that our overall accuracy gained was close to ten percent gain based on some of the baseline methods that we looked into. If you are interested in finding out more about the details, please come talk to me and I'll be happy to talk more about it. In summary, I just spoke about the problem of the fusion as graph theoretic optimization problem where I treated correspondence as cycles in K-partite graphs. I also spoke about greedy algorithms that can be used for efficient searching of these cycles and an important point here is this is a relatively general approach and can be applied to other types of sports, for example field hockey and basketball and volleyball and so. On that note I wanted to share with you some of the related stuff that I've been involved in that uses this framework. I'm a big squash junkie and I like to watch a lot of these games on YouTube, so the idea here is given dozens of videos of matches of a particular player can we have systems that can understand what are the general trends of that player? What are the different types of techniques that they use against their opponents. For example, one of my favorite players is Rommy [phonetic] who I show here in the red shirt here. Question is if I were to play with Rommy I'm sure if I dropped him in the top right corner of the court is he going to drop me back or is he going to lob me? These types of questions I know are very important from coaching perspectives, so this type of framework can be used to get us closer to automatically figuring out these trends. At the risk of jumping the gun a little bit, I understand that I am not yet talking about this action detection part of my talk. But since we are talking about sports, I want to say that not only can you look at these sports videos at the macro level, but also you can start looking at it at the relatively less macro level where each individual action performed by each individual player can start to be classified and recognized, not only when it is being captured from one camera, but also from several cameras together because that's how usually these activities are captured in sort of World Cup or high-level situations. Finally, so far whatever I talked about is with static cameras, but more and more people are using handheld mobile devices to capture videos, so I wanted to say a few words about this thread that I am actually currently involved in at eBay Research. This problem, in fact, relates to the image quality problem. EBay has a lot of users that are uploading millions of images online, but they are not very professional photographers, so a lot of the times they capture images that have a lot of background clutter. While we can use techniques like graph cut as they try to extract these foreground objects from the image, the form factor of a mobile device is such that it does not necessarily always encourage a lots of interaction with the device. So here the objective was to explore whether we can use the medium of video to extract the foreground objects in a less interactive way. The idea is very simple actually. Suppose you want to sell a toy on eBay and you want to capture it in a certain environment. First you capture a background video using a hand held camera and then you put the object in that environment again and take another video with the handheld camera. Now the question is these two produce very different camera pattern disparity can we actually align these videos efficiently so that we can subtract the background video from the foreground video and come up with the output video which only has the foreground object in it. To test our system and this is still working the process. To test the system we actually applied it to different types of scenarios taking foreground objects with different types of geometric complexity and several foreground objects, multiple foreground objects, different types of illumination conditions and also different types of movement, articulate movement of the foreground object. Our initial results tell us that this is an interesting way to go toward solving this particular problem. >>: Back to the soccer project, when you talk about the improved environments to handle the three stream results of HD and were you limited by that at all? >> Raffy Hamid: That's an excellent question. To answer that question let me -- the compute requirement of the project was that we wanted to show it in pseudo-real-time, which means that we wanted to show it in terms of instant replay, so we would get about 3 to 4 seconds to process close to ten seconds of video and we would need to generate results with that. This project was done at the demo level so it was not at a production level and the code that was written by me and it was semi-optimized, not super optimized and just to give you some understanding of what is a division of the synchs in terms of time. I was able to bring it to almost .4 seconds per frame and this is sort of for all three frames, so you could divide it by three. This is [indiscernible] time and here you can see that most of the time is taken by the background subtraction part of it which is very, very parallelizable, so I was not using any GPUs to get these numbers. I think that this number can actually be significantly brought down, but at the point when I got finished with this project, this was sort of the time that we were taking to process video for this. >>: And single core? >> Raffy Hamid: I was using multicore, but I was not using multicore in a very efficient way. I was using [indiscernible] things but I am quite certain that we can optimize it to quite a different level. >>: And what if you had the 10, 50 times the capability, would that make the job easier or make the result any better? >> Raffy Hamid: I think it would help us in two ways. One is it would allow us to try algorithms that are slightly more complex and therefore it would help us get more accuracy. That's one thing. The other thing is one of the solutions, one of the sort of philosophies of solving this problem was to [indiscernible] and enunciate the problem, so I'm quite confident. I know for a fact that if you had fewer number of cameras, your accuracy degrades quite steeply, so one thing that having a lot more compute power would give us would be to deal with far more number of cameras. Imagine a ring of 50 cameras. Now you can come up and get over the problem of visual occlusion much better, and so that is some of the things that would help us. >>: So can you say something about segmenting the [indiscernible] >> Raffy Hamid: Sure. I'd be happy to. The way we did segmentation was actually doing sort of connected complement detector, so that's how we started our trackers. And then at each step we were actually performing background subtraction first. The notion of segmentation is really just a step of connected competence. >>: First you have to match between the videos, right? [indiscernible] handheld videos… >> Raffy Hamid: Oh. You're talking about this step? I see, I see. You are asking me about, yeah, sure. Absolutely. I thought we were talking about -- I see. Yes. I would be happy to. I don't know why these videos are not playing right now, but… >>: That's okay. >> Raffy Hamid: The steps are relatively simple. We start with finding the search features in the background. We do each frame in the background and also the search features in each frame of the foreground we do. And then we use for each frame in the foreground video we find that matches, these search matches using gransack [phonetic] to figure out what are the best matches here, and then once we have found these matches now we have a subset of frames that are good matches for that particular frame in the foreground we do. And we use this sort of predictive transform to transform those frames, those background frames that have been matched well and so… >>: And so you assume [indiscernible] >> Raffy Hamid: Yes. Exactly. This is very similar to the sort of a plane plus parallax work that has happened before and so the idea here is that you don't have to build a whole 3-D sort of model off the alignment. >>: The first that you show the blue doll there, it obviously didn't have the planar background. >> Raffy Hamid: Yes, that is absolutely true. It does not have, the scene doesn't have to be planar, but the idea here is -- and I wish this would work. >>: That's fine. >> Raffy Hamid: But the idea here is if you actually first do this planar assumption and then once you have a resistor trained, you actually do non-rigid transformation using opticals. So on the previously [indiscernible] frame, then you know it takes care of some of the sort of depth in the scene that you are not captioning using your assumption. And not just that, you're getting multiple hypotheses, so if you look at… >>: [indiscernible] this will fail if for example you have something near a corner and you have two planes that go from near to far? >> Raffy Hamid: Two planes that go from near to far? >>: I'm trying to think of the scenario where it's hard to approximate. >> Raffy Hamid: Yes. For any situation where two things don't happen you might fail much more than not. One is if you have a situation where there's a lot of depth in the scene to the point where the planar assumption that we are making is completely screwing up everything, so that's one situation that we know we fail. The second situation is when for a particular frame in the foreground you are not able to come up with a lot of good matches in the background video, and so we know for a fact in those situations the algorithm would obviously not be able to work. And so for those types of situations when we go to them, we actually rely on tracking, so we are adaptively relying on tracking the pixels that we are detecting in the previous frame. So I just spent a little bit of time talking about the characterization of key objects. This is some of the work that I've done in this regard. And now I will speak about the characterization of actions and really looking at this problem from the perspective of figuring out what are the important parts of the video an interesting part of the video and the practical application I'll use for this part of my talk is a video summarization. This problem is becoming increasingly important for eBay because sellers are realizing more and more that they want to use the medium of video to capture information about the products that they want to sell online. So if you go to eBay and search for YouTube.com you will be able to find tens of thousands of listings and what's happening is that these sellers are capturing these videos of the products they want to sell. They are uploading them to YouTube, copying the link and embedding the description in their listings. Just to give you a sense of a video that was captured by a seller trying to sell his blue Honda Civic on eBay, here's that video and you can see that the general quality of the video is relatively poor and these videos can run all the way to ten minutes, so with no understanding of where each piece of information is, you know, watching a poorly made ten minute video can be quite cumbersome. The idea behind this project is really to figure out what are the important parts of this video so you can allow the potential buyer to navigate this video in a more nonlinear, less slack manner. A lot of the previous work in reduce summarization relies upon the content of the video itself, and since the quality of these videos are quite poor, those traditional schemes don't work very well and we know it for sure because we have algorithms written for that. Thankfully, at eBay we also have millions of images that these sellers are uploading of these products that they want to sell online, so here, for example, I'm showing you some of the still images that the same seller uploaded of the same car here you can see that these still images are actually of much higher quality compared to a randomly sampled frame from this video. The question really becomes how can you use these images as a prior to help us figure out what are some of the more representative parts of the video that these users are generating. Just to give you a very brief overview of our approach -- by the way I should mention here that this project was done with one of my interns from last year, [indiscernible] one of my interns who is doing his internship this year at division group, and so if [indiscernible] is watching, hi to him. Just to give you some overview of the framework that we used, we started with the corpus of our unlabeled images and we used [indiscernible] clustering to figure out what are the canonical viewpoints present in our corpus. Not only do we have these individual images, but we also have frames of the videos that these users have uploaded, and it incorporating these frames can actually improve the quality of the discovered cluster. By bootstrapping on the clusters that we found just using the images, we incorporated these frames in a [indiscernible] type algorithm to directly improve the quality of our discovered clusters. With these clusters discovered, given a test video we can ascribe these frames to any one of these discovered clusters, and then come up with the final output sum. A related problem that we explored over the course of this project was how do we do large-scale annotation and evaluation of different summarization algorithms. A lot of the previous work in this context has heavily relied upon expert knowledge to figure out whether the summarization algorithm’s performing well or not. Clearly this view of solving this problem does not scale very well, so a part of the contribution of this work was to figure out how can we obtain multiple summaries from crowd sourcing such as Amazon Mechanical Turk, AMT, and use each one of these summaries as an instance of ground truth to figure out whether a particular summarization algorithm is performing well or not and also to compare different summarization algorithms with each other. Just like any comparison mechanism we need some notion of distance between summaries that we are trying to compare to each other. Then again, there’s several motions of distances that you can use here and after exploring quite a few of those we were convinced that using the notion of a SIFT Flow as a distance between some of these is actually a good idea. And to just give you a visual sense of how do the search results look like or comparing the results look like when you are using SIFT Flow as a mechanism, here I am showing you in the first column some of the query frames and on the right I'm showing you some of the retrieved matches that have been sorted in descending order, and as you can see SIFT Flow can be quite robust with the types of variations in illumination and viewpoint that we actually observe in our data. Yes? >>: So the right color is a match to the left color? What's the relation between the speed of the [indiscernible] and the car? >> Raffy Hamid: Right. First of all, each one of these rows is independent from other row and these guys are the closest matches and this is the farthest match. You can see that here the picture looks very, very similar and here it is sort of similar and there it is not similar. With this notion of distance in place, our problem really sort of boils down to the question of correspondence. We need to find the correspondence between the frames, summarization frames that an algorithm gives us and the summarization frame that an AMT summary has given us. For that we can use a model of a bipartite graph where the weights on these bipartite graphs are found using the SIFT Flow as a distance. Now we can use the sort of classic notions of precision and recall to actually figure out how well or poorly is an algorithm performing and also to compare different algorithms with each other. >>: So what was the measurement for the Amazon Mechanical Turk [indiscernible] what people do? >> Raffy Hamid: Right. So basically what we asked them to do is we gave them a video and we asked them to come up with sort of a set of, we basically showed them a video and we asked them to come up with a set of frames from the video that they feel are the most representative of that video. So for each one of the videos, I'm going to talk about the particular tests in a bit, but for each of the videos that we used we asked several AMT turkers and we filtered a whole bunch of them out and we kept only the top ten turkers and we used each one of these as an instance of ground truth to figure out whether the particular algorithms are performing good or not. >>: Okay. Because I would argue that, and you have an awesome example there with the car where I want to sell the car. I sent a set of nice photos that will do a good job to sell the car, and then actually for me as a buyer the video, what was interesting was not the main features shown, but maybe the scratches on the back which were not in the images. >> Raffy Hamid: Right. So that is an excellent question and that is a very fair point. And the idea here is while those types of -- we would only be able to do that at the moment in our algorithm if we had that type of information available in our image corpus. So if some people have shown scratches and have taken pictures of their scratches as individual images, then only we would be able to do that, and so we right now are relying very heavily on the content of the image corpus. You are absolutely right, and one of the things we are looking at going forward is how do we add additional information that is not present in the image corpus, so that’s a completely valid point. To say a few words about the data set that we actually used to explore this problem, we focused on the vertical of cars and trucks on eBay because this is the most popular category of products that people use videos to describe their products. We use half a million vehicle images as possible examples and also we use the PASCAL 2007 data set as the negative examples. We downloaded 180 videos from YouTube of eBay sellers and we used 25 of these as training and 155 of them as testing. Again, in the interest of time, I won't be able to go over too many numerical details of the results that we captured, but suffice to say that our performance gains were greater than ten percent over the benchmarks that we actually looked out which were several and we also did some quantitative and qualitative, both types of analyses and if you are interested in knowing more about the details of the numbers, please talk to me after the talk and I'll be happy to chat with you. >>: [indiscernible] numbers just as the task. The task is take a video approximately ten percent of your video corresponds to your original pictures associated with that car? >> Raffy Hamid: Right. So the way that we are doing the evaluation in terms of average precision and the way it works is so you have a particular video and you have a particular summarization from a particular trigger and now you compare it to the results that were given to us by a particular summarization algorithm and that would give you one instance of the average precision and now you average it over the entire set of turkers that you have. That's one value and then you average it over the entire set of 155 testing videos that we used. Then we compared it with using different types of summarization algorithms like uniform, random, KMeans, spectral and so on. In summary I spoke about using web images as prior to perform large-scale summarization of user generated videos. I spoke about also the use of crowd sourcing for large-scale evaluation of these summarization algorithms. And again, we'd like to believe that this is a relatively more general approach and so one of the things that we are currently trying to look into is whether we can use these types of approaches to summarize more wider variety of videos, user generated videos online such as birthday parties and weddings. So any video or any class of videos where you have an image corpus that actually captures the geometry and general appearance of the scene can be thought of as being looked at using our method. So I just spoke about the characterization of actions, particularly looking at figuring out what are the important parts of the video that are interesting or important. And now I'd like to spend a few minutes talking about the characterization of activities, specifically focusing on the problem of large-scale activity analysis and the practical problem that I want to use to motivate this part of my talk is automatic video surveillance. Just to give you the big picture of this part of the talk, this was some of the work that I did for my PhD, and in the context of activity analysis at that time and also today, one of the main assumptions that people make is that the structure of the activity that the system is supposed to detect is known a priority. So for example, imagine you are making a dish in the kitchen and you want to make a system that can recognize whether you are making that dish in the kitchen or whether you are making that dish correctly in the kitchen. We have to make an assumption that the system knows the structure in which that dish is made, so if you are making an omelette we would need to know that first you are supposed to open the fridge, take some eggs out, heat up the pan and so on and so forth. Now, this assumption works perfectly well if the number of categories that you are looking into is relatively small, but it doesn't really scale well when you have several categories, in fact, when you don't even know the number of categories that are priorities. That is the focus of this particular work, that we want to be able to figure out the main behaviors, the main types of behaviors in an environment in an unsupervised or minimally supervised manner. You know, this was done over a course of five or six years and in retrospect if I look back and I were to sort of concisely describe the mean finding of the work, I would say that it is that we should look at activities as finite sequences of discrete actions. As soon as you start looking at activities like that, you realize that it is very similar to how researchers in natural language processing have looked at documents, because they look at documents as finite sequences of discrete words. Once you do that, once you make that connection you're able to scale away a lot of the representations and algorithms that these folks have been doing over several years. I was the one to identify this and therefore build the bridge between these two sort of research communities and the main idea here is to be able to learn about these activities without knowing the structure and only by using the statistics of the local structures. On that note I'll say a few words on the general sort of the framework that we used. So starting with the purpose of K activities, we use some representation to extract some sequential features of these activity sequences. We can define some notion of distance based on these event subsequences using which we can find the different types of classes that are existing in that environment. Once we have discovered these classes we can perform the task of classification of a new activity and also figuring out whether something anomalous or irregular is happening in that activity instance. While I discovered several different sequence representations for my PhD, here I will only briefly mention the representation of event ngrams where an n-gram is a contiguous subsequence of events. The idea here is that you go find these contiguous subsequences and then based on the counts of these subsequences you figure out what's going on. Just to show you some of the experiments that we did using the notion of event n-grams to figure out how good or bad this representation is, we actually captured some delivery data in one of the loading docks of a bookstore right next to Georgia Tech. This is a Barnes & Noble bookstore and please don't ask me how we managed to convince the manager of the bookstore to capture data over long periods of time, but in any case we sort of placed two overlapping cameras to capture some of the vantage points of the loading dock area. We captured activities daily from 9 AM to 5 PM for five days a week over a month and we were able to capture 195 activity sequences. Just for fun I should mention this was done at the end of the year and so by the end of this project I was very sick of listening to Christmas songs and I couldn't bear to hear one more Christmas song. >>: [indiscernible] >> Raffy Hamid: Well, for me it certainly did. So using these 195 activities we randomly selected 150 as training activities. >>: What's an activity? >> Raffy Hamid: That's a great question. Here an activity is defined as the time when loading a vehicle; a delivery vehicle enters the loading dock and the time when it leaves the loading dock. For these types of situations and also kitchen situations the start and end of figuring out where the start and end of an activity is relatively easy, but there are several situations where, you know, finding this start and end or segmenting activity from a stream of events is nontrivial and so I've done some work which I'll be happy to talk over later if you are interested, but that's an open problem here; it's a nontrivial question. So we randomly picked 150 training activities and 45 testing activities and in this environment there were ten key objects. So an example of a key object is a back door of a delivery vehicle, and also the size of the action vocabulary was 61. For example, an example of an action in this environment would be a person opens the back door of a delivery vehicle. Using this 150 training examples, here I'm showing you that adjacency matrix of the un-clustered instances of these training examples and here I'm showing you the same adjacency matrix reordered according to the clusters that we discovered in the data. >>: What's the features? >> Raffy Hamid: Right. So the features are as I mentioned, so this is using event n-grams as a representation and the features are really the counts of these n-grams, so this defines an ngram, an activity and the features are really the sort of the counts of which n-gram happened how many times and so based on the difference in these n-grams you can determine the distance between any two instances of activities. >>: And how do you find those n-grams? >> Raffy Hamid: The way that we find those n-grams is just by parsing over the entire activity. So you know the start of the activity; you know the end of the activity. You know which actions happen. You go and find the first n-gram. You count it as one. If you find it in another instance you add to the count before >>: So there's 150 training. Someone has manually go out and… >> Raffy Hamid: No. That's not true. For this problem, since the high level inference we are making is quite high, the thing that was provided to the system was only sort of the sparse tracks of the location of where these key objects are, and so based on that we build action detectors that in fact detected these actions automatically. The part of tracking was not done for this project, but everything above that was done, so all the results that I'm showing you have all types of errors that were incorporated using automatic steps for each one of these individual steps. We were able to discover seven cohesive classes in our data and just to give you some semantic notion of what a class implies in this environment, so the most cohesive class that we were able to find consisted of all activities that were from UPS delivery vehicles. There was no explicit knowledge about the fact that it is a UPS truck in our event vocabulary, so the fact that we were able to find all the deliveries automatically that was of a particular type of vehicle makes us believe that, you know, the low-level perceptual bias that we are adding to our system is being successfully latched onto at the higher level. There were several activities that did not get clustered into any one of the discovered classes because they were so different from any of the other activities and we treat them as irregularities or anomalies. Just to give you some example of the detected anomalies that we were able to find was in the first case we are showing you the anomaly where the back door of the delivery vehicle was left open, which can actually be very dangerous. And the second one I'm showing you an activity where more than the usual number of people were involved in the delivery activity and lastly, I'm showing you an example of an anomaly where a janitor is shown cleaning the floor of the -- go down -which is actually a very unusual thing to do, which brings us to an important point about the fact that while I am calling them anomalies, these are, this is a misnomer. They are not really anomalies. They are just unusual occurrences and so the idea in which this type of a system could possibly be used is to just filter out the activities that are relatively suspicious and then the human should be brought in the loop who would make the decision as to whether these are just irregularities or actual anomalies. In conclusion, basically I started my talk identifying this large gap that exists between the low-level perceptual data and the more interesting high-level inference stage. I basically made a case for using these key objects, actions and activities as intermediate characterizations to channel this information from the low level all the way to the high level. My hope is that over the course of the talk I was able to somewhat convince you that using these sets of key characterizations is at least a useful way to sort of channel this information, particularly from the perspective of some of the problems that I had and that I talked about. I should also mention that this is a very open problem, how do you bridge this gap. This is one of the big problems out there and I'm not at all suggesting that this is the right way to bridge this gap. In fact, much more research needs to be done before we can actually get to that question. I also had the chance of working on several other industrial projects over the last ten years or so and so I wanted to say a few words about them and if you are interested in talking to me about it afterwards, please do. One of the first things I was involved in at eBay was this app called eBay fashion app and we built this project called eBay image swatch as a feature in this app and the idea here is content-based image retrieval, so suppose I like your jacket. I take a picture of your jacket and I want to see if eBay has this type of jacket in its inventory or not in really, really quick amounts of time. This project was done by a team of three individuals and I was one of the three people and all the way from ideation to productization we did it all ourselves. In fact, if you download the app right now the code that is written is partly by me which is running on the server. This was a lot of fun, got some traction both inside and outside of eBay and was featured on several news channels. >>: What kinds of features were used for retrieval? >> Raffy Hamid: That's a great question; unfortunately I cannot talk about it. >>: [indiscernible] >> Raffy Hamid: Yes. So please read in the paper. [laughter]. Right. So the types of features that we are sort of sharing with other people. >>: I see. I will leave it at that. >> Raffy Hamid: Okay. Sounds good. Another project that I want to mention here is that I did one of my internships at Microsoft Research back in 2007 where I had the privilege of working on this project called Microsoft Ringcam where the research problem was really audiovisual speaker detection. While my work on this project was only over three months of time, it really gave me the perspective of how to do research when you are working for a project that is driven towards a product. And also I wanted to say a few words about the project that I did for General Motors before actually starting my PhD career. Here the problem is that we want to detect in the case of an accident, we want to deploy the airbag or not. The motivation here is a lot of the times in the U.S. for a nontrivial percentage of the time the fatality of the passenger actually happens because of the impact from the airbag and not the accident itself. The idea here is we need to determine in real time whether we should deploy the airbag in case of an accident or not. I wanted to say a few words about some of the future teams that I am interested in. The big message here is that I am particularly driven by problems that have a particular goal in mind, and so that is something that I have done in the past and hope to do in the future as well. As far as some of the thoughts about some future research themes are concerned, over the last couple of years at eBay I have had the privilege of working with some folks from large-scale machine learning, large-scale data analysis side of the spectrum and I was lucky to learn some of the stuff from these people. I am very interested in bringing that knowledge back to the video analysis side of the spectrum and to figure out how can we use those types of representations and learning mechanisms to do inference for problems in video, because in my opinion video is one of the biggest sources of big data out there and we really need to bridge this gap between these two communities. Another thing that I am super excited about is this notion of mobile media computing. I really strongly feel that this is definitely going to change the way that we understand computing. However, I feel that right now there are really two trends that people are really focusing very heavily on. One is the capture side of the spectrum and the other is the compute side of the spectrum. On capture side, every year better and better cell phones come in which are able to let you capture the experience in a smarter faster way. However, we don't know which types of information are you supposed to sort of extract from these capturing devices because there is a wide gap that exists between the capture and the compute sense. It's not clear which piece of information should you communicate to the compute side of the spectrum and I really think that figuring out what are important and interesting parts of the information so that you only communicate that part is opening up really interesting sets of challenges. Last but not least, I'm actually very interested in applying vision for robotics and especially in terms of human robot interaction. A lot of my work in the past is about detecting and analyzing activities and actions of humans and for the field of human robot interaction, it's really important for a robot to understand what the state of the human is and what the likely next state of the human is, so I think that from that perspective I can bring in some value on the table when it comes to bridging the gap between the field of computer vision and also human robot interaction. At the end I'd like to say a few words to thank a lot of the people who helped me do all of this work that I presented today. At eBay Research I collaborated closely with Dennis DeCoste. I had the pleasure of working with Chih-Jen Lin and Atish Sarma. At Disney Research my postdoc manager was Professor Jessica Hodgins. My PhD advisor was Aaron Bobick. Also I had a great time interning at several industrial research places, especially Cha Zhang was my mentor at MSR and also I had the privilege of mentoring quite a few wonderful interns and mentees and they really helped me do a a lot of the stuff that I have presented today. Okay. So thank you very much and at this point I'll take more questions from you guys. >>: Thank you. [indiscernible] ask questions [indiscernible] >>: Did you get a chance to work with [indiscernible] information or video sequences? >> Raffy Hamid: Not much so far. No I have not had too much work with RGPD. [indiscernible] but I am very interested in working with that [indiscernible] >>: Seems like it would help a lot for the software. >> Raffy Hamid: Yeah. If you can get to that level, yeah. >>: [indiscernible] structured domains so you have [indiscernible] >>: I have another question. When you work with videos on big data and one of them major problems I saw with videos was how to dig the data out of -- it's not, I would say the photos are not well-organized but at least they have tags and they have sometimes GPS locations. Videos are horrible. Just try to find all the videos that maybe tourists took in Venice and you'll get the amount of noise that you get in your material there is enormous. Any thoughts about that? >> Raffy Hamid: Yes. Several thoughts. First thought is absolutely, I agree with you. It's a very, very difficult question. And I think that there are several ways to look this and it really depends on the context. I think that if you are looking at very explicit context where the environment is known, then some sort of structured information can be added, but in general, from a generic perspective I actually very strongly believe in using these images that have tags and Geo and textual information available as a prior to actually figure out which part of the video do they belong to, and if they belong to quite a large part of the video then using that information to classify what is the content of the video. So I think the particular direction that I am very interested in for the specific problem that you mentioned is to rely upon the image information that we have out there. We have quite good quality image information, much better than the user generated video and we also have very good amount of textual information, view information available with these images. So the direction in which I would like to explore is how to use that information for this problem. And I think that our summarization work is sort of a baby step towards that and I hinted at using that type of approach for other problems as well like figuring out wedding videos, figuring out birthday partys, things that are really event related. We are going from very simple product video to a more complex event video, so I think that would be some of my thoughts as to how to look at that. >>: Okay. Thanks. >> Raffy Hamid: Thank you very much. [applause]