Document 17955760

advertisement
>> Eyal Ofek: Hello. Good morning. And it's my pleasure to invite Raffy Hamid to Microsoft to
give a talk. Raffy is a researcher at eBay Research Labs. He finished his PhD at Georgia Tech
and was an associate researcher at Disney Research in Pittsburgh. Raffy.
>> Raffy Hamid: Thank you very much and thanks for having me here and for coming to my
talk. A lot of my work is about automatic video analysis. That is, given a video how do we
figure out what is going on in it? In today's talk I'll speak about some of the stuff that I've done
in this context over the last few years. Let me start by sharing with you what my opinions are
about why I care about videos and why do videos matter. Video is a unique medium that can
capture both the sight and sound of an environment in synchronicity, which makes it really
effective to not only capture, but also to view the everyday human experiences. It is easy for us
today to take video for granted, but at the time that it was invented a little more than 100 years
ago, it was nothing short of a miracle. Just to bring that message home and also for fun I
wanted to share with you this excerpt from a 1950s movie called the magic box where the
character of the inventor of the video camera is shown to sort of share his great joy at his
remarkable achievement. So let's take a view. [video begins]
>>: It's a bit like a magic lantern, but instead of one picture at a time, you see eight or more
pictures every second and that would [indiscernible] eight pictures every second and they are
all merged together into one moving, living picture. See? Of course, there is a bit more to it
than that. I'm not saying it's perfect. After all, it works [indiscernible] it works [indiscernible]
doesn't it. You can see that. [video ends]
>> Raffy Hamid: So starting with these humble beginnings, video has really become an integral
part of our day-to-day life. Today we have millions of users uploading and viewing video that
they select on YouTube. We have videos making inroads into the social media circle thanks to
[indiscernible] and Instagram. More and more people are realizing that they can capture
information about the products that they want to sell online using video that they can modify
using just text and or images and being at eBay this is something that I'm going to talk about a
little bit more in detail. The entire model of news reporting is undergoing a complete
transformation. These days any individual with a digital video camera who is at the right place
at the right time can capture a breaking event and submit it to one of these news reporting
agencies. Also, in the field of online education, videos are playing a great role in places like
Khan Academy and Udacity are making thorough usage of videos to disseminate knowledge to
the masses. And last but not least, thanks to the online and on-demand video the entire model
of entertainment has undergone several transformations over the last few years. With all these
really exciting and interesting things happening with and around digital video, it really feels as if
you we’re living in an era which is at least partly defined by the medium of digital video. While
we've made some commendable progress in terms of capturing and storing and transmitting
videos, I still feel that the so-called elusive Holy Grail remains to be the automatic video
analysis, and if you can make some major progress in that direction then I feel that the
opportunities can truly be endless. For example, we could have systems that could be used for
automatic healthcare management of senile individuals. With better surveillance systems we
could have improved safety and security, systems that can understand complicated tasks such
as medical surgeries would help us provide useful information which could therefore help us
improve our performance in these complicated tasks. Again, last but not least, we could have
automatic sports visualization and analysis systems that could help us engage better with the
different types of entertainment and therefore be able to enjoy our lives in a better way.
Again, this is something that I'm going to talk about in more detail in just a bit. So to situate my
talk better let me start by showing you an active environment of a soccer field where different
players are performing different sports related activities. One of the main challenges for
automatic video analysis systems is this wide gap that exists between the low level perceptual
data, the pixel values and the more interesting high-level semantic stuff, for example, whether
a player is kicking a ball or not. An actual way to bridge this gap is to have a set of intermediate
characterizations that could channel this low-level information in an appropriate way so that he
could be used at the high inference level. In one such set of intermediate characterizations,
start with the assumption that an active environment consists of a set of key objects that define
the type of activities that can go on in that environment. So in the case of a soccer field, for
example, some of these key objects might be the players on the field, the goalposts, the ball
and so on. Different interactions among these key objects over finite duration of time
constitute what I call actions. Here for example, I'm showing you an example in showing the
action of kicking. Finally, when we put these actions in a specific temporal order they
constitute what I call activities, so here, for example, I'm showing you an activity which consists
of a player throwing the ball to another player and an opposing team player intercepting the
pass and the first player diving at the ball in order to make a save on the goal that is being done
by the opposite team. In this space of activity dynamic there is a wide range of very difficult
and open research challenges that still are there, and in this talk I wanted to speak about some
of the work that I've done at each one of these intermediate characterizations. In particular,
for the characterization of key objects I talk about the problem of localizing and tracking these
key objects. For the characterization of action I'll talk about the problem of spatiotemporal
analysis of videos to figure out what are the more important or interesting parts of these
videos. And finally, for the characterization of activities, I talk about the problem of large-scale
activity analysis particularly using unsupervised or minimally supervised learning methods. So
let me start with the characterization of key objects where the motivating application I want to
use for this part of my talk is sports visualization and analysis. I don't know how many of you
remember this match between USA and Algeria in World Cup 2010, but one of the goals that
was scored by team USA was considered an off-side foul and this actually resulted in the U.S.
team not winning the match. Wouldn't it be awesome if we could have visualization systems
that could figure out where each one of the players involved in the game is and then figure out
who are the key players that are involved in the particular foul and then provide us with some
visualization information that could help us understand the dynamics of the sport better? This
is the project that I was involved in at Disney Research Pittsburgh. This was done for ESPN
where the object was to track these several players, figure out where the second to the last
defense player is because that is one of the crucial players to determine whether an offside foul
has happened or not, and then draw a virtual offside line, so to speak, underneath that player.
Some of the technical challenges that make this problem interesting are tracking under visual
occlusions, noisy measurements and dealing with varying illumination conditions. I'll start by
giving you an overview of the computation framework that we used to solve this problem and
then I will say a few more words about each one of these steps. Along the way I'll try to identify
some of the technical contributions that we made in this project. Starting with the set of input
videos, the first step we do is extract the foreground and at the end of this step we have pixels
that don't only belong to the players on the field, but also the shadows of these players. Since
later on we are going to track these players in the image planes of the cameras, these shadows
can become distracting, and therefore it's important to remove them. That's exactly what we
do in this step of shadow removal. Once we have obtained the blobs of these players we can
actually track them and also classify whether they belong to the offense team or the defense
team. When we have this sort of spatial information available to us from multiple cameras, we
can fuse this information together, search for where the second to the last defense player is
and finally draw the virtual offside line underneath that player. So now I'll say a few more
words about each one of these steps. Starting with this foreground extraction we begin by
building adaptive color mix and models with each one of the pixels in the scene and these
models are used to classify each pixel as to whether they belong to the foreground or the
background and this is done by thresholding the appearance likelihood. As you saw previously
after this step of foreground extraction, we can get pixels not only from the player’s body but
also the shadows and since these shadows can be very strong relying purely on appearancebased methods to remove them don't generally work really well, so for our case we relied upon
some of the 3-D constraints of our camera system and the basic intuition that we are using here
is that shadows cast on planar surfaces are actually view invariant. What that means is if you
warp a subset of these views onto a particular view then the pixels that belong to shadows are
only going to be the ones that correspond to each other at the same point. By using this very
basic invariance and some simple dramatic thresholding we were able to remove these
shadows relatively accurately. With the blog of the shadow, of the players in hand, we can also
track them using the particle filter based tracking mechanism. The basic intuition is that given
the position of a particular player in a particular frame, we use the motion model to
hypothesize where this player is going to move in the next frame. And then given the noisy
measurement in the next frame, we figure out how good or bad each one of our hypotheses
was, and then in maximum likely sense we can figure out the most likely estimate of where the
player is in the next frame and this process is repeated over the entire length of the input
video. One of the things that went to our benefit for this project was that by classifying these
players across different teams…
>>: At what granularity were you doing that, predicting where there was going to be that at the
frame level?
>> Raffy Hamid: Exactly at the frame level and I'll talk a little bit more about the data that we
captured and it was either that 30 FPS or 60 FPS.
>>: It's strange that you have to rely on a prediction as opposed to just finding what happens in
the next frame.
>> Raffy Hamid: Right. So we could do detection at each frame but then the problem of how
do you ascribe which player corresponds to which player in the next frame is a nontrivial
problem, so generally in tracking they use time information because there's a lot of velocity or
first order information that you can use to simplify the process. So one of the things that went
to our advantage was the appearance of the players in this problem is by design quite
discriminative, because you want the players to wear uniforms from different teams so that you
can discriminate the two teams from each other. So for the problems of classifying these blobs
we were able to use simple human saturation based queues in the nearest neighbor sense to
be able to achieve quite high classification results for this two-class classification problem.
>>: Was he not tracking the goalkeeper?
>> Raffy Hamid: We are tracking the goalkeeper and the goalkeeper had a different template.
Yes, that's a good question. So while it was easy for us to classify these players across the
team, the flipside of that was that within the team it's very difficult to detect which
measurement belongs to which player especially when these players come close to each other,
and this was really where some of the technical contributions come in. The way that we tried
to solve this problem was to throw redundancy at the problem, so the idea was if you track
these players in multiple cameras and some of these trackers in some of these cameras fail, you
can still rely on other cameras to actually still get to an accurate inference. To illustrate this
point further here, I'm showing you the top down illustrative figure where several players are
shown in one half of the field that are being captured from three partially overlapping cameras.
In an ideal world if you were only interested in player p4 for now we would get three different
measurements in the image plane from each one of these three cameras and we could warp
these image plane measurements onto a common ground plane and they would all correspond
to the exact same point and we would know exactly where this player is. However, real life is
not so nice to us and due to several reasons we have different types of noises that don't let
these three measurements correspond to the exact same point. This becomes a really
challenging problem as we have multiple players that come very close to each other, and so
fundamentally this becomes a correspondence problem. So that if we could find out what
measurements correspond to a particular player we would be able to fuse it together and come
to an accurate inference. There are several solutions to this problem, and the one that we
chose to use was a graph theoretic approach using a complete K-partite graph where the
number of partites are equal to the number of cameras that we are using, so here I'm showing
you a graph with three partites because I'm showing you three cameras. The nodes in each
partite corresponds to the observation of where a player is in the 2-D position of images. Here
I'm showing you for illustrative purposes the color of these nodes to be of the same color as the
players that are seen from a particular camera. This is an undirected edge weighted graph and
the weights on these edges are defined by the partite distance between the appearance of any
two players. This leads to the notion of how do you figure out what this correspondence in this
graph mean and again you can have several solutions. For example, you could have the notion
of maximum, you could have maximal cliques defining what correspondence means in this
graph. However, for our problem we figured out that using the notion of maximum length
cycles was sufficient to have a robust enough result, so the idea is if you could figure out that
these three nodes in the graph form a maximal length cycle, then we would know that these
three measurements belong to a particular player, you'd be able to fuse them and then
similarly if you could find all of the cycles in this graph you would be able to find out where
each one of these players is quite accurately. It turns out that finding these maximum length
cycles in K-partite graph is an NP hard problem for any graph with number of partites greater
than or equal to three, and so we can't really rely on exhaustive search methods to do that
because we needed to do that in pseudo-real-time. And so we had to rely on greedy algorithms
to do this search and that presents us with a trade-off between optimality and efficiency and it
turns out that there's a whole class of algorithms that you can use for greedy searching and one
of the contributions in this work was to actually analyze which of these greedy methods work w
when and how well for our problem.
>>: Can you say a few words about why, where the inconsistency comes from? Is a vibration in
the ground?
>> Raffy Hamid: That's a wonderful question. So I would be more than happy. Camera
calibration is definitely one of the reasons, because while we are making the assumption that
this is a planar surface, no surface is perfectly planar, so there are errors that we get in our
homography measurements. Also, because of the lens aberrations on the cameras, there are
also different types of noise that can be added, so that assumption is one of the big reasons.
The other big reason is the shadow removal step. Just to -- maybe I should go like this. Just to
give you an understanding of that problem, why this really happens, I'll show you a graphic if I
can find it. There we go. Here I'm showing you a scene where I have the shadows of these
players that I've marked green. Here I'm showing you a picture of a corresponding picture
where the shadows have been removed. Several of these players the foot pixels are still there.
For a lot of of them foot pixels are gone, so essentially from the system’s perspective they are
flying in the air, and so when you project it they don't correspond to -- so this is one of the
reasons, but yeah, that's a very good question. Just to give you sort of a visual of what this
fusion mechanism results in looking like when you actually apply on a real video. This is sort of
to give you a sense of how we fuse different measurements as different players move around in
the environment. And this leads us to the point of, you know, evaluating this framework and,
again, as I mentioned this work was done for ESPN and one of the things that they were very
keen about was to make sure that whatever statements we make about the accuracy and
performance of our system are really backed by a solid number. To do that we actually went
through quite a bit of work. I'll present two sets of experiments for the evaluation part of it.
One of them was done at the Pixar Studios in California where they have a soccer field and we
actually went there and hired a local soccer team. We requested of them to play different
types of games for us and different types of formations and drills. We also requested them to
wear jerseys of different colors and these colors corresponded to the team colors of
international teams that take part in World Cup. We also captured this data for several days
from 9 AM to 6 PM to make sure that we were able to incorporate all the different types of
illumination conditions that can happen. Not only did we test this framework in one setting,
but to make sure that our framework is robust over different types of stadiums, we also went
down to Florida in Orlando at the ESPN Worldwide of Sports where they have this very large
facility where kids from all over the U.S., to play soccer amongst other games. Here, as you can
see, one of the variables that we looked at was whether our system can perform in conditions
of floodlight. Also, you might've noticed that the height of these towers, these cameras is
higher, so in this case it goes 60 feet, whereas, in the previous data set it was 40 feet. Another
difference was that here we captured data at 720 P, whereas in the previous set we captured it
at 10 ADP, so we wanted to see how does the resolution of the image affect the performance of
our framework. And in the interest of time I'm not going to belabor too many numerical details
of our work. When compared to baseline, suffice to say that our overall accuracy gained was
close to ten percent gain based on some of the baseline methods that we looked into. If you
are interested in finding out more about the details, please come talk to me and I'll be happy to
talk more about it. In summary, I just spoke about the problem of the fusion as graph theoretic
optimization problem where I treated correspondence as cycles in K-partite graphs. I also
spoke about greedy algorithms that can be used for efficient searching of these cycles and an
important point here is this is a relatively general approach and can be applied to other types of
sports, for example field hockey and basketball and volleyball and so. On that note I wanted to
share with you some of the related stuff that I've been involved in that uses this framework.
I'm a big squash junkie and I like to watch a lot of these games on YouTube, so the idea here is
given dozens of videos of matches of a particular player can we have systems that can
understand what are the general trends of that player? What are the different types of
techniques that they use against their opponents. For example, one of my favorite players is
Rommy [phonetic] who I show here in the red shirt here. Question is if I were to play with
Rommy I'm sure if I dropped him in the top right corner of the court is he going to drop me back
or is he going to lob me? These types of questions I know are very important from coaching
perspectives, so this type of framework can be used to get us closer to automatically figuring
out these trends. At the risk of jumping the gun a little bit, I understand that I am not yet
talking about this action detection part of my talk. But since we are talking about sports, I want
to say that not only can you look at these sports videos at the macro level, but also you can
start looking at it at the relatively less macro level where each individual action performed by
each individual player can start to be classified and recognized, not only when it is being
captured from one camera, but also from several cameras together because that's how usually
these activities are captured in sort of World Cup or high-level situations. Finally, so far
whatever I talked about is with static cameras, but more and more people are using handheld
mobile devices to capture videos, so I wanted to say a few words about this thread that I am
actually currently involved in at eBay Research. This problem, in fact, relates to the image
quality problem. EBay has a lot of users that are uploading millions of images online, but they
are not very professional photographers, so a lot of the times they capture images that have a
lot of background clutter. While we can use techniques like graph cut as they try to extract
these foreground objects from the image, the form factor of a mobile device is such that it does
not necessarily always encourage a lots of interaction with the device. So here the objective
was to explore whether we can use the medium of video to extract the foreground objects in a
less interactive way. The idea is very simple actually. Suppose you want to sell a toy on eBay
and you want to capture it in a certain environment. First you capture a background video
using a hand held camera and then you put the object in that environment again and take
another video with the handheld camera. Now the question is these two produce very
different camera pattern disparity can we actually align these videos efficiently so that we can
subtract the background video from the foreground video and come up with the output video
which only has the foreground object in it. To test our system and this is still working the
process. To test the system we actually applied it to different types of scenarios taking
foreground objects with different types of geometric complexity and several foreground
objects, multiple foreground objects, different types of illumination conditions and also
different types of movement, articulate movement of the foreground object. Our initial results
tell us that this is an interesting way to go toward solving this particular problem.
>>: Back to the soccer project, when you talk about the improved environments to handle the
three stream results of HD and were you limited by that at all?
>> Raffy Hamid: That's an excellent question. To answer that question let me -- the compute
requirement of the project was that we wanted to show it in pseudo-real-time, which means
that we wanted to show it in terms of instant replay, so we would get about 3 to 4 seconds to
process close to ten seconds of video and we would need to generate results with that. This
project was done at the demo level so it was not at a production level and the code that was
written by me and it was semi-optimized, not super optimized and just to give you some
understanding of what is a division of the synchs in terms of time. I was able to bring it to
almost .4 seconds per frame and this is sort of for all three frames, so you could divide it by
three. This is [indiscernible] time and here you can see that most of the time is taken by the
background subtraction part of it which is very, very parallelizable, so I was not using any GPUs
to get these numbers. I think that this number can actually be significantly brought down, but
at the point when I got finished with this project, this was sort of the time that we were taking
to process video for this.
>>: And single core?
>> Raffy Hamid: I was using multicore, but I was not using multicore in a very efficient way. I
was using [indiscernible] things but I am quite certain that we can optimize it to quite a
different level.
>>: And what if you had the 10, 50 times the capability, would that make the job easier or
make the result any better?
>> Raffy Hamid: I think it would help us in two ways. One is it would allow us to try algorithms
that are slightly more complex and therefore it would help us get more accuracy. That's one
thing. The other thing is one of the solutions, one of the sort of philosophies of solving this
problem was to [indiscernible] and enunciate the problem, so I'm quite confident. I know for a
fact that if you had fewer number of cameras, your accuracy degrades quite steeply, so one
thing that having a lot more compute power would give us would be to deal with far more
number of cameras. Imagine a ring of 50 cameras. Now you can come up and get over the
problem of visual occlusion much better, and so that is some of the things that would help us.
>>: So can you say something about segmenting the [indiscernible]
>> Raffy Hamid: Sure. I'd be happy to. The way we did segmentation was actually doing sort of
connected complement detector, so that's how we started our trackers. And then at each step
we were actually performing background subtraction first. The notion of segmentation is really
just a step of connected competence.
>>: First you have to match between the videos, right? [indiscernible] handheld videos…
>> Raffy Hamid: Oh. You're talking about this step? I see, I see. You are asking me about,
yeah, sure. Absolutely. I thought we were talking about -- I see. Yes. I would be happy to. I
don't know why these videos are not playing right now, but…
>>: That's okay.
>> Raffy Hamid: The steps are relatively simple. We start with finding the search features in
the background. We do each frame in the background and also the search features in each
frame of the foreground we do. And then we use for each frame in the foreground video we
find that matches, these search matches using gransack [phonetic] to figure out what are the
best matches here, and then once we have found these matches now we have a subset of
frames that are good matches for that particular frame in the foreground we do. And we use
this sort of predictive transform to transform those frames, those background frames that have
been matched well and so…
>>: And so you assume [indiscernible]
>> Raffy Hamid: Yes. Exactly. This is very similar to the sort of a plane plus parallax work that
has happened before and so the idea here is that you don't have to build a whole 3-D sort of
model off the alignment.
>>: The first that you show the blue doll there, it obviously didn't have the planar background.
>> Raffy Hamid: Yes, that is absolutely true. It does not have, the scene doesn't have to be
planar, but the idea here is -- and I wish this would work.
>>: That's fine.
>> Raffy Hamid: But the idea here is if you actually first do this planar assumption and then
once you have a resistor trained, you actually do non-rigid transformation using opticals. So on
the previously [indiscernible] frame, then you know it takes care of some of the sort of depth in
the scene that you are not captioning using your assumption. And not just that, you're getting
multiple hypotheses, so if you look at…
>>: [indiscernible] this will fail if for example you have something near a corner and you have
two planes that go from near to far?
>> Raffy Hamid: Two planes that go from near to far?
>>: I'm trying to think of the scenario where it's hard to approximate.
>> Raffy Hamid: Yes. For any situation where two things don't happen you might fail much
more than not. One is if you have a situation where there's a lot of depth in the scene to the
point where the planar assumption that we are making is completely screwing up everything,
so that's one situation that we know we fail. The second situation is when for a particular
frame in the foreground you are not able to come up with a lot of good matches in the
background video, and so we know for a fact in those situations the algorithm would obviously
not be able to work. And so for those types of situations when we go to them, we actually rely
on tracking, so we are adaptively relying on tracking the pixels that we are detecting in the
previous frame. So I just spent a little bit of time talking about the characterization of key
objects. This is some of the work that I've done in this regard. And now I will speak about the
characterization of actions and really looking at this problem from the perspective of figuring
out what are the important parts of the video an interesting part of the video and the practical
application I'll use for this part of my talk is a video summarization. This problem is becoming
increasingly important for eBay because sellers are realizing more and more that they want to
use the medium of video to capture information about the products that they want to sell
online. So if you go to eBay and search for YouTube.com you will be able to find tens of
thousands of listings and what's happening is that these sellers are capturing these videos of
the products they want to sell. They are uploading them to YouTube, copying the link and
embedding the description in their listings. Just to give you a sense of a video that was
captured by a seller trying to sell his blue Honda Civic on eBay, here's that video and you can
see that the general quality of the video is relatively poor and these videos can run all the way
to ten minutes, so with no understanding of where each piece of information is, you know,
watching a poorly made ten minute video can be quite cumbersome. The idea behind this
project is really to figure out what are the important parts of this video so you can allow the
potential buyer to navigate this video in a more nonlinear, less slack manner. A lot of the
previous work in reduce summarization relies upon the content of the video itself, and since
the quality of these videos are quite poor, those traditional schemes don't work very well and
we know it for sure because we have algorithms written for that. Thankfully, at eBay we also
have millions of images that these sellers are uploading of these products that they want to sell
online, so here, for example, I'm showing you some of the still images that the same seller
uploaded of the same car here you can see that these still images are actually of much higher
quality compared to a randomly sampled frame from this video. The question really becomes
how can you use these images as a prior to help us figure out what are some of the more
representative parts of the video that these users are generating. Just to give you a very brief
overview of our approach -- by the way I should mention here that this project was done with
one of my interns from last year, [indiscernible] one of my interns who is doing his internship
this year at division group, and so if [indiscernible] is watching, hi to him. Just to give you some
overview of the framework that we used, we started with the corpus of our unlabeled images
and we used [indiscernible] clustering to figure out what are the canonical viewpoints present
in our corpus. Not only do we have these individual images, but we also have frames of the
videos that these users have uploaded, and it incorporating these frames can actually improve
the quality of the discovered cluster. By bootstrapping on the clusters that we found just using
the images, we incorporated these frames in a [indiscernible] type algorithm to directly
improve the quality of our discovered clusters. With these clusters discovered, given a test
video we can ascribe these frames to any one of these discovered clusters, and then come up
with the final output sum. A related problem that we explored over the course of this project
was how do we do large-scale annotation and evaluation of different summarization
algorithms. A lot of the previous work in this context has heavily relied upon expert knowledge
to figure out whether the summarization algorithm’s performing well or not. Clearly this view
of solving this problem does not scale very well, so a part of the contribution of this work was
to figure out how can we obtain multiple summaries from crowd sourcing such as Amazon
Mechanical Turk, AMT, and use each one of these summaries as an instance of ground truth to
figure out whether a particular summarization algorithm is performing well or not and also to
compare different summarization algorithms with each other. Just like any comparison
mechanism we need some notion of distance between summaries that we are trying to
compare to each other. Then again, there’s several motions of distances that you can use here
and after exploring quite a few of those we were convinced that using the notion of a SIFT Flow
as a distance between some of these is actually a good idea. And to just give you a visual sense
of how do the search results look like or comparing the results look like when you are using SIFT
Flow as a mechanism, here I am showing you in the first column some of the query frames and
on the right I'm showing you some of the retrieved matches that have been sorted in
descending order, and as you can see SIFT Flow can be quite robust with the types of variations
in illumination and viewpoint that we actually observe in our data. Yes?
>>: So the right color is a match to the left color? What's the relation between the speed of
the [indiscernible] and the car?
>> Raffy Hamid: Right. First of all, each one of these rows is independent from other row and
these guys are the closest matches and this is the farthest match. You can see that here the
picture looks very, very similar and here it is sort of similar and there it is not similar. With this
notion of distance in place, our problem really sort of boils down to the question of
correspondence. We need to find the correspondence between the frames, summarization
frames that an algorithm gives us and the summarization frame that an AMT summary has
given us. For that we can use a model of a bipartite graph where the weights on these bipartite
graphs are found using the SIFT Flow as a distance. Now we can use the sort of classic notions
of precision and recall to actually figure out how well or poorly is an algorithm performing and
also to compare different algorithms with each other.
>>: So what was the measurement for the Amazon Mechanical Turk [indiscernible] what
people do?
>> Raffy Hamid: Right. So basically what we asked them to do is we gave them a video and we
asked them to come up with sort of a set of, we basically showed them a video and we asked
them to come up with a set of frames from the video that they feel are the most representative
of that video. So for each one of the videos, I'm going to talk about the particular tests in a bit,
but for each of the videos that we used we asked several AMT turkers and we filtered a whole
bunch of them out and we kept only the top ten turkers and we used each one of these as an
instance of ground truth to figure out whether the particular algorithms are performing good or
not.
>>: Okay. Because I would argue that, and you have an awesome example there with the car
where I want to sell the car. I sent a set of nice photos that will do a good job to sell the car,
and then actually for me as a buyer the video, what was interesting was not the main features
shown, but maybe the scratches on the back which were not in the images.
>> Raffy Hamid: Right. So that is an excellent question and that is a very fair point. And the
idea here is while those types of -- we would only be able to do that at the moment in our
algorithm if we had that type of information available in our image corpus. So if some people
have shown scratches and have taken pictures of their scratches as individual images, then only
we would be able to do that, and so we right now are relying very heavily on the content of the
image corpus. You are absolutely right, and one of the things we are looking at going forward is
how do we add additional information that is not present in the image corpus, so that’s a
completely valid point. To say a few words about the data set that we actually used to explore
this problem, we focused on the vertical of cars and trucks on eBay because this is the most
popular category of products that people use videos to describe their products. We use half a
million vehicle images as possible examples and also we use the PASCAL 2007 data set as the
negative examples. We downloaded 180 videos from YouTube of eBay sellers and we used 25
of these as training and 155 of them as testing. Again, in the interest of time, I won't be able to
go over too many numerical details of the results that we captured, but suffice to say that our
performance gains were greater than ten percent over the benchmarks that we actually looked
out which were several and we also did some quantitative and qualitative, both types of
analyses and if you are interested in knowing more about the details of the numbers, please
talk to me after the talk and I'll be happy to chat with you.
>>: [indiscernible] numbers just as the task. The task is take a video approximately ten percent
of your video corresponds to your original pictures associated with that car?
>> Raffy Hamid: Right. So the way that we are doing the evaluation in terms of average
precision and the way it works is so you have a particular video and you have a particular
summarization from a particular trigger and now you compare it to the results that were given
to us by a particular summarization algorithm and that would give you one instance of the
average precision and now you average it over the entire set of turkers that you have. That's
one value and then you average it over the entire set of 155 testing videos that we used. Then
we compared it with using different types of summarization algorithms like uniform, random, KMeans, spectral and so on. In summary I spoke about using web images as prior to perform
large-scale summarization of user generated videos. I spoke about also the use of crowd
sourcing for large-scale evaluation of these summarization algorithms. And again, we'd like to
believe that this is a relatively more general approach and so one of the things that we are
currently trying to look into is whether we can use these types of approaches to summarize
more wider variety of videos, user generated videos online such as birthday parties and
weddings. So any video or any class of videos where you have an image corpus that actually
captures the geometry and general appearance of the scene can be thought of as being looked
at using our method. So I just spoke about the characterization of actions, particularly looking
at figuring out what are the important parts of the video that are interesting or important. And
now I'd like to spend a few minutes talking about the characterization of activities, specifically
focusing on the problem of large-scale activity analysis and the practical problem that I want to
use to motivate this part of my talk is automatic video surveillance. Just to give you the big
picture of this part of the talk, this was some of the work that I did for my PhD, and in the
context of activity analysis at that time and also today, one of the main assumptions that
people make is that the structure of the activity that the system is supposed to detect is known
a priority. So for example, imagine you are making a dish in the kitchen and you want to make
a system that can recognize whether you are making that dish in the kitchen or whether you
are making that dish correctly in the kitchen. We have to make an assumption that the system
knows the structure in which that dish is made, so if you are making an omelette we would
need to know that first you are supposed to open the fridge, take some eggs out, heat up the
pan and so on and so forth. Now, this assumption works perfectly well if the number of
categories that you are looking into is relatively small, but it doesn't really scale well when you
have several categories, in fact, when you don't even know the number of categories that are
priorities. That is the focus of this particular work, that we want to be able to figure out the
main behaviors, the main types of behaviors in an environment in an unsupervised or minimally
supervised manner. You know, this was done over a course of five or six years and in retrospect
if I look back and I were to sort of concisely describe the mean finding of the work, I would say
that it is that we should look at activities as finite sequences of discrete actions. As soon as you
start looking at activities like that, you realize that it is very similar to how researchers in natural
language processing have looked at documents, because they look at documents as finite
sequences of discrete words. Once you do that, once you make that connection you're able to
scale away a lot of the representations and algorithms that these folks have been doing over
several years. I was the one to identify this and therefore build the bridge between these two
sort of research communities and the main idea here is to be able to learn about these
activities without knowing the structure and only by using the statistics of the local structures.
On that note I'll say a few words on the general sort of the framework that we used. So starting
with the purpose of K activities, we use some representation to extract some sequential
features of these activity sequences. We can define some notion of distance based on these
event subsequences using which we can find the different types of classes that are existing in
that environment. Once we have discovered these classes we can perform the task of
classification of a new activity and also figuring out whether something anomalous or irregular
is happening in that activity instance. While I discovered several different sequence
representations for my PhD, here I will only briefly mention the representation of event ngrams where an n-gram is a contiguous subsequence of events. The idea here is that you go
find these contiguous subsequences and then based on the counts of these subsequences you
figure out what's going on. Just to show you some of the experiments that we did using the
notion of event n-grams to figure out how good or bad this representation is, we actually
captured some delivery data in one of the loading docks of a bookstore right next to Georgia
Tech. This is a Barnes & Noble bookstore and please don't ask me how we managed to
convince the manager of the bookstore to capture data over long periods of time, but in any
case we sort of placed two overlapping cameras to capture some of the vantage points of the
loading dock area. We captured activities daily from 9 AM to 5 PM for five days a week over a
month and we were able to capture 195 activity sequences. Just for fun I should mention this
was done at the end of the year and so by the end of this project I was very sick of listening to
Christmas songs and I couldn't bear to hear one more Christmas song.
>>: [indiscernible]
>> Raffy Hamid: Well, for me it certainly did. So using these 195 activities we randomly
selected 150 as training activities.
>>: What's an activity?
>> Raffy Hamid: That's a great question. Here an activity is defined as the time when loading a
vehicle; a delivery vehicle enters the loading dock and the time when it leaves the loading dock.
For these types of situations and also kitchen situations the start and end of figuring out where
the start and end of an activity is relatively easy, but there are several situations where, you
know, finding this start and end or segmenting activity from a stream of events is nontrivial and
so I've done some work which I'll be happy to talk over later if you are interested, but that's an
open problem here; it's a nontrivial question. So we randomly picked 150 training activities and
45 testing activities and in this environment there were ten key objects. So an example of a key
object is a back door of a delivery vehicle, and also the size of the action vocabulary was 61. For
example, an example of an action in this environment would be a person opens the back door
of a delivery vehicle. Using this 150 training examples, here I'm showing you that adjacency
matrix of the un-clustered instances of these training examples and here I'm showing you the
same adjacency matrix reordered according to the clusters that we discovered in the data.
>>: What's the features?
>> Raffy Hamid: Right. So the features are as I mentioned, so this is using event n-grams as a
representation and the features are really the counts of these n-grams, so this defines an ngram, an activity and the features are really the sort of the counts of which n-gram happened
how many times and so based on the difference in these n-grams you can determine the
distance between any two instances of activities.
>>: And how do you find those n-grams?
>> Raffy Hamid: The way that we find those n-grams is just by parsing over the entire activity.
So you know the start of the activity; you know the end of the activity. You know which actions
happen. You go and find the first n-gram. You count it as one. If you find it in another instance
you add to the count before
>>: So there's 150 training. Someone has manually go out and…
>> Raffy Hamid: No. That's not true. For this problem, since the high level inference we are
making is quite high, the thing that was provided to the system was only sort of the sparse
tracks of the location of where these key objects are, and so based on that we build action
detectors that in fact detected these actions automatically. The part of tracking was not done
for this project, but everything above that was done, so all the results that I'm showing you
have all types of errors that were incorporated using automatic steps for each one of these
individual steps. We were able to discover seven cohesive classes in our data and just to give
you some semantic notion of what a class implies in this environment, so the most cohesive
class that we were able to find consisted of all activities that were from UPS delivery vehicles.
There was no explicit knowledge about the fact that it is a UPS truck in our event vocabulary, so
the fact that we were able to find all the deliveries automatically that was of a particular type of
vehicle makes us believe that, you know, the low-level perceptual bias that we are adding to
our system is being successfully latched onto at the higher level. There were several activities
that did not get clustered into any one of the discovered classes because they were so different
from any of the other activities and we treat them as irregularities or anomalies. Just to give
you some example of the detected anomalies that we were able to find was in the first case we
are showing you the anomaly where the back door of the delivery vehicle was left open, which
can actually be very dangerous. And the second one I'm showing you an activity where more
than the usual number of people were involved in the delivery activity and lastly, I'm showing
you an example of an anomaly where a janitor is shown cleaning the floor of the -- go down -which is actually a very unusual thing to do, which brings us to an important point about the
fact that while I am calling them anomalies, these are, this is a misnomer. They are not really
anomalies. They are just unusual occurrences and so the idea in which this type of a system
could possibly be used is to just filter out the activities that are relatively suspicious and then
the human should be brought in the loop who would make the decision as to whether these are
just irregularities or actual anomalies. In conclusion, basically I started my talk identifying this
large gap that exists between the low-level perceptual data and the more interesting high-level
inference stage. I basically made a case for using these key objects, actions and activities as
intermediate characterizations to channel this information from the low level all the way to the
high level. My hope is that over the course of the talk I was able to somewhat convince you
that using these sets of key characterizations is at least a useful way to sort of channel this
information, particularly from the perspective of some of the problems that I had and that I
talked about. I should also mention that this is a very open problem, how do you bridge this
gap. This is one of the big problems out there and I'm not at all suggesting that this is the right
way to bridge this gap. In fact, much more research needs to be done before we can actually
get to that question. I also had the chance of working on several other industrial projects over
the last ten years or so and so I wanted to say a few words about them and if you are interested
in talking to me about it afterwards, please do. One of the first things I was involved in at eBay
was this app called eBay fashion app and we built this project called eBay image swatch as a
feature in this app and the idea here is content-based image retrieval, so suppose I like your
jacket. I take a picture of your jacket and I want to see if eBay has this type of jacket in its
inventory or not in really, really quick amounts of time. This project was done by a team of
three individuals and I was one of the three people and all the way from ideation to
productization we did it all ourselves. In fact, if you download the app right now the code that
is written is partly by me which is running on the server. This was a lot of fun, got some traction
both inside and outside of eBay and was featured on several news channels.
>>: What kinds of features were used for retrieval?
>> Raffy Hamid: That's a great question; unfortunately I cannot talk about it.
>>: [indiscernible]
>> Raffy Hamid: Yes. So please read in the paper. [laughter]. Right. So the types of features
that we are sort of sharing with other people.
>>: I see. I will leave it at that.
>> Raffy Hamid: Okay. Sounds good. Another project that I want to mention here is that I did
one of my internships at Microsoft Research back in 2007 where I had the privilege of working
on this project called Microsoft Ringcam where the research problem was really audiovisual
speaker detection. While my work on this project was only over three months of time, it really
gave me the perspective of how to do research when you are working for a project that is
driven towards a product. And also I wanted to say a few words about the project that I did for
General Motors before actually starting my PhD career. Here the problem is that we want to
detect in the case of an accident, we want to deploy the airbag or not. The motivation here is a
lot of the times in the U.S. for a nontrivial percentage of the time the fatality of the passenger
actually happens because of the impact from the airbag and not the accident itself. The idea
here is we need to determine in real time whether we should deploy the airbag in case of an
accident or not. I wanted to say a few words about some of the future teams that I am
interested in. The big message here is that I am particularly driven by problems that have a
particular goal in mind, and so that is something that I have done in the past and hope to do in
the future as well. As far as some of the thoughts about some future research themes are
concerned, over the last couple of years at eBay I have had the privilege of working with some
folks from large-scale machine learning, large-scale data analysis side of the spectrum and I was
lucky to learn some of the stuff from these people. I am very interested in bringing that
knowledge back to the video analysis side of the spectrum and to figure out how can we use
those types of representations and learning mechanisms to do inference for problems in video,
because in my opinion video is one of the biggest sources of big data out there and we really
need to bridge this gap between these two communities. Another thing that I am super excited
about is this notion of mobile media computing. I really strongly feel that this is definitely going
to change the way that we understand computing. However, I feel that right now there are
really two trends that people are really focusing very heavily on. One is the capture side of the
spectrum and the other is the compute side of the spectrum. On capture side, every year
better and better cell phones come in which are able to let you capture the experience in a
smarter faster way. However, we don't know which types of information are you supposed to
sort of extract from these capturing devices because there is a wide gap that exists between the
capture and the compute sense. It's not clear which piece of information should you
communicate to the compute side of the spectrum and I really think that figuring out what are
important and interesting parts of the information so that you only communicate that part is
opening up really interesting sets of challenges. Last but not least, I'm actually very interested
in applying vision for robotics and especially in terms of human robot interaction. A lot of my
work in the past is about detecting and analyzing activities and actions of humans and for the
field of human robot interaction, it's really important for a robot to understand what the state
of the human is and what the likely next state of the human is, so I think that from that
perspective I can bring in some value on the table when it comes to bridging the gap between
the field of computer vision and also human robot interaction. At the end I'd like to say a few
words to thank a lot of the people who helped me do all of this work that I presented today. At
eBay Research I collaborated closely with Dennis DeCoste. I had the pleasure of working with
Chih-Jen Lin and Atish Sarma. At Disney Research my postdoc manager was Professor Jessica
Hodgins. My PhD advisor was Aaron Bobick. Also I had a great time interning at several
industrial research places, especially Cha Zhang was my mentor at MSR and also I had the
privilege of mentoring quite a few wonderful interns and mentees and they really helped me do
a a lot of the stuff that I have presented today. Okay. So thank you very much and at this point
I'll take more questions from you guys.
>>: Thank you. [indiscernible] ask questions [indiscernible]
>>: Did you get a chance to work with [indiscernible] information or video sequences?
>> Raffy Hamid: Not much so far. No I have not had too much work with RGPD. [indiscernible]
but I am very interested in working with that [indiscernible]
>>: Seems like it would help a lot for the software.
>> Raffy Hamid: Yeah. If you can get to that level, yeah.
>>: [indiscernible] structured domains so you have [indiscernible]
>>: I have another question. When you work with videos on big data and one of them major
problems I saw with videos was how to dig the data out of -- it's not, I would say the photos are
not well-organized but at least they have tags and they have sometimes GPS locations. Videos
are horrible. Just try to find all the videos that maybe tourists took in Venice and you'll get the
amount of noise that you get in your material there is enormous. Any thoughts about that?
>> Raffy Hamid: Yes. Several thoughts. First thought is absolutely, I agree with you. It's a very,
very difficult question. And I think that there are several ways to look this and it really depends
on the context. I think that if you are looking at very explicit context where the environment is
known, then some sort of structured information can be added, but in general, from a generic
perspective I actually very strongly believe in using these images that have tags and Geo and
textual information available as a prior to actually figure out which part of the video do they
belong to, and if they belong to quite a large part of the video then using that information to
classify what is the content of the video. So I think the particular direction that I am very
interested in for the specific problem that you mentioned is to rely upon the image information
that we have out there. We have quite good quality image information, much better than the
user generated video and we also have very good amount of textual information, view
information available with these images. So the direction in which I would like to explore is
how to use that information for this problem. And I think that our summarization work is sort
of a baby step towards that and I hinted at using that type of approach for other problems as
well like figuring out wedding videos, figuring out birthday partys, things that are really event
related. We are going from very simple product video to a more complex event video, so I think
that would be some of my thoughts as to how to look at that.
>>: Okay. Thanks.
>> Raffy Hamid: Thank you very much. [applause]
Download