>> Dong Yu: So today we're glad to have... audio-visual event detection.

advertisement
>> Dong Yu: So today we're glad to have Xiaodan Zhuang from USC to give us a talk on
audio-visual event detection.
Xiaodan is a Ph.D. candidate at the UIUC. He got his master degree from electrical engineering
at the UIUC in 2007 and bachelor degree from Tsinghua University in 2005.
So without further ado.
>> Xiaodan Zhuang: Thank you for coming in the morning. So today I'll talk about modeling
audio and visual cues for real-world event detection. This is the topic I picked for my
dissertation as an umbrella to include a couple of projects I got involved over the years.
I also prepared a couple of snapshot slides towards the end just covering a couple of other
projects I was involved in that might be interesting to this audience.
So audio-visual event detection aims to identify semantically defined events that reveal human
activities. And we will agree here that while speech is the most informative information source
in audio stream, a lot of the known speech events also tell you what's happening in your
environment.
For example, on the acoustic side, if you have events like door slam, steps, chair moving, key
jingle in this room, they tell you something about what's happening. And, in particular, if we
hear a very loud yawn here at the very first five minutes of the talk, then that's also something
that tells you about the talk.
We also know for the video events it's also very apparent that if you can figure out what's
happening in the video, it makes uses to multiple applications, like surveillance, human
computing interaction, supporting independence and well-being of aging population, video
annotation and multimedia retrieval in general.
So as useful as these applications are, they are relatively less studied than speech recognition.
It's also a shorter history for these applications. But they are explored by different researchers
before. For example, many people worked on detecting restricted highlight events, such as a
gunshot in the street or explosion in a movie or a goal cheering detection in a sports game.
Many times people leverage lower level detectors. For example, a car detector, a flag detector,
an office detector, using these things for video event detection. And in particular people -- for a
meeting room scenario like this, people would use a person tracker, a laptop detector, face
detector, and door activity estimator.
And the benefits here are that hopefully these lower level detectors are transferable between
different tasks. The drawback is that these are usually ad hoc and need more annotation for
training.
So in this work we're trying to see what could be generalizable and robust way to model these
cues for detection of real-world events. By real-world events we're meaning events that come
from data that are realistic and that are not collected in a lab setting.
So my talk will be roughly in three parts. The major pieces are actually the first two parts. First
I'll talk about a classification detection of acoustic events. There's a lot of lessons learned by the
speech recognition community that actually I think this audience will find a lot of the methods
very similar to what we use in speech recognition, and they work so well because they survive all
these decades of work in speech recognition community.
So to detect and classify acoustic events is a task of significance according to the clear evaluation
that we were participating. The activity detection and description is a key functionality of
perceptually aware interfaces working in collaborative human communication environments.
And these events usually help describe human activities. So the particular dataset that we
worked on have over 12 events such as door slam, steps, chair moving, key jingle. They are
difficult because they -- we are talking about a set of general events in a very realistic
environment. We're not talking about a particular very highlight event like explosion or
something like that.
And also the temporal constraint between different events is much looser than in speech. There's
no linguistic constraint, of course. And usually they happen -- they don't tend to happen
following a particular sequence.
The SNR is low. The particular dataset and setup we use actually was using one far-field
microphone. So we haven't -- we haven't included any beam search -- I mean any array,
microphone array beam search in this setup. So the far-field microphone would have
background noise and also some background [inaudible] talking speech as a background.
We propose leveraging statistical models proven effective in the speech recognition literature. In
particular we use a tandem connectionist-HMM approach to combine the sequence modeling
capability of the HMM with the context-dependent discriminative capability of an artificial
neural network.
So here we're talking about one HMM for one particular event. So for 14 events, we'll have 14
HMMs. And in a rescoring stage we'll use a support vector machine, Gaussian mixture model,
supervector approach to use this noise-adaptive kernels to better approximate the KL divergence
between feature distributions in different audio segments. We'll cover that in a while.
So the way that we formulate acoustic event detection is exactly the same way as in speech
recognition in particular as a whole word sequence recognition. So you're basically maximizing
the joint distribution between the observation and the event sequence. And for feature
representation we are looking at sliding window-based sequence of local frame-based features.
So temporal segmentation is one of the major challenges for this problem. People have tried
different ways to address this. For example, you could do a super sliding window-based
approaches, basically detection by classification. And people also come up with ideas similar to
a cascade of different sliding windows, try to improve accuracy. But from what we found and
also from what the results of the evaluation sets using the Viterbi algorithm of HMMs compute
simultaneously optimal segmentation and classification of the audio stream. So this is still the
best choice as we believe, and we illustrate this in the evaluation.
In particular we believe that the benefit is that the noise in these individual frames will be
alleviated by this learned prior of preferring self-transitions rather than non-self-transitions in the
hidden finite state machine.
And how do you model the temporal context and does that matter in this particular application.
We explored this question by using artificial neural network to observe a larger context window
for each hidden state. In particular, the feature vectors that are observed in context window and
the output of the neural networks are the posteriors which are transformed and then fed into a
dynamic Bayesian model, in particular a hidden Markov model.
So this is also learning from the speech recognition community, but it's not always working in
speech recognition, because I think people having these lessons saying when the task is of low
SNR and also when the models are relatively context independent, then this tandem approach
usually would give you more pronounced improvement.
And these two assumptions do stand true for our application, because SNR is indeed low, and
also the models are less context dependent because we don't really know whether this is -- there's
no -- nothing like in a [inaudible] speech models where you have a very rigidly defined context.
So basically we'll take a larger window of these frame-based features and take the output of the
neural network transform and use that as an input to the HMMs.
Another question is is the temporal structure very important or not. So because up to now we
use the HMMs to model the events. In particular, the HMMs are of the topology of left to right.
So we do want the different state sequence to capture the temporal structure within each event
segment proposed.
However, there's an alternative way to look at these audio segments. The audio segments, they
vary in length, but in essence they are all a set of local frame-based features. So looking at an
audio segment, whatever length it is, is looking at the joint distribution of all these frame-based
feature vectors.
So the way that we did here was did also have success in speaker ID and language ID. So the
idea here is to look at the audio segment as a set of these local descriptors and then we would
approximate this joint distribution using a Gaussian mixture model. And then the parameters of
the Gaussian mixture model would be used to construct a supervector which becomes this
uniform length vector representation for the audio segment that varies.
And once we have this vector representation, it can be shown that this vector representation put
into a linear kernel actually approximates the kernel divergence between the other segments, and
the kernel can be used in whatever classifier of your choice.
So in particular the GMM supervector is the normalized stacked mean vectors from the different
components. And that can be shown that the kernel divergence between the feature distributions
can be approximated by the Euclidean distance, which is why we can use a linear kernel.
And there are some assumptions behind these being true, in particular we need the different
Gaussian mixture models to be obtained by adapting from a global model. So the scenario here
is that we'll train a UBM or global model used these frame-based feature vectors from all audio
segments that we have available in training regardless of their categorical labels.
And then we'll adapt this global model into each different audio segment. So it's adapting to
each different audio segment instead of adapting to each category.
So in the end you have one Gaussian mixture model for one audio segment. So we're using this
Gaussian mixture model as an approximation to the joint distribution of all local descriptors
within one segment.
>>: [inaudible]
>> Xiaodan Zhuang: The segment here we're talking about usually varies from one second to
ten seconds. It could sometimes be shorter depending on the particular events. Like footsteps
usually are pretty long. But door slam or shut and key jingle are not very long as well.
>>: So that adaptation [inaudible].
>> Xiaodan Zhuang: Yeah. So for adaptation, we're doing MAP adaptation in this experiment
and with some conjugate price. Roughly the more data you have from each segment, the better
the adaptation will be.
So as you may have noticed for this approach, we actually do need some predefined boundaries.
So the way that we get around this is to have the first part being the HMM-based Viterbi
decoding and then use the hypothesized boundaries -- with the hypothesized boundaries, we'll be
able to approximate the join distributions here.
So in the end we'll have MAP decoding, which is Viterbi decoding of HMMs. And then with the
hypothesized boundaries and event labels we'll do confidence rescoring based on the
classification result out of the GMM supervector.
And in the end when we're combining the two hypotheses, it also gives -- opens this possibility
to refine output result according to the particular metric of your problem.
So in this case, the AED metric I mentioned. So the AED metric is basically the S score between
the hypothesized event sequence and the ground truth event sequence. So they'll give you
something complementary to what MAP decoding finds you.
Very quick go over some of the results here. So this was on the CLEAR 2007 acoustic event
detection data. There are 12 general meeting room non-speech events, such as keyboard typing,
cough, and chair moving. So basically we could improve the baseline result, which is the
HMM-based result that we submitted to the evaluation, by having a better context, temporal
context modeling, so the tandem model, and also to harness this complementary information
from a GMM supervector model in the rescoring -- in the rescoring phase here.
So this is saying even if we don't have this very rigid temporal structure captured by the hidden
state sequence, there is at least some complementary information coming from looking at audio
segments in a different way. And the two things are kind of additive to each other that could
further improve your result.
So then I'll talk about our experiences in categorization and localization of video events.
>>: Question [inaudible].
>> Xiaodan Zhuang: Oh, the features here -- yeah, I didn't talk about features here. But if you -say if we need some result tomorrow, I just recommend using MFCC RPRP. The particular
features that we used was kind of engineered towards this task. In particular, we bring together
MFCC RPRPs and filter banks with different parameters.
So I build a pool of kind of well-engineered speech features, and then we select some of -- and
we de-correlate these features and select a subset of the features out of the pool according to
some boosting based method or according to the best error for individual features.
>>: [inaudible]
>> Xiaodan Zhuang: We haven't used that. We have other participants actually used that ->>: [inaudible]
>> Xiaodan Zhuang: Yeah. We haven't explored this in particular. And, frankly, I have to say
the reason that we performed better than the other participants, it's -- we don't have proof saying
that it's on the feature side whether we're doing better or not. We believe it's more on the
modeling side that we're doing everything right and in a reasonable way that give us an
advantage.
Yeah. We ->>: [inaudible]
>> Xiaodan Zhuang: Yeah. Or we kind of derive features out of the pool that consists of
MFCCs, filter banks, and PRPs. The major motivation that we want to leverage what the
community has accumulated over the decades, and we believe these features are the best choice
if you have limited time.
>>: [inaudible] model and the tandem model, right? Another way to think of either to have
[inaudible] features.
>> Xiaodan Zhuang: Yeah. We -- so we haven't explicitly tried in that regard. I think the only
way to say that we might have some kind of multiscaled temporal thing happening here is in the
first part the HMM has this neural network to observe a larger context. Where in a rescoring
phase, we didn't -- we just used the local frame-based features.
But it's complementary in different ways. In one way it does not have this hidden ridged state
sequence. And also, as I said, the GMM supervectors observed directly the frame-based
features, not with this [inaudible]. So move on.
So I talk about categorization and localization of video events. This is centered around improved
image and video representation called Gaussianized vector representation. But as we go on, I
think this audience will find a lot of these things very similar to what we just saw from the GMM
supervector, because it's just -- we are actually inspired by what's happening in the speech and
audio models to see whether we can look at image and videos in a similar way.
So in particular the Gaussianized vector representation is our attempt to say can we model
images and videos without explicitly worrying about segmentation or particular object detection
within that image.
So the Gaussianized vector representation adapts a set of Gaussian mixtures according to the set
of patch-based descriptors in an image or video clip. And adaptation will be regularized by a
global Gaussian mixture model, and also the final linear kernel will be improved by within-class
covariance normalization.
And because of the success that we had in doing works like categorization and regression, we're
thinking what would be a reasonable way to apply this representation for object localization. So
in the end we adopt an efficient branch-and-bound search scheme, and this could be potentially
used to identify regions of interest in the video corresponding to different events.
So one of the major problems in computer vision, probably in any of the machine learning
problems is that you want to find the correspondence between the dataset. So if we're looking at
different faces here, we could easily see that if we want to compare the different faces, you
probably want to compare eyes to eyes and the mouths to mouths.
But if you look at some more realistic images, like a bedroom image or a broadcast news image,
even if we see the bedroom probably have a bed inside, but it's less apparent what kind of
particular correspondence we should find between the two images.
And if we look at a video, the broadcast news video, even if this event is categorized as car
entering, we don't necessarily see a very complete car in the image. So that makes any efforts
based on detecting a particular car very challenging if doable at all.
So visual cues for real-world events present extra challenges for correspondence compared to
some other tasks like face processing.
So the way that we look at image and video modeling is to look at them as a set of local patch
distributors. A patch is a very small region in the original spatial domain. So there are two
ways -- two major ways that people have used to extract patches. The first way is that you would
do a 2D sliding window called a -- patches extracted from a dense pixel grid, and then for each
patch you have some descriptor based on that particular patch.
Another way is to have some kind of low-level detectors. For example, a SIFT detector would
actually find regions where there is contrast or edge, basically something -- hopefully there is
this information.
But in the end of the day, you all get a set of local descriptors anyways. So the image or the
video clip becomes a set of local descriptors.
The most popular way to deal with this set of local descriptors is called a histogram of keywords.
Basically you have K means clustering to establish a codebook and then the histogram of the
counts for different codebooks. Codebook entries becomes the information that you carry for
that image. So a very complex image like this will or a video clip like this will become a
histogram in the end.
There is large quantization errors and loss of discriminability, so we're trying to see what can we
do to improve beyond that.
Oh, and I have to mention that there is improvement over the histogram of keywords itself. The
way I present here is kind of simplified version of different things.
The way that we address this problem is that we still start from this set of local descriptors in the
original feature space. And then same as in audio, the way -- the information that we need from
the image or video clip is basically the joint distribution of all these local descriptors. So to be
able to approximate this joint distribution and hopefully be robust to noise, we'll use a Gaussian
mixture model to fit this local descriptor example.
And then in the end you have the Gaussian mixture model that approximates the joint
distribution here. And then looking at different images is nothing but looking at -- is nothing but
different from looking at the different Gaussian mixture models each corresponding to one
image.
And in particular you take this normalized means of the Gaussian components to form a
supervector each corresponding to one image or video clip.
>>: So you also [inaudible].
>> Xiaodan Zhuang: Yeah. So you need to have a global model so that your -- the kernel
divergence can take this simple form.
>>: [inaudible] to each image or ->> Xiaodan Zhuang: Adapt to each image or each video clip. So we use this for different kind
of scenarios. One is for each image you have a set of local descriptors. The other is for video
clip classification, you basically have a set of local descriptors extracted from all the images that
were in that video clip. But we haven't worried about the temporal structure of the different
images sequenced within the video clip.
So we have a global model that trains using local descriptors from all kind of images and videos
available to us, regardless of their categories. And also the adaptation is done using local
descriptors extracted from one image or one video clip. And adaptation can be done with the
conjugate prior just to help establish this correspondence with the original global model.
And it can be shown that the distance between the two images or video clips can be characterized
by the approximate kernel divergence. And in particular you could see if you write this stacked
mean normalized in the proper way, the kernel divergence can be approximated by the Euclidean
distance.
So the kernel function here takes an interesting form. It's linear, but each of these phi here is
actually usually of very high dimension. So in our experiments we would easily take 500
Gaussian components for broadcast news video. And maybe for simpler dataset you could take
smaller number of Gaussian components. Because it's linear, so it's less a problem
computationally.
To wrap up on this part, so the Gaussian vector representation has a couple of advantages
compared to the original histogram of keywords representation. First, the assignment into the
different keyword is replaced by a self-assignment to the different Gaussian components
according to the posterior. And then we could leverage them Mahalanobis distance instead of a
Euclidean distance.
Although, for this representation, the Mahalanobis distance can be written as a Euclidean
distance, if you have this normalized properly.
And also the multiple order statistics are taken into consideration, in particular compared to only
the counts, which is zero order statistics, we could also use the adapted means and also take into
consideration the covariance matrices.
So a way to visualize this is in place of the very simplistic histogram of keywords representation,
the Gaussianized vector representation would not only establish the correspondence through the
different Gaussian components [inaudible] visualization, but also according to how well all these
local descriptors are adapted into these different modes and what their distribution is around the
different modes.
So a kernel diversion from here is that the way that we do this is saying we don't want to
establish any hard segmentation, because it's very hard for realistic data. We're basically
leveraging only the Gaussian components to give us that kind of correspondence.
But for kind of data that is very structured, like of human face, can we use something beyond
mixture of Gaussians to set up this correspondence. The answer is yes. One way to do this is -one most straightforward way to do this is to use a hidden Markov model.
And, for example, we tried on face images, if you take a hidden Markov model, even if a very
simple left-to-right Markov model, if you arrange your local patch descriptors in a [inaudible]
style such that you will observe first the eyes and then the nose and then the mouth. Then a very
simple left-to-right HMM would give you some additional performance in improvement because
the hidden states here kind of capture the temporal structure that's very structured with face
images. And this is not usually true for broadcast news or general video, but for faces they're
true.
Another way to look at why the Gaussianized vector representation had very good practical
performances is that if we do some approximation and look at the feature in such a way, first we
can calculate the posterior of an observed patch coming from the case mixture, and then we can
randomly distribute the observed patch into M Gaussian components or classes by a multinomial
trial called [inaudible] posterior probability.
So in the real world we're doing MAP adaptation, so it's kind of a weighted summation. But if
you look at it as a multinomial trial, then you would be able to write all the equations and see
that the final supervector actually observes a standard normal distribution.
So this has a benefit because a lot of the later processing for computer vision usually take this as
an assumption. For example, if people need to do PCA to reduce dimension, people need to
calculate distance as a Euclidean distance, all those processes are assuming that the feature
vector lives in Euclidean space. So this representation is partially to satisfy that assumption. So
we believe this might be the reason why it will give you performance boost.
So up to now we have used no categorical supervision for deriving these representations. Of
course you can use the kernels in an SVM and introduce this supervision. But before that we
would like to do something else to leverage the categorical labels that are so expensive to obtain
in the first place.
In particular we'd use within-class covariance normalization to identify a subspace for the
kernels that has maximum inter-image or inter-video distances within the same categories.
So basically this term here is calculating the distance between different images if they are
coming from the same category. And then in the end you get this covariance of these -- the
covariance that contributed from all these pairs that come from the same category.
The refined Gaussianized vector kernel would suppress this undesired subspace because it tells
us nothing about the target category labels. So in particular can suppress that subspace by
subtracting it from the original identical covariance kernel matrix here.
So we're comparing this with the then state-of-the-art paper working on a video event
categorization retrieval. So this is a paper that was given the best performance for this particular
task, but it was also a very represented way of how the computer vision community looks at this
problem.
So the particular way here is that they are dividing the video clip into subclips in the hierarchy.
So you have -- maybe you have -- so this is one video clip. And the other is another one. So the
essential problem is how do we characterize the distance between the two video clips.
So the way many people look at this is that I want to find the correspondence between the two
video clips. So I want to see which frame corresponds to which frame. This is temporally, but
potentially you could also do which local descriptor corresponds to another local descriptor
spatially. But that's the same idea.
So in particular you could establish a kind of hierarchy to look at this video clip at different
temporal resolutions, and then you try to find pairs between the grids in the different video clips
to establish this correspondence and based on some optimization, and then once you establish
that correspondence, hopefully when you calculate the distance between the two video clips,
your calculation is to compare apple with apple and pear with pear.
However, we're showing that even if you don't have that explicit step do this computation, you
could also get reasonable and even better result using something like the Gaussianized vector
representation.
So in particular we're comparing -- we're using the Gaussianized vector representation with
different classifiers. We could show -- even with the nearest neighbor classifier, it's actually
performing pretty well compared with -- in the TAPM approach, people actually use the support
vector machine.
So if we combine the Gaussianized vector representation with the support vector machine, that
gives you even more pronounced improvement. And also we could get benefits by taking
within-class covariance normalization to refine the kernel in the supervector space.
Another ->>: [inaudible]
>> Xiaodan Zhuang: This is TAPM result, the temporally aligned pyramid matching.
>>: So you did [inaudible]?
>> Xiaodan Zhuang: It's a separate thing because we actually didn't worry about the explicit
alignment between the images. And we believe they actually did find the alignment sometimes
makes sense but not always.
So I think the idea here is not to say we don't want any alignment in the end, but we want to say
if the alignment is very hard to do, can we do something reasonable such that if we don't do
alignment, we can perform comparably. The answer seems to be we could perform even better
on this particular problem. So that's one of the reasons that we think this is a result that we're
very excited about, and we prove this similar findings on different applications.
>>: So basically alignment event [inaudible] your system?
>> Xiaodan Zhuang: No.
>>: So do you think if you one day integrate that on top of your results you can get better ->> Xiaodan Zhuang: This -- yeah ->>: Or you already have used some of that concept [inaudible].
>> Xiaodan Zhuang: Yeah. We could have already harnessed some of this information. This is
a similar problem, like if we do do it in our way only using the GMM or using a multistate
HMM, do you think the multistate HMM always would be better than the single state HMM.
The answer is maybe not always. So for face processing, where there's a very strict temporal
structure, and this temporal structure is easy to capture, using a left to right you can capture part
of it.
If that's the case, then probably more complicated model would actually do you good. But for
broadcast news, because from their experience is the -- they found correspondence are not
always what they intended. So we're not sure -- maybe a version of that, combining with those,
there is always a chance to improve, but the idea of doing segmentation in the first place
sometimes could be a dangerous one to do. It's just like do you always want to do speech
recognition by doing phone sequence recognition first and then try to get that into whole word
sequence. It's a similar philosophy.
But we did use some of the other alignment methods that are popular with the computer vision
community. For example, they would do a hierarchical pyramid over the spatial domain and try
to take some assumptions, for example, left-right, the left-right corner probably also corresponds
to left-right cornering in an image, things like that. And in some applications we actually do find
improvement for that.
>>: [inaudible].
>> Xiaodan Zhuang: No, this -- yeah, yeah. For this one, yes. Yeah. This one we are
essentially assigning labels to each video clips, yes.
>>: [inaudible].
>> Xiaodan Zhuang: Yeah, exactly. Yes. So we do have a side benefit, which is because we
look at this image and video, as I said, a local descriptor, so it's less prone to occlusion. So even
if you occlude -- even if you only use 20 percent of the local descriptors, your performance don't
drop much.
But there is a warning here because this 20 percent is purely randomly sampled. So if you very
carefully occlude all those important local descriptors, then they probably won't work as well.
But just as a note here, if we randomly sample 20 percent of the local descriptors, the result will
be pretty much intact.
>>: [inaudible].
>> Xiaodan Zhuang: Yeah, in this one we're using ->>: [inaudible]
>> Xiaodan Zhuang: Not for this one, no. Yeah. We used some other features for other tasks,
but the idea of this modeling background is similar.
So we want to see can we use -- because it gave us a lot of advantages in doing categorization
retrieval. But all these things are talking about the global information based on the whole
images. So can we do something similar for a localization at which to find a particular object
within an image and then tell how large they are.
So we adopt a branch-and-bound scheme that has been proposed by Lampert to allow efficient
object localization that achieves global optimum with N squared instead of N fourth. And also
this scheme has been successfully applied for histograms keywords, so that gave us a way to
preliminarily compare with what other people have done using similar approaches.
So in particular the branch-and-bound scheme, regardless of the detailed nuances here, the basic
idea is that you want to search all these Ns, N power to the 4, rectangles, but not one by one.
You want to sort out sets of them if they have a very good bound to tell you what the upper
bound for that particular rectangle set is. And if the bound is not -- is worse than the bound of
another set, then you can always start with the set with the highest bound. And then if the bound
is good enough, then you won't miss the targeted rectangle that you're hunting after.
In particular you would initialize an empty queue of the rectangle sets and initialize this
rectangle set to be all the rectangles. So if we characterize this as top, bottom, left, and right, the
sets are initialized as all rectangles having the top, bottom, and left, right, to be complete scale
for each of these dimensions.
And then we would obtain two sets by splitting the parameter space along the largest of the four
dimensions. Then you would push these two rectangle sets into the queue with their respective
quality bound, which we'll talk about in a minute. And then we'll update R by saying we'll look
at the set with the highest quality bound. In the end, the set will shrink and shrink, and in the end
it will become only one rectangle. That's the one that we're trying to find.
So in order to introduce the quality bound we'll first talk about the quality function here which is
informing the confidence of the evaluated subarea being the target object. So we're actually
using the output of a binary SVM just in the standard format.
So the bound should satisfy two conditions. It will be larger than any of the quality functions
evaluated at any of the member rectangles. And also if the set becomes one member only, then
the bounds should be equal to the quality function evaluated on that single member.
So we can see -- first we can see the quality function can be rewritten with approximations as a -as the summation of some of the contributions from each local descriptor. So in particular
defined as per feature vector contribution here, so in the end we'll know this -- if you are to
evaluate a particular rectangle, that's nothing more than evaluating the summation of the
contributions coming from all the member patches within that subarea.
So if we can written it in this linear format, then the quality bound can be written as the
summation of the positive local contributions from the largest rectangle plus the summation of
the negative contributions from the smallest rectangle.
So this one is to evaluate the quality bound for a rectangle set. So that's essentially all these
positives coming from the largest member and all the negatives coming from the smallest
member. And they'll be easy to verify this satisfies two conditions that we just set forth.
>>: [inaudible]
>> Xiaodan Zhuang: So this is -- so you're talking about how this per feature contribution is
calculated?
>>: [inaudible]
>> Xiaodan Zhuang: For each [inaudible].
>>: Each vector.
>> Xiaodan Zhuang: We use the Gaussian mixture models to approximate one distribution for
one image in the video clip. But there's no class information.
>>: [inaudible]
>> Xiaodan Zhuang: Yeah, there's no class information here. And of course you could say why
don't we do classification based on classification of each local descriptor. But I'm pretty sure
that would not work because a single local descriptor is very noisy. So if we want to do
classification based on that, then it's usually not working as well.
So we are using the Gaussian mixture model to approximate all this during distribution within
one image, and to hope this smooths distribution modeling can actually suppress some of the
noise that's coming from the local distributors. So that's why.
But if you look at this per feature vector contribution here, that's what I was about to say, is that
the calculation of all these things are essentially the same as calculating the adaptation of a
global model into this particular image with all the local descriptors.
So because the per feature vector contribution seems -- you need to do this for all the local
descriptors. But that's the same thing as doing the adaptation where you already sum up the
adaptations of each feature vector according to the posteriors to different Gaussian components.
So computationally it's very similar.
Getting the set of -- the complete set of per feature vector contribution is computationally similar
to doing the adaptation for the whole image. Just because you could written it and approximate
it as a linear way.
>>: In the previous [inaudible] you're talking about the adaptation, right? And just based on the
compression between video clips to another video clip, it seems not very scalable if you have a
lot of label [inaudible] different events, you have to compare with all the different events, right?
>> Xiaodan Zhuang: Oh, so the comparison -- so once we get this Gaussianized vector in place,
then we'll basically just calculate the kernel matrix.
>>: [inaudible]
>> Xiaodan Zhuang: Basically you'll have a multiclass ->>: [inaudible]
>> Xiaodan Zhuang: Yeah. That's the same idea as if you have multiclass SVM for
classification of multiple categories. Right?
>>: [inaudible]
>> Xiaodan Zhuang: Yeah, you need to compute with each of the training ones. Right.
>>: You really don't model an event.
>> Xiaodan Zhuang: You model event only in the kernel space. Yes.
>>: [inaudible]
>> Xiaodan Zhuang: Yeah. With the kernel space you model the events.
>>: So [inaudible] I have, I don't know, [inaudible] event and I have a hundred video clips.
>> Xiaodan Zhuang: For [inaudible].
>>: For [inaudible]. So you have to compare with all the ->> Xiaodan Zhuang: Yes, that's right.
>>: But in this [inaudible].
>> Xiaodan Zhuang: For a localization, you're actually doing the same. So for localization, one
thing that I probably should have emphasized is it's basically based on the two-class
classification problem. It's either the foreground, the target, or it's the background.
So when you have -- you still -- say you have 500 positive training tokens foreground, and then
you have 500 negative background images, and then so what -- you're training this binary class
SVM, and then in testing time you still have to calculate the kernel -- you have to evaluate this
kernel with using the one test vector with all [inaudible].
>>: [inaudible] modeling and model this probability and the likelihood of each feature to each
event.
>> Xiaodan Zhuang: Not really [inaudible].
>>: That's confusing part.
>>: [inaudible]
>>: So how will you do that?
>> Xiaodan Zhuang: So the only difference between -- in this regard is a binary SVM versus the
multiclasses. So you calculate the -- you calculate the kernel matrix, and then it's the training of
the SVM that deals with this categorical supervision. So you're always getting -- so the only
thing that we change here is how do you calculate the distance between two images.
>>: [inaudible]
>> Xiaodan Zhuang: Yeah, so you're still looking at this kernel matrix, which is always looking
at two individual images, one coming from training and one coming from the testing.
>>: In your training you have all the label that the [inaudible].
>> Xiaodan Zhuang: Yeah, yeah.
>>: I see.
>>: So then you SVM.
>> Xiaodan Zhuang: Yeah. SVM is the way where the supervision comes into play.
>>: That's really where you sort of use the [inaudible] foreground objects.
>> Xiaodan Zhuang: Yeah.
>>: And so the feature vectors [inaudible].
>> Xiaodan Zhuang: Yeah, yeah, yeah.
>>: [inaudible]
>> Xiaodan Zhuang: Yeah, yeah, yeah.
>>: [inaudible]
>> Xiaodan Zhuang: Yeah.
>>: So that's [inaudible] you don't really need separate classes, you just -- do you do the SVM
per event, per class, or [inaudible] SVM for another type of event [inaudible]?
>> Xiaodan Zhuang: So the way that we do it practically is that we get this one supervector, this
vector representation for one image. So for each -- let's first talk about the video event
categorization problem. So we have one vector for one image, and then we also know the label
for that vector. Then we throw this into a multiclass SVM training program.
So for localization, we have different images -- some of these are target, some of these are the
background. And each of them would be converted into a vector. And then we throw this into a
binary SVM.
>>: So that's not -- it's not the same as [inaudible].
>> Xiaodan Zhuang: But when you throw it into the binary SVM training, you basically throw
the kernel matrix in. Right.
>>: Sure. I'm just saying that your classifier [inaudible] it's not just [inaudible] to compare ->> Xiaodan Zhuang: Sure, the classifier itself is supervised. So that's just SVM.
>>: [inaudible] the way you do classification is different from [inaudible].
>> Xiaodan Zhuang: So for localization, the core is a binary classifier. But for localization, the
problem is that you have to do binary classification for many hypothesized rectangles. So the
way that it looks different here is because you organize it in this linear summation so that you
can reuse a lot of the computation and also you could bound this subset of hypothesized
rectangles and then throw them away if they their bound is bad enough.
But at the cores, it's actually a binary classification. It's to tell whether this subarea is the
foreground or background. So in that sense it's the same as [inaudible].
Okay. Yeah. Here is some examples of how things look for localization. So basically you
would identify the areas correct if you had collect this local positive red areas and less of these
negative blue areas. And sometimes it goes around because the red areas are not as clustered
together as we want or sometimes we count it as a run detection when you combine two cores
into one model.
We compare with -- we compare this with a similar work using the histogram of keywords based
on this search-and-bound scheme. So it's consistently improving beyond the histogram of
keywords. And we can also see doing within-class normalization can also further improve the
result. All these Ns are meaning the within-class covariance normalization.
So I'll quickly go through the third part, which is relatively short anyways. So this part is trying
to say acoustic event detection is very hard, and even if we did reasonably well, the performance
is far from usable from the practical perspective. So people are trying to look at can we engage
video to help just as the way we engage lips to help speech recognition.
So for the previous work, people have proposed a detection by classification system, and then
that system use a lot of extra information including audio spatial localization, multiperson
tracking, motion analysis, face recognition, object detection.
The drawback is that training all these AED hoc visual detectors are very expensive. And we're
trying to see can we go around this and do something reasonable such that the performance is
comparable at least.
So the way we look at it is all these non-speech audio-visual events are mostly related to motion,
so we're using visual features based on optical flow. And then we are summarizing all these
optical flow firings using an overlapping spatial pyramid histogram.
So basically we're looking at a complete image and calculate a histogram of the local optical
flow firing magnitude, and also we look at this two-by-two grid and then extract four histograms,
and then we look at this three-by-one grid and look at -- extract these three histograms.
And then we stack all these histograms together and do a decoorelation and use that as a
feature -- video feature vector for each frame. And that goes together with the MFCC or
something like that for the audio stream.
So to combine these we use the multistream HMM and coupled HMM. In particular with the
coupled HMM, there's a lot of hassle to initialize such that the output PDFs becomes reasonable.
So the way we would try the two different schemes, one way to initialize using a
state-synchronized multistream HMM pair, another way to initialize it using pairs of audio-only
and video-only HMM.
>>: [inaudible]
>> Xiaodan Zhuang: Yeah. So the coupled HMM, it's basically -- a coupled HMM is saying the
state -- in the multistream HMM, the audio state sequence and the video state sequence would
progress at the same pace. So the first audio state goes to the second, and the same time the
video goes to the second.
And the coupled HMM, they can go asynchronized from the perspective of the state sequence.
So maybe the audio state has already progressed into the second one while the video state is still
residing in the first one.
But actually you could look at a coupled HMM as a HMM with more states. But the equivalent
HMM has more space because each of the states in the equivalent HMM is defined on both the
audio state variable and the video state visual in the coupled HMM.
So in our experiment we're actually allowing asynchrony up to one state so that you don't really
grow this topology into a huge structure.
So in the end it's about how do you initialize the output PDFs of the coupled HMM or its
equivalent HMM. So when I'm saying we initialize using multistream instruments, basically
initializing assuming the audio and video states should always be synchronized on the
[inaudible] level, and then we would combine the PDF from the first audio state with the first
video state.
And also we have another state that's combining the first audio PDF with the second video PDF.
So in that case we construct this equivalent -- this HMM equivalent to the coupled HMM.
Or another way to do this is to initialize using audio-only HMM, video-only HMM, and then put
their states together. And once we put together this equivalent -- this coupled HMM equivalent
HMM, then we can do re-estimation based on the new topology. That tends to converge easier
than if you directly train on a [inaudible], it pretty much always fails. And most of the states
never get traversed.
So the answer is in terms of can we get performance similar to doing all these lower level
detectors, it's absolutely yes. So we could actually do better than their approach. The benefits
come from of course multiple places. First, if you use a dynamic Bayesian model to do Viterbi
decoding, there gives you advantage.
And also when you do -- when you characterize the video using this generalizable feature, that
sometimes meant even give you advantage then to use the very localized low-level detector.
Another thing that -- however, the two ways to initialize the HMM actually turns out to be not
very significantly different. But a warning is that it is important to initialize them reasonably,
but which one doesn't seem to make much difference here.
Okay. I think I need to wrap up here. So it's basically I talked about audio event modeling, a lot
of lessons from speech recognition, and video cue modeling where we want to say can we do
these jobs reasonably without going all these very painstaking segmentation. The answer is yes.
And in many cases we actually do better. And then we could improve the acoustic event
detection by using audio-visual multimodality modeling.
Thank you. That's the picture of the incomplete group. Maybe you would recognize a couple
faces who worked here on this campus for a while.
And I have a few very quick shots, but I think we're going over time. Should I take five more
minutes or ->> Dong Yu: [inaudible]
>> Xiaodan Zhuang: Oh, okay. Some other projects -- so just because this is a very
interdisciplinary group, that the speech group and video group work very closely together, and I
also had experience working with phonologists who are interested to see whether a phonological
series works in speech recognition setup or can we devise a computation model according to
their theory. So that gives me exposure to a couple different projects that might be related to
something happening here.
For example, we worked on a pronunciation modeling which basically following the [inaudible]
phonology theory. So according to this theory, a word is pronounced as this word because
there's a unique set of gestural targets behind it defined on different tract variables. And when
the word is co-articulated with something else or the speech rate changes, there's a speech
reduction.
This set of gestural targets are still there. They're just arranged in different ways. They shift a
long time. But the targets are still there. But because they overlap with each other in different
ways, so their service won't change. So that's their theory. And we're trying to build a model
according to this series.
So the way that I answer this question is if we define the gestural targets at the local time frame
as a vector, then this 2D representation just becomes a vector sequence problem.
So we will say a gestural score as a 2D dimension here defined on tract variables and time could
be approximated as a sequence of gestural pattern vectors. Each vector defined the construction
targets at this local time.
So once we have this formulation, we want to see can we do speech observation to gestural
pattern vector sequence, thus to the gestural score, therefore to words. So the gestural pattern
vector would encode instantaneous patterns of gestural activations across all tract variables in the
gestural score given a particular time.
And the gestural score approximated as a sequence of gestural pattern vectors is the example of
gestural activations which is basically the approximated gestural pattern vector sequence.
And of course at each step we could introduce some confident score, the likelihoods. The
interesting part is actually the gestural score to a word part where we'd use a [inaudible] gestural
timing model which was originally developed for articular phonology to generate one canonical
gestural score for different word. This is like the canonical pronunciation for this word.
So the words will be distinguishable according to the gestural scores in the sense that they are
distinguishable according to this example of gestural targets behind them. But they are not
distinguishable according to their particular timings, because when a word is pronounced by
different people in different contexts, the timings tend to change according to [inaudible]
phonology.
So the computation model that we devise according to this theory is that we have a finite state
machine emitting gestural pattern vector sequences. So you could imagine each parse in this
finite state machine actually is emitting a gestural pattern vector sequence which is essentially a
gestural score with particular timing.
So if you look at the word "the" here, these are the different tract variables. So this -- they
always have these four gestural activation targets. But they could be arranged in different
timings. And we want to summarize all this into the finite state machine through some recursive
process.
And once we have this finite state machine -- so this machine is basically the pronunciation of
this single word. And once we have this machine, we'll be able to do word recognition by
composing a lattice with a pronunciation model, which is roughly the union of this -- all these
different word pronunciations.
>>: [inaudible] looking at the acoustic [inaudible] or by looking at the articulation?
>> Xiaodan Zhuang: We actually look at the canonical speech gesture. So we have this
intergestural timing model which could generate one of these to the representation for each word.
And then we're saying the pronunciations allowed are all the different possible shiftings of all
these gestural targets. So we're building this recursively by saying looking at a canonical one,
let's say the example of the gestural targets we have, and then at each time frame of proposing
possible combinations of them and then weighted by some kind of quality measures to measure
whether this combination is possible or not from the training data or from the theory.
So at each -- so each time we're doing an interstate transition, it's like changing from a particular
local gestural target combination into a next one. And each self transition is saying we'll stay in
this particular configuration for another time slot.
So we're building this recursively as building a tree. And then in the end you can optimize this
finite state machine by pushing and everything. But once you get this finite state machine, you
could use the open source finite state machine toolboxes to optimize and construct a dictionary.
This is still at the very early stage. It's not -- we don't have any evidence to say this works better
than the conventional approaches. So it's more like an interdisciplinary exploration with the
phonologists.
>>: [inaudible]
>> Xiaodan Zhuang: Oh, the targets.
>>: Yeah.
>> Xiaodan Zhuang: So the way -- the part that I worked on was to devise how these things all
put together. We have collaborators in [inaudible] lab. They are working on how do you
recover this underlining 2D representation from the speech surface form. So they have actually a
series of words working on that. But I think it's probably beyond what I can talk about.
Yeah. Another thing is we had -- we did some kind of speech retrieval for unknown language
for a multimedia evaluation where the basic idea is that you build indexes for each database entry
using -- you summarize each dataset entry as a lattice and then convert that into a finite state
machine-based index term.
So that term is basically allowing transitions from the beginning into many of the internal nodes
instead of requiring a complete traversing of one passing lattice.
So when you introduce these extra transitions from a beginning to the different nodes, you would
associate that with something like a forward probability or backward probability. So that part
we're actually just following the Bell Labs approach before.
The reason of doing it that way is I was given three weeks to build this system, so I want to
devise something that requires less tuning. So using the finite state machine was a reasonable
way to do this, and the tuning could mostly be done using standard optimization methods.
For unknown speech, we are talking about looking at model training with different languages,
but none of which is the testing language that you retrieve from -- so the idea here is that you
would want to have models that are hopefully language independent. So the way that we address
it is to cluster all these language-dependent atomic models and then do clustering based on their
pairwise distances.
And then the clustering result would -- the clustering result would hopefully capture information
that is shared across different training languages, but not language specific. And hopefully that
information is more transferable to a brand-new, never-seen language at the testing.
Oh, the last one was actually done with Microsoft. The problem was to converge from speech to
lips. We had originally a baseline was STS based. Since this model you build one triphone
visual model for each phone, and then you can synthesize the lips from the triphone sequence,
which is known before.
So the question here is can you engage audio to improve, and, second, can you do lips without
knowing the ground truth phone sequence, which usually can be hard to get.
The answer are yes to both questions. You want to engage the audio so that you can engage the
audios in the maximum likelihood estimation of the video observation, which will be -- which
would improve beyond this visual-only synthesis.
And the diagram shown here is more on the conversion side where we don't have this ground
truth phone sequence at all. The idea here is that if we are going from speech to lips it looks like
it's not always necessary to jump this -- a huge semantic gap to find the underlying speech
contact. It may be easier to not go into the speech content just because how hard it is.
So the way that we address this is to use a audio-visual joint Gaussian mixture model, and then
we select optimized modality weight according to a development set. And then on the other
hand, we want to find the visual sequence that best mimic the ground truth video sequence. The
maximum likelihood is a good criteria to work with, but it's not necessarily in line with the target
metric, which is the human perception.
The human perception is very hard to do, so we're mimicking the human perception using the
mean square error of the synthesized trajectories. So that will use -- that will be able to refine the
visual Gaussian mixture model. It's kind of extension to the minimum generation error that
Microsoft has worked on speech synthesis. This was in a conversion setup.
Yeah. That's all I have today. Thank you.
[applause]
Download