>> Dong Yu: So today we're glad to have Xiaodan Zhuang from USC to give us a talk on audio-visual event detection. Xiaodan is a Ph.D. candidate at the UIUC. He got his master degree from electrical engineering at the UIUC in 2007 and bachelor degree from Tsinghua University in 2005. So without further ado. >> Xiaodan Zhuang: Thank you for coming in the morning. So today I'll talk about modeling audio and visual cues for real-world event detection. This is the topic I picked for my dissertation as an umbrella to include a couple of projects I got involved over the years. I also prepared a couple of snapshot slides towards the end just covering a couple of other projects I was involved in that might be interesting to this audience. So audio-visual event detection aims to identify semantically defined events that reveal human activities. And we will agree here that while speech is the most informative information source in audio stream, a lot of the known speech events also tell you what's happening in your environment. For example, on the acoustic side, if you have events like door slam, steps, chair moving, key jingle in this room, they tell you something about what's happening. And, in particular, if we hear a very loud yawn here at the very first five minutes of the talk, then that's also something that tells you about the talk. We also know for the video events it's also very apparent that if you can figure out what's happening in the video, it makes uses to multiple applications, like surveillance, human computing interaction, supporting independence and well-being of aging population, video annotation and multimedia retrieval in general. So as useful as these applications are, they are relatively less studied than speech recognition. It's also a shorter history for these applications. But they are explored by different researchers before. For example, many people worked on detecting restricted highlight events, such as a gunshot in the street or explosion in a movie or a goal cheering detection in a sports game. Many times people leverage lower level detectors. For example, a car detector, a flag detector, an office detector, using these things for video event detection. And in particular people -- for a meeting room scenario like this, people would use a person tracker, a laptop detector, face detector, and door activity estimator. And the benefits here are that hopefully these lower level detectors are transferable between different tasks. The drawback is that these are usually ad hoc and need more annotation for training. So in this work we're trying to see what could be generalizable and robust way to model these cues for detection of real-world events. By real-world events we're meaning events that come from data that are realistic and that are not collected in a lab setting. So my talk will be roughly in three parts. The major pieces are actually the first two parts. First I'll talk about a classification detection of acoustic events. There's a lot of lessons learned by the speech recognition community that actually I think this audience will find a lot of the methods very similar to what we use in speech recognition, and they work so well because they survive all these decades of work in speech recognition community. So to detect and classify acoustic events is a task of significance according to the clear evaluation that we were participating. The activity detection and description is a key functionality of perceptually aware interfaces working in collaborative human communication environments. And these events usually help describe human activities. So the particular dataset that we worked on have over 12 events such as door slam, steps, chair moving, key jingle. They are difficult because they -- we are talking about a set of general events in a very realistic environment. We're not talking about a particular very highlight event like explosion or something like that. And also the temporal constraint between different events is much looser than in speech. There's no linguistic constraint, of course. And usually they happen -- they don't tend to happen following a particular sequence. The SNR is low. The particular dataset and setup we use actually was using one far-field microphone. So we haven't -- we haven't included any beam search -- I mean any array, microphone array beam search in this setup. So the far-field microphone would have background noise and also some background [inaudible] talking speech as a background. We propose leveraging statistical models proven effective in the speech recognition literature. In particular we use a tandem connectionist-HMM approach to combine the sequence modeling capability of the HMM with the context-dependent discriminative capability of an artificial neural network. So here we're talking about one HMM for one particular event. So for 14 events, we'll have 14 HMMs. And in a rescoring stage we'll use a support vector machine, Gaussian mixture model, supervector approach to use this noise-adaptive kernels to better approximate the KL divergence between feature distributions in different audio segments. We'll cover that in a while. So the way that we formulate acoustic event detection is exactly the same way as in speech recognition in particular as a whole word sequence recognition. So you're basically maximizing the joint distribution between the observation and the event sequence. And for feature representation we are looking at sliding window-based sequence of local frame-based features. So temporal segmentation is one of the major challenges for this problem. People have tried different ways to address this. For example, you could do a super sliding window-based approaches, basically detection by classification. And people also come up with ideas similar to a cascade of different sliding windows, try to improve accuracy. But from what we found and also from what the results of the evaluation sets using the Viterbi algorithm of HMMs compute simultaneously optimal segmentation and classification of the audio stream. So this is still the best choice as we believe, and we illustrate this in the evaluation. In particular we believe that the benefit is that the noise in these individual frames will be alleviated by this learned prior of preferring self-transitions rather than non-self-transitions in the hidden finite state machine. And how do you model the temporal context and does that matter in this particular application. We explored this question by using artificial neural network to observe a larger context window for each hidden state. In particular, the feature vectors that are observed in context window and the output of the neural networks are the posteriors which are transformed and then fed into a dynamic Bayesian model, in particular a hidden Markov model. So this is also learning from the speech recognition community, but it's not always working in speech recognition, because I think people having these lessons saying when the task is of low SNR and also when the models are relatively context independent, then this tandem approach usually would give you more pronounced improvement. And these two assumptions do stand true for our application, because SNR is indeed low, and also the models are less context dependent because we don't really know whether this is -- there's no -- nothing like in a [inaudible] speech models where you have a very rigidly defined context. So basically we'll take a larger window of these frame-based features and take the output of the neural network transform and use that as an input to the HMMs. Another question is is the temporal structure very important or not. So because up to now we use the HMMs to model the events. In particular, the HMMs are of the topology of left to right. So we do want the different state sequence to capture the temporal structure within each event segment proposed. However, there's an alternative way to look at these audio segments. The audio segments, they vary in length, but in essence they are all a set of local frame-based features. So looking at an audio segment, whatever length it is, is looking at the joint distribution of all these frame-based feature vectors. So the way that we did here was did also have success in speaker ID and language ID. So the idea here is to look at the audio segment as a set of these local descriptors and then we would approximate this joint distribution using a Gaussian mixture model. And then the parameters of the Gaussian mixture model would be used to construct a supervector which becomes this uniform length vector representation for the audio segment that varies. And once we have this vector representation, it can be shown that this vector representation put into a linear kernel actually approximates the kernel divergence between the other segments, and the kernel can be used in whatever classifier of your choice. So in particular the GMM supervector is the normalized stacked mean vectors from the different components. And that can be shown that the kernel divergence between the feature distributions can be approximated by the Euclidean distance, which is why we can use a linear kernel. And there are some assumptions behind these being true, in particular we need the different Gaussian mixture models to be obtained by adapting from a global model. So the scenario here is that we'll train a UBM or global model used these frame-based feature vectors from all audio segments that we have available in training regardless of their categorical labels. And then we'll adapt this global model into each different audio segment. So it's adapting to each different audio segment instead of adapting to each category. So in the end you have one Gaussian mixture model for one audio segment. So we're using this Gaussian mixture model as an approximation to the joint distribution of all local descriptors within one segment. >>: [inaudible] >> Xiaodan Zhuang: The segment here we're talking about usually varies from one second to ten seconds. It could sometimes be shorter depending on the particular events. Like footsteps usually are pretty long. But door slam or shut and key jingle are not very long as well. >>: So that adaptation [inaudible]. >> Xiaodan Zhuang: Yeah. So for adaptation, we're doing MAP adaptation in this experiment and with some conjugate price. Roughly the more data you have from each segment, the better the adaptation will be. So as you may have noticed for this approach, we actually do need some predefined boundaries. So the way that we get around this is to have the first part being the HMM-based Viterbi decoding and then use the hypothesized boundaries -- with the hypothesized boundaries, we'll be able to approximate the join distributions here. So in the end we'll have MAP decoding, which is Viterbi decoding of HMMs. And then with the hypothesized boundaries and event labels we'll do confidence rescoring based on the classification result out of the GMM supervector. And in the end when we're combining the two hypotheses, it also gives -- opens this possibility to refine output result according to the particular metric of your problem. So in this case, the AED metric I mentioned. So the AED metric is basically the S score between the hypothesized event sequence and the ground truth event sequence. So they'll give you something complementary to what MAP decoding finds you. Very quick go over some of the results here. So this was on the CLEAR 2007 acoustic event detection data. There are 12 general meeting room non-speech events, such as keyboard typing, cough, and chair moving. So basically we could improve the baseline result, which is the HMM-based result that we submitted to the evaluation, by having a better context, temporal context modeling, so the tandem model, and also to harness this complementary information from a GMM supervector model in the rescoring -- in the rescoring phase here. So this is saying even if we don't have this very rigid temporal structure captured by the hidden state sequence, there is at least some complementary information coming from looking at audio segments in a different way. And the two things are kind of additive to each other that could further improve your result. So then I'll talk about our experiences in categorization and localization of video events. >>: Question [inaudible]. >> Xiaodan Zhuang: Oh, the features here -- yeah, I didn't talk about features here. But if you -say if we need some result tomorrow, I just recommend using MFCC RPRP. The particular features that we used was kind of engineered towards this task. In particular, we bring together MFCC RPRPs and filter banks with different parameters. So I build a pool of kind of well-engineered speech features, and then we select some of -- and we de-correlate these features and select a subset of the features out of the pool according to some boosting based method or according to the best error for individual features. >>: [inaudible] >> Xiaodan Zhuang: We haven't used that. We have other participants actually used that ->>: [inaudible] >> Xiaodan Zhuang: Yeah. We haven't explored this in particular. And, frankly, I have to say the reason that we performed better than the other participants, it's -- we don't have proof saying that it's on the feature side whether we're doing better or not. We believe it's more on the modeling side that we're doing everything right and in a reasonable way that give us an advantage. Yeah. We ->>: [inaudible] >> Xiaodan Zhuang: Yeah. Or we kind of derive features out of the pool that consists of MFCCs, filter banks, and PRPs. The major motivation that we want to leverage what the community has accumulated over the decades, and we believe these features are the best choice if you have limited time. >>: [inaudible] model and the tandem model, right? Another way to think of either to have [inaudible] features. >> Xiaodan Zhuang: Yeah. We -- so we haven't explicitly tried in that regard. I think the only way to say that we might have some kind of multiscaled temporal thing happening here is in the first part the HMM has this neural network to observe a larger context. Where in a rescoring phase, we didn't -- we just used the local frame-based features. But it's complementary in different ways. In one way it does not have this hidden ridged state sequence. And also, as I said, the GMM supervectors observed directly the frame-based features, not with this [inaudible]. So move on. So I talk about categorization and localization of video events. This is centered around improved image and video representation called Gaussianized vector representation. But as we go on, I think this audience will find a lot of these things very similar to what we just saw from the GMM supervector, because it's just -- we are actually inspired by what's happening in the speech and audio models to see whether we can look at image and videos in a similar way. So in particular the Gaussianized vector representation is our attempt to say can we model images and videos without explicitly worrying about segmentation or particular object detection within that image. So the Gaussianized vector representation adapts a set of Gaussian mixtures according to the set of patch-based descriptors in an image or video clip. And adaptation will be regularized by a global Gaussian mixture model, and also the final linear kernel will be improved by within-class covariance normalization. And because of the success that we had in doing works like categorization and regression, we're thinking what would be a reasonable way to apply this representation for object localization. So in the end we adopt an efficient branch-and-bound search scheme, and this could be potentially used to identify regions of interest in the video corresponding to different events. So one of the major problems in computer vision, probably in any of the machine learning problems is that you want to find the correspondence between the dataset. So if we're looking at different faces here, we could easily see that if we want to compare the different faces, you probably want to compare eyes to eyes and the mouths to mouths. But if you look at some more realistic images, like a bedroom image or a broadcast news image, even if we see the bedroom probably have a bed inside, but it's less apparent what kind of particular correspondence we should find between the two images. And if we look at a video, the broadcast news video, even if this event is categorized as car entering, we don't necessarily see a very complete car in the image. So that makes any efforts based on detecting a particular car very challenging if doable at all. So visual cues for real-world events present extra challenges for correspondence compared to some other tasks like face processing. So the way that we look at image and video modeling is to look at them as a set of local patch distributors. A patch is a very small region in the original spatial domain. So there are two ways -- two major ways that people have used to extract patches. The first way is that you would do a 2D sliding window called a -- patches extracted from a dense pixel grid, and then for each patch you have some descriptor based on that particular patch. Another way is to have some kind of low-level detectors. For example, a SIFT detector would actually find regions where there is contrast or edge, basically something -- hopefully there is this information. But in the end of the day, you all get a set of local descriptors anyways. So the image or the video clip becomes a set of local descriptors. The most popular way to deal with this set of local descriptors is called a histogram of keywords. Basically you have K means clustering to establish a codebook and then the histogram of the counts for different codebooks. Codebook entries becomes the information that you carry for that image. So a very complex image like this will or a video clip like this will become a histogram in the end. There is large quantization errors and loss of discriminability, so we're trying to see what can we do to improve beyond that. Oh, and I have to mention that there is improvement over the histogram of keywords itself. The way I present here is kind of simplified version of different things. The way that we address this problem is that we still start from this set of local descriptors in the original feature space. And then same as in audio, the way -- the information that we need from the image or video clip is basically the joint distribution of all these local descriptors. So to be able to approximate this joint distribution and hopefully be robust to noise, we'll use a Gaussian mixture model to fit this local descriptor example. And then in the end you have the Gaussian mixture model that approximates the joint distribution here. And then looking at different images is nothing but looking at -- is nothing but different from looking at the different Gaussian mixture models each corresponding to one image. And in particular you take this normalized means of the Gaussian components to form a supervector each corresponding to one image or video clip. >>: So you also [inaudible]. >> Xiaodan Zhuang: Yeah. So you need to have a global model so that your -- the kernel divergence can take this simple form. >>: [inaudible] to each image or ->> Xiaodan Zhuang: Adapt to each image or each video clip. So we use this for different kind of scenarios. One is for each image you have a set of local descriptors. The other is for video clip classification, you basically have a set of local descriptors extracted from all the images that were in that video clip. But we haven't worried about the temporal structure of the different images sequenced within the video clip. So we have a global model that trains using local descriptors from all kind of images and videos available to us, regardless of their categories. And also the adaptation is done using local descriptors extracted from one image or one video clip. And adaptation can be done with the conjugate prior just to help establish this correspondence with the original global model. And it can be shown that the distance between the two images or video clips can be characterized by the approximate kernel divergence. And in particular you could see if you write this stacked mean normalized in the proper way, the kernel divergence can be approximated by the Euclidean distance. So the kernel function here takes an interesting form. It's linear, but each of these phi here is actually usually of very high dimension. So in our experiments we would easily take 500 Gaussian components for broadcast news video. And maybe for simpler dataset you could take smaller number of Gaussian components. Because it's linear, so it's less a problem computationally. To wrap up on this part, so the Gaussian vector representation has a couple of advantages compared to the original histogram of keywords representation. First, the assignment into the different keyword is replaced by a self-assignment to the different Gaussian components according to the posterior. And then we could leverage them Mahalanobis distance instead of a Euclidean distance. Although, for this representation, the Mahalanobis distance can be written as a Euclidean distance, if you have this normalized properly. And also the multiple order statistics are taken into consideration, in particular compared to only the counts, which is zero order statistics, we could also use the adapted means and also take into consideration the covariance matrices. So a way to visualize this is in place of the very simplistic histogram of keywords representation, the Gaussianized vector representation would not only establish the correspondence through the different Gaussian components [inaudible] visualization, but also according to how well all these local descriptors are adapted into these different modes and what their distribution is around the different modes. So a kernel diversion from here is that the way that we do this is saying we don't want to establish any hard segmentation, because it's very hard for realistic data. We're basically leveraging only the Gaussian components to give us that kind of correspondence. But for kind of data that is very structured, like of human face, can we use something beyond mixture of Gaussians to set up this correspondence. The answer is yes. One way to do this is -one most straightforward way to do this is to use a hidden Markov model. And, for example, we tried on face images, if you take a hidden Markov model, even if a very simple left-to-right Markov model, if you arrange your local patch descriptors in a [inaudible] style such that you will observe first the eyes and then the nose and then the mouth. Then a very simple left-to-right HMM would give you some additional performance in improvement because the hidden states here kind of capture the temporal structure that's very structured with face images. And this is not usually true for broadcast news or general video, but for faces they're true. Another way to look at why the Gaussianized vector representation had very good practical performances is that if we do some approximation and look at the feature in such a way, first we can calculate the posterior of an observed patch coming from the case mixture, and then we can randomly distribute the observed patch into M Gaussian components or classes by a multinomial trial called [inaudible] posterior probability. So in the real world we're doing MAP adaptation, so it's kind of a weighted summation. But if you look at it as a multinomial trial, then you would be able to write all the equations and see that the final supervector actually observes a standard normal distribution. So this has a benefit because a lot of the later processing for computer vision usually take this as an assumption. For example, if people need to do PCA to reduce dimension, people need to calculate distance as a Euclidean distance, all those processes are assuming that the feature vector lives in Euclidean space. So this representation is partially to satisfy that assumption. So we believe this might be the reason why it will give you performance boost. So up to now we have used no categorical supervision for deriving these representations. Of course you can use the kernels in an SVM and introduce this supervision. But before that we would like to do something else to leverage the categorical labels that are so expensive to obtain in the first place. In particular we'd use within-class covariance normalization to identify a subspace for the kernels that has maximum inter-image or inter-video distances within the same categories. So basically this term here is calculating the distance between different images if they are coming from the same category. And then in the end you get this covariance of these -- the covariance that contributed from all these pairs that come from the same category. The refined Gaussianized vector kernel would suppress this undesired subspace because it tells us nothing about the target category labels. So in particular can suppress that subspace by subtracting it from the original identical covariance kernel matrix here. So we're comparing this with the then state-of-the-art paper working on a video event categorization retrieval. So this is a paper that was given the best performance for this particular task, but it was also a very represented way of how the computer vision community looks at this problem. So the particular way here is that they are dividing the video clip into subclips in the hierarchy. So you have -- maybe you have -- so this is one video clip. And the other is another one. So the essential problem is how do we characterize the distance between the two video clips. So the way many people look at this is that I want to find the correspondence between the two video clips. So I want to see which frame corresponds to which frame. This is temporally, but potentially you could also do which local descriptor corresponds to another local descriptor spatially. But that's the same idea. So in particular you could establish a kind of hierarchy to look at this video clip at different temporal resolutions, and then you try to find pairs between the grids in the different video clips to establish this correspondence and based on some optimization, and then once you establish that correspondence, hopefully when you calculate the distance between the two video clips, your calculation is to compare apple with apple and pear with pear. However, we're showing that even if you don't have that explicit step do this computation, you could also get reasonable and even better result using something like the Gaussianized vector representation. So in particular we're comparing -- we're using the Gaussianized vector representation with different classifiers. We could show -- even with the nearest neighbor classifier, it's actually performing pretty well compared with -- in the TAPM approach, people actually use the support vector machine. So if we combine the Gaussianized vector representation with the support vector machine, that gives you even more pronounced improvement. And also we could get benefits by taking within-class covariance normalization to refine the kernel in the supervector space. Another ->>: [inaudible] >> Xiaodan Zhuang: This is TAPM result, the temporally aligned pyramid matching. >>: So you did [inaudible]? >> Xiaodan Zhuang: It's a separate thing because we actually didn't worry about the explicit alignment between the images. And we believe they actually did find the alignment sometimes makes sense but not always. So I think the idea here is not to say we don't want any alignment in the end, but we want to say if the alignment is very hard to do, can we do something reasonable such that if we don't do alignment, we can perform comparably. The answer seems to be we could perform even better on this particular problem. So that's one of the reasons that we think this is a result that we're very excited about, and we prove this similar findings on different applications. >>: So basically alignment event [inaudible] your system? >> Xiaodan Zhuang: No. >>: So do you think if you one day integrate that on top of your results you can get better ->> Xiaodan Zhuang: This -- yeah ->>: Or you already have used some of that concept [inaudible]. >> Xiaodan Zhuang: Yeah. We could have already harnessed some of this information. This is a similar problem, like if we do do it in our way only using the GMM or using a multistate HMM, do you think the multistate HMM always would be better than the single state HMM. The answer is maybe not always. So for face processing, where there's a very strict temporal structure, and this temporal structure is easy to capture, using a left to right you can capture part of it. If that's the case, then probably more complicated model would actually do you good. But for broadcast news, because from their experience is the -- they found correspondence are not always what they intended. So we're not sure -- maybe a version of that, combining with those, there is always a chance to improve, but the idea of doing segmentation in the first place sometimes could be a dangerous one to do. It's just like do you always want to do speech recognition by doing phone sequence recognition first and then try to get that into whole word sequence. It's a similar philosophy. But we did use some of the other alignment methods that are popular with the computer vision community. For example, they would do a hierarchical pyramid over the spatial domain and try to take some assumptions, for example, left-right, the left-right corner probably also corresponds to left-right cornering in an image, things like that. And in some applications we actually do find improvement for that. >>: [inaudible]. >> Xiaodan Zhuang: No, this -- yeah, yeah. For this one, yes. Yeah. This one we are essentially assigning labels to each video clips, yes. >>: [inaudible]. >> Xiaodan Zhuang: Yeah, exactly. Yes. So we do have a side benefit, which is because we look at this image and video, as I said, a local descriptor, so it's less prone to occlusion. So even if you occlude -- even if you only use 20 percent of the local descriptors, your performance don't drop much. But there is a warning here because this 20 percent is purely randomly sampled. So if you very carefully occlude all those important local descriptors, then they probably won't work as well. But just as a note here, if we randomly sample 20 percent of the local descriptors, the result will be pretty much intact. >>: [inaudible]. >> Xiaodan Zhuang: Yeah, in this one we're using ->>: [inaudible] >> Xiaodan Zhuang: Not for this one, no. Yeah. We used some other features for other tasks, but the idea of this modeling background is similar. So we want to see can we use -- because it gave us a lot of advantages in doing categorization retrieval. But all these things are talking about the global information based on the whole images. So can we do something similar for a localization at which to find a particular object within an image and then tell how large they are. So we adopt a branch-and-bound scheme that has been proposed by Lampert to allow efficient object localization that achieves global optimum with N squared instead of N fourth. And also this scheme has been successfully applied for histograms keywords, so that gave us a way to preliminarily compare with what other people have done using similar approaches. So in particular the branch-and-bound scheme, regardless of the detailed nuances here, the basic idea is that you want to search all these Ns, N power to the 4, rectangles, but not one by one. You want to sort out sets of them if they have a very good bound to tell you what the upper bound for that particular rectangle set is. And if the bound is not -- is worse than the bound of another set, then you can always start with the set with the highest bound. And then if the bound is good enough, then you won't miss the targeted rectangle that you're hunting after. In particular you would initialize an empty queue of the rectangle sets and initialize this rectangle set to be all the rectangles. So if we characterize this as top, bottom, left, and right, the sets are initialized as all rectangles having the top, bottom, and left, right, to be complete scale for each of these dimensions. And then we would obtain two sets by splitting the parameter space along the largest of the four dimensions. Then you would push these two rectangle sets into the queue with their respective quality bound, which we'll talk about in a minute. And then we'll update R by saying we'll look at the set with the highest quality bound. In the end, the set will shrink and shrink, and in the end it will become only one rectangle. That's the one that we're trying to find. So in order to introduce the quality bound we'll first talk about the quality function here which is informing the confidence of the evaluated subarea being the target object. So we're actually using the output of a binary SVM just in the standard format. So the bound should satisfy two conditions. It will be larger than any of the quality functions evaluated at any of the member rectangles. And also if the set becomes one member only, then the bounds should be equal to the quality function evaluated on that single member. So we can see -- first we can see the quality function can be rewritten with approximations as a -as the summation of some of the contributions from each local descriptor. So in particular defined as per feature vector contribution here, so in the end we'll know this -- if you are to evaluate a particular rectangle, that's nothing more than evaluating the summation of the contributions coming from all the member patches within that subarea. So if we can written it in this linear format, then the quality bound can be written as the summation of the positive local contributions from the largest rectangle plus the summation of the negative contributions from the smallest rectangle. So this one is to evaluate the quality bound for a rectangle set. So that's essentially all these positives coming from the largest member and all the negatives coming from the smallest member. And they'll be easy to verify this satisfies two conditions that we just set forth. >>: [inaudible] >> Xiaodan Zhuang: So this is -- so you're talking about how this per feature contribution is calculated? >>: [inaudible] >> Xiaodan Zhuang: For each [inaudible]. >>: Each vector. >> Xiaodan Zhuang: We use the Gaussian mixture models to approximate one distribution for one image in the video clip. But there's no class information. >>: [inaudible] >> Xiaodan Zhuang: Yeah, there's no class information here. And of course you could say why don't we do classification based on classification of each local descriptor. But I'm pretty sure that would not work because a single local descriptor is very noisy. So if we want to do classification based on that, then it's usually not working as well. So we are using the Gaussian mixture model to approximate all this during distribution within one image, and to hope this smooths distribution modeling can actually suppress some of the noise that's coming from the local distributors. So that's why. But if you look at this per feature vector contribution here, that's what I was about to say, is that the calculation of all these things are essentially the same as calculating the adaptation of a global model into this particular image with all the local descriptors. So because the per feature vector contribution seems -- you need to do this for all the local descriptors. But that's the same thing as doing the adaptation where you already sum up the adaptations of each feature vector according to the posteriors to different Gaussian components. So computationally it's very similar. Getting the set of -- the complete set of per feature vector contribution is computationally similar to doing the adaptation for the whole image. Just because you could written it and approximate it as a linear way. >>: In the previous [inaudible] you're talking about the adaptation, right? And just based on the compression between video clips to another video clip, it seems not very scalable if you have a lot of label [inaudible] different events, you have to compare with all the different events, right? >> Xiaodan Zhuang: Oh, so the comparison -- so once we get this Gaussianized vector in place, then we'll basically just calculate the kernel matrix. >>: [inaudible] >> Xiaodan Zhuang: Basically you'll have a multiclass ->>: [inaudible] >> Xiaodan Zhuang: Yeah. That's the same idea as if you have multiclass SVM for classification of multiple categories. Right? >>: [inaudible] >> Xiaodan Zhuang: Yeah, you need to compute with each of the training ones. Right. >>: You really don't model an event. >> Xiaodan Zhuang: You model event only in the kernel space. Yes. >>: [inaudible] >> Xiaodan Zhuang: Yeah. With the kernel space you model the events. >>: So [inaudible] I have, I don't know, [inaudible] event and I have a hundred video clips. >> Xiaodan Zhuang: For [inaudible]. >>: For [inaudible]. So you have to compare with all the ->> Xiaodan Zhuang: Yes, that's right. >>: But in this [inaudible]. >> Xiaodan Zhuang: For a localization, you're actually doing the same. So for localization, one thing that I probably should have emphasized is it's basically based on the two-class classification problem. It's either the foreground, the target, or it's the background. So when you have -- you still -- say you have 500 positive training tokens foreground, and then you have 500 negative background images, and then so what -- you're training this binary class SVM, and then in testing time you still have to calculate the kernel -- you have to evaluate this kernel with using the one test vector with all [inaudible]. >>: [inaudible] modeling and model this probability and the likelihood of each feature to each event. >> Xiaodan Zhuang: Not really [inaudible]. >>: That's confusing part. >>: [inaudible] >>: So how will you do that? >> Xiaodan Zhuang: So the only difference between -- in this regard is a binary SVM versus the multiclasses. So you calculate the -- you calculate the kernel matrix, and then it's the training of the SVM that deals with this categorical supervision. So you're always getting -- so the only thing that we change here is how do you calculate the distance between two images. >>: [inaudible] >> Xiaodan Zhuang: Yeah, so you're still looking at this kernel matrix, which is always looking at two individual images, one coming from training and one coming from the testing. >>: In your training you have all the label that the [inaudible]. >> Xiaodan Zhuang: Yeah, yeah. >>: I see. >>: So then you SVM. >> Xiaodan Zhuang: Yeah. SVM is the way where the supervision comes into play. >>: That's really where you sort of use the [inaudible] foreground objects. >> Xiaodan Zhuang: Yeah. >>: And so the feature vectors [inaudible]. >> Xiaodan Zhuang: Yeah, yeah, yeah. >>: [inaudible] >> Xiaodan Zhuang: Yeah, yeah, yeah. >>: [inaudible] >> Xiaodan Zhuang: Yeah. >>: So that's [inaudible] you don't really need separate classes, you just -- do you do the SVM per event, per class, or [inaudible] SVM for another type of event [inaudible]? >> Xiaodan Zhuang: So the way that we do it practically is that we get this one supervector, this vector representation for one image. So for each -- let's first talk about the video event categorization problem. So we have one vector for one image, and then we also know the label for that vector. Then we throw this into a multiclass SVM training program. So for localization, we have different images -- some of these are target, some of these are the background. And each of them would be converted into a vector. And then we throw this into a binary SVM. >>: So that's not -- it's not the same as [inaudible]. >> Xiaodan Zhuang: But when you throw it into the binary SVM training, you basically throw the kernel matrix in. Right. >>: Sure. I'm just saying that your classifier [inaudible] it's not just [inaudible] to compare ->> Xiaodan Zhuang: Sure, the classifier itself is supervised. So that's just SVM. >>: [inaudible] the way you do classification is different from [inaudible]. >> Xiaodan Zhuang: So for localization, the core is a binary classifier. But for localization, the problem is that you have to do binary classification for many hypothesized rectangles. So the way that it looks different here is because you organize it in this linear summation so that you can reuse a lot of the computation and also you could bound this subset of hypothesized rectangles and then throw them away if they their bound is bad enough. But at the cores, it's actually a binary classification. It's to tell whether this subarea is the foreground or background. So in that sense it's the same as [inaudible]. Okay. Yeah. Here is some examples of how things look for localization. So basically you would identify the areas correct if you had collect this local positive red areas and less of these negative blue areas. And sometimes it goes around because the red areas are not as clustered together as we want or sometimes we count it as a run detection when you combine two cores into one model. We compare with -- we compare this with a similar work using the histogram of keywords based on this search-and-bound scheme. So it's consistently improving beyond the histogram of keywords. And we can also see doing within-class normalization can also further improve the result. All these Ns are meaning the within-class covariance normalization. So I'll quickly go through the third part, which is relatively short anyways. So this part is trying to say acoustic event detection is very hard, and even if we did reasonably well, the performance is far from usable from the practical perspective. So people are trying to look at can we engage video to help just as the way we engage lips to help speech recognition. So for the previous work, people have proposed a detection by classification system, and then that system use a lot of extra information including audio spatial localization, multiperson tracking, motion analysis, face recognition, object detection. The drawback is that training all these AED hoc visual detectors are very expensive. And we're trying to see can we go around this and do something reasonable such that the performance is comparable at least. So the way we look at it is all these non-speech audio-visual events are mostly related to motion, so we're using visual features based on optical flow. And then we are summarizing all these optical flow firings using an overlapping spatial pyramid histogram. So basically we're looking at a complete image and calculate a histogram of the local optical flow firing magnitude, and also we look at this two-by-two grid and then extract four histograms, and then we look at this three-by-one grid and look at -- extract these three histograms. And then we stack all these histograms together and do a decoorelation and use that as a feature -- video feature vector for each frame. And that goes together with the MFCC or something like that for the audio stream. So to combine these we use the multistream HMM and coupled HMM. In particular with the coupled HMM, there's a lot of hassle to initialize such that the output PDFs becomes reasonable. So the way we would try the two different schemes, one way to initialize using a state-synchronized multistream HMM pair, another way to initialize it using pairs of audio-only and video-only HMM. >>: [inaudible] >> Xiaodan Zhuang: Yeah. So the coupled HMM, it's basically -- a coupled HMM is saying the state -- in the multistream HMM, the audio state sequence and the video state sequence would progress at the same pace. So the first audio state goes to the second, and the same time the video goes to the second. And the coupled HMM, they can go asynchronized from the perspective of the state sequence. So maybe the audio state has already progressed into the second one while the video state is still residing in the first one. But actually you could look at a coupled HMM as a HMM with more states. But the equivalent HMM has more space because each of the states in the equivalent HMM is defined on both the audio state variable and the video state visual in the coupled HMM. So in our experiment we're actually allowing asynchrony up to one state so that you don't really grow this topology into a huge structure. So in the end it's about how do you initialize the output PDFs of the coupled HMM or its equivalent HMM. So when I'm saying we initialize using multistream instruments, basically initializing assuming the audio and video states should always be synchronized on the [inaudible] level, and then we would combine the PDF from the first audio state with the first video state. And also we have another state that's combining the first audio PDF with the second video PDF. So in that case we construct this equivalent -- this HMM equivalent to the coupled HMM. Or another way to do this is to initialize using audio-only HMM, video-only HMM, and then put their states together. And once we put together this equivalent -- this coupled HMM equivalent HMM, then we can do re-estimation based on the new topology. That tends to converge easier than if you directly train on a [inaudible], it pretty much always fails. And most of the states never get traversed. So the answer is in terms of can we get performance similar to doing all these lower level detectors, it's absolutely yes. So we could actually do better than their approach. The benefits come from of course multiple places. First, if you use a dynamic Bayesian model to do Viterbi decoding, there gives you advantage. And also when you do -- when you characterize the video using this generalizable feature, that sometimes meant even give you advantage then to use the very localized low-level detector. Another thing that -- however, the two ways to initialize the HMM actually turns out to be not very significantly different. But a warning is that it is important to initialize them reasonably, but which one doesn't seem to make much difference here. Okay. I think I need to wrap up here. So it's basically I talked about audio event modeling, a lot of lessons from speech recognition, and video cue modeling where we want to say can we do these jobs reasonably without going all these very painstaking segmentation. The answer is yes. And in many cases we actually do better. And then we could improve the acoustic event detection by using audio-visual multimodality modeling. Thank you. That's the picture of the incomplete group. Maybe you would recognize a couple faces who worked here on this campus for a while. And I have a few very quick shots, but I think we're going over time. Should I take five more minutes or ->> Dong Yu: [inaudible] >> Xiaodan Zhuang: Oh, okay. Some other projects -- so just because this is a very interdisciplinary group, that the speech group and video group work very closely together, and I also had experience working with phonologists who are interested to see whether a phonological series works in speech recognition setup or can we devise a computation model according to their theory. So that gives me exposure to a couple different projects that might be related to something happening here. For example, we worked on a pronunciation modeling which basically following the [inaudible] phonology theory. So according to this theory, a word is pronounced as this word because there's a unique set of gestural targets behind it defined on different tract variables. And when the word is co-articulated with something else or the speech rate changes, there's a speech reduction. This set of gestural targets are still there. They're just arranged in different ways. They shift a long time. But the targets are still there. But because they overlap with each other in different ways, so their service won't change. So that's their theory. And we're trying to build a model according to this series. So the way that I answer this question is if we define the gestural targets at the local time frame as a vector, then this 2D representation just becomes a vector sequence problem. So we will say a gestural score as a 2D dimension here defined on tract variables and time could be approximated as a sequence of gestural pattern vectors. Each vector defined the construction targets at this local time. So once we have this formulation, we want to see can we do speech observation to gestural pattern vector sequence, thus to the gestural score, therefore to words. So the gestural pattern vector would encode instantaneous patterns of gestural activations across all tract variables in the gestural score given a particular time. And the gestural score approximated as a sequence of gestural pattern vectors is the example of gestural activations which is basically the approximated gestural pattern vector sequence. And of course at each step we could introduce some confident score, the likelihoods. The interesting part is actually the gestural score to a word part where we'd use a [inaudible] gestural timing model which was originally developed for articular phonology to generate one canonical gestural score for different word. This is like the canonical pronunciation for this word. So the words will be distinguishable according to the gestural scores in the sense that they are distinguishable according to this example of gestural targets behind them. But they are not distinguishable according to their particular timings, because when a word is pronounced by different people in different contexts, the timings tend to change according to [inaudible] phonology. So the computation model that we devise according to this theory is that we have a finite state machine emitting gestural pattern vector sequences. So you could imagine each parse in this finite state machine actually is emitting a gestural pattern vector sequence which is essentially a gestural score with particular timing. So if you look at the word "the" here, these are the different tract variables. So this -- they always have these four gestural activation targets. But they could be arranged in different timings. And we want to summarize all this into the finite state machine through some recursive process. And once we have this finite state machine -- so this machine is basically the pronunciation of this single word. And once we have this machine, we'll be able to do word recognition by composing a lattice with a pronunciation model, which is roughly the union of this -- all these different word pronunciations. >>: [inaudible] looking at the acoustic [inaudible] or by looking at the articulation? >> Xiaodan Zhuang: We actually look at the canonical speech gesture. So we have this intergestural timing model which could generate one of these to the representation for each word. And then we're saying the pronunciations allowed are all the different possible shiftings of all these gestural targets. So we're building this recursively by saying looking at a canonical one, let's say the example of the gestural targets we have, and then at each time frame of proposing possible combinations of them and then weighted by some kind of quality measures to measure whether this combination is possible or not from the training data or from the theory. So at each -- so each time we're doing an interstate transition, it's like changing from a particular local gestural target combination into a next one. And each self transition is saying we'll stay in this particular configuration for another time slot. So we're building this recursively as building a tree. And then in the end you can optimize this finite state machine by pushing and everything. But once you get this finite state machine, you could use the open source finite state machine toolboxes to optimize and construct a dictionary. This is still at the very early stage. It's not -- we don't have any evidence to say this works better than the conventional approaches. So it's more like an interdisciplinary exploration with the phonologists. >>: [inaudible] >> Xiaodan Zhuang: Oh, the targets. >>: Yeah. >> Xiaodan Zhuang: So the way -- the part that I worked on was to devise how these things all put together. We have collaborators in [inaudible] lab. They are working on how do you recover this underlining 2D representation from the speech surface form. So they have actually a series of words working on that. But I think it's probably beyond what I can talk about. Yeah. Another thing is we had -- we did some kind of speech retrieval for unknown language for a multimedia evaluation where the basic idea is that you build indexes for each database entry using -- you summarize each dataset entry as a lattice and then convert that into a finite state machine-based index term. So that term is basically allowing transitions from the beginning into many of the internal nodes instead of requiring a complete traversing of one passing lattice. So when you introduce these extra transitions from a beginning to the different nodes, you would associate that with something like a forward probability or backward probability. So that part we're actually just following the Bell Labs approach before. The reason of doing it that way is I was given three weeks to build this system, so I want to devise something that requires less tuning. So using the finite state machine was a reasonable way to do this, and the tuning could mostly be done using standard optimization methods. For unknown speech, we are talking about looking at model training with different languages, but none of which is the testing language that you retrieve from -- so the idea here is that you would want to have models that are hopefully language independent. So the way that we address it is to cluster all these language-dependent atomic models and then do clustering based on their pairwise distances. And then the clustering result would -- the clustering result would hopefully capture information that is shared across different training languages, but not language specific. And hopefully that information is more transferable to a brand-new, never-seen language at the testing. Oh, the last one was actually done with Microsoft. The problem was to converge from speech to lips. We had originally a baseline was STS based. Since this model you build one triphone visual model for each phone, and then you can synthesize the lips from the triphone sequence, which is known before. So the question here is can you engage audio to improve, and, second, can you do lips without knowing the ground truth phone sequence, which usually can be hard to get. The answer are yes to both questions. You want to engage the audio so that you can engage the audios in the maximum likelihood estimation of the video observation, which will be -- which would improve beyond this visual-only synthesis. And the diagram shown here is more on the conversion side where we don't have this ground truth phone sequence at all. The idea here is that if we are going from speech to lips it looks like it's not always necessary to jump this -- a huge semantic gap to find the underlying speech contact. It may be easier to not go into the speech content just because how hard it is. So the way that we address this is to use a audio-visual joint Gaussian mixture model, and then we select optimized modality weight according to a development set. And then on the other hand, we want to find the visual sequence that best mimic the ground truth video sequence. The maximum likelihood is a good criteria to work with, but it's not necessarily in line with the target metric, which is the human perception. The human perception is very hard to do, so we're mimicking the human perception using the mean square error of the synthesized trajectories. So that will use -- that will be able to refine the visual Gaussian mixture model. It's kind of extension to the minimum generation error that Microsoft has worked on speech synthesis. This was in a conversion setup. Yeah. That's all I have today. Thank you. [applause]