>> Emily Fox: Thank you. I want to start by actually introducing my grad student, potential intern here, Raja. Stand up, Raja. He's not working on this kind of stuff. He's working on very cool repulsive processes, scaling up to large datasets. Kind of a counterpoint to the types of things I'll discuss here.
Okay. So in this talk we're going to describe a set of bayesian nonperimetric time series models that are useful in analyzing complex dynamic phenomenon. This is joint work with Eric Sudderth, Michael
Jordan from Berkeley and Alan Wilsky at MIT.
So there are large class of datasets that can be categorized as exhibiting very complex but pattern behaviors. So, for example, you might be wondering what these datasets have in common. How does a problem of segmenting conference audio into a set of speaker labels relate to the problem of analyzing the dance of honeybees and what does that have to do with describing human emotions, no less trends in stock data.
The complex dynamics present in these datasets can be modeled using what's called Markov switching processes. For example, honeybees, which appear to have rather chaotic motion, are actually switching between a set of dances, each of which can be described as using simple linear dynamical models.
Previous approaches to analyzing these datasets such as the ones presented in these papers relied on a lot of application-specific information in order to fix the number of dynamic regimes or relied on heuristics in order to infer them from the data.
Instead, we're going to cast the problem within a bayesian nonperimetric framework and show how the Dirichlet process provides a useful prior. The clustering properties induced by the Dirichlet process have been exploited in many standard mixture modeling applications.
So, for example, imagine that you have data generated from some mixture of Gaussians with unknown number of mixture components, which in this case happen to be three.
The Dirichlet process allows you to infer this from the data while still allowing for new components to be added to your model as more data are observed. And the way in which it does this is by defining these accountably infinite set of mixture weights in such a way that leads to this sparse but flexible structure in the clustering. Okay.
Well, instead of this type of static data, we're interested in time series. And specifically time series modeled using Markov switching processes. And here the goal is to cluster the observations based on the underlying dynamics of the process.
And one classical example of a Markov switching process is the hidden
Markov model, or HMM. So just as a little background, an HMM consists of an underlying discrete valued state sequence represented by the Z random models, which is modeled as Markov with respect to some collection transitions pi K.
So, for example, in the human emotion capture application, they might represent a set of action labels such as jumping jacks, squats, and side twists and so on. HMM assumes the condition on the state sequence, observations are independent emissions from some family of distributions which in this case we take to be Gaussian. So each observation might represent a vector of body position measurements.
While one can view a sample path of the state sequence as a walk through the following state through time lattice. Let's assume at the first time step we start in this second state, which might correspond to jumping jacks, and we have the following set of possible transitions where the relative weights of these arrows captures the probability of making each one of these transitions and that's indicated by that state's transition distribution pi 2. Now imagine that we persist in this jumping jack state, have the same set of possible transitions and we transition to the squat state, maybe have two possible transitions and so on.
Okay. In everything I've described so far I've assumed that there are capital K different states, like jumping jacks. So this begs the question: What if this isn't known ahead of time and what if we like to be able to add new dynamic regimes to our model as more data are observed?
Well, in such case as an attractive approach to appeal to bayesian nonperimetrics and number of states K tend to infinity. It's been shown that hierarchical layering of Dirichlet processes allows us to define a collection of accountably infinite transition distributions.
So the Dirichlet process is what allows for this unbounded state space.
And the hierarchical layering of Dirichlet processes is what couples together the transition distribution so that there's a shared sparse set of possible states visited.
Well, we developed a generalization of this HTP HMM called a sticky HTP
HMM that better captures state persistence. The way it does that is by increasing the prior bias towards self-transitions which I'm indicating here. What this means if I'm currently doing a jumping jack, I'm more likely to continue to do a jumping jack than switch to some new state.
In our formulation, we'll infer this bias towards self-transitions from the data itself.
Okay. So just to visualize the difference between this original HTP
HMM and its sticky variant, it's useful to look at draws from the prior.
So here we're showing two sample paths of state sequences drawn from each of these models. And if the labels here are going to represent something such as who is speaking when within a conference audio recording or which action am I performing within an exercise routine, well, the dynamics present here much better capture what we'd expect in these real world applications than this rapid switching between states.
And so we showed in a number of applications that it was really essential to capture this in our prior to have good performance. So we looked at this problem of speaker diarization using the sticky HTP HMM.
So here the problem is to take conference audio recordings and segment them into speaker homogenous regions in the presence of an unknown number of speakers. So we're looking at the NIST speaker diarization database that consists of 21 recorded meetings with ground truth labels.
These labels are just used to assess our performance. What I'm showing is sticky HTP HMM is able to provide state-of-the-art diarizations. So here these numbers are diarization error rates, just basically a calculation of the percent of incorrect speaker labels.
And so we also compare to nonsticky model where we're simply setting that prior biased for self-transitions to zero. You see significantly worst diarizations. And here we compare to the performance of the
Berkeley ICSI team. And they use an oglometer of clustering approach that's highly engineered to this task. And this represented the gold standard at the time at which we did this analysis.
Okay. Well, just to visualize what our diarizations look like, here's one meeting where we did really well. So on the right-hand side I'm showing the true speaker labels. On the left-hand side our estimated labels with errors shown in red. So very few errors. But here's a meeting where we did significantly worse. Okay.
What you see is that our errors can be attributed to taking one true speaker, splitting into multiple estimated speakers. That might be a reasonable segmentation. Maybe a person sounds different throughout the recording. However, benefit of taking this bayesian approach is the fact that we get a posterior distribution over possible segmentations.
So another one of our top five segmentations had significantly lower
DER. So to a practitioner, this might be useful. And it's in contrast to the aglomerative clustering approach which produces one segmentation.
Okay. Well, in addition to looking at these bayesian nonperimetric hidden Markov models, we also looked at other classes of Markov switching processes including autoregressive hidden Markov models. And here each observation is modeled as a linear combination, specifically a state-specific linear function of some previous set of observations, plus state-specific noise or innovations. So this better captures temporal correlations in observations themselves within a given state.
And we also looked at bayesian nonperimetric extensions of switching linear dynamic systems. And there are many other formulations you could consider as well.
Okay. So in addition to the speaker diarization problem, using these other bayesian nonperimetric Markov switching processes, we showed that they're useful in detecting the changes of volatility in a stock index as well as in segmenting these dances of honeybees.
The cool thing here is that in both of these cases we used exactly the same model, which is in one case we passed daily returns of a stock index and the other case tracks of a honeybee and inferred this underlying structure without building application, without building in this application-specific information.
Well, up to this point we've assumed we're interested in inferring the dynamic regimes present in time series. Interesting question is: What if you have a collection of related time series?
How do you jointly model them? So, for example, imagine you have some set of videos of people performing some sort of exercise routines. How do you transfer knowledge between these both in order to improve parameter estimates in the face of limited data as well as to find interesting structure on how they relate to one another.
Well, if we were to independently model each of these time series using the types of processes described before, this would necessarily define
Markov switching processes on completely disjoint accountably infinite state spaces. Alternatively, a naive approach to jointly modeling them is to assume that each use is exactly the same state space. However, what we'd really like to capture here is the fact that there's some sharing of states but also some variability.
So, for example, perhaps in one movie a person does side twists but that motion doesn't appear in any of the other movies. Okay. Well, our modeling approach is going to start by modeling each one of these time series using a switching autoregressive process. And just remember what this means is that each state has these AR dynamics that are parameterized by a set of state-specific lag matrices and noise covariance. And so again we're going to introduce an infinite set of possible dynamic regimes. Such as jumping jacks, side twists, arm
circles and so on. But remember here that we want to capture the fact that not every time series exhibits all of these behaviors. So what we can do is we can think of these as defining some set of features that are either going to be selected by the time series or not.
So, in particular, we're going to summarize the set of selected behaviors with this binary feature matrix. So every row is a different time series. So one of our different movies. And every column is a different dynamic behavior from this shared library that we defined.
So all these different actions.
Okay. So here the white squares indicate the set of selected behaviors. So for every time series it has a feature vector. And what that feature vector is going to do it's going to constrain that time series to only switch between its set of selected behaviors. And the way it does that is by operating on a collection of transition distributions to define what we call feature constrained transition distributions.
So in particular you can imagine that we defined a set of transition distributions in an infinite dimensional space. We do an element-wise product with dot time series feature vector and renormalize. Now that time series can only switch between these selected behaviors.
Okay. So just a bit more formally. Each one of these transition distributions is Dirichlet distributed only along the dimension specified by the selected features. And then again this kappa parameter here is just like in the sticky HTP HMM, it's going to encourage self-transitions. And then our state evolves according to these feature-constrained transition distributions. And that state then indexes into this shared library of dynamic behaviors to index which autoregressive process is driving the dynamics at that time step.
Okay. Well, then the key ingredient is how are we going to define this feature matrix. And for this a bayesian nonperimetric method known as the beta process is going to prove quite useful, allows us to define a feature space that has an infinite number of possible features, and it's going to encourage the use of a sparse subset of these.
So in particular the predictive distribution on feature assignments that's induced by specifying this beta process prior has this analogy that's referred to as the Indian buffet process. So imagine you have this restaurant and it has this buffet line, with infinitely many possible dishes and customers arrive at this restaurant.
The first customer comes in. Just chooses some Poisson number of dishes. The next customer comes in and it's more likely to choose the dishes that the first customer chose, gets to tend of the set of sampled dishes and then chooses some Poisson number of new dishes himself.
And this process continues where every arriving customer comes in, samples dishes in proportion to how many people have sampled them so far and then chooses some new set of dishes.
Okay. Well, here we're showing a feature matrix associated with a draw from this Indian buffet process where every row corresponds to different customer. And different column to every different dish and the white squares indicate the set of selected dishes.
So what we see by the structure induced here is that by specifying this beta process prior on our feature space, we're encouraging sharing of dynamic behaviors between the time series while still allowing for some time series-specific variability.
Okay. So we looked at -- let me just note we refer to that model as
BPAR HMM for beta process autoregressive hidden Markov model. Okay.
So we used this BPAR HMM to analyze a set of videos taken from the CMU database. In these videos, these people are dressed up in these Mocap suits which record 62 dimensions of body positions and joint angle measurements. For our analysis, we're going to look at the 12 that correspond to gross motor movement because we don't care about things like what the digits are doing.
And our goal here is to discover common behaviors between these time series. And from this plot you can see that we're able to do just that.
Each one of these skeleton plots is inferred contiguous segment of at least two seconds of motion, and we're grouping together all the skeletons, so all of the different segments that were labeled as having been generated from the same dynamic behavior. And the color of the box indicates the true behavior category.
So we see that we've been able to identify and group together the six examples of jumping jacks that appeared in these different movies. Arm circles -- sorry, side twists in this case, arm circles and squats.
And one nice thing is the fact that we've been able to identify a set of behaviors that appeared in one and only one movie. We did, however, split a couple of motion categories. For example, knee braces. But if you look closely at the data, there's actually a reason for this.
And one case this person has significant side-to-side upper body motion, whereas the other person is doing more of a crunch. And in the case of running, one person is running with their hands in sync with their knees. Not sure why. And the other person is doing a more standard motion. I won't make fun of computer scientists here, but this person is in the middle of doing a jumping jack as they start running. So just generally confused.
But, overall, what you can see is that we've been able to clearly identify a set of behaviors that appeared within a set of time series while still allowing for this time series specific variability, these unique behaviors, to pop up.
Okay. So we also applied this BPAR HMM to the problem of parsing out
EEG recordings. So we're interested in doing this both in order to standardize and automate this process of reading a patient's EEG as well as to relate self-clinical burstive activity to full-blown seizures.
In particular, our data are going to come from these intercranial EEG recordings taken from these grid electrodes shown here. So the collection of channels in this grid provides our collection of time series that we wish to jointly model using this BPAR HMM. But just remember that in the BPAR HMM, once you condition on the model parameters, the time series evolved independently. So one guy's exercise routine wasn't influenced by the exercise routines of the other guys.
But that's clearly not the case here. So, in particular, in this plot what I'm showing are the residual in channel voltages after subtracting off BPAR HMM predictions.
What you see is these errors, these residuals have clear spatial correlations. So in order to account for these, while still coping with the dimensionality of the time series, we're going to harness a sparse dependency structure.
In particular, we're going to employ a graphical model that's based on spatial adjacencies within the grid. The other thing we'd like to capture is the correlations between these channels changes over time as the seizure goes through various states. So our modeling approach is going to start by modeling each one of the channel dynamics using this
BPAR HMM; however, now instead of having independent innovations driving these switching autoregressions, we're going to couple together the innovations using the Gaussian graphical model. In particular this very big covariance matrix is going to have a sparse precision.
Specified guided structure of this graphical model, and in addition in order to capture these changing correlations with time, we're going to assume that the covariance matrix itself follows its own Markov process.
So using this type of model, what we can do is produce parsings of intercranial EEG recordings that look like the following, the gray bar scale indicates the covariance over time. We have these correlation matrices associated with these co-variant states evolving with the seizure and we also have these channel dynamics.
Okay. So just remember we have a shared library of possible dynamics that these channels can switch between asynchronously.
So using these types of parsings of EEG recordings, we can start looking at the problem of relating subclinical bursts to full-blown clinical seizures. So over the last decade, there have been a lot of discoveries of other types of epileptic events beyond the classical full-blown clinical seizure, and it's been shown these events tend to occur very frequently in the hours preceding a seizure. There's a lot of interest in trying to figure out what's the related structure between these subclinical bursts, they are fewer than ten seconds long, and these actual clinical seizures.
So here I'm just showing two subclinical bursts, and the onset of a full clinical seizure, but our analysis was really looking jointly at
14 subclinical bursts and the full clinical seizure all from one patient's recordings. And what we showed from this analysis is the fact that the onset of these seizures and the subclinical bursts are extremely similar.
What's different is that in the subclinical bursts there's this disrupting discharge, and the dynamic state present here never appears in this full clinical seizure.
So this is really interesting to clinicians in order to study maybe it's some type of false start seizure. So potentially leading to insights in terms of treatments here.
Okay. Well, in summary we looked at these bayesian nonperimetric variance of classical models like the hidden Markov Model ARHMM, switching dynamic system, and we focused on this idea of relating multiple time series using this feature-based representation, specifically based off of this beta process. And we demonstrated the utility of these processes looking at the task of speaker diarization, analyzing motion capture videos, and parsing out these EEG recordings.
Thank you.
[applause]
>> Ofer Dekel: Okay. We have time for some questions.
>>:
Very interesting topic. How do you [inaudible] stickiness.
>> Emily Fox: Okay. So --
>>: Repeat the question.
>> Emily Fox: How do you choose kappa, your stickiness. That's a very good question. And that was actually the crux of that paper, was the fact that we're not going to select it ahead of time. We're going to
put, take this fully bayesian approach. We're going to specify a prior on it, typically a weakly informative prior. And then we're going to infer it from the data. So in some cases you infer very low probability of self-transition and the model reduces down to their original HTP HMM, but in many cases most real world applications you actually learn pretty significant bias towards self-transition. So the challenging part of that model was really how do you do those posterior computations in a principal way.
>>: Shared between all states?
>> Emily Fox: Yes, in the prior. So the posterior of different states, yeah, but, yes, in the prior they're coupled together. So they're trivial specifications you can write down to develop a hierarchy and shrink them all towards a common one. But we chose the simplest thing.
>>: More questions?
>>: [inaudible] similar work to this for navigation. [inaudible] understand them as getting from BPs and seems like you're sort of adding on BPs, that's kind of the process maybe a comparison between the two how yours goes beyond beta.
>> Emily Fox: Jonathan has work. So actually worked for Jonathan Howe back when I was an undergrad. So his work -- and he's organizing the session at NIPS also where I'm speaking. But I guess I don't know exactly which of his work you're referring to. But in terms of like their GP dynamics, right, which, yes, these are GP dynamics if you assume a band limited kernel matrix, right? Autoregressive structure.
We're not looking at GP dynamics themselves. We're looking at either independent or autoregressive dynamics. And then there's this question, the focus here was in terms of the number of dynamic regimes present, assuming some model, assuming an HMM, assuming a switching autoregression. And so the Dirichlet process is a way to allow for an unbounded number of states, whereas the beta process allows for the same thing. But you can think of it more as an L-1 penalty where you can let within the different time series you can have different dynamic regimes pop up or be penalized in this kind of hard fashion. But doesn't completely address your question, I realize that.
>> Ofer Dekel: Okay. Let's thank the speaker.
[applause]