>> Dan Povey: So I'm happy to introduce Lukas. ... speaker ID effort at BUT that has in the last...

>> Dan Povey: So I'm happy to introduce Lukas. Lukas is the head of the speaker ID effort at BUT that has in the last few years become quite well known. This is due to the success in the speaker ID evaluations. And I asked Lukas how many of these actually won. Apparently it's a big no-no to say that someone won one of these evaluations, it's against the rules I guess because everyone is a winner. [laughter]. But they did outstandingly I think it's safe to say. And Lukas is also in charge of the other kind of research efforts at BUT involving speech things. And I personally know Lukas through this JHU project where we were doing the SGMM work. Lukas was kind of in charge of a lot of aspects of that project and contributions to the math as well. So he knows his stuff. And he's going to be explaining -- I really don't know what this talk is about because I'm not a speaker ID guy, but I hope that by the end I will understand it. So, Lukas. >> Lukas Burget: Thanks, Dan, for the introduction. So as Dan said, I'm interested in different -- different problems like -- I mean the speech field, speech recognition, language identification, speaker identification, keywords spoken. But you the two days talk would be about speaker recognition, which I have been mainly active in past few years. So what do we have on the agenda today? I will just give some short introduction into what the task is about and then I will say a few words about what the main problem in speaker identification enable which we call channel variability. I will explain what I mean by that. Then I will give some kind of tutorial into what are the techniques for speaker identification. I will describe the traditional approach, and then I will move to the approaches that evolved through past few years, so I will show you the shift in the paradigm that happened there during the past few years that improve the system in terms of both speed and performance. And then at the end, I will say a few words about my most current award, which is discriminative approach to speaker identification, which is something really new in speaker identification, something quite usual in speech recognition ->>: [inaudible]. >> Lukas Burget: Yes? >>: [inaudible]. >> Dan Povey: Do you have your mic -- I think he's not supposed to be broadcasting, it's just a recording. >>: We're having trouble hearing. >> Lukas Burget: Okay. I can definitely speak louder. And so this slide just lists some of the speaker recognition application, just to imagine what the technology's really good for. And we were mainly involved in project that deals with the security and defense aspects where you can use the speaker identification for searching for the suspect in large quantity of audios and the other -- other line, the link analysis and speaker clustering. Link analysis is actually something that you people helped -- that you -- that the defense people helped to search for the aspects like -- I mean, the problem is you would have lots of recordings of various people and you would really like to make some link between where the same speaker speaks and whether he speaks from the same phone and the things like that. So you really need to search in large quantity of audios and audio and find the same speakers, cluster of speakers. So quite difficult problems with like if you have millions of recordings. Then I mean for the -- maybe for the more conventional applications you can think of application for access control to access computer networks. Nowadays speaker identification is used for the transaction authentication. So the telephone banking uses speaker identification not really to -- not really to verify that it's the speaker or not, but at least to get some -- get some idea. And eventually you don't need that many -- you don't need to specify that many password and see if it looks like that probably this is the guy already from the recording. Then also you can think of other applications like the Web -- voice-Web or device customizations, your device can -- I mean your mobile phone can recognize that it's really you and adjust accordingly. Maybe for the speaker -- speech recognition people here you can think of using ASR for clustering recordings to adapt the model for article speaker on the recordings of the same or even similar speaker. This is something that people have been trying recently and which seems to help quite a lot. So you could think of really searching for the same person in all the recordings from the -- that comes from the bing system and try to adopt the speaker model for the current recording on the similar data. And so on. And also I think that Microsoft could be also interested in the search in audio archives and the -- I think there was some work on really indexing the speech and accessing the speech so you could be interested in searching for particular talk of some -- of the same speaker that you have been watching right now, things like that. >>: [inaudible]. >> Lukas Burget: I just tried to point out maybe some of those that we are mainly interested in and maybe what Microsoft could be interested in but really not for any particular reason. So there was a lot of progress in past say five years. There was massive speed-up in the technology that allow us to use the technology in much larger scale. I will speak about the things that are described in all the bullets like the development have the techniques where you -- that allows you to create low-dimensional voice prints for the recording and then you search for the speakers by -- just by comparing these voice prints. The voice prints can be obtained something like 150 times faster than real time, so this is something that you can easily do online when you record a speech. And then once you have these low-dimensional voice prints, you can compare the voice prints like running billions of these verification trials, billions of comparisons in a few seconds. So this is one aspect. We speeded up the technology quite a lot. And the other aspect is that we improved the technology a lot. So over the five -- over the past five years, we improved the technology by more than -- factor of more than five, so the accuracies are more than five times larger. You will see that in the graph that I will be showing in a second. >>: [inaudible] compare the five times, that was what kind of system at the time? >> Lukas Burget: I will talk about that. I mean, really to say the state of the art in like 2006, if you would see what. So one thing that allowed to improve the technology is where the Bayesian modeling approaches that we are using nowadays. I will mention those. And mainly the development of channel compensation approaches that like Joint Factor Analysis, Eigenchannel adaptation. Again, I will introduce those. So first of all, what the task in speaker recognition is. And we will be mainly interested just in text-interest speaker verification where the task is given an example recording of some speaker, detect recordings of the same speaker in other data. Or equivalently, we can say given a pair of recordings of unrestricted speech, decide whether these two recordings come if the same speaker or two different speakers. So these are two questions which are actually equivalent. But the form you ask this question somehow corresponds to the -- two verification approaches, where the first one is more traditional one that I will talk in more detail shortly, where you train speaker model on some enrollment utterance and then you have some background model which is trained on large amount of recordings from many speakers. And then to make some decision for test utterance you just compare likelihood between these two models. The other approach would be to have two models. Both these models would take as it -- they input both recordings and the one model would say how likely it is that these two recordings come from the same speaker or the other one, what is the likelihood, how likely it is that they come from two different speakers and you will compare these likelihoods. I already mentioned the problem with the channel variability in speaker identification. This is really the most important problem there. What we mean by the channel variability, just imagine that we have for each verification trial, for each pair of recordings we have these two recordings. They are usually just few seconds or maybe a few minutes long, at most. And you are supposed to say whether they come from the same speaker or different speaker. But each of them could have been recorded under different combinations, different telephone, different handset, different microphone, different background noise, different mood of the speaker or it could be second, he can have voice different because of that. So this is what makes the problem difficult. We have just -- we have two recordings. They can each sound quite different just maybe because of the background noise and you are still supposed to say that there is the same speaker in both of them. >>: [inaudible] variability if you are focusing on speaker variability in addition to channel and session, I thought the more important variability [inaudible] difference, since you are dealing with text independent. >> Lukas Burget: Well, you will see that -- you will see that -- well, the phonetic difference is also the unwanted variability, but usually at least we have recording which kind of covers all the phonemes you have information about, all the sounds possible. But it doesn't have to be the truth. So, I mean, in fact, the phonetic difference can be -- one of the different contents spoken is of course one of the problems here. But the data that we usually deal with, which are say about two minutes long, you have -- they cover all the phonetic information sound also so that this is not really an issue. The more difficult part is that the differing channel, and you will see on it the next slide. Anyway, so we have -- we should call this like channel variability session, channel or session variability, inter-session variability. But I'm going to refer to this kind of variability simply as channel. But I will mean by that even this phonetic variability, for example. What I'm showing on this slide is how important it is to apply this channel compensation techniques. And it was probably recognized in 2006 NIST evaluation, NIST evaluation, speaker evaluations that -- speaker recognition evaluations that the channel compensation techniques are important. We can see that on the graph -- I should say first what the graph is actually showing. So this is DET curve which is the usual graph showing the performance for speaker identification task. I mean, I don't know whether you can read the -- what is off graph, but this is a percentage of false alarm and this is percentage of missed probability. The graph is pretty similar to what people would know as ROC curve, receive operating curve. Usually on the ROC curve just this axis would be flipped so you would be showing the accuracy rather than miss probability. And also it differs in the scale of the axis. The ROC curve would be just linear scale from the zero probability to hundred probability of doing error here. The scale is different just for some reason if the distribution of the target and non-target scores are Gaussian, then the curve wouldn't be really curve but just the straight line, so it easier to compare the differences between the lines. Anyway, so what they -- what the graph shows is the probability of false alarm, probability of misses and it actually shows us the performs on different operating points where you -- when you make tradeoff between these two kinds of error. Either you detect too many -- you make too many detections but then you would have high false alarm probability and that would be low probability that you miss something or you go for low false alarm regions which is actually where -- where the difference applications are usually, what the defense people are usually interested in this. >>: Each curve is [inaudible]. >> Lukas Burget: Well, I -- [laughter] I can't say that, but -- anyway. These are the curve from 2006 evaluations. Dan said that I -- we can't say who was first, who was second. But we are actually allowed to show the DET curves of all the participants. These are the black curves and we can say that ours are these two in color. >>: [inaudible]. >> Lukas Burget: Sorry? >>: How many systems, how many ->>: Each line is a separate ->> Lukas Burget: Each line is separate site. The best system from the -- or the selected system from the site, so there was site like 30 sites participating in the evals. And we have two curves just because we participated -- well, we have two submissions. One was the Brno University of Technology submission and we were also part of consortium with the Spescom DataVoice which are our friends from South Africa, TNO from Netherlands and University of Stellenbosch, also from South Africa. So why I'm showing these graphs -- and you can clearly see that these systems are clearly better than the other systems, and this access of our system was really because we implemented the channel compensation techniques called Eigenchannel adaptation, can which I will introduce shortly. There wouldn't be any channel compensation techniques before, so some of the participants like MIT Lincoln Labs, for example, which I can't say which curve belongs to them, had something that was called feature mapping but was different kind of adaptation. You had to first recognize what channel it is. You had to classify the channels. And after classifying the channels you could apply the compensation while the techniques -- the techniques that we are using currently works in kind of answer provides where you recognize the -- recognize the channel along the way. So we will see what that actually means. This is another slide from NIST evaluations two years later, and you can actually see that -- well, our curve here is the black one, so it's still -- we still perform quite well. But you can see that the difference is not all that large. Why was that? Because all the other sides, of course, realize that they have to implement these techniques, too, so they kind of managed to catch up with ->>: Was this evaluation more difficult ->> Lukas Burget: Yeah. It -- yes, it was. I mean, they are every time they are making it more difficult by introducing different channels and restricting that people have to make the conversation using different telephones actually so nowadays you are not allowed to make two calls from the same telephone because that's considered to be too ->>: [inaudible]. >> Lukas Burget: Yes. And the ->>: [inaudible]. >> Lukas Burget: Right. So the reason -- the difference is mainly really in having the task more difficult, probably also there was -- there were -- there were recordings of non-native speakers which make difference because we trained -we trained our system only on native speakers or mainly on native speakers and non-native speakers. They introduce some other channels differences in the microphone, things like that. So yes, this task -- it's every -- every NIST evaluation tries to be more and more complicated and difficult. >>: Was Joint Factor Analysis [inaudible]. >> Lukas Burget: In our system, there was Joint Factor Analysis. And what I'm -- why I'm showing this graph is for two reasons. Of course I want to show that we were -- we are doing really well in these evals but also that -- I mean, also that the technology that I'm going to describe in the following slides it's not just system built on some site, it's really the state-of-the-art technology and what you are going to hear about today is top technology in speaker identification. >>: So this joint analysis -- Joint Factor Analysis in future domain or in the model domain? >> Lukas Burget: In the model domain. So you will see what I mean by that shortly. And yes, so what I'm trying to say here also is that the -- usually most of the site combine many different system based on different features and even different modeling techniques to get the best performance. Here we could really achieve this best performance just using single system based on this Joint Factor Analysis. So pretty -- pretty simple system and still provide this good performance. >>: How many speakers generally, ballpark in these evaluations? >> Lukas Burget: You mean how many ->>: Speakers in the database. >> Lukas Burget: Like few hundreds of speakers. Of course, I mean, there would be lots of trials. So you always make the trial out of pair of recordings and -- but there would be limited number of the same speaker trials, lots of different speaker trials. But let's say like would be like few hundreds of recordings, few -the last evaluations was maybe few thousands of recordings or few thousands of recordings -- few thousands of speakers, sorry, and few hundred thousands of trials that you have are supposed to test. >>: What is the duration of the [inaudible]. >> Lukas Burget: All the results that I'm showing here are on what we call it one conversation training, which is two and a half minute. It's five-minute conversation, about two -- from two to two and a half minute is this real speech from one of the people. We have two ->>: [inaudible]. [brief talking over]. >> Lukas Burget: Both. Both are two and a half minutes. I mean, there are different conditions in the NIST evaluations, about three hour -- usually interested in this one and I am showing the performance on this one. This is like the main condition that people would be looking at. >>: How do number of speakers may affect the performance? >> Lukas Burget: Which speakers? >>: The number of speakers. How do number of speakers ->> Lukas Burget: Well, the number of speakers doesn't really affect the performance because you do verification. You never know -- you don't train the models on those speakers. Actually you are not supposed to use the -- when you do the verification trial, when you get the two recordings, you are not allowed to use any of the other recordings from the sets. So you don't know anything about the other speakers. So the performance is -- the -- how good the performance is not really affected by the number of speakers in the ->>: [inaudible] verification. >> Lukas Burget: These are verifications, right. >>: Verification results? >> Lukas Burget: Yes. So it doesn't get simpler or more difficult if you have more or less trials. You would just see that the DET curve would become stuffy, so it would get more noisy if you don't have enough trials to tells it on and would just ->>: The entire evaluation has nothing to do with the identification tasked where ->> Lukas Burget: No, no, you don't have -- here you don't have information about the -- yeah. This is pure verification. You don't have information about this ->>: Is this different ->> Lukas Burget: The models would be about the same. This just becomes more difficult and usually the identification task you would feed to -- many more recordings to see -- to make the results significant. It would just become too simple. So they are just trying to make it more difficult so that the results are really -- the differences between systems are really significant. And also, I mean, this is the task that the government people are really interested in, so. So even if you use all the channel compensation techniques, the channel difference is still remains an issue which we can see on this graph where this is graph from 2008 evaluation, I'm using some of the other conditions where the performance of this system is -- this is system that -- where you have recordings from recorded over microphone, so these actually not telephone recordings, but interviews recorded over microphone. The other curve here, which is -- where you can actually see what happens if you don't have enough of data because here we limited ourselves just to data recorded over the same microphone, so we make many verification trials, but we also compare the recordings that just we recorded using the -- this same or different speaker but recorded using the same microphone. You can see that the performance actually gets about like three times better just if you don't have this problem with -- with channel mismatch. Otherwise, they are the same recording. So they are the same noisy or clean recordings recorded over the same microphones base, just not the problem with the channel mismatch. So there's still room to improve the techniques. The traditional approach I already said that in the traditional approach what we do is that we first train something that we call universal background model using recordings of many speakers. Then we train some -- for each enrollment utterance we train -- we train speaker model which is trained on the enrollment utterance and then for the verification we take the test utterance, we calculate likelihood using both model, compare those and decide whether this is the same speaker or not. So basically the score that we base the verification on this is some log likelihood ratio between these two likelihoods. >>: You use the higher order of MFCC compared to ASR to [inaudible]. >> Lukas Burget: This is just what [inaudible] not really. I mean what we use is really MFCC's -- the features that we are currently using are actually MPCCs with 20 quotients which seems to be -- seems to perform -- but, I mean, people would normally use the same features like for speech recognition. This is something that we did just recently to have higher resolution about spectrum. And then we also used deltas and double deltas like this normal thing that people would do for speech recognition. It's not necessary. We just saw that we are getting some gains by having larger dimensional features. We don't really know at the moment, but it's because of the -- we provide more information or whether we really need to see -- have more information about the -- about this fine spectrum of changes or things like that. But, yeah. So what we would use here as the model for speaker would be -- would be simply just -- okay would be simply just Gaussian mixture models so the university background model would be Gaussian mixture model that is trained on whole bunch of features we use the standard of MFCC's feature where you get feature that go for each 10 milliseconds of speech and for training the universal background model we would simply pool all the data together and train Gaussian mixture model based on that. People have tried hidden Markov models and all kind of things and never really got any -- any benefit from doing that. So this is just simply pool all the data, train Gaussian mixture model for that to get the UBM. And then the speaker model having some adaptation data for it so I mean this would be just an example, two Gaussian in 2 dimensional space and having some training data for speaker we apply these relevance MAP adaptation which is based just on this formula which is the usual thing that people would do for to adapt models and also for speech recognition, the simple thing -- I mean, this is what people would be doing those five years back exactly, they would build a speaker model this way. >>: So there's a different kind of adaptation [inaudible]. >> Lukas Burget: I mean, this would be the thing that you actually do in Joint Factor Analysis to some extent. But I -- let me -- let me talk about that later. This is -- people are trying -- trying Eigenvoice adaptation, ML adaptations. The MAP work the best. All the other techniques usually perform much worse than MAP but right -- because of the problem with the channel, and we will see that later. So once we start dealing with the channel and start modeling that Eigenvoices actually helps a lot. So the -- in here, you can see that the adaptation really simply is just really -- we adapt only means and the means of the adapting models are just the means from the UBM from each component and you make linear combination with what would happen if you retrained the UBM using maximum likelihood retraining, which would give you this model. And then basically the final model is just something where the means moves somewhere in -- along the way to the maximum likelihood trained model. The weights are the occupation counts. The more data you have for the particular Gaussian closer it would move towards data. And this is just -- I mean the -- why is that called map? It's based -- I mean we will talk about point MAP estimates later on with our other probabilistic models, but here the -- the I mean, why this simple techniques is actually any MAP adaptation, it corresponds to getting point MAP status of mean parameter using some ad hoc priors derived from university background model. We only adapt weights and covariance matrices which means that the speaker model is fully characterized just by the means. We can stake the means, this is what we are going to do next, into one supervector, and the supervector means would be the speaker model. The rest is derived -- just taken from the shared among all the speakers taken from the UBM. >>: The speech recognition when people use the MAP then how -- it's quite different from number ->> Lukas Burget: Well, it would be about -- it would be about 10. It would be about 10. I -- and if for speaker identification -- it actually doesn't matter that much. I mean, if you use 50 would probably perform about the same. Doesn't matter very much. So now I would like to say what the problem is the channel variability is. I mean, I already told you when I introduced the channel variability. But what it means in the model space. Say we have this UBM model and then we get -- just for simplicity we have one Gaussian two dimensional data to keep things simple. We train the UBM model and then we get the data, training data for some particular speaker and we adapt the UBM to this model to get the speaker model. Then we get another recording and I would ask is this recording from the same speaker or does it come from different speaker? And maybe here you would say well it probably comes from different speaker because it's closer to the UBM model than to the speaker model. But I tell you no, this is another recording of the same speaker. So we will train another model for this speaker, you get another model of the same speaker which is quite different. Then do you that with many recordings and you get the whole content of model for that speaker. You can do that for another speaker. You get the whole content have the model for another speaker, for yet another speaker. So now we will look at all these models, which are just single Gaussian models for simplicity here. We would see that there is some direction with large session variability. Here all the models differ just because we took different recording of the same speaker. And there would be some direction with large speaker variability where the means of the speaker really different when going from one speaker to another one. So the idea of the channel compensation techniques in general and particularly with the Eigenchannel adaptation that we used first time in 2006, is that when we get the -- when we get the UBM model and the speaker model and when we get the test data, then before we evaluate the likelihood, we just allowed to adapt both these model by moving them in this direction with large session variability to watch the data best and sense of maximum likelihood, for example, and then with would ask the about the likelihood. So now it would be clear that the data fits better the speaker model than UBM. Right. So now it was simple, it was just for single Gaussian in -- but in real application we deal with Gaussian mix fewer model. So I'm actually reusing slide that I used first time for explaining the soft space GMM models which -- that was introduced by Dan and that is related to this problem. >>: [inaudible] also used the current [inaudible]. >> Lukas Burget: They did that for -- they did that for different approach for systems based on -- well either for the channel compensation they did that at some point, and then they use it with system based on supervector machines. But I -- I would dare to say that these systems are -- they would -- maybe they wouldn't agree with me, but I would say that these are becoming obsolete, the performance of those wouldn't be all that good anymore. It would still probably -they would object that it has chance to work if you have many recordings of the same speaker. Then this cumulative training down in that way would probably help, and then you can use -- make use of the ->>: [inaudible]. >> Lukas Burget: We never saw that it would help any way to adapt the -- it was different people trying that and it never help. It's probably just too dangerous to play with the current matrices because the likelihood score is then too much dependent -- I mean just more sensitive to changes the [inaudible]. Anyway, so what do we get here? We got the Gaussian mixture model where the -- I said that the Gaussian mixture model which let's say this would -- this thing would be the speaker model or the UBM model that we want to adapt to channel where we have already stacked the mean vectors into the supervector. So let's say these two Gaussians corresponds to these two dimensional mean vector of the red Gaussian. This is for the blue Gaussian and so on. Now, if we want to adapt the speaker model to channel, we need the subspace which correspond to large session variability. So let's say that this would be matrix which correspond to the subspace and then we have some quotients which describe low-dimensional representation of the channel we charge as the weighting quotient which tells you how to weight these different bases, how to linear combine these different bases and add them to the speaker model to get the finite adapted model. So you no maybe just to get some intuitive understanding of the whole thing, if we would think of changing this factor, it means that we are scaling the first base vector and adding it that the speaker model where these two quotients actually corresponds to some direction in which this Gaussian start moving these two quotient correspond to this direction which these two Gaussian start moving and now we start making -- if we start making this quotient larger or smaller than the Gaussian, start moving along the directions. When [inaudible] saw the animation first time he said that it's better than Avatar and I promised him that I will have it in 3D next time, and I still of it just in 2D, so I apologize for that. Okay. So this would be I mean one quotient allows you to move the Gaussians if certain direction. Then if we start moving quotient like this one, then they would possibility -- I'm sorry, they would possibility start moving in different directions. So we can by just changing these quotients, we can get lots of different configurations of the Gaussian mixture model, but at the same time they are constrained to move in some low-dimensional subspace of the model parameter space, right? So we can possibility quite robustly estimate these if you quotients which would be normally something like 50 or hundred quotients to estimate per recording that would allow us to adapt the Gaussian mixture model towards the channel of the [inaudible]. >>: [inaudible] estimate [inaudible]. >> Lukas Burget: So the -- right. So the trick now is just like what we saw on the -- let me go here. Just like what we saw on this slide where I allowed to move the -- both these Gaussians, the speaker and the UBM along this direction, it would do the same thing. So the quotient just corresponds to how far we are moving with the Gaussian. So we would do the same thing. We do exactly the same thing, just with Gaussian mixture model where we have possibly multiple directions and moving -- moving the Gaussian mixture model in the subspace of the large channel variability. Okay? >>: Can you go back to that one before, please. There. >> Lukas Burget: There. Okay. >>: So UBM or MAP adapted speaker model ->> Lukas Burget: Yes. So ->>: Wouldn't it make sense to do the same trick for ->> Lukas Burget: Yes, and -- we will get there, right. It is the next which is Joint Factor Analysis. So, yes. So basically yes. I mean here -- but this works actually pretty well. So you can skill get the MAP adapted speaker model and just do the channel compensation in this way. But of course now the idea is we could also restrict speaker to live in some low-dimensional space, right? >>: It has to be speaker adapted model. If you use the UBM in that vector does it [inaudible]? So let's go back to the previous slide. I think I missed that a little bit. >> Lukas Burget: Okay. >>: This whole question that [inaudible] was asking. >> Lukas Burget: Right. >>: So that M vector is ->> Lukas Burget: The M vector is one's the speaker models and other time it would be the UBM. So I mean just like on this slide, you adapt both. You adapt the speaker model and you adapt the UBM to the channel. So you need to do that before both. You have the UBM model, you have the speaker model. For each of those you estimate the channel factors. Eventually different set of channel factors for ->>: So you do that twice? >> Lukas Burget: You do that twice. >>: I see. >> Lukas Burget: And you -- we will see that you actually don't have to do that twice. You can actually estimate it just using the UBM and reuse it for the speaker model, which doesn't make much sense but it seems to work very well. And ->>: You use UBM to ->> Lukas Burget: You can use UBM just to estimate the speaker factors, and it's going to work about the same if you make some other approximations. But I will -- I will get to that. So I mean ->>: The UBM model with the speaker adapted model and then to make a speaker twice [inaudible]. >> Lukas Burget: Yes. Well, but you estimated with the UBM but yes to synthesize both model you can kind of think of stacking it one under another. But it actually doesn't work on its own. It needs some other approximation to make it work in this way. Anyway, so I mean this is just what -- what summarizes what I just said. You train the speaker model in the usual way using MAP adaptation, this one, and then for every verification trial you take UBM and you take the speaker model and you adapt the M to the channel of the utterance just by finding the appropriate, the X vector which we call speaker factors, and you -- we just -- we could just surge those to maximum the likelihood of the test utterance and then using the UBM and the speaker model we would calculate log likelihoods, compared to log likelihoods, and this would be the verification score. So this is something that was actually introduced at -- that was already used by -in 2004 NIST evaluations by Nico Brummer, which was actually our partner in 2006 evaluations. By the time he didn't manage to make it work -- it actually worked pretty well, because he could get about the same performance as other system that combine lots of techniques. He adjust this single system two years later we just make it working much better compared to ->>: How does it differ from the -- suppose this is similar to the Joint Factor Analysis ->> Lukas Burget: I'm just getting there, so ->>: [inaudible]. >> Lukas Burget: Yes, the [inaudible]. Yes. So this is -- this is just coming in two slides later. Anyway, now I told you how to adapt the speaker to the channel given -assuming that we already have this direction. Now is question is how do we get the directions. But there is simple trick to do that, which is just to estimate each speaker model so the -- let's say -- let's have just two-dimensional model, single Gaussian 2D, for example. Then each speaker model would be just one in these two-dimensional space and then we can just subtract the mean of each of the groups, the groups with the same color corresponding to the same speech. We get just the data -- well, we can now calculate a covariance matrix of this data, see the -- find the direction with the largest Eigen -- Eigenvector with the largest eigenvalue or few direction with the larger eigenvalue and find the subspace this way. This is actually what we did in 2006 and works pretty well. We can use more principled way by estimating it using maximum likelihood which is what we did for Joint Factor Analysis that is just coming on the next slide. Now just to show what -- how these techniques perform. And this actually shows the performance on -- of the systems that we developed for 2006 evals but still showing it on 2005 data. Again, telephone -- telephone trials. This would be system which is based just on the techniques that I described, just the relevance MAP. No channel adaptation. No channel adaptation in the old ones. The blue one would be what happens if you apply the channel adaptation techniques like feature mapping that were known before the subspace channel adaptations. So you can see that you could get quite some improvement but not that significant or I'm -- significant but not that large then compared to the -compared to the subspace channel compensation technique which is the red one for the Eigenchannel adaptation where we had 50 Eigenchannels, in other words fixed -- this fixed -- 50 dimensional subspace with the larger session variability. >>: So [inaudible] anything like CMM or anything like that to do a first order -you know, channel ->> Lukas Burget: No, not -- it's there. In fact, I mean this has all the technique that we would have there. So there would be the MFCC's feature mapping is what I -- I actually say -- sorry. Feature mapping comes here. But the feature warping is actually similar to mean invariance normalization. This is actually normalizing things. CMM but done with window -- I mean, this would be in warping things into Gaussian distribution. That's just too expensive. You don't have to do that. But you have to take about three second window and normalize the -- do just a mean normalization in the three-second window. We don't do any variance normalization. It doesn't seem to be helping. Okay. So yes, I mean there are the basic -- basic tricks to. So this -- we can see that the channel adaptation actually gave us a little -- quite significant performance gain. You can see that you can for example, 40 -- 40 -- I'm trying to find some good point where we can compare the performances like these. This crosses some line, I don't know, like the 40 -- 0.5, the arrow drops from something like 40 person down to 15 person. So I mean this is already like lots of -- lots of improvement. Maybe it doesn't look that well on the curve but you can see that the -- the error right rate reductions here were about 50 person or 60 person. So it's not like what people would be used in speech recognition where you find for every person but you are adjusting two years people go 50 percent relative improvement. Then another two years with the JFA we go another 50 percent relative improvement. So the nice thing about speaker ID field is that it's really active field, live field where you get lots of improvements every ->>: [inaudible]. >> Lukas Burget: Yes. The ->>: So this ->> Lukas Burget: The dimensionality of this is 50. We have 50 of these bases. >>: The numbers should be roughly the same as your estimated different channels. Is it sort of the [inaudible] type of ->> Lukas Burget: I -- well, with the feature -- with the feature mapping, with the previous technique with the feature mapping, you would recognize something like -- well, we had 14 different classes. But it wasn't really necessary. I mean, if you had just like four or five, it would already do the same thing. But the problem is that the -- the previous compensation technique where you had to estimate a class, the problem is that you can't combine any effects. I mean, what if there is the effect of having this particle microphone an effect of having this particle a type of background noise you want to deal with is you need to train separate class for this channel and this background noise. Here, hopefully the channel compensation you can have one direction accounting for the channel differences another direction accounting for the noise difference and things like that. We don't really understand what the direction means, right. >>: So you find a different kind of task on this, you know, [inaudible] sort of if they have more diverse in terms of [inaudible]. >> Lukas Burget: You would need more factors. I guess that it's more given by the amount of data that you have for adapting. So you probably can't go and -again, I will show you some example of that, but most likely you can -- you can increase it to infinity because it just the estimates will get less robust. But you can impose prior on that and a get regular mats with like or getting some point MAP estimate priors and things like that. So now we will move towards the Joint Factor Analysis. And we will add the thing that just Jeff was missing in our model. So instead of just modeling, modeling the channel differences with the subspace and just adapting the speaker model by moving it in the subspace U, we will also create a speaker model by moving the UBM mean in the direction of large speaker variability. So I mean like on this graph if you had the direction with speaker variety and channel variability why wouldn't he use the information about what is the subspace with the direction of his large speaker variability? And so this model, it's still not Joint Factor Analysis but it's close to that. We would have two subspaces, one for V for larger speaker variability, U for larger channel variability. And now the information about speaker in each enrollment utterance would be compressed in the low-dimensional vector Y, which would be roughly 300 numbers. I mean, this is like what we saw to give the best performance. So the adaptation now is really similar to what people in speech recognition would call Eigenvoice adaptation. We also call these direct -- this directional subspace Eigenvoices. But the problem that in speaker ID was that if people tried it before, it didn't work. And it only works if we apply the -- if we model the channel subspace. Once we started modeling the channel variety subspace it start to be effective. Without that it fails to work. I mean, this is also the system that was kind of inspiration for Dan's subspace GMMs that he used for large [inaudible] speech recognition. So how do we use it now for verification? It's I think kind of obvious. This is the whole equation that summarizes the whole model where we have the speaker and channel subspace now for verification. We estimate on the enrollment utterance we estimate both the speaker and channel factors using the enrollment data and for test we estimate just the channel factors to adapt the model to the channel of the test utterance. Again, we get the adaptive model of the speaker. We adapt the UBM the same way, compare the likelihoods around the verification. Now, the parameters, these parameters of the model can be actually estimated using EM algorithm where iteratively we can estimate the parameters that are shared by all the speakers and these parameters that are speaker specific or channel specific that we can alternately try them on training data where we further constrain the Y to be the same for all the recordings of the same speaker Y that X can vary from one recording to another one. So we would have one Y for each speaker, one X vector for each recording. And now finally to end up with the whole Joint Factor Analysis. The Joint Factor Analysis is fully probabilistic model. So far we have limited ourself to constrain the speaker model to live in some subspace, in subspaces large, speaker variability and channel variability. But we didn't care about the variability itself in this subspace. So what we can defer to is to model the variability in the subspaces. How much speaker variabilities in this particular direction? So if we assume that Y and X would be standard random -- standard normal distributed random variables, then writing this equation would give us Gaussian intuition on a mean, right? So this would be the UBM mean, this would be Gaussian distributed normal -- standard normal Gaussian distributed random variables. This would be some subspaces if we just -- we look at the distribution of U, that would be Gaussian distributed. It would be distributed in subspace. It wouldn't cover the whole -- whole space of the parameters. But then we would get basically vector which is Gaussian distributed in the parameter space and which would tell us what is the variability in different direction, specifically the M would be mean of the -- of this distribution and V, very transposed and U and U transposed would be covariances -- covariance matrices that would correspond to amounts of variabilities in the directions of the subspace V and U. And now we use this as a prior to estimate the parameter. So now we have -right now we have the distribution of the speaker models, prior distribution of the speaker models. And given some test data or verification -- the enrollment data, we can get estimates -- we can actually derive posterior distribution of the model parameters. Right? So this is what that would be good for. I will go to -- come to that again on the next slide. And this still wouldn't be the full factor analysis model. Those of you who knows what factor analysis is, that it's techniques for modeling variability, then there is still something missing here. We have just some subspaces. To end up with the full factor analysis model, we would add another one term which would be this is D matrix and zed would be again random variable which would be standard distributed random variable and that D would be diagonal matrix. In this case it would be actually huge matrix of the dimensionality, same as what is the number of parameters in our model, like typically 200,000, but it would be just diagonal matrix. And it would model the residue of variability, the remaining variability in the model parameter space. >>: So you [inaudible] group all these [inaudible] variability with [inaudible]. >> Lukas Burget: Well, it would be probably also in the U term. This is kind of the just -- I mean, the subspaces describe where is most of the variability but of course I mean there would be some remaining variability in the model space which is described by the -- by these diagonal terms. >>: [inaudible] you can force that U to deal with particular [inaudible] admission you actually took the amount data with the metadata that's specified [inaudible] files those [inaudible]. >> Lukas Burget: Yes. I mean, you again train -- you again train these subspaces. So you train the U, V, U and D, you train it on the training data by saying these are recordings of the same speaker, these are -- these are recordings of another speaker. So this way you train the variability. Normally we would -- the D would account for the residual speaker variability. There could be actually one more term accounting for the residual channel variability. But, I mean, the basic idea here is, I mean, basic idea of factor analysis is just given some two-dimensional Gaussian distribution which has some means to model the full covariance matrix, you find the direction with most of the variability and you model the distribution along this dimension and then you model the residual variability which would be -- which would be like variability in the different directions. And together if you sum these two, these two trends you get the overall covariance matrix, which wouldn't make any sense in two-dimensional space, but you can say full of the parameters in the large dimensional spaces. But I'm -- anyway, for -- we don't need these speaker models to live in the -- to have the freedom to live in the full space. We can really -- we can really live just with these two subspaces. And this additional term is not even helping the performance. So this is just to explain why it's called factor analysis, because this model is actually factor analysis model that statisticians would know as factor analysis model. And use this factor analysis to model distribution of have speaker models in the model parameter space. So this is what was introduced by Patrick Kenny, the model that he -- that he actually introduced sometime I think in 2003, but was only 2008 we have implemented it in our system where we show that actually it's performed rather than what we have done before with the ->>: [inaudible] that group as I understand ->> Lukas Burget: Well ->>: [inaudible]. [brief talking over]. >> Lukas Burget: Patrick would have it implemented I think 2005, but his system perform that well that people would recognize the importance of this technique. But, yeah. I mean we closely collaborate with him. The last NIST evaluations we actually -- we had him in our team. We teamed up with [inaudible]. >>: [inaudible] estimation were there a lot of free parameters ->> Lukas Burget: Yes. Right. So, I mean, this is the most -- of course the most expensive part, but still things can be done pretty quickly based on some other assumptions that will talk about. >>: It sounds like in 2005, 2008, the same basic technique existed, it just [inaudible]. >> Lukas Burget: Right. Right. >>: [inaudible]. >> Lukas Burget: I mean, right. Even the Eigenchannel adaptation, I mean this wasn't something that we would developed. We use it in 2006 but was somebody was -- somebody use it already in 2004. We just had more time to ->>: [inaudible]. >> Lukas Burget: Well, we just -- with we just had the system better tuned for using better features, having the -- many people say that our advantage is mainly in having lots of machines to run the experiments on and I mean for the 2008 evals we would just switch on all the computer labs in university and run things on like few hundreds -- like seven hundreds of cores. And Patrick Kenny would run it on his machine that sits on his table, right. [laughter]. I mean so then you have better chance to tune things up. But, on the other hand, then you can really show that the technology works, that ->>: So how many parameters can be tuned for this [inaudible]. >> Lukas Burget: The number of parameters that you have to estimate the dimensionality, so the dimensionality of these matrix would be 300 times about 200,000. And this would be -- this would be 100 times 200,000, something like that. Then you probably want to train few tens of system or maybe few hundreds of systems to figure out which features work the best and all that and other parameters and things like that, right? So I just said that on the previous slide that we have introduced the probabilistic framework. We now can model -- using this equation we can model the prior distribution of the speaker, speaker models of this mean supervectors which are the speaker models. Now, what this distribution can be good for? The first thing is we can obtain point MAP estimates of the -- given some enrollment data and test data we can obtain point MAP estimates of the speaker factors and the channel factors as -- if we have prior distribution on those and we get -- we get some training data, we can calculate distribution of those and we can go for the point MAP estimators as more likely estimates, which are kind of just regularized versions of the estimates. And it seems to -- it seems to help. But more importantly, it actually allows us to use the Bayesian approach to speaker verification, it allows us to use Bayesian model comparison to do this speaker verification, to get the score for the speaker trial. What is that about? Given the two recordings, 01 and 02, we want to compare likelihoods for two hypotheses. The first hypothesis is does these two recordings come from the same speaker? The other hypothesis is do they come there different speakers? So we actually want to build two models. One that takes these two recordings and calculates likelihood for the one hypothesis. The other models that calculates the likelihood for the idea hypothesis compared to this hypothesis. This would be like theoretically from the Bayesian -- the Bayesian would say that this is the proper way of evaluating the model. >>: So the reason why this prior estimates for this speaker and channel [inaudible] that you build some kind of Gaussian distribution or ->> Lukas Burget: No, so these parameters -- well, the final distribution is actually going to be the distribution of the speaker models. And the -- this distribution is specified by these U and V matrices and those you train using maximum likelihood on training data. >>: I was referring the next slide you talk about regularizing on the priors of Y and X. >> Lukas Burget: Yes. >>: The peeves one didn't have it, it was kind of maximum likelihood ->> Lukas Burget: No, so the -- these things are estimated using maximum like -now in JFA. These things are estimated using maximum likelihood. But these guys of Gaussian priors. >>: That's what I was asking. >> Lukas Burget: Yes. So then we still estimate -- we still estimate these parameters using maximum likelihood, and we hope that if we have enough data, which is probably not true and we will see it in the later technique that we have enough data to estimate those subspaces, we could of course impose some priors on those and regularize it. Never actually tried that. But these are -- these have priors and we can get posterior estimates of these. >>: If you just renormalize so that they stay Gaussian? >> Lukas Burget: They stay Gaussian by definition. I mean, this is -- what you get here is ->>: I mean [inaudible]. >> Lukas Burget: This is standard normal distributed random variable, right in if you transform it with some matrix, just linear transformation you get Gaussian distribution with zero mean. And then if you add some mean, you get Gaussian distribution. >>: What's the maximum likelihood being used, that maximum likelihood [inaudible] some kind of Bayesian sense where you're integrating it for something. >> Lukas Burget: Well, when you estimate the U -- I didn't want to get to such details, but when you estimate the U and V, you -- and in every iteration you obtain posterior distributions of Y and X, and then you integrate over those to calculate a likelihood of the normal ->>: [inaudible] Y and X [inaudible]. >> Lukas Burget: Yes. Well, they ->>: Well [inaudible] I mean if you ->> Lukas Burget: Well, now I mean this is just what comes from -- the Gaussian distribution -- the Gaussian distribution would be conjugate prior for the mean. So just by definition you calculate a posterior distributions of Ys you automatically get Gaussian distribution on the -- as the posterior distribution with this model. Okay. So we have this new scheme which I am showing also on the next slide. So now how does this model can be constructed from the Joint Factor Analysis? We want to calculate a verification score as the log likelihood ratio coming from these two models. And you can actually see -- so I mean here, again, I'm just saying we want to calculate the log-likelihood ratio between these two hypotheses, so the log likelihood of these two utterances given they are the same speaker or they are two different speakers. So we have the corresponding models in the numerator and denominator of calculating this likelihood ratio. And you can see that here if you have the -- once we have the Joint Factor Analysis, we can given some speaker factors, the speaker factors would just define the Gaussian mixture model, which we would say this is the speaker model. So if you have the particle point estimate the Ys, we have the speaker model. Now we say we don't have these particle estimator Ys, but we have the prior distribution over the Ys is normal. So again, if we had the particle estimator Y, then the likelihood of one an utter utterance generated from the same speaker described by Y would be just this term, right. So given the likelihood that this speaker -- given the likelihood of this data given this speaker, the given likelihood of this data given the same speaker. Now, we don't know the speaker. We have just the distribution of those. So we consider each possible speaker given by the priors, we integrate over the prior distribution of the speakers, and for each possible speaker we calculate the likelihood that both utterances are generated from the same speaker. This gives us the likelihood of this hypothesis that both speakers -- yes? >>: [inaudible] or just maximize it? >> Lukas Burget: If the posterior distribution of the speaker factors wouldn't be too sharp, saying that this is -- there is really just one possibility that the speaker factors would look like this, then it does. And I'm -- I can tell you that it doesn't match in the case of Joint Factor Analysis and I would be showing that but it is going to make a lots of difference with the most recent models that we are using at the moment. On so but I'm just showing how -- what would a proper way of getting the log-likelihood ratio. Yes? >>: So the Ys are speaker specific priors. The speaker factors -- the priors train from speaker specific. >> Lukas Burget: The prior is just standard normal distribution. Right? This is just the standard normal distribution. The variability in the subspace, in the large speaker variability subspace is hidden in these matrices. Right? So once we start trying all these Ys which are on standard normal distributions, we are actually trying all the possible speaker model ranging in the subspace of the speaker variability in trying all the possible models that can live in the speaker variability subspace. But in this integral, in this integral, the Y is just standard norm distributed variable, nothing more. And in the denominator, you can see that each of this term is basically I mean if you compute this integral, you get just the marginal distribution of the data. >>: [inaudible] why is estimated from JFA that it just give you one factor. So you have to impose, you know, some high parameters -- type of parameters [inaudible] of one. >> Lukas Burget: You need to estimate the hyperparameters on the training data. Once you have the hyperparameters, you have the JFA model, which already describes the variability. But in the speaker model space. And then to calculate the proper score, you need to integrate over the ->>: You haven't talked about how to estimate the prior parameter here. >> Lukas Burget: Oh, for the U and Y? For the U and V? >>: Distribution of Y. >> Lukas Burget: The distribution of Y is standard normal by definition. You just say I say that the Y and X has standard normal distribution, right? >>: So you go through all the training data for each training -- for each ->> Lukas Burget: No, no. >>: You have [inaudible] you take the sample, sample distribution of the Y. >> Lukas Burget: No, so -- you have -- the distribution of the speaker model is defined by this equation. You have standard normal priors imposed on this U and Y, and then it depends what distribution of this would be of the speaker model would be depends just on the U and V. >>: [inaudible] factor analysis gives you UV and also Y and X [inaudible] iteration at the end you get X and Y. >> Lukas Burget: In every iteration it gives you -- in one iteration you estimate U and V, given posterior distributions of Y and X. You calculate this posterior distribution based on the standard normal priors. So you say X and Y has standard normal prior in each iteration give me posterior distributions of Y and X, and based on the posterior distributions integrate over all the possible models to get maximum likelihood estimates of U and Y. >>: I think as always it's overdeterminized. You could have infinite number of solutions by putting a constant in V and the inverse of that constant Y, right? >> Lukas Burget: You can actually -- you can decide on any prior -- any Gaussian prior on this thing. It's overparameterized [inaudible]. >>: So you have training procedure to get that prior. >> Lukas Burget: Yes. >>: [inaudible]. >>: Is there any kind of scale inflected here like scaling [inaudible] on the priors? >> Lukas Burget: The -- I mean not really. I try to play with things like acoustic scaling in speech recognition. It never really helped, but the -- I think that there is different problem. I mean, basically the posterior distribution may be estimate the posterior distribution of speaker factors and channel factors. They are -- they suffer from this model not really being the best possible -- the correct model. So there would be probably other problems than just the scaling. Hopefully you could -- you could do something with the scaling but anyway, I will just show that I'm just introducing the Bayesian scoring where we integrate over things. But we actually don't do that in the real models. We just get the point estimates and most likely for the -- especially for the speaker factors probably the maximum likely estimate would be just fine. Anyway, so this is the way how Bayesians would like to score the things. We have just speaker factors here. You may wonder where the channel factors are. I just assume that they got already integrated out of each of these terms. So things -- I'm showing the complete equation of the next slide, but it just gets more complicated. And what one thing that you should note on this slide is the symmetrical row of the scoring. I mean, before we trained the speaker model on one utterance and tested the speaker model on the other utterance. Why don't we take the second utterance and train the model on that and test it on the other one? We would get different score. Each one is actually better. Normally the -- it's better to take the longer utterance for training and the shorter for testing because we probably would get better point estimates of the speaker model on the longer utterance. But this approach doesn't suffer from that. It's completely -- it's completely symmetrical. You -- no matter which segment come as the first one and which comes as the second one, you just get the proper score. So at least theoretically this scoring is the better way of getting the log likelihood score. So this is actually what would we get if the exponent even if it's the channel factor each of the terms has to now be integrated also with the channel factors. And this just becomes too complicated. So the problem with JFA and with this scoring is this integrals are intractable. Even for that we integrate over this Gaussian prior, each of these terms, each of the terms O, depending on YX is Gaussian mixture model and just makes it impossible to carry out the whole integration. One can resort to approximations like variation base, and they seem to provide some improvement, but not much. It usually help you something if you deal with very short utterances, like two seconds of speech. Then this integration provides some improvement. For few many utterances it doesn't help. It doesn't help at all. So, in fact, what I'm trying to say here is that the old-fashioned scoring where we just obtain the point MAP estimates for the factor Y and X, a factor, and then do the standard scoring like just avoid the likelihood with one model and another likelihood UBM and compare the likelihoods works just fine and doesn't need -you don't need to run this integration. And there is another problem that to get good performance with Joint Factor Analysis and even with the Eigenchannel adaptation before, we need to -- we need to apply some normalization techniques on the find score. So we get the score from the system, from comparing the likelihoods of the model and the UBM, but then we need to do something with the score. And normally we would apply something that's called ZT-norm where you also obtain set of other scores for some cohort of other models and cohort of other utterances and normalized by mean invariance of scores from this cohort to get the find score and then it start working. Which is just big pain because you don't have to perform verification trial you don't score just single model, you need to score 200 models. And which of course makes the whole procedure much slower and so. So any way, this slide shows the performance of Joint Factor Analysis versus the Eigenchannel adaptation. You can see that on for 2006 data we got quite nice profit from Joint Factor Analysis which is the black line compared to the Eigenchannel adaptation. And also this slide shows what happens if you go from -- from 50 Eigenchannels to 50 dimensional subspace for the channel variability to hundred dimensional subspace for the channel variability. So the question is you were also asking about. And for the Eigenchannel adaptation we actually go degredation. So it look like there is too many -- too large subspace you can estimate these parameters reliably. But for the Joint Factor Analysis we actually improved. So after explaining the speaker variability with the Eigenvoices we can get even more benefit from adding -- adding more Eigenchannels having larger -- modeling more precisely the channel variability and the channel variability subspace. And this is just another graph showing performance of Joint Factor Analysis on different database and it compares it with the baseline. The purple line is -magenta line is the Eigenchannel adaptation, the red line is Joint Factor Analysis. We didn't get all that much improvement from Joint Factor Analysis here, but still the improvements are significant. And this is actually data -- these are the most recent data of 2010, this evaluation where -- and I would be showing the other results on the same graph. So you can compare all the techniques ->>: [inaudible]. >> Lukas Burget: Well, the -- you can -- well, probably T zero is there in infinity, and probably T hundred is there in infinity. But the ->>: [inaudible]. >> Lukas Burget: But, yeah, I mean the reason why I'm showing him this thing is one thing is that it gets more noisy here and over there and other thing is this is actually what -- what the NIST is becoming interested in and the -- our sponsor like the US government and these applications ->>: [inaudible]. >> Lukas Burget: Exactly. I mean, this false alarm is what's killing them. They can't apply this technology if they have too many false alarms. >>: [inaudible]. [laughter]. >>: So [inaudible] performance now is about five times better and error rate five times lower than five, six years ago. And the JFA is about technology five years ago, right? >> Lukas Burget: Well, the technology -- the JFA's technology that shows -showed up their benefits in 2008. So like three years ago. So, yes, it's five years ago it was proposed something like five [inaudible] but it was just 2008 when we actually use it first time in NIST evaluations and then everybody else was using it during the last evaluations. Anyway, so let me speed up little bit because I'm -- I see that I'm already running out of time, and I would like to get to the final part of this cumulative training. Any way, so there was lots of simplification that you could do with Joint Factor Analysis and that improved -- they speeded the whole thing up and also they actually provided improvement in performance, and they are things like the -- the first thing just mentioned how the parameters should be trained and you can't evaluate the integrals analytically and you need to resort to some variational base techniques but then for the verification itself. UBM is used to align the frames to all Gaussians. So you do that once for each test of trains using UBM you can reuse it for all the speaker models. Doesn't hurt at all. It allows you to collect statistics and raises statistics for all the other estimation speed up things a lot. We use the point MAP estimates of Y. This is what I said to represent the speaker model. The channel factor can be estimated using this universal background model for test utterance. So you actually don't have to for each verification trial, you don't need to train speaker model and then for the speaker model estimate the channel factors. You can pre-estimate the channel factors for the test utterances and then reuse it for all the speaker models. And finally there is something that is called -- that is called linear scoring which is -- which you can find more details that you can find in paper that we wrote with Andre Glembek, my colleague. And the thing is from the approximation above and from another approximation that falls from actually using linear approximation to log likelihood score instead of the quadratic score that you would otherwise get, you can -- you can -- you can get evaluation of the score which can be extremely fast. And for -- you can precompute these speaker factors which are like 300 dimensional vector for each utterance which would represent the model of the speaker. And you can precompute another low-dimensional vector of the same dimensionality which would represent kind of the statistics of the utterance for testing. And then log likelihood ratio can be obtained just as the dot product between these two vectors. It's still asymmetric. You still have to choose which utterance is going to be the training and which is the test and selecting the model and the statistics based on that and then computing the dot product and getting the find score. And this is what we are going to read off on the -- right on the next slide. So anyway, there are -- there were lots of simplifications that allowed us to run the verification really fast that even improved the linear scoring actually improved the performance little bit, which is really crude approximation to how the score is computed which now suggests that the model is actually not correct. I mean if you can come with so many approximations and if it's still perform well, then probably the model is not really correct. On the other hand, the point estimates of why the speaker factor that we are getting allows you to get pretty good performance, pretty good verification performance, which means they have to contain -- it has to contain the information about the speaker, right? Each of them has to contain the information about the speaker from the utterance. Can we actually take the Y vectors and use them as features for another classifier and get the -- let's just compare -- we were just wondering -- it must be enough to convert the Y vectors. They contain the information about speaker. But what is the right way of comparing the Y vectors to getting the verification score? >>: Y shouldn't be factor for the speaker [inaudible] the speaker. >> Lukas Burget: No, I mean ->>: [inaudible]. >> Lukas Burget: And if I'm saying that I can -- I extract this Y vector from enrollment trains and then use it as speaker representation to build the speaker model and test the test utterance, the Y has to contain the information about speaker, the relevant information about the speaker from the utterance, right? So then -- then possibly it should be enough just to take one set of speaker factors from one utterance and the other one from other utterance and just compare the speaker factors and get the -- and make the verification -- compute the verification score just based on comparison of those. >>: The condition Y requires that you put all the speaker data together. So each Y actually sort of gives you the space ->> Lukas Burget: No. You -- each Y is going to be estimated just on the utterance itself. >>: So for each utterance ->> Lukas Burget: So you have the -- all the utterances no matter whether they are enrollment or just utterances, you would just estimate the Y for each utterance individually. So I mean we have to distinguish two parts. One is to train the Joint Factor Analysis model. U and V subspace, the speaker and channel subspacings. For that we use some set of data. >>: Whole set of ->> Lukas Burget: I mean, the whole set of data but different from those that we would -- that we would test it on. Right? And now we have another bench of test data where we have the enrollment utterances and test utterances. And now for each enrollment utterance and each test utterance we would just estimate it why. And we can compare just this Y and say they come from the same speaker or different speakers. >>: It's a little bit like using [inaudible] matrices as a feature ->> Lukas Burget: Right. Right. In fact, yes. It's quite similar. Yeah. But the question is what would be the right classifier? You can feed it into software vector machines which would be the standard technique that people do, and I'm just saying this becomes obsolete and you just don't do that. It's not really the right classifier. This is actually what I thought I'm going to say right now. So anyway, we could use Y as the fixed-length feature -- low-dimensional fixed-length representation of speakers for the utterance. So instead of really having the -- all the frames, the whole sequence of features, now we would have just the Y as low dimensional representation of the speaker for the utterance and we can perform verification using that. But during -- I was actually leader of the group during the Johns Hopkins University summer workshop in 2008, and one of the things that we played with was to see whether these Ys can be used as features. And these are experiment that Najim Dehak who is now with [inaudible], and he tried to apply supervector machine classifiers to -- which is basically technique where you really trying different supervector machine classifier for each speaker as the speaker model and then you test it against the test utterance. And you need -you train it using single utterance, single example of the speaker and bench of other background examples as the imposter example. So then they got these examples trying the supervector machine and then test it against the test utterance. So -- and this is kind of -- at least to me it seems a little silly to try and discriminantly classify using single positive example, right? There is something wrong about it. But, anyway, this is something that people would do for long time. And so what he get was the following: The Joint Factor Analysis, this was the performance, now I'm showing the numbers in equal rate. This was the performance Joint Factor Analysis. If you use the speaker factor as the features to supervector machines he got much worse performance. But the interesting thing here was that he also tried to use the channel factors as the features for recognizing speaker and the channel factors are not supposed to contain any information about speaker, they are supposed to contain information on channel. And he still got 20 percent equal rate. Well with, it's quite poor. I wouldn't sell such system for speaker recognition but still it's much better than shown so it means that there is speaker information in channel factors. Then why should we even bother with training the speaker subspace and channel subspace if Joint Factor Analysis does such poor job in separating the speaker and channel information and instructing all the channel -- extracting all the speaker channel information into Y vector? So instead we are going to train much simpler system where instead of the Eigenvoices and Eigenchannels we would have just single subspace which accounts for all the variability, and we would extract this -- something that we call iVector, which is one vector instead of speaker and channel factors, and we have going to use this as the input to another classifier. And when he did that, he actually got better result than with this -- the speaker factors themselves. So it looks like -- this is the right way to go. Now we need to find better classifier than super vector machines which I don't like anyway. I mean for this task. >>: [inaudible]. >> Lukas Burget: Say again. >>: [inaudible]. >> Lukas Burget: Yes. Yes. So again, this -- all of them, Y and X and even this I is going to be point MAP estimate of the -- your channel and speaker factor and the year of the general factors. >>: [inaudible] since when the single feature [inaudible] in training [inaudible]. >> Lukas Burget: So for -- in this case, yes. So in this case, you would -- you would get the -- no, I mean the posterior estimates are based on the sentence, right? So you have ->>: Right. [inaudible]. >> Lukas Burget: I see. No, so I mean there would be just a point MAP estimate. So there would be the most likely values of Y and X, the means of the posterior ->>: [inaudible]. >> Lukas Burget: Single example, yes. Single example. You don't use information about the variability in ->>: [inaudible]. >> Lukas Burget: No, in this case it was really the -- I mean always binary classification, but in this case you train supervector machine for each speaker, using the single positive example of the speaker, bunch of negative example. And then for each test utterance you test it for the task, is it the speaker or is it somebody else, right? >>: [inaudible] iVector? >> Lukas Burget: Sorry? >>: Was the machine [inaudible] iVector. >> Lukas Burget: It would be -- yeah, I mean we had typically 300 here and 100 here so we put 400 here. I mean right now we see that large dimensional would even help but then it's slower to estimate those and 400 seems to be [inaudible]. >>: [inaudible] recognize the channel using the channel vectors. >> Lukas Burget: You could use them to recognize channel factors. But the thing is, I mean, would you want to recognize what the channel is and you don't really have -- I mean, of course ->>: [inaudible] specific speaker models. >> Lukas Burget: Sure. I mean, one can do that. But, I mean, the thing is, the thing is whether that would be really helpful to -- I mean, you mean recognized speaker specific model for speaker recognition? I mean one could certainly train the speaker model using large amount of data and eventually by knowing that this is still the same channel getting more robust estimate of the speaker by knowing what the channel is and what the channel factor is. >>: So I'd like to challenge your assumption that the factors contains the same information as you are using in JFA. It seems to me like your factors would include information especially your iVector about the channel ->> Lukas Burget: Sure. >>: And speaker like you said. >> Lukas Burget: Sure. >>: But also the text. >> Lukas Burget: Sure. Just like before, just like before combination of these two -- these two vectors. Sure. I mean they would be -- they now contain information about other thing. I'm not saying that ->>: No, but ->> Lukas Burget: Yeah? >>: When you adapt your UBM and get your JFA and then evaluate it on a segmented speech, I can see that being good for text independent speaker verification. But when you extract out that adaptation parameter, your factors, that should be more dependent on the actual text that was in your query than your -- than evaluating -- adapting the model and evaluating on the query because then you're -- you're matched. >> Lukas Burget: And the thing is in both cases we have to deal with this variability somehow. So I mean sure there would be the information about the text and Y that would be information about channel that would be inform. Hopefully there would be enough information about speaker. And we have to deal with it, and we have to build model like Joint Factor Analysis that we built in before. And if we have to -- we have to somehow deal with this variability. But we are going to do that on the level of these vectors. And we hope that these vectors now contains enough information about speaker, and we have going to separate again this information about channel and speaker by now dealing with just low-dimensional fixed-length vector which will allow us to make the task much simpler. I don't know if it answered your question, but ->>: We'll talk about it ->> Lukas Burget: Okay. >>: But still JFA is doing better than everything else, right? >> Lukas Burget: No. No. >>: [inaudible]. >> Lukas Burget: So far, yes. So far, yes. But we will see that we can actually -- we can deal with it some more. So to summarize the thing, now we have the iVector extractor which would be system which uses just the single subspace for both channel and speaker variability and the vector I as the representation of the recording in general. What that means, I mean, the systems become simpler. We have just single subspace to train. It's actually simple to train. We don't have to train the U and V spaces independently. We don't need any speaker labels for the training now. I mean, we need utterances of many speakers to train this subspace, and we need utterances where each utterance contains single speaker. So each utterance would be representation of particular speaker in particular session. But we don't need any labels for the data. And so we can train it actually in unsupervised way in large amount of recordings of -- which hopefully will make the whole model for robust. Again, we assume standard normal priors on I so we can obtain the iVectors as a point MAP estimate of the -- of these vectors where they would now subtract this iVector for every utterance and every -- the iVector for every utterance would be the low-dimensional representation of each utterance and now we have no more sequences for utterances, we have just these low-dimensional vectors and we are going to perform speaker identification from that. So now what would be the right classifier for it? We still see that we have the problem with the -- that the I still contains all the information about both speaker and channel. But now we know that the -- we have just single low-dimensional vector, and we actually assume the normal distribution, the standard normal distribution is the prior for I, so it would be -- it would be useful to assume that the distribution of the I is probably Gaussian. So we can come with model which is just like Joint Factor Analysis but having just single Gaussian. Single Gaussian model. Having just single vector per utterance, no -- not a sequence. And apply such model. This is something that is known as Probabilistic Linear Discriminant Analysis and we end up with the formula equation that we just had before. So it looks just like the Joint Factor Analysis, but now you have to realize that this is not the supervector of speaker model parameters. Now this is the -- this is the observed vector, right? So now we are really modeling -- we are modeling distribution of the observed vector where again the U and V would be some subspaces with large channel and speaker variability in the space of these vectors Is. >>: I is -- Eigen is not same as iVector? >> Lukas Burget: The I is the iVector. >>: [inaudible]. >> Lukas Burget: Yes. The I is iVector. Yes. >>: IVector is the one that lumps all these together like the one you showed earlier, how could you ->> Lukas Burget: IVector, the I is this low-dimensional vector that if you take the vector and multiply it into basis T, you get the Gaussian mixture model for that utterance. But the I is now just 400 dimensional vector ->>: [inaudible] different and ->> Lukas Burget: I'm sorry. Yeah. No, the M is different. I'm sorry. So because this is the supervector, yeah, that's my mistake. >>: [inaudible]. >> Lukas Burget: Now so this is just the mean in the -- in there, so I should have used the different symbol here. Yup. So I'm -- so now the V and U I'm using still V and U. But these are not the parameters in the supervector. These are all in the -- in the low-dimensional space now. So maybe I should have used different -- different letters for all of these. Again, we can add some more term which accounts for the -- for the residual noise in the data, which is just like the in the Joint Factor Analysis. But, anyway, so the. >>: What are the dimensions of Y and X here? >> Lukas Burget: 400. Oh, so, yeah. Well, that depends. So I mean you can decide. So this -- this has a dimensionality of 400. You can decide on low-dimensionalities for these. But I will just show in next slide that normally what we would do would be use the full dimensionality for these guys. So then you can actually forget about this the epsilon because you -- these subspaces cover the whole variability in the space. And, in fact, views even more -- even less. We would typically we would reduce the dimensionality of the iVectors then into just about 200, just by LDA, and it still work about the same. >>: And then Y and X are 200 too? Are they still ->> Lukas Burget: Let -- that would be clear from the next slide. Okay? So again the parameters, all the parameters can be estimated using EM algorithm. It actually gets much simpler in this case. This slide just explains what the PLDA model is about, and it's nicely seen on the examples from face recognition where it was actually introduced for face recognition by Simon Prince. And if you train this -- what you can do is that each face can be treated just as the example, you can just vectorize each example and this would be the vector representing the face. And then you can train the joint -- the Probabilistic Linear Discriminant Analysis for that, where each I in our case would represent the face, the vectorized version of the face. The M would be mean face. So this is the example of the mean face. And then the linear combination of the direction with the large, in this case would be called not speaker variability but between individually variability. So if we actually move in the direction from the mean -- in the direction with large between individually variability we are getting pictures that looks like different people. So this would be moving from mean in direction first, second, and third most important direction with -- between individual variability. If we move in the direction with -- within the individually variability, we are getting pictures that looks about like the same person but probably with different lighting conditions and things like that. In this case, this would be the residual term which is just the standard deviation how much variability we are not able to reconstruct and using the subspaces. Here these slides actually demonstrate what the PLDA model would be. We further assume that we are at least the U can be model with full rank matrix, and in that case we can rewrite the equations that we had before like this equation with this normal prior impose on the Y can be rewritten this way. So we would have -- we would have -- this is the speaker variability which has certain meaning, this is across class covariance matrix where the across class covariance matrix would be just the return retransposed. And then given the latent vector representing the speaker, we add another part which is the noise part of the channel variability part. So the observed iVector given the speaker vector, the Eigenvector would be again Gaussian distributed with the mean given by the latent vector N and covariance matrix within class covariance matrix. So I mean basically these equations and these equations are equivalent. Maybe it's easier to see what's going on here. So you can clearly see this is model that makes like LDA assumptions. You assume that there is some global mean, there is some across class covariance matrix, there is some within class covariance matrix and just having these examples like the data from let's say if we had two dimensional iVectors now, this would be data from some one speaker, mean for the speaker. We would see the across class distribution within class distribution. And we can already intuitively think of doing the verification. If I ask you do these two iVectors come from the same speaker or different speakers, you would think they would say well this is -- they are probably the same speakers because they are distributed in [inaudible] large session variability. If I ask you whether these two iVectors, which are much closer each to other are from the same speaker, you would say no, they are probably different speakers. Right? So this is what you assume that the model would provide you. So again, this model now becomes now just linear Gaussian model that you can use to evaluate exactly the same compute this -- exactly the same log likelihood ratio as before, right? So we again want to calculate the ->>: [inaudible] using this subspace trick twice now, right, once to represent the ->> Lukas Burget: IVector. >>: And the second time to represent ->> Lukas Burget: But the first time in ->>: [inaudible]. >> Lukas Burget: Yeah. In first time in unsupervised way just to reduce the damage leader. So just to go from -- just to go from the Gaussian mixture model into Gaussian distributed thing and the other time -- and I'm even saying I'm not even using subspaces, I can actually model -- here I can actually model the full covariance matrix for across class and within class covariance. Eventually these across class covariance matrix doesn't have to be full rank, so it has -- it can be represented as this product of lower rank matrices. So by just saying I'm restricting the speaker model to live in some lower dimensional subspace of the full iVector space. But, yes, I mean the idea is first use the sub -- in unsupervised way, use the subspace to reduce the dimensionality to convert sequence of features into low-dimensional vector, not to have Gaussian mixture model but just to have Gaussian distributed iVectors. And second time you use the JFA like [inaudible] to model session and speaker variability in this low-dimensional vector. And so now if you want to calculate these likelihoods we actually find out that all these terms are not any Gaussian mixture model but this is just convolutional Gaussians and that can be evaluated analytically. So, in fact, I did the dirty for you and you can -- you can find out that the numerator is actually evaluating this Gaussian and then [inaudible] evaluating such Gaussian distribution where here I stacked the two iVectors, one above the other one and -- but you can see that again the score is symmetric. It didn't matter whether I switch these two or these two. And you are going to still get the same score. All right? So now can be very easily evaluated. In fact, doing some -- some more manipulation with this Gaussian just like expressing it explicitly you find out that the final score can be calculated just like this where this is -- this is some billionaire product so just dot product of one and second iVector with some low-dimensional matrix in between. This is the most expensive part. All the other things are just some constant calculating for one iVector you can really precompute this constant for each segment. And then I mean there are these lambdas and gammas parameters that are calculated this way but it doesn't really matter. This is just what you would obtain. This is what we -- what we need to know. So they -- the thing is, the score can be really calculated by this simple formula. And, in fact, you can even take -- get a composition of the lambda and precalculate the iVector so then at the end the score can be really calculated just as the dot product of two vectors. That's it. And there is good news. This PLD -- iVector and PLDA approach doesn't need any ZT normalization. It works just fine as it is. Another thing to point out. We can actually calculate the whole bench of score very quickly because I just told you the score, in fact, is just dot product between two vectors. So I mean if we just rewrite it in terms of having sets of iVectors for like let's say enrollment iVectors and test iVectors but again, I mean, we don't have to distinguish enrollment and test data. The role of both iVectors is completely symmetric. So then having these sets of iVectors you can calculate the whole matrix of scores, scoring each iVector against each enrollment I get vector against each test iVector just by calculating this product of three matrices or just actually product of two matrices if he preferences the iVectors. So it's extreme fast to perform -- once we have the iVectors, it's extremely fast to -- there is nothing simpler that would give you actually the score, just computer product -- the compute multiply two matrices with iVectors. And I see that I'm running out of time. I can actually cut it before I get to the discriminative training. >>: So does he know that he's late? [laughter]. >> Lukas Burget: I apologize. I didn't expect so many questions from the audience, and I ->>: [inaudible]. [brief talking over]. >>: Could we hold on the questions so I mean much later, maybe after talk. >> Lukas Burget: Okay. >>: We just flip through ->> Lukas Burget: Actually, I can go really, really fast to the end if you just let me finish the talk. So anyway, this slide those the improvement. So we can actually see that by doing this trade with PLDA, you get lots of improvement from -- and especially for the low false region, this is what we saw. So you can see that again compared to JFA and for this false rate you would be going down from about 50 percent to 30 percent. So a significant improvement. >>: [inaudible] from my office [inaudible] start to look for me. [laughter]. >> Lukas Burget: And still it can be [inaudible] so fast, right? I have yet another picture which shows another system based on exactly the same idea, which is the whole lot better. In this case we -- I mean, when I was comparing this performance I was -- I'm comparing with the like what our features and our UBM were in 2008 and compare things on that. If we apply a new UBM that we are using at the moment where we useful covariance matrices rather than diagonal matrices and if you use some other tricks to normalize the length of iVectors, some tricks that we just you learned recently to reduce dimensionality by LDA, normalize the length of the iVectors to unity lengths and tricks like that, you can get another quite significant gain from that. So I mean it's kind of -- this is just the relevant map adapted baseline. But, in fact, it uses already better features that we for example had in 2006. So I think it's kind of fair to compare this line on maybe a line that would be just little bit better with this performance that we have right now. And you can see how much gain in performance we were getting from about year 2006 to 2011, and you can see that the -- the performance for this for other region it dropped from -- this is 80 percent down to about 12 percent. Right? So it would be ->>: [inaudible]. >>: Hold the questions. >>: I think -- first green line's where things that existed in the literature before. >> Lukas Burget: I mean, in fact, this would be system that existed in the year 2000, but which was still kind of the state of the art in 2005, 2006. >>: The last two lines are things that you developed at BUT; is that correct? >> Lukas Burget: Well, I mean -- I wouldn't say that was -- this of course is based on the ideas like the PLDA existed in phase recognition and things like that. And there are other people that were working on the same things at the same time. You know, I mean the JHU workshop we funded, we can use these features -- these things as features. Everybody started implementing. It was straightforward extension. So there have been other labs doing the same research at the same time but, yeah, I mean these are -- basically these are systems that we have -- we have developed at BUT in past two years, past three years. And this improvement that we got in like really in past two years. Yeah, so the point slides and I can make it really short, are on the -- on the discriminative training just showing that after we have now this framework we can actually retrain the parameters discriminantly and still get quite some gains. There were some efforts on this that people would call discriminative training in speaker ID and was mainly the thing that I already describe as supervector machines where you could try and separate supervector machine for each speaker. But I don't consider that to be really the true discriminative training for the task that we deal with in speaker identification. Because in speaker identification we really -- we want to address the binary task of discriminating between same speaker trial and different speaker trial. And we should -- we should train system important this binary task. And we have done something in -- some work in this direction doing the JHU workshop 2008 where we tried to train the hyperparameters of Joint Factor Analysis, the subspaces discriminantly. But we were not too successful with that. There was probably too many parameters to estimate discriminantly. And there was still the problem with the normalization with scores which somehow -discriminative training actually worked very well without it -- without the normalization once somebody added -- once we added the normalization, we were getting just minor gains from the whole discriminative training. Anyway, right now we tried to apply the discriminative training to -- in -- to the PLDA system and to return the PLDA parameters discriminatively and it seems to work quite well. So this is actually -- this is just first time the discriminative training, this kind of discriminative training was really successful for speaker identification. So about now it's when the discriminative training techniques I think will start moving to the speaker identification field. There was nothing like real discriminative training so far. So what do we propose here is to calculate -- calculate the verification score based on the same functional form like what we have used for PLDA, but instead of training the parameters of it using maximum likelihood, we are going to train it discriminatively. Specifically I will be showing here that we use the cross-entropy function to training the parameters discriminatively. Again, as I said, the PLDA model is well suited for the discriminative training just because we need no score normalization anymore and the functional form is really very simple. And you will see that it actually leads to just linear classifier. So again just using the -- writing down the equation for this course and realizing that we can actually rewrite this bilinear form in terms of just dot product of vectorized matrix A which is here in between and vectorized outer product of the vectors X and Y, we can actually rewrite this core function in the following form. We can just vectorize all the matrices, lambda, gamma, CMK, and we can compute the dot product with the -- with such vector where we actually calculate all these dot product between the two iVectors and the trial. So we do some kind of non-linear expansion of the pair of iVectors in the trial and form the vector out of it. And then we actually can calculate the core just as the dot product between vector of weights and between the non-linear expansion of the iVector pair, right? And the result, the resulting score can be interpreted as log-likelihood ratio. This is what it gives us. So what we are going to do here is we are going to train -- we are going to train the V discriminatively, these parameters discriminatively, and we are even going to train it using -- using a local -- logistic regression which is again technique that assumes that the scores are -- that these dot product can be determined as log-likelihood ratio. Or we use supervector machine as the classifier. So we are basically training just linear classifier discriminatively, either linear supervector machines or logistic regression. And I am going to -- I'm going to skip this slide which kind of just says what logistic regression is about, and I and those that don't know that, I will just -- I would just refer them to some standard group to learn about what it is. But the problem in -- the -- I assume that everybody [laughter]. >>: [inaudible]. >> Lukas Burget: Yeah, I'm sorry. But, well ->>: Oh, my goodness. >> Lukas Burget: Everybody was waiting for thank you or something like that. I'm sorry. And then I did that by purpose because I thought that you would be nervous [inaudible] anyway. So the thing is we have to realize that for the discriminative training now we have the -- we have the -- each training example is going to be pair of iVectors, and now we can create our training example. We have the bench of training data and we can create our training examples. We can -- we have to actually create the send vector trials out of iVectors that correspond to the recordings of the same speaker and we can specially create lot of imposter example, different speaker example by just combining any iVectors that are corresponding to utterances of different speakers. And in our case, we have about 2,000 training examples from -- for female, 16K for males, so we could create almost billion of trials for training our systems. So now one could suggest that we should probably simple from those that would be too many. But, in fact, we don't have to because if we want to calculate the gradient for the logistic regulation we can use exactly the same tricks for like what we used for calculating the scores. And you can actually figure out that the gradient can be calculated just using the same tricks by just computing dot products of matrices with all our training trials. And the G is just matrix of all the scores calculated for -- calculated by this simple dot product and then applying some simple nonlinearity to that. So we can actually calculate the -- we train using almost billion trials and we calculate the gradient in few seconds which allows us to train the system pretty fast. And hopefully we could train it even much larger amounts of data. Yes? >>: So do you run into any class imbalance issues? Like, it feels likes ->> Lukas Burget: Yes, you have to -- I mean, sure. That was what I would otherwise explain on the previous slides where we introduce some scaling factor for trials and I give more importance to target trials than non-target trials. So I mean now it just compares this table just compares the performs gains from the discriminative training and you can see that the PLDA already perform quite well but just retraining things discriminatively can give you some -- another nice gains. Most likely we are getting actually better performance with super vector machines, but that was problem just because we didn't use any regularization in case of logistic regularization that would probably help also. So this is the final slide, I promise. So, yeah, but I'm just summarizing what I was telling you. The -- we have seen that the -- over the pass few years we have speeded up the technology quite a lot, like I mean by few -- the speedup factor would be in orders of magnitudes and we have improved the accuracy which you could just see like by factor, I'm saying five factors and more. But you could see that we -- one from -- for the particular form like form 20 person accuracy to 12 person error. So it's like really huge gains. And you could see the shift in the paradigm that we -- we are moving from the paradigm that we were training the speaker model and then testing it on the test utterances to something we just compares the pair of -- pair of utterances in more principled way which is similar to -- which is beyond the variation approaches for model comparison. And the -- the new paradigm based on the iVectors actually open new research directions so I have been just showing you that they train the PLDA discriminatively. But there are other possibilities that we work on currently, so there is -- you can actually count with -- you can easily combine the low-dimensional presentation. You could have different features for different -- different system for different features, for example, and getting different iVectors for different systems. You can simply stack the iVectors and train the system for that, for example. So they allow us to combine the information in a very easy way from different systems. Or you can use the same idea of extracting the low-dimensional iVector based on other models. Instead of Gaussian mixture model, we have recently used Multinominal Subspace model, which is quite similar to what -- how we model the weights as GMM and something that we have been using before for ->>: I was going to suggest doing that. >> Lukas Burget: Yeah. And yeah. But this case was just the weights actually. It wasn't -- it's still not the Gaussian mixture model. But we use it for modeling prosodic features and now we got like huge gains compared to what has been before on prosodic features and things like that. Yeah. So that's pretty much everything I wanted to say. [applause]. >> Dan Povey: I guess it's lunch time.

>> Dan Povey: So I'm happy to introduce Lukas. ... speaker ID effort at BUT that has in the last...

Related documents

Products

Support

&gt;&gt; Dan Povey: So I'm happy to introduce Lukas. ... speaker ID effort at BUT that has in the last...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Dan Povey: So I'm happy to introduce Lukas. ... speaker ID effort at BUT that has in the last...