>> Dan Povey: So I'm happy to introduce Lukas. ... speaker ID effort at BUT that has in the last...

advertisement
>> Dan Povey: So I'm happy to introduce Lukas. Lukas is the head of the
speaker ID effort at BUT that has in the last few years become quite well known.
This is due to the success in the speaker ID evaluations.
And I asked Lukas how many of these actually won. Apparently it's a big no-no
to say that someone won one of these evaluations, it's against the rules I guess
because everyone is a winner. [laughter]. But they did outstandingly I think it's
safe to say.
And Lukas is also in charge of the other kind of research efforts at BUT involving
speech things. And I personally know Lukas through this JHU project where we
were doing the SGMM work. Lukas was kind of in charge of a lot of aspects of
that project and contributions to the math as well.
So he knows his stuff. And he's going to be explaining -- I really don't know what
this talk is about because I'm not a speaker ID guy, but I hope that by the end I
will understand it. So, Lukas.
>> Lukas Burget: Thanks, Dan, for the introduction. So as Dan said, I'm
interested in different -- different problems like -- I mean the speech field, speech
recognition, language identification, speaker identification, keywords spoken. But
you the two days talk would be about speaker recognition, which I have been
mainly active in past few years.
So what do we have on the agenda today? I will just give some short
introduction into what the task is about and then I will say a few words about
what the main problem in speaker identification enable which we call channel
variability. I will explain what I mean by that.
Then I will give some kind of tutorial into what are the techniques for speaker
identification. I will describe the traditional approach, and then I will move to the
approaches that evolved through past few years, so I will show you the shift in
the paradigm that happened there during the past few years that improve the
system in terms of both speed and performance. And then at the end, I will say a
few words about my most current award, which is discriminative approach to
speaker identification, which is something really new in speaker identification,
something quite usual in speech recognition ->>: [inaudible].
>> Lukas Burget: Yes?
>>: [inaudible].
>> Dan Povey: Do you have your mic -- I think he's not supposed to be
broadcasting, it's just a recording.
>>: We're having trouble hearing.
>> Lukas Burget: Okay. I can definitely speak louder.
And so this slide just lists some of the speaker recognition application, just to
imagine what the technology's really good for. And we were mainly involved in
project that deals with the security and defense aspects where you can use the
speaker identification for searching for the suspect in large quantity of audios and
the other -- other line, the link analysis and speaker clustering. Link analysis is
actually something that you people helped -- that you -- that the defense people
helped to search for the aspects like -- I mean, the problem is you would have
lots of recordings of various people and you would really like to make some link
between where the same speaker speaks and whether he speaks from the same
phone and the things like that. So you really need to search in large quantity of
audios and audio and find the same speakers, cluster of speakers. So quite
difficult problems with like if you have millions of recordings.
Then I mean for the -- maybe for the more conventional applications you can
think of application for access control to access computer networks.
Nowadays speaker identification is used for the transaction authentication. So
the telephone banking uses speaker identification not really to -- not really to
verify that it's the speaker or not, but at least to get some -- get some idea. And
eventually you don't need that many -- you don't need to specify that many
password and see if it looks like that probably this is the guy already from the
recording.
Then also you can think of other applications like the Web -- voice-Web or device
customizations, your device can -- I mean your mobile phone can recognize that
it's really you and adjust accordingly.
Maybe for the speaker -- speech recognition people here you can think of using
ASR for clustering recordings to adapt the model for article speaker on the
recordings of the same or even similar speaker. This is something that people
have been trying recently and which seems to help quite a lot. So you could
think of really searching for the same person in all the recordings from the -- that
comes from the bing system and try to adopt the speaker model for the current
recording on the similar data. And so on.
And also I think that Microsoft could be also interested in the search in audio
archives and the -- I think there was some work on really indexing the speech
and accessing the speech so you could be interested in searching for particular
talk of some -- of the same speaker that you have been watching right now,
things like that.
>>: [inaudible].
>> Lukas Burget: I just tried to point out maybe some of those that we are mainly
interested in and maybe what Microsoft could be interested in but really not for
any particular reason.
So there was a lot of progress in past say five years. There was massive
speed-up in the technology that allow us to use the technology in much larger
scale. I will speak about the things that are described in all the bullets like the
development have the techniques where you -- that allows you to create
low-dimensional voice prints for the recording and then you search for the
speakers by -- just by comparing these voice prints. The voice prints can be
obtained something like 150 times faster than real time, so this is something that
you can easily do online when you record a speech.
And then once you have these low-dimensional voice prints, you can compare
the voice prints like running billions of these verification trials, billions of
comparisons in a few seconds.
So this is one aspect. We speeded up the technology quite a lot. And the other
aspect is that we improved the technology a lot. So over the five -- over the past
five years, we improved the technology by more than -- factor of more than five,
so the accuracies are more than five times larger. You will see that in the graph
that I will be showing in a second.
>>: [inaudible] compare the five times, that was what kind of system at the time?
>> Lukas Burget: I will talk about that. I mean, really to say the state of the art in
like 2006, if you would see what.
So one thing that allowed to improve the technology is where the Bayesian
modeling approaches that we are using nowadays. I will mention those. And
mainly the development of channel compensation approaches that like Joint
Factor Analysis, Eigenchannel adaptation. Again, I will introduce those.
So first of all, what the task in speaker recognition is. And we will be mainly
interested just in text-interest speaker verification where the task is given an
example recording of some speaker, detect recordings of the same speaker in
other data. Or equivalently, we can say given a pair of recordings of unrestricted
speech, decide whether these two recordings come if the same speaker or two
different speakers.
So these are two questions which are actually equivalent. But the form you ask
this question somehow corresponds to the -- two verification approaches, where
the first one is more traditional one that I will talk in more detail shortly, where
you train speaker model on some enrollment utterance and then you have some
background model which is trained on large amount of recordings from many
speakers. And then to make some decision for test utterance you just compare
likelihood between these two models.
The other approach would be to have two models. Both these models would
take as it -- they input both recordings and the one model would say how likely it
is that these two recordings come from the same speaker or the other one, what
is the likelihood, how likely it is that they come from two different speakers and
you will compare these likelihoods.
I already mentioned the problem with the channel variability in speaker
identification. This is really the most important problem there. What we mean by
the channel variability, just imagine that we have for each verification trial, for
each pair of recordings we have these two recordings. They are usually just few
seconds or maybe a few minutes long, at most. And you are supposed to say
whether they come from the same speaker or different speaker. But each of
them could have been recorded under different combinations, different
telephone, different handset, different microphone, different background noise,
different mood of the speaker or it could be second, he can have voice different
because of that.
So this is what makes the problem difficult. We have just -- we have two
recordings. They can each sound quite different just maybe because of the
background noise and you are still supposed to say that there is the same
speaker in both of them.
>>: [inaudible] variability if you are focusing on speaker variability in addition to
channel and session, I thought the more important variability [inaudible]
difference, since you are dealing with text independent.
>> Lukas Burget: Well, you will see that -- you will see that -- well, the phonetic
difference is also the unwanted variability, but usually at least we have recording
which kind of covers all the phonemes you have information about, all the sounds
possible. But it doesn't have to be the truth. So, I mean, in fact, the phonetic
difference can be -- one of the different contents spoken is of course one of the
problems here.
But the data that we usually deal with, which are say about two minutes long, you
have -- they cover all the phonetic information sound also so that this is not really
an issue. The more difficult part is that the differing channel, and you will see on
it the next slide.
Anyway, so we have -- we should call this like channel variability session,
channel or session variability, inter-session variability. But I'm going to refer to
this kind of variability simply as channel. But I will mean by that even this
phonetic variability, for example.
What I'm showing on this slide is how important it is to apply this channel
compensation techniques. And it was probably recognized in 2006 NIST
evaluation, NIST evaluation, speaker evaluations that -- speaker recognition
evaluations that the channel compensation techniques are important.
We can see that on the graph -- I should say first what the graph is actually
showing. So this is DET curve which is the usual graph showing the
performance for speaker identification task. I mean, I don't know whether you
can read the -- what is off graph, but this is a percentage of false alarm and this
is percentage of missed probability. The graph is pretty similar to what people
would know as ROC curve, receive operating curve. Usually on the ROC curve
just this axis would be flipped so you would be showing the accuracy rather than
miss probability. And also it differs in the scale of the axis. The ROC curve
would be just linear scale from the zero probability to hundred probability of doing
error here. The scale is different just for some reason if the distribution of the
target and non-target scores are Gaussian, then the curve wouldn't be really
curve but just the straight line, so it easier to compare the differences between
the lines.
Anyway, so what they -- what the graph shows is the probability of false alarm,
probability of misses and it actually shows us the performs on different operating
points where you -- when you make tradeoff between these two kinds of error.
Either you detect too many -- you make too many detections but then you would
have high false alarm probability and that would be low probability that you miss
something or you go for low false alarm regions which is actually where -- where
the difference applications are usually, what the defense people are usually
interested in this.
>>: Each curve is [inaudible].
>> Lukas Burget: Well, I -- [laughter] I can't say that, but -- anyway. These are
the curve from 2006 evaluations. Dan said that I -- we can't say who was first,
who was second. But we are actually allowed to show the DET curves of all the
participants. These are the black curves and we can say that ours are these two
in color.
>>: [inaudible].
>> Lukas Burget: Sorry?
>>: How many systems, how many ->>: Each line is a separate ->> Lukas Burget: Each line is separate site. The best system from the -- or the
selected system from the site, so there was site like 30 sites participating in the
evals. And we have two curves just because we participated -- well, we have two
submissions. One was the Brno University of Technology submission and we
were also part of consortium with the Spescom DataVoice which are our friends
from South Africa, TNO from Netherlands and University of Stellenbosch, also
from South Africa.
So why I'm showing these graphs -- and you can clearly see that these systems
are clearly better than the other systems, and this access of our system was
really because we implemented the channel compensation techniques called
Eigenchannel adaptation, can which I will introduce shortly.
There wouldn't be any channel compensation techniques before, so some of the
participants like MIT Lincoln Labs, for example, which I can't say which curve
belongs to them, had something that was called feature mapping but was
different kind of adaptation. You had to first recognize what channel it is. You
had to classify the channels. And after classifying the channels you could apply
the compensation while the techniques -- the techniques that we are using
currently works in kind of answer provides where you recognize the -- recognize
the channel along the way. So we will see what that actually means.
This is another slide from NIST evaluations two years later, and you can actually
see that -- well, our curve here is the black one, so it's still -- we still perform quite
well. But you can see that the difference is not all that large. Why was that?
Because all the other sides, of course, realize that they have to implement these
techniques, too, so they kind of managed to catch up with ->>: Was this evaluation more difficult ->> Lukas Burget: Yeah. It -- yes, it was. I mean, they are every time they are
making it more difficult by introducing different channels and restricting that
people have to make the conversation using different telephones actually so
nowadays you are not allowed to make two calls from the same telephone
because that's considered to be too ->>: [inaudible].
>> Lukas Burget: Yes. And the ->>: [inaudible].
>> Lukas Burget: Right. So the reason -- the difference is mainly really in having
the task more difficult, probably also there was -- there were -- there were
recordings of non-native speakers which make difference because we trained -we trained our system only on native speakers or mainly on native speakers and
non-native speakers. They introduce some other channels differences in the
microphone, things like that.
So yes, this task -- it's every -- every NIST evaluation tries to be more and more
complicated and difficult.
>>: Was Joint Factor Analysis [inaudible].
>> Lukas Burget: In our system, there was Joint Factor Analysis. And what I'm
-- why I'm showing this graph is for two reasons. Of course I want to show that
we were -- we are doing really well in these evals but also that -- I mean, also
that the technology that I'm going to describe in the following slides it's not just
system built on some site, it's really the state-of-the-art technology and what you
are going to hear about today is top technology in speaker identification.
>>: So this joint analysis -- Joint Factor Analysis in future domain or in the model
domain?
>> Lukas Burget: In the model domain. So you will see what I mean by that
shortly.
And yes, so what I'm trying to say here also is that the -- usually most of the site
combine many different system based on different features and even different
modeling techniques to get the best performance. Here we could really achieve
this best performance just using single system based on this Joint Factor
Analysis. So pretty -- pretty simple system and still provide this good
performance.
>>: How many speakers generally, ballpark in these evaluations?
>> Lukas Burget: You mean how many ->>: Speakers in the database.
>> Lukas Burget: Like few hundreds of speakers. Of course, I mean, there
would be lots of trials. So you always make the trial out of pair of recordings and
-- but there would be limited number of the same speaker trials, lots of different
speaker trials. But let's say like would be like few hundreds of recordings, few -the last evaluations was maybe few thousands of recordings or few thousands of
recordings -- few thousands of speakers, sorry, and few hundred thousands of
trials that you have are supposed to test.
>>: What is the duration of the [inaudible].
>> Lukas Burget: All the results that I'm showing here are on what we call it one
conversation training, which is two and a half minute. It's five-minute
conversation, about two -- from two to two and a half minute is this real speech
from one of the people. We have two ->>: [inaudible].
[brief talking over].
>> Lukas Burget: Both. Both are two and a half minutes. I mean, there are
different conditions in the NIST evaluations, about three hour -- usually interested
in this one and I am showing the performance on this one. This is like the main
condition that people would be looking at.
>>: How do number of speakers may affect the performance?
>> Lukas Burget: Which speakers?
>>: The number of speakers. How do number of speakers ->> Lukas Burget: Well, the number of speakers doesn't really affect the
performance because you do verification. You never know -- you don't train the
models on those speakers. Actually you are not supposed to use the -- when
you do the verification trial, when you get the two recordings, you are not allowed
to use any of the other recordings from the sets. So you don't know anything
about the other speakers. So the performance is -- the -- how good the
performance is not really affected by the number of speakers in the ->>: [inaudible] verification.
>> Lukas Burget: These are verifications, right.
>>: Verification results?
>> Lukas Burget: Yes. So it doesn't get simpler or more difficult if you have
more or less trials. You would just see that the DET curve would become stuffy,
so it would get more noisy if you don't have enough trials to tells it on and would
just ->>: The entire evaluation has nothing to do with the identification tasked where
->> Lukas Burget: No, no, you don't have -- here you don't have information
about the -- yeah. This is pure verification. You don't have information about this
->>: Is this different ->> Lukas Burget: The models would be about the same. This just becomes
more difficult and usually the identification task you would feed to -- many more
recordings to see -- to make the results significant. It would just become too
simple. So they are just trying to make it more difficult so that the results are
really -- the differences between systems are really significant. And also, I mean,
this is the task that the government people are really interested in, so.
So even if you use all the channel compensation techniques, the channel
difference is still remains an issue which we can see on this graph where this is
graph from 2008 evaluation, I'm using some of the other conditions where the
performance of this system is -- this is system that -- where you have recordings
from recorded over microphone, so these actually not telephone recordings, but
interviews recorded over microphone.
The other curve here, which is -- where you can actually see what happens if you
don't have enough of data because here we limited ourselves just to data
recorded over the same microphone, so we make many verification trials, but we
also compare the recordings that just we recorded using the -- this same or
different speaker but recorded using the same microphone. You can see that the
performance actually gets about like three times better just if you don't have this
problem with -- with channel mismatch.
Otherwise, they are the same recording. So they are the same noisy or clean
recordings recorded over the same microphones base, just not the problem with
the channel mismatch.
So there's still room to improve the techniques.
The traditional approach I already said that in the traditional approach what we
do is that we first train something that we call universal background model using
recordings of many speakers.
Then we train some -- for each enrollment utterance we train -- we train speaker
model which is trained on the enrollment utterance and then for the verification
we take the test utterance, we calculate likelihood using both model, compare
those and decide whether this is the same speaker or not. So basically the score
that we base the verification on this is some log likelihood ratio between these
two likelihoods.
>>: You use the higher order of MFCC compared to ASR to [inaudible].
>> Lukas Burget: This is just what [inaudible] not really. I mean what we use is
really MFCC's -- the features that we are currently using are actually MPCCs with
20 quotients which seems to be -- seems to perform -- but, I mean, people would
normally use the same features like for speech recognition. This is something
that we did just recently to have higher resolution about spectrum. And then we
also used deltas and double deltas like this normal thing that people would do for
speech recognition.
It's not necessary. We just saw that we are getting some gains by having larger
dimensional features. We don't really know at the moment, but it's because of
the -- we provide more information or whether we really need to see -- have more
information about the -- about this fine spectrum of changes or things like that.
But, yeah.
So what we would use here as the model for speaker would be -- would be
simply just -- okay would be simply just Gaussian mixture models so the
university background model would be Gaussian mixture model that is trained on
whole bunch of features we use the standard of MFCC's feature where you get
feature that go for each 10 milliseconds of speech and for training the universal
background model we would simply pool all the data together and train Gaussian
mixture model based on that. People have tried hidden Markov models and all
kind of things and never really got any -- any benefit from doing that. So this is
just simply pool all the data, train Gaussian mixture model for that to get the
UBM. And then the speaker model having some adaptation data for it so I mean
this would be just an example, two Gaussian in 2 dimensional space and having
some training data for speaker we apply these relevance MAP adaptation which
is based just on this formula which is the usual thing that people would do for to
adapt models and also for speech recognition, the simple thing -- I mean, this is
what people would be doing those five years back exactly, they would build a
speaker model this way.
>>: So there's a different kind of adaptation [inaudible].
>> Lukas Burget: I mean, this would be the thing that you actually do in Joint
Factor Analysis to some extent. But I -- let me -- let me talk about that later. This
is -- people are trying -- trying Eigenvoice adaptation, ML adaptations. The MAP
work the best. All the other techniques usually perform much worse than MAP
but right -- because of the problem with the channel, and we will see that later.
So once we start dealing with the channel and start modeling that Eigenvoices
actually helps a lot.
So the -- in here, you can see that the adaptation really simply is just really -- we
adapt only means and the means of the adapting models are just the means from
the UBM from each component and you make linear combination with what
would happen if you retrained the UBM using maximum likelihood retraining,
which would give you this model. And then basically the final model is just
something where the means moves somewhere in -- along the way to the
maximum likelihood trained model. The weights are the occupation counts. The
more data you have for the particular Gaussian closer it would move towards
data. And this is just -- I mean the -- why is that called map? It's based -- I mean
we will talk about point MAP estimates later on with our other probabilistic
models, but here the -- the I mean, why this simple techniques is actually any
MAP adaptation, it corresponds to getting point MAP status of mean parameter
using some ad hoc priors derived from university background model.
We only adapt weights and covariance matrices which means that the speaker
model is fully characterized just by the means. We can stake the means, this is
what we are going to do next, into one supervector, and the supervector means
would be the speaker model. The rest is derived -- just taken from the shared
among all the speakers taken from the UBM.
>>: The speech recognition when people use the MAP then how -- it's quite
different from number ->> Lukas Burget: Well, it would be about -- it would be about 10. It would be
about 10. I -- and if for speaker identification -- it actually doesn't matter that
much. I mean, if you use 50 would probably perform about the same. Doesn't
matter very much.
So now I would like to say what the problem is the channel variability is. I mean,
I already told you when I introduced the channel variability. But what it means in
the model space. Say we have this UBM model and then we get -- just for
simplicity we have one Gaussian two dimensional data to keep things simple.
We train the UBM model and then we get the data, training data for some
particular speaker and we adapt the UBM to this model to get the speaker model.
Then we get another recording and I would ask is this recording from the same
speaker or does it come from different speaker? And maybe here you would say
well it probably comes from different speaker because it's closer to the UBM
model than to the speaker model.
But I tell you no, this is another recording of the same speaker. So we will train
another model for this speaker, you get another model of the same speaker
which is quite different. Then do you that with many recordings and you get the
whole content of model for that speaker. You can do that for another speaker.
You get the whole content have the model for another speaker, for yet another
speaker.
So now we will look at all these models, which are just single Gaussian models
for simplicity here. We would see that there is some direction with large session
variability. Here all the models differ just because we took different recording of
the same speaker. And there would be some direction with large speaker
variability where the means of the speaker really different when going from one
speaker to another one.
So the idea of the channel compensation techniques in general and particularly
with the Eigenchannel adaptation that we used first time in 2006, is that when we
get the -- when we get the UBM model and the speaker model and when we get
the test data, then before we evaluate the likelihood, we just allowed to adapt
both these model by moving them in this direction with large session variability to
watch the data best and sense of maximum likelihood, for example, and then
with would ask the about the likelihood. So now it would be clear that the data
fits better the speaker model than UBM. Right.
So now it was simple, it was just for single Gaussian in -- but in real application
we deal with Gaussian mix fewer model. So I'm actually reusing slide that I used
first time for explaining the soft space GMM models which -- that was introduced
by Dan and that is related to this problem.
>>: [inaudible] also used the current [inaudible].
>> Lukas Burget: They did that for -- they did that for different approach for
systems based on -- well either for the channel compensation they did that at
some point, and then they use it with system based on supervector machines.
But I -- I would dare to say that these systems are -- they would -- maybe they
wouldn't agree with me, but I would say that these are becoming obsolete, the
performance of those wouldn't be all that good anymore. It would still probably -they would object that it has chance to work if you have many recordings of the
same speaker. Then this cumulative training down in that way would probably
help, and then you can use -- make use of the ->>: [inaudible].
>> Lukas Burget: We never saw that it would help any way to adapt the -- it was
different people trying that and it never help. It's probably just too dangerous to
play with the current matrices because the likelihood score is then too much
dependent -- I mean just more sensitive to changes the [inaudible].
Anyway, so what do we get here? We got the Gaussian mixture model where
the -- I said that the Gaussian mixture model which let's say this would -- this
thing would be the speaker model or the UBM model that we want to adapt to
channel where we have already stacked the mean vectors into the supervector.
So let's say these two Gaussians corresponds to these two dimensional mean
vector of the red Gaussian. This is for the blue Gaussian and so on.
Now, if we want to adapt the speaker model to channel, we need the subspace
which correspond to large session variability. So let's say that this would be
matrix which correspond to the subspace and then we have some quotients
which describe low-dimensional representation of the channel we charge as the
weighting quotient which tells you how to weight these different bases, how to
linear combine these different bases and add them to the speaker model to get
the finite adapted model.
So you no maybe just to get some intuitive understanding of the whole thing, if
we would think of changing this factor, it means that we are scaling the first base
vector and adding it that the speaker model where these two quotients actually
corresponds to some direction in which this Gaussian start moving these two
quotient correspond to this direction which these two Gaussian start moving and
now we start making -- if we start making this quotient larger or smaller than the
Gaussian, start moving along the directions. When [inaudible] saw the animation
first time he said that it's better than Avatar and I promised him that I will have it
in 3D next time, and I still of it just in 2D, so I apologize for that.
Okay. So this would be I mean one quotient allows you to move the Gaussians if
certain direction. Then if we start moving quotient like this one, then they would
possibility -- I'm sorry, they would possibility start moving in different directions.
So we can by just changing these quotients, we can get lots of different
configurations of the Gaussian mixture model, but at the same time they are
constrained to move in some low-dimensional subspace of the model parameter
space, right? So we can possibility quite robustly estimate these if you quotients
which would be normally something like 50 or hundred quotients to estimate per
recording that would allow us to adapt the Gaussian mixture model towards the
channel of the [inaudible].
>>: [inaudible] estimate [inaudible].
>> Lukas Burget: So the -- right. So the trick now is just like what we saw on the
-- let me go here. Just like what we saw on this slide where I allowed to move
the -- both these Gaussians, the speaker and the UBM along this direction, it
would do the same thing. So the quotient just corresponds to how far we are
moving with the Gaussian. So we would do the same thing. We do exactly the
same thing, just with Gaussian mixture model where we have possibly multiple
directions and moving -- moving the Gaussian mixture model in the subspace of
the large channel variability. Okay?
>>: Can you go back to that one before, please. There.
>> Lukas Burget: There. Okay.
>>: So UBM or MAP adapted speaker model ->> Lukas Burget: Yes. So ->>: Wouldn't it make sense to do the same trick for ->> Lukas Burget: Yes, and -- we will get there, right. It is the next which is Joint
Factor Analysis. So, yes. So basically yes. I mean here -- but this works
actually pretty well. So you can skill get the MAP adapted speaker model and
just do the channel compensation in this way. But of course now the idea is we
could also restrict speaker to live in some low-dimensional space, right?
>>: It has to be speaker adapted model. If you use the UBM in that vector does
it [inaudible]? So let's go back to the previous slide. I think I missed that a little
bit.
>> Lukas Burget: Okay.
>>: This whole question that [inaudible] was asking.
>> Lukas Burget: Right.
>>: So that M vector is ->> Lukas Burget: The M vector is one's the speaker models and other time it
would be the UBM. So I mean just like on this slide, you adapt both. You adapt
the speaker model and you adapt the UBM to the channel. So you need to do
that before both. You have the UBM model, you have the speaker model. For
each of those you estimate the channel factors. Eventually different set of
channel factors for ->>: So you do that twice?
>> Lukas Burget: You do that twice.
>>: I see.
>> Lukas Burget: And you -- we will see that you actually don't have to do that
twice. You can actually estimate it just using the UBM and reuse it for the
speaker model, which doesn't make much sense but it seems to work very well.
And ->>: You use UBM to ->> Lukas Burget: You can use UBM just to estimate the speaker factors, and it's
going to work about the same if you make some other approximations. But I will
-- I will get to that. So I mean ->>: The UBM model with the speaker adapted model and then to make a
speaker twice [inaudible].
>> Lukas Burget: Yes. Well, but you estimated with the UBM but yes to
synthesize both model you can kind of think of stacking it one under another. But
it actually doesn't work on its own. It needs some other approximation to make it
work in this way.
Anyway, so I mean this is just what -- what summarizes what I just said. You
train the speaker model in the usual way using MAP adaptation, this one, and
then for every verification trial you take UBM and you take the speaker model
and you adapt the M to the channel of the utterance just by finding the
appropriate, the X vector which we call speaker factors, and you -- we just -- we
could just surge those to maximum the likelihood of the test utterance and then
using the UBM and the speaker model we would calculate log likelihoods,
compared to log likelihoods, and this would be the verification score.
So this is something that was actually introduced at -- that was already used by -in 2004 NIST evaluations by Nico Brummer, which was actually our partner in
2006 evaluations. By the time he didn't manage to make it work -- it actually
worked pretty well, because he could get about the same performance as other
system that combine lots of techniques. He adjust this single system two years
later we just make it working much better compared to ->>: How does it differ from the -- suppose this is similar to the Joint Factor
Analysis ->> Lukas Burget: I'm just getting there, so ->>: [inaudible].
>> Lukas Burget: Yes, the [inaudible]. Yes. So this is -- this is just coming in
two slides later.
Anyway, now I told you how to adapt the speaker to the channel given -assuming that we already have this direction. Now is question is how do we get
the directions. But there is simple trick to do that, which is just to estimate each
speaker model so the -- let's say -- let's have just two-dimensional model, single
Gaussian 2D, for example. Then each speaker model would be just one in these
two-dimensional space and then we can just subtract the mean of each of the
groups, the groups with the same color corresponding to the same speech. We
get just the data -- well, we can now calculate a covariance matrix of this data,
see the -- find the direction with the largest Eigen -- Eigenvector with the largest
eigenvalue or few direction with the larger eigenvalue and find the subspace this
way.
This is actually what we did in 2006 and works pretty well. We can use more
principled way by estimating it using maximum likelihood which is what we did for
Joint Factor Analysis that is just coming on the next slide.
Now just to show what -- how these techniques perform. And this actually shows
the performance on -- of the systems that we developed for 2006 evals but still
showing it on 2005 data. Again, telephone -- telephone trials. This would be
system which is based just on the techniques that I described, just the relevance
MAP. No channel adaptation. No channel adaptation in the old ones.
The blue one would be what happens if you apply the channel adaptation
techniques like feature mapping that were known before the subspace channel
adaptations. So you can see that you could get quite some improvement but not
that significant or I'm -- significant but not that large then compared to the -compared to the subspace channel compensation technique which is the red one
for the Eigenchannel adaptation where we had 50 Eigenchannels, in other words
fixed -- this fixed -- 50 dimensional subspace with the larger session variability.
>>: So [inaudible] anything like CMM or anything like that to do a first order -you know, channel ->> Lukas Burget: No, not -- it's there. In fact, I mean this has all the technique
that we would have there. So there would be the MFCC's feature mapping is
what I -- I actually say -- sorry. Feature mapping comes here. But the feature
warping is actually similar to mean invariance normalization. This is actually
normalizing things. CMM but done with window -- I mean, this would be in
warping things into Gaussian distribution. That's just too expensive. You don't
have to do that. But you have to take about three second window and normalize
the -- do just a mean normalization in the three-second window. We don't do any
variance normalization. It doesn't seem to be helping.
Okay. So yes, I mean there are the basic -- basic tricks to. So this -- we can see
that the channel adaptation actually gave us a little -- quite significant
performance gain. You can see that you can for example, 40 -- 40 -- I'm trying to
find some good point where we can compare the performances like these. This
crosses some line, I don't know, like the 40 -- 0.5, the arrow drops from
something like 40 person down to 15 person. So I mean this is already like lots
of -- lots of improvement. Maybe it doesn't look that well on the curve but you
can see that the -- the error right rate reductions here were about 50 person or
60 person. So it's not like what people would be used in speech recognition
where you find for every person but you are adjusting two years people go 50
percent relative improvement. Then another two years with the JFA we go
another 50 percent relative improvement.
So the nice thing about speaker ID field is that it's really active field, live field
where you get lots of improvements every ->>: [inaudible].
>> Lukas Burget: Yes. The ->>: So this ->> Lukas Burget: The dimensionality of this is 50. We have 50 of these bases.
>>: The numbers should be roughly the same as your estimated different
channels. Is it sort of the [inaudible] type of ->> Lukas Burget: I -- well, with the feature -- with the feature mapping, with the
previous technique with the feature mapping, you would recognize something like
-- well, we had 14 different classes. But it wasn't really necessary. I mean, if you
had just like four or five, it would already do the same thing.
But the problem is that the -- the previous compensation technique where you
had to estimate a class, the problem is that you can't combine any effects. I
mean, what if there is the effect of having this particle microphone an effect of
having this particle a type of background noise you want to deal with is you need
to train separate class for this channel and this background noise. Here,
hopefully the channel compensation you can have one direction accounting for
the channel differences another direction accounting for the noise difference and
things like that. We don't really understand what the direction means, right.
>>: So you find a different kind of task on this, you know, [inaudible] sort of if
they have more diverse in terms of [inaudible].
>> Lukas Burget: You would need more factors. I guess that it's more given by
the amount of data that you have for adapting. So you probably can't go and -again, I will show you some example of that, but most likely you can -- you can
increase it to infinity because it just the estimates will get less robust. But you
can impose prior on that and a get regular mats with like or getting some point
MAP estimate priors and things like that.
So now we will move towards the Joint Factor Analysis. And we will add the
thing that just Jeff was missing in our model.
So instead of just modeling, modeling the channel differences with the subspace
and just adapting the speaker model by moving it in the subspace U, we will also
create a speaker model by moving the UBM mean in the direction of large
speaker variability. So I mean like on this graph if you had the direction with
speaker variety and channel variability why wouldn't he use the information about
what is the subspace with the direction of his large speaker variability?
And so this model, it's still not Joint Factor Analysis but it's close to that. We
would have two subspaces, one for V for larger speaker variability, U for larger
channel variability. And now the information about speaker in each enrollment
utterance would be compressed in the low-dimensional vector Y, which would be
roughly 300 numbers. I mean, this is like what we saw to give the best
performance.
So the adaptation now is really similar to what people in speech recognition
would call Eigenvoice adaptation. We also call these direct -- this directional
subspace Eigenvoices. But the problem that in speaker ID was that if people
tried it before, it didn't work. And it only works if we apply the -- if we model the
channel subspace. Once we started modeling the channel variety subspace it
start to be effective. Without that it fails to work.
I mean, this is also the system that was kind of inspiration for Dan's subspace
GMMs that he used for large [inaudible] speech recognition.
So how do we use it now for verification? It's I think kind of obvious. This is the
whole equation that summarizes the whole model where we have the speaker
and channel subspace now for verification. We estimate on the enrollment
utterance we estimate both the speaker and channel factors using the enrollment
data and for test we estimate just the channel factors to adapt the model to the
channel of the test utterance. Again, we get the adaptive model of the speaker.
We adapt the UBM the same way, compare the likelihoods around the
verification.
Now, the parameters, these parameters of the model can be actually estimated
using EM algorithm where iteratively we can estimate the parameters that are
shared by all the speakers and these parameters that are speaker specific or
channel specific that we can alternately try them on training data where we
further constrain the Y to be the same for all the recordings of the same speaker
Y that X can vary from one recording to another one. So we would have one Y
for each speaker, one X vector for each recording.
And now finally to end up with the whole Joint Factor Analysis. The Joint Factor
Analysis is fully probabilistic model. So far we have limited ourself to constrain
the speaker model to live in some subspace, in subspaces large, speaker
variability and channel variability. But we didn't care about the variability itself in
this subspace. So what we can defer to is to model the variability in the
subspaces. How much speaker variabilities in this particular direction?
So if we assume that Y and X would be standard random -- standard normal
distributed random variables, then writing this equation would give us Gaussian
intuition on a mean, right? So this would be the UBM mean, this would be
Gaussian distributed normal -- standard normal Gaussian distributed random
variables. This would be some subspaces if we just -- we look at the distribution
of U, that would be Gaussian distributed. It would be distributed in subspace. It
wouldn't cover the whole -- whole space of the parameters.
But then we would get basically vector which is Gaussian distributed in the
parameter space and which would tell us what is the variability in different
direction, specifically the M would be mean of the -- of this distribution and V,
very transposed and U and U transposed would be covariances -- covariance
matrices that would correspond to amounts of variabilities in the directions of the
subspace V and U.
And now we use this as a prior to estimate the parameter. So now we have -right now we have the distribution of the speaker models, prior distribution of the
speaker models. And given some test data or verification -- the enrollment data,
we can get estimates -- we can actually derive posterior distribution of the model
parameters. Right? So this is what that would be good for. I will go to -- come
to that again on the next slide.
And this still wouldn't be the full factor analysis model. Those of you who knows
what factor analysis is, that it's techniques for modeling variability, then there is
still something missing here. We have just some subspaces. To end up with the
full factor analysis model, we would add another one term which would be this is
D matrix and zed would be again random variable which would be standard
distributed random variable and that D would be diagonal matrix. In this case it
would be actually huge matrix of the dimensionality, same as what is the number
of parameters in our model, like typically 200,000, but it would be just diagonal
matrix. And it would model the residue of variability, the remaining variability in
the model parameter space.
>>: So you [inaudible] group all these [inaudible] variability with [inaudible].
>> Lukas Burget: Well, it would be probably also in the U term. This is kind of
the just -- I mean, the subspaces describe where is most of the variability but of
course I mean there would be some remaining variability in the model space
which is described by the -- by these diagonal terms.
>>: [inaudible] you can force that U to deal with particular [inaudible] admission
you actually took the amount data with the metadata that's specified [inaudible]
files those [inaudible].
>> Lukas Burget: Yes. I mean, you again train -- you again train these
subspaces. So you train the U, V, U and D, you train it on the training data by
saying these are recordings of the same speaker, these are -- these are
recordings of another speaker. So this way you train the variability. Normally we
would -- the D would account for the residual speaker variability. There could be
actually one more term accounting for the residual channel variability.
But, I mean, the basic idea here is, I mean, basic idea of factor analysis is just
given some two-dimensional Gaussian distribution which has some means to
model the full covariance matrix, you find the direction with most of the variability
and you model the distribution along this dimension and then you model the
residual variability which would be -- which would be like variability in the different
directions. And together if you sum these two, these two trends you get the
overall covariance matrix, which wouldn't make any sense in two-dimensional
space, but you can say full of the parameters in the large dimensional spaces.
But I'm -- anyway, for -- we don't need these speaker models to live in the -- to
have the freedom to live in the full space. We can really -- we can really live just
with these two subspaces. And this additional term is not even helping the
performance. So this is just to explain why it's called factor analysis, because
this model is actually factor analysis model that statisticians would know as factor
analysis model. And use this factor analysis to model distribution of have
speaker models in the model parameter space.
So this is what was introduced by Patrick Kenny, the model that he -- that he
actually introduced sometime I think in 2003, but was only 2008 we have
implemented it in our system where we show that actually it's performed rather
than what we have done before with the ->>: [inaudible] that group as I understand ->> Lukas Burget: Well ->>: [inaudible].
[brief talking over].
>> Lukas Burget: Patrick would have it implemented I think 2005, but his system
perform that well that people would recognize the importance of this technique.
But, yeah. I mean we closely collaborate with him. The last NIST evaluations we
actually -- we had him in our team. We teamed up with [inaudible].
>>: [inaudible] estimation were there a lot of free parameters ->> Lukas Burget: Yes. Right. So, I mean, this is the most -- of course the most
expensive part, but still things can be done pretty quickly based on some other
assumptions that will talk about.
>>: It sounds like in 2005, 2008, the same basic technique existed, it just
[inaudible].
>> Lukas Burget: Right. Right.
>>: [inaudible].
>> Lukas Burget: I mean, right. Even the Eigenchannel adaptation, I mean this
wasn't something that we would developed. We use it in 2006 but was
somebody was -- somebody use it already in 2004. We just had more time to ->>: [inaudible].
>> Lukas Burget: Well, we just -- with we just had the system better tuned for
using better features, having the -- many people say that our advantage is mainly
in having lots of machines to run the experiments on and I mean for the 2008
evals we would just switch on all the computer labs in university and run things
on like few hundreds -- like seven hundreds of cores. And Patrick Kenny would
run it on his machine that sits on his table, right. [laughter]. I mean so then you
have better chance to tune things up.
But, on the other hand, then you can really show that the technology works, that
->>: So how many parameters can be tuned for this [inaudible].
>> Lukas Burget: The number of parameters that you have to estimate the
dimensionality, so the dimensionality of these matrix would be 300 times about
200,000. And this would be -- this would be 100 times 200,000, something like
that. Then you probably want to train few tens of system or maybe few hundreds
of systems to figure out which features work the best and all that and other
parameters and things like that, right?
So I just said that on the previous slide that we have introduced the probabilistic
framework. We now can model -- using this equation we can model the prior
distribution of the speaker, speaker models of this mean supervectors which are
the speaker models.
Now, what this distribution can be good for? The first thing is we can obtain point
MAP estimates of the -- given some enrollment data and test data we can obtain
point MAP estimates of the speaker factors and the channel factors as -- if we
have prior distribution on those and we get -- we get some training data, we can
calculate distribution of those and we can go for the point MAP estimators as
more likely estimates, which are kind of just regularized versions of the
estimates. And it seems to -- it seems to help.
But more importantly, it actually allows us to use the Bayesian approach to
speaker verification, it allows us to use Bayesian model comparison to do this
speaker verification, to get the score for the speaker trial.
What is that about? Given the two recordings, 01 and 02, we want to compare
likelihoods for two hypotheses. The first hypothesis is does these two recordings
come from the same speaker? The other hypothesis is do they come there
different speakers? So we actually want to build two models. One that takes
these two recordings and calculates likelihood for the one hypothesis. The other
models that calculates the likelihood for the idea hypothesis compared to this
hypothesis. This would be like theoretically from the Bayesian -- the Bayesian
would say that this is the proper way of evaluating the model.
>>: So the reason why this prior estimates for this speaker and channel
[inaudible] that you build some kind of Gaussian distribution or ->> Lukas Burget: No, so these parameters -- well, the final distribution is actually
going to be the distribution of the speaker models. And the -- this distribution is
specified by these U and V matrices and those you train using maximum
likelihood on training data.
>>: I was referring the next slide you talk about regularizing on the priors of Y
and X.
>> Lukas Burget: Yes.
>>: The peeves one didn't have it, it was kind of maximum likelihood ->> Lukas Burget: No, so the -- these things are estimated using maximum like -now in JFA. These things are estimated using maximum likelihood. But these
guys of Gaussian priors.
>>: That's what I was asking.
>> Lukas Burget: Yes. So then we still estimate -- we still estimate these
parameters using maximum likelihood, and we hope that if we have enough data,
which is probably not true and we will see it in the later technique that we have
enough data to estimate those subspaces, we could of course impose some
priors on those and regularize it. Never actually tried that. But these are -- these
have priors and we can get posterior estimates of these.
>>: If you just renormalize so that they stay Gaussian?
>> Lukas Burget: They stay Gaussian by definition. I mean, this is -- what you
get here is ->>: I mean [inaudible].
>> Lukas Burget: This is standard normal distributed random variable, right in if
you transform it with some matrix, just linear transformation you get Gaussian
distribution with zero mean. And then if you add some mean, you get Gaussian
distribution.
>>: What's the maximum likelihood being used, that maximum likelihood
[inaudible] some kind of Bayesian sense where you're integrating it for
something.
>> Lukas Burget: Well, when you estimate the U -- I didn't want to get to such
details, but when you estimate the U and V, you -- and in every iteration you
obtain posterior distributions of Y and X, and then you integrate over those to
calculate a likelihood of the normal ->>: [inaudible] Y and X [inaudible].
>> Lukas Burget: Yes. Well, they ->>: Well [inaudible] I mean if you ->> Lukas Burget: Well, now I mean this is just what comes from -- the Gaussian
distribution -- the Gaussian distribution would be conjugate prior for the mean.
So just by definition you calculate a posterior distributions of Ys you automatically
get Gaussian distribution on the -- as the posterior distribution with this model.
Okay. So we have this new scheme which I am showing also on the next slide.
So now how does this model can be constructed from the Joint Factor Analysis?
We want to calculate a verification score as the log likelihood ratio coming from
these two models. And you can actually see -- so I mean here, again, I'm just
saying we want to calculate the log-likelihood ratio between these two
hypotheses, so the log likelihood of these two utterances given they are the
same speaker or they are two different speakers. So we have the corresponding
models in the numerator and denominator of calculating this likelihood ratio.
And you can see that here if you have the -- once we have the Joint Factor
Analysis, we can given some speaker factors, the speaker factors would just
define the Gaussian mixture model, which we would say this is the speaker
model. So if you have the particle point estimate the Ys, we have the speaker
model.
Now we say we don't have these particle estimator Ys, but we have the prior
distribution over the Ys is normal. So again, if we had the particle estimator Y,
then the likelihood of one an utter utterance generated from the same speaker
described by Y would be just this term, right. So given the likelihood that this
speaker -- given the likelihood of this data given this speaker, the given likelihood
of this data given the same speaker.
Now, we don't know the speaker. We have just the distribution of those. So we
consider each possible speaker given by the priors, we integrate over the prior
distribution of the speakers, and for each possible speaker we calculate the
likelihood that both utterances are generated from the same speaker. This gives
us the likelihood of this hypothesis that both speakers -- yes?
>>: [inaudible] or just maximize it?
>> Lukas Burget: If the posterior distribution of the speaker factors wouldn't be
too sharp, saying that this is -- there is really just one possibility that the speaker
factors would look like this, then it does. And I'm -- I can tell you that it doesn't
match in the case of Joint Factor Analysis and I would be showing that but it is
going to make a lots of difference with the most recent models that we are using
at the moment.
On so but I'm just showing how -- what would a proper way of getting the
log-likelihood ratio. Yes?
>>: So the Ys are speaker specific priors. The speaker factors -- the priors train
from speaker specific.
>> Lukas Burget: The prior is just standard normal distribution. Right? This is
just the standard normal distribution. The variability in the subspace, in the large
speaker variability subspace is hidden in these matrices. Right? So once we
start trying all these Ys which are on standard normal distributions, we are
actually trying all the possible speaker model ranging in the subspace of the
speaker variability in trying all the possible models that can live in the speaker
variability subspace. But in this integral, in this integral, the Y is just standard
norm distributed variable, nothing more. And in the denominator, you can see
that each of this term is basically I mean if you compute this integral, you get just
the marginal distribution of the data.
>>: [inaudible] why is estimated from JFA that it just give you one factor. So you
have to impose, you know, some high parameters -- type of parameters
[inaudible] of one.
>> Lukas Burget: You need to estimate the hyperparameters on the training
data. Once you have the hyperparameters, you have the JFA model, which
already describes the variability. But in the speaker model space. And then to
calculate the proper score, you need to integrate over the ->>: You haven't talked about how to estimate the prior parameter here.
>> Lukas Burget: Oh, for the U and Y? For the U and V?
>>: Distribution of Y.
>> Lukas Burget: The distribution of Y is standard normal by definition. You just
say I say that the Y and X has standard normal distribution, right?
>>: So you go through all the training data for each training -- for each ->> Lukas Burget: No, no.
>>: You have [inaudible] you take the sample, sample distribution of the Y.
>> Lukas Burget: No, so -- you have -- the distribution of the speaker model is
defined by this equation. You have standard normal priors imposed on this U
and Y, and then it depends what distribution of this would be of the speaker
model would be depends just on the U and V.
>>: [inaudible] factor analysis gives you UV and also Y and X [inaudible] iteration
at the end you get X and Y.
>> Lukas Burget: In every iteration it gives you -- in one iteration you estimate U
and V, given posterior distributions of Y and X. You calculate this posterior
distribution based on the standard normal priors. So you say X and Y has
standard normal prior in each iteration give me posterior distributions of Y and X,
and based on the posterior distributions integrate over all the possible models to
get maximum likelihood estimates of U and Y.
>>: I think as always it's overdeterminized. You could have infinite number of
solutions by putting a constant in V and the inverse of that constant Y, right?
>> Lukas Burget: You can actually -- you can decide on any prior -- any
Gaussian prior on this thing. It's overparameterized [inaudible].
>>: So you have training procedure to get that prior.
>> Lukas Burget: Yes.
>>: [inaudible].
>>: Is there any kind of scale inflected here like scaling [inaudible] on the priors?
>> Lukas Burget: The -- I mean not really. I try to play with things like acoustic
scaling in speech recognition. It never really helped, but the -- I think that there is
different problem. I mean, basically the posterior distribution may be estimate
the posterior distribution of speaker factors and channel factors. They are -- they
suffer from this model not really being the best possible -- the correct model. So
there would be probably other problems than just the scaling. Hopefully you
could -- you could do something with the scaling but anyway, I will just show that
I'm just introducing the Bayesian scoring where we integrate over things. But we
actually don't do that in the real models. We just get the point estimates and
most likely for the -- especially for the speaker factors probably the maximum
likely estimate would be just fine.
Anyway, so this is the way how Bayesians would like to score the things. We
have just speaker factors here. You may wonder where the channel factors are.
I just assume that they got already integrated out of each of these terms. So
things -- I'm showing the complete equation of the next slide, but it just gets more
complicated.
And what one thing that you should note on this slide is the symmetrical row of
the scoring. I mean, before we trained the speaker model on one utterance and
tested the speaker model on the other utterance. Why don't we take the second
utterance and train the model on that and test it on the other one? We would get
different score. Each one is actually better. Normally the -- it's better to take the
longer utterance for training and the shorter for testing because we probably
would get better point estimates of the speaker model on the longer utterance.
But this approach doesn't suffer from that. It's completely -- it's completely
symmetrical. You -- no matter which segment come as the first one and which
comes as the second one, you just get the proper score. So at least theoretically
this scoring is the better way of getting the log likelihood score.
So this is actually what would we get if the exponent even if it's the channel factor
each of the terms has to now be integrated also with the channel factors. And
this just becomes too complicated. So the problem with JFA and with this
scoring is this integrals are intractable. Even for that we integrate over this
Gaussian prior, each of these terms, each of the terms O, depending on YX is
Gaussian mixture model and just makes it impossible to carry out the whole
integration.
One can resort to approximations like variation base, and they seem to provide
some improvement, but not much. It usually help you something if you deal with
very short utterances, like two seconds of speech. Then this integration provides
some improvement. For few many utterances it doesn't help. It doesn't help at
all.
So, in fact, what I'm trying to say here is that the old-fashioned scoring where we
just obtain the point MAP estimates for the factor Y and X, a factor, and then do
the standard scoring like just avoid the likelihood with one model and another
likelihood UBM and compare the likelihoods works just fine and doesn't need -you don't need to run this integration.
And there is another problem that to get good performance with Joint Factor
Analysis and even with the Eigenchannel adaptation before, we need to -- we
need to apply some normalization techniques on the find score. So we get the
score from the system, from comparing the likelihoods of the model and the
UBM, but then we need to do something with the score. And normally we would
apply something that's called ZT-norm where you also obtain set of other scores
for some cohort of other models and cohort of other utterances and normalized
by mean invariance of scores from this cohort to get the find score and then it
start working. Which is just big pain because you don't have to perform
verification trial you don't score just single model, you need to score 200 models.
And which of course makes the whole procedure much slower and so.
So any way, this slide shows the performance of Joint Factor Analysis versus the
Eigenchannel adaptation. You can see that on for 2006 data we got quite nice
profit from Joint Factor Analysis which is the black line compared to the
Eigenchannel adaptation. And also this slide shows what happens if you go from
-- from 50 Eigenchannels to 50 dimensional subspace for the channel variability
to hundred dimensional subspace for the channel variability. So the question is
you were also asking about.
And for the Eigenchannel adaptation we actually go degredation. So it look like
there is too many -- too large subspace you can estimate these parameters
reliably. But for the Joint Factor Analysis we actually improved. So after
explaining the speaker variability with the Eigenvoices we can get even more
benefit from adding -- adding more Eigenchannels having larger -- modeling
more precisely the channel variability and the channel variability subspace.
And this is just another graph showing performance of Joint Factor Analysis on
different database and it compares it with the baseline. The purple line is -magenta line is the Eigenchannel adaptation, the red line is Joint Factor Analysis.
We didn't get all that much improvement from Joint Factor Analysis here, but still
the improvements are significant.
And this is actually data -- these are the most recent data of 2010, this evaluation
where -- and I would be showing the other results on the same graph. So you
can compare all the techniques ->>: [inaudible].
>> Lukas Burget: Well, the -- you can -- well, probably T zero is there in infinity,
and probably T hundred is there in infinity. But the ->>: [inaudible].
>> Lukas Burget: But, yeah, I mean the reason why I'm showing him this thing is
one thing is that it gets more noisy here and over there and other thing is this is
actually what -- what the NIST is becoming interested in and the -- our sponsor
like the US government and these applications ->>: [inaudible].
>> Lukas Burget: Exactly. I mean, this false alarm is what's killing them. They
can't apply this technology if they have too many false alarms.
>>: [inaudible]. [laughter].
>>: So [inaudible] performance now is about five times better and error rate five
times lower than five, six years ago. And the JFA is about technology five years
ago, right?
>> Lukas Burget: Well, the technology -- the JFA's technology that shows -showed up their benefits in 2008. So like three years ago. So, yes, it's five years
ago it was proposed something like five [inaudible] but it was just 2008 when we
actually use it first time in NIST evaluations and then everybody else was using it
during the last evaluations.
Anyway, so let me speed up little bit because I'm -- I see that I'm already running
out of time, and I would like to get to the final part of this cumulative training.
Any way, so there was lots of simplification that you could do with Joint Factor
Analysis and that improved -- they speeded the whole thing up and also they
actually provided improvement in performance, and they are things like the -- the
first thing just mentioned how the parameters should be trained and you can't
evaluate the integrals analytically and you need to resort to some variational
base techniques but then for the verification itself.
UBM is used to align the frames to all Gaussians. So you do that once for each
test of trains using UBM you can reuse it for all the speaker models. Doesn't hurt
at all. It allows you to collect statistics and raises statistics for all the other
estimation speed up things a lot.
We use the point MAP estimates of Y. This is what I said to represent the
speaker model. The channel factor can be estimated using this universal
background model for test utterance. So you actually don't have to for each
verification trial, you don't need to train speaker model and then for the speaker
model estimate the channel factors. You can pre-estimate the channel factors
for the test utterances and then reuse it for all the speaker models.
And finally there is something that is called -- that is called linear scoring which is
-- which you can find more details that you can find in paper that we wrote with
Andre Glembek, my colleague. And the thing is from the approximation above
and from another approximation that falls from actually using linear
approximation to log likelihood score instead of the quadratic score that you
would otherwise get, you can -- you can -- you can get evaluation of the score
which can be extremely fast. And for -- you can precompute these speaker
factors which are like 300 dimensional vector for each utterance which would
represent the model of the speaker. And you can precompute another
low-dimensional vector of the same dimensionality which would represent kind of
the statistics of the utterance for testing. And then log likelihood ratio can be
obtained just as the dot product between these two vectors. It's still asymmetric.
You still have to choose which utterance is going to be the training and which is
the test and selecting the model and the statistics based on that and then
computing the dot product and getting the find score. And this is what we are
going to read off on the -- right on the next slide.
So anyway, there are -- there were lots of simplifications that allowed us to run
the verification really fast that even improved the linear scoring actually improved
the performance little bit, which is really crude approximation to how the score is
computed which now suggests that the model is actually not correct. I mean if
you can come with so many approximations and if it's still perform well, then
probably the model is not really correct.
On the other hand, the point estimates of why the speaker factor that we are
getting allows you to get pretty good performance, pretty good verification
performance, which means they have to contain -- it has to contain the
information about the speaker, right? Each of them has to contain the
information about the speaker from the utterance. Can we actually take the Y
vectors and use them as features for another classifier and get the -- let's just
compare -- we were just wondering -- it must be enough to convert the Y vectors.
They contain the information about speaker. But what is the right way of
comparing the Y vectors to getting the verification score?
>>: Y shouldn't be factor for the speaker [inaudible] the speaker.
>> Lukas Burget: No, I mean ->>: [inaudible].
>> Lukas Burget: And if I'm saying that I can -- I extract this Y vector from
enrollment trains and then use it as speaker representation to build the speaker
model and test the test utterance, the Y has to contain the information about
speaker, the relevant information about the speaker from the utterance, right?
So then -- then possibly it should be enough just to take one set of speaker
factors from one utterance and the other one from other utterance and just
compare the speaker factors and get the -- and make the verification -- compute
the verification score just based on comparison of those.
>>: The condition Y requires that you put all the speaker data together. So each
Y actually sort of gives you the space ->> Lukas Burget: No. You -- each Y is going to be estimated just on the
utterance itself.
>>: So for each utterance ->> Lukas Burget: So you have the -- all the utterances no matter whether they
are enrollment or just utterances, you would just estimate the Y for each
utterance individually. So I mean we have to distinguish two parts. One is to
train the Joint Factor Analysis model. U and V subspace, the speaker and
channel subspacings. For that we use some set of data.
>>: Whole set of ->> Lukas Burget: I mean, the whole set of data but different from those that we
would -- that we would test it on. Right? And now we have another bench of test
data where we have the enrollment utterances and test utterances. And now for
each enrollment utterance and each test utterance we would just estimate it why.
And we can compare just this Y and say they come from the same speaker or
different speakers.
>>: It's a little bit like using [inaudible] matrices as a feature ->> Lukas Burget: Right. Right. In fact, yes. It's quite similar. Yeah. But the
question is what would be the right classifier? You can feed it into software
vector machines which would be the standard technique that people do, and I'm
just saying this becomes obsolete and you just don't do that. It's not really the
right classifier.
This is actually what I thought I'm going to say right now.
So anyway, we could use Y as the fixed-length feature -- low-dimensional
fixed-length representation of speakers for the utterance. So instead of really
having the -- all the frames, the whole sequence of features, now we would have
just the Y as low dimensional representation of the speaker for the utterance and
we can perform verification using that.
But during -- I was actually leader of the group during the Johns Hopkins
University summer workshop in 2008, and one of the things that we played with
was to see whether these Ys can be used as features. And these are
experiment that Najim Dehak who is now with [inaudible], and he tried to apply
supervector machine classifiers to -- which is basically technique where you
really trying different supervector machine classifier for each speaker as the
speaker model and then you test it against the test utterance. And you need -you train it using single utterance, single example of the speaker and bench of
other background examples as the imposter example. So then they got these
examples trying the supervector machine and then test it against the test
utterance.
So -- and this is kind of -- at least to me it seems a little silly to try and
discriminantly classify using single positive example, right? There is something
wrong about it. But, anyway, this is something that people would do for long
time.
And so what he get was the following: The Joint Factor Analysis, this was the
performance, now I'm showing the numbers in equal rate. This was the
performance Joint Factor Analysis. If you use the speaker factor as the features
to supervector machines he got much worse performance. But the interesting
thing here was that he also tried to use the channel factors as the features for
recognizing speaker and the channel factors are not supposed to contain any
information about speaker, they are supposed to contain information on channel.
And he still got 20 percent equal rate. Well with, it's quite poor. I wouldn't sell
such system for speaker recognition but still it's much better than shown so it
means that there is speaker information in channel factors.
Then why should we even bother with training the speaker subspace and
channel subspace if Joint Factor Analysis does such poor job in separating the
speaker and channel information and instructing all the channel -- extracting all
the speaker channel information into Y vector? So instead we are going to train
much simpler system where instead of the Eigenvoices and Eigenchannels we
would have just single subspace which accounts for all the variability, and we
would extract this -- something that we call iVector, which is one vector instead of
speaker and channel factors, and we have going to use this as the input to
another classifier.
And when he did that, he actually got better result than with this -- the speaker
factors themselves. So it looks like -- this is the right way to go. Now we need to
find better classifier than super vector machines which I don't like anyway. I
mean for this task.
>>: [inaudible].
>> Lukas Burget: Say again.
>>: [inaudible].
>> Lukas Burget: Yes. Yes. So again, this -- all of them, Y and X and even this
I is going to be point MAP estimate of the -- your channel and speaker factor and
the year of the general factors.
>>: [inaudible] since when the single feature [inaudible] in training [inaudible].
>> Lukas Burget: So for -- in this case, yes. So in this case, you would -- you
would get the -- no, I mean the posterior estimates are based on the sentence,
right? So you have ->>: Right. [inaudible].
>> Lukas Burget: I see. No, so I mean there would be just a point MAP
estimate. So there would be the most likely values of Y and X, the means of the
posterior ->>: [inaudible].
>> Lukas Burget: Single example, yes. Single example. You don't use
information about the variability in ->>: [inaudible].
>> Lukas Burget: No, in this case it was really the -- I mean always binary
classification, but in this case you train supervector machine for each speaker,
using the single positive example of the speaker, bunch of negative example.
And then for each test utterance you test it for the task, is it the speaker or is it
somebody else, right?
>>: [inaudible] iVector?
>> Lukas Burget: Sorry?
>>: Was the machine [inaudible] iVector.
>> Lukas Burget: It would be -- yeah, I mean we had typically 300 here and 100
here so we put 400 here. I mean right now we see that large dimensional would
even help but then it's slower to estimate those and 400 seems to be [inaudible].
>>: [inaudible] recognize the channel using the channel vectors.
>> Lukas Burget: You could use them to recognize channel factors. But the
thing is, I mean, would you want to recognize what the channel is and you don't
really have -- I mean, of course ->>: [inaudible] specific speaker models.
>> Lukas Burget: Sure. I mean, one can do that. But, I mean, the thing is, the
thing is whether that would be really helpful to -- I mean, you mean recognized
speaker specific model for speaker recognition? I mean one could certainly train
the speaker model using large amount of data and eventually by knowing that
this is still the same channel getting more robust estimate of the speaker by
knowing what the channel is and what the channel factor is.
>>: So I'd like to challenge your assumption that the factors contains the same
information as you are using in JFA. It seems to me like your factors would
include information especially your iVector about the channel ->> Lukas Burget: Sure.
>>: And speaker like you said.
>> Lukas Burget: Sure.
>>: But also the text.
>> Lukas Burget: Sure. Just like before, just like before combination of these
two -- these two vectors. Sure. I mean they would be -- they now contain
information about other thing. I'm not saying that ->>: No, but ->> Lukas Burget: Yeah?
>>: When you adapt your UBM and get your JFA and then evaluate it on a
segmented speech, I can see that being good for text independent speaker
verification. But when you extract out that adaptation parameter, your factors,
that should be more dependent on the actual text that was in your query than
your -- than evaluating -- adapting the model and evaluating on the query
because then you're -- you're matched.
>> Lukas Burget: And the thing is in both cases we have to deal with this
variability somehow. So I mean sure there would be the information about the
text and Y that would be information about channel that would be inform.
Hopefully there would be enough information about speaker. And we have to
deal with it, and we have to build model like Joint Factor Analysis that we built in
before. And if we have to -- we have to somehow deal with this variability. But
we are going to do that on the level of these vectors. And we hope that these
vectors now contains enough information about speaker, and we have going to
separate again this information about channel and speaker by now dealing with
just low-dimensional fixed-length vector which will allow us to make the task
much simpler. I don't know if it answered your question, but ->>: We'll talk about it ->> Lukas Burget: Okay.
>>: But still JFA is doing better than everything else, right?
>> Lukas Burget: No. No.
>>: [inaudible].
>> Lukas Burget: So far, yes. So far, yes. But we will see that we can actually
-- we can deal with it some more.
So to summarize the thing, now we have the iVector extractor which would be
system which uses just the single subspace for both channel and speaker
variability and the vector I as the representation of the recording in general.
What that means, I mean, the systems become simpler. We have just single
subspace to train. It's actually simple to train. We don't have to train the U and V
spaces independently. We don't need any speaker labels for the training now. I
mean, we need utterances of many speakers to train this subspace, and we need
utterances where each utterance contains single speaker. So each utterance
would be representation of particular speaker in particular session. But we don't
need any labels for the data. And so we can train it actually in unsupervised way
in large amount of recordings of -- which hopefully will make the whole model for
robust.
Again, we assume standard normal priors on I so we can obtain the iVectors as a
point MAP estimate of the -- of these vectors where they would now subtract this
iVector for every utterance and every -- the iVector for every utterance would be
the low-dimensional representation of each utterance and now we have no more
sequences for utterances, we have just these low-dimensional vectors and we
are going to perform speaker identification from that.
So now what would be the right classifier for it? We still see that we have the
problem with the -- that the I still contains all the information about both speaker
and channel. But now we know that the -- we have just single low-dimensional
vector, and we actually assume the normal distribution, the standard normal
distribution is the prior for I, so it would be -- it would be useful to assume that the
distribution of the I is probably Gaussian.
So we can come with model which is just like Joint Factor Analysis but having
just single Gaussian. Single Gaussian model. Having just single vector per
utterance, no -- not a sequence. And apply such model.
This is something that is known as Probabilistic Linear Discriminant Analysis and
we end up with the formula equation that we just had before. So it looks just like
the Joint Factor Analysis, but now you have to realize that this is not the
supervector of speaker model parameters. Now this is the -- this is the observed
vector, right? So now we are really modeling -- we are modeling distribution of
the observed vector where again the U and V would be some subspaces with
large channel and speaker variability in the space of these vectors Is.
>>: I is -- Eigen is not same as iVector?
>> Lukas Burget: The I is the iVector.
>>: [inaudible].
>> Lukas Burget: Yes. The I is iVector. Yes.
>>: IVector is the one that lumps all these together like the one you showed
earlier, how could you ->> Lukas Burget: IVector, the I is this low-dimensional vector that if you take the
vector and multiply it into basis T, you get the Gaussian mixture model for that
utterance. But the I is now just 400 dimensional vector ->>: [inaudible] different and ->> Lukas Burget: I'm sorry. Yeah. No, the M is different. I'm sorry. So because
this is the supervector, yeah, that's my mistake.
>>: [inaudible].
>> Lukas Burget: Now so this is just the mean in the -- in there, so I should have
used the different symbol here. Yup. So I'm -- so now the V and U I'm using still
V and U. But these are not the parameters in the supervector. These are all in
the -- in the low-dimensional space now. So maybe I should have used different
-- different letters for all of these.
Again, we can add some more term which accounts for the -- for the residual
noise in the data, which is just like the in the Joint Factor Analysis. But, anyway,
so the.
>>: What are the dimensions of Y and X here?
>> Lukas Burget: 400. Oh, so, yeah. Well, that depends. So I mean you can
decide. So this -- this has a dimensionality of 400. You can decide on
low-dimensionalities for these. But I will just show in next slide that normally
what we would do would be use the full dimensionality for these guys. So then
you can actually forget about this the epsilon because you -- these subspaces
cover the whole variability in the space. And, in fact, views even more -- even
less. We would typically we would reduce the dimensionality of the iVectors then
into just about 200, just by LDA, and it still work about the same.
>>: And then Y and X are 200 too? Are they still ->> Lukas Burget: Let -- that would be clear from the next slide. Okay? So again
the parameters, all the parameters can be estimated using EM algorithm. It
actually gets much simpler in this case.
This slide just explains what the PLDA model is about, and it's nicely seen on the
examples from face recognition where it was actually introduced for face
recognition by Simon Prince. And if you train this -- what you can do is that each
face can be treated just as the example, you can just vectorize each example
and this would be the vector representing the face. And then you can train the
joint -- the Probabilistic Linear Discriminant Analysis for that, where each I in our
case would represent the face, the vectorized version of the face. The M would
be mean face. So this is the example of the mean face. And then the linear
combination of the direction with the large, in this case would be called not
speaker variability but between individually variability. So if we actually move in
the direction from the mean -- in the direction with large between individually
variability we are getting pictures that looks like different people. So this would
be moving from mean in direction first, second, and third most important direction
with -- between individual variability.
If we move in the direction with -- within the individually variability, we are getting
pictures that looks about like the same person but probably with different lighting
conditions and things like that. In this case, this would be the residual term which
is just the standard deviation how much variability we are not able to reconstruct
and using the subspaces.
Here these slides actually demonstrate what the PLDA model would be. We
further assume that we are at least the U can be model with full rank matrix, and
in that case we can rewrite the equations that we had before like this equation
with this normal prior impose on the Y can be rewritten this way. So we would
have -- we would have -- this is the speaker variability which has certain
meaning, this is across class covariance matrix where the across class
covariance matrix would be just the return retransposed. And then given the
latent vector representing the speaker, we add another part which is the noise
part of the channel variability part. So the observed iVector given the speaker
vector, the Eigenvector would be again Gaussian distributed with the mean given
by the latent vector N and covariance matrix within class covariance matrix.
So I mean basically these equations and these equations are equivalent. Maybe
it's easier to see what's going on here. So you can clearly see this is model that
makes like LDA assumptions. You assume that there is some global mean,
there is some across class covariance matrix, there is some within class
covariance matrix and just having these examples like the data from let's say if
we had two dimensional iVectors now, this would be data from some one
speaker, mean for the speaker. We would see the across class distribution
within class distribution. And we can already intuitively think of doing the
verification. If I ask you do these two iVectors come from the same speaker or
different speakers, you would think they would say well this is -- they are
probably the same speakers because they are distributed in [inaudible] large
session variability.
If I ask you whether these two iVectors, which are much closer each to other are
from the same speaker, you would say no, they are probably different speakers.
Right? So this is what you assume that the model would provide you.
So again, this model now becomes now just linear Gaussian model that you can
use to evaluate exactly the same compute this -- exactly the same log likelihood
ratio as before, right? So we again want to calculate the ->>: [inaudible] using this subspace trick twice now, right, once to represent the ->> Lukas Burget: IVector.
>>: And the second time to represent ->> Lukas Burget: But the first time in ->>: [inaudible].
>> Lukas Burget: Yeah. In first time in unsupervised way just to reduce the
damage leader. So just to go from -- just to go from the Gaussian mixture model
into Gaussian distributed thing and the other time -- and I'm even saying I'm not
even using subspaces, I can actually model -- here I can actually model the full
covariance matrix for across class and within class covariance. Eventually these
across class covariance matrix doesn't have to be full rank, so it has -- it can be
represented as this product of lower rank matrices. So by just saying I'm
restricting the speaker model to live in some lower dimensional subspace of the
full iVector space. But, yes, I mean the idea is first use the sub -- in
unsupervised way, use the subspace to reduce the dimensionality to convert
sequence of features into low-dimensional vector, not to have Gaussian mixture
model but just to have Gaussian distributed iVectors. And second time you use
the JFA like [inaudible] to model session and speaker variability in this
low-dimensional vector.
And so now if you want to calculate these likelihoods we actually find out that all
these terms are not any Gaussian mixture model but this is just convolutional
Gaussians and that can be evaluated analytically. So, in fact, I did the dirty for
you and you can -- you can find out that the numerator is actually evaluating this
Gaussian and then [inaudible] evaluating such Gaussian distribution where here I
stacked the two iVectors, one above the other one and -- but you can see that
again the score is symmetric. It didn't matter whether I switch these two or these
two. And you are going to still get the same score. All right? So now can be
very easily evaluated. In fact, doing some -- some more manipulation with this
Gaussian just like expressing it explicitly you find out that the final score can be
calculated just like this where this is -- this is some billionaire product so just dot
product of one and second iVector with some low-dimensional matrix in between.
This is the most expensive part. All the other things are just some constant
calculating for one iVector you can really precompute this constant for each
segment.
And then I mean there are these lambdas and gammas parameters that are
calculated this way but it doesn't really matter. This is just what you would
obtain. This is what we -- what we need to know. So they -- the thing is, the
score can be really calculated by this simple formula. And, in fact, you can even
take -- get a composition of the lambda and precalculate the iVector so then at
the end the score can be really calculated just as the dot product of two vectors.
That's it.
And there is good news. This PLD -- iVector and PLDA approach doesn't need
any ZT normalization. It works just fine as it is.
Another thing to point out. We can actually calculate the whole bench of score
very quickly because I just told you the score, in fact, is just dot product between
two vectors. So I mean if we just rewrite it in terms of having sets of iVectors for
like let's say enrollment iVectors and test iVectors but again, I mean, we don't
have to distinguish enrollment and test data. The role of both iVectors is
completely symmetric.
So then having these sets of iVectors you can calculate the whole matrix of
scores, scoring each iVector against each enrollment I get vector against each
test iVector just by calculating this product of three matrices or just actually
product of two matrices if he preferences the iVectors. So it's extreme fast to
perform -- once we have the iVectors, it's extremely fast to -- there is nothing
simpler that would give you actually the score, just computer product -- the
compute multiply two matrices with iVectors.
And I see that I'm running out of time. I can actually cut it before I get to the
discriminative training.
>>: So does he know that he's late? [laughter].
>> Lukas Burget: I apologize. I didn't expect so many questions from the
audience, and I ->>: [inaudible].
[brief talking over].
>>: Could we hold on the questions so I mean much later, maybe after talk.
>> Lukas Burget: Okay.
>>: We just flip through ->> Lukas Burget: Actually, I can go really, really fast to the end if you just let me
finish the talk.
So anyway, this slide those the improvement. So we can actually see that by
doing this trade with PLDA, you get lots of improvement from -- and especially for
the low false region, this is what we saw. So you can see that again compared to
JFA and for this false rate you would be going down from about 50 percent to 30
percent. So a significant improvement.
>>: [inaudible] from my office [inaudible] start to look for me. [laughter].
>> Lukas Burget: And still it can be [inaudible] so fast, right? I have yet another
picture which shows another system based on exactly the same idea, which is
the whole lot better. In this case we -- I mean, when I was comparing this
performance I was -- I'm comparing with the like what our features and our UBM
were in 2008 and compare things on that. If we apply a new UBM that we are
using at the moment where we useful covariance matrices rather than diagonal
matrices and if you use some other tricks to normalize the length of iVectors,
some tricks that we just you learned recently to reduce dimensionality by LDA,
normalize the length of the iVectors to unity lengths and tricks like that, you can
get another quite significant gain from that.
So I mean it's kind of -- this is just the relevant map adapted baseline. But, in
fact, it uses already better features that we for example had in 2006. So I think
it's kind of fair to compare this line on maybe a line that would be just little bit
better with this performance that we have right now. And you can see how much
gain in performance we were getting from about year 2006 to 2011, and you can
see that the -- the performance for this for other region it dropped from -- this is
80 percent down to about 12 percent. Right?
So it would be ->>: [inaudible].
>>: Hold the questions.
>>: I think -- first green line's where things that existed in the literature before.
>> Lukas Burget: I mean, in fact, this would be system that existed in the year
2000, but which was still kind of the state of the art in 2005, 2006.
>>: The last two lines are things that you developed at BUT; is that correct?
>> Lukas Burget: Well, I mean -- I wouldn't say that was -- this of course is
based on the ideas like the PLDA existed in phase recognition and things like
that. And there are other people that were working on the same things at the
same time. You know, I mean the JHU workshop we funded, we can use these
features -- these things as features. Everybody started implementing. It was
straightforward extension. So there have been other labs doing the same
research at the same time but, yeah, I mean these are -- basically these are
systems that we have -- we have developed at BUT in past two years, past three
years. And this improvement that we got in like really in past two years.
Yeah, so the point slides and I can make it really short, are on the -- on the
discriminative training just showing that after we have now this framework we can
actually retrain the parameters discriminantly and still get quite some gains.
There were some efforts on this that people would call discriminative training in
speaker ID and was mainly the thing that I already describe as supervector
machines where you could try and separate supervector machine for each
speaker. But I don't consider that to be really the true discriminative training for
the task that we deal with in speaker identification. Because in speaker
identification we really -- we want to address the binary task of discriminating
between same speaker trial and different speaker trial. And we should -- we
should train system important this binary task.
And we have done something in -- some work in this direction doing the JHU
workshop 2008 where we tried to train the hyperparameters of Joint Factor
Analysis, the subspaces discriminantly. But we were not too successful with that.
There was probably too many parameters to estimate discriminantly. And there
was still the problem with the normalization with scores which somehow -discriminative training actually worked very well without it -- without the
normalization once somebody added -- once we added the normalization, we
were getting just minor gains from the whole discriminative training.
Anyway, right now we tried to apply the discriminative training to -- in -- to the
PLDA system and to return the PLDA parameters discriminatively and it seems
to work quite well.
So this is actually -- this is just first time the discriminative training, this kind of
discriminative training was really successful for speaker identification. So about
now it's when the discriminative training techniques I think will start moving to the
speaker identification field. There was nothing like real discriminative training so
far.
So what do we propose here is to calculate -- calculate the verification score
based on the same functional form like what we have used for PLDA, but instead
of training the parameters of it using maximum likelihood, we are going to train it
discriminatively. Specifically I will be showing here that we use the cross-entropy
function to training the parameters discriminatively. Again, as I said, the PLDA
model is well suited for the discriminative training just because we need no score
normalization anymore and the functional form is really very simple. And you will
see that it actually leads to just linear classifier.
So again just using the -- writing down the equation for this course and realizing
that we can actually rewrite this bilinear form in terms of just dot product of
vectorized matrix A which is here in between and vectorized outer product of the
vectors X and Y, we can actually rewrite this core function in the following form.
We can just vectorize all the matrices, lambda, gamma, CMK, and we can
compute the dot product with the -- with such vector where we actually calculate
all these dot product between the two iVectors and the trial. So we do some kind
of non-linear expansion of the pair of iVectors in the trial and form the vector out
of it. And then we actually can calculate the core just as the dot product between
vector of weights and between the non-linear expansion of the iVector pair, right?
And the result, the resulting score can be interpreted as log-likelihood ratio. This
is what it gives us.
So what we are going to do here is we are going to train -- we are going to train
the V discriminatively, these parameters discriminatively, and we are even going
to train it using -- using a local -- logistic regression which is again technique that
assumes that the scores are -- that these dot product can be determined as
log-likelihood ratio. Or we use supervector machine as the classifier.
So we are basically training just linear classifier discriminatively, either linear
supervector machines or logistic regression. And I am going to -- I'm going to
skip this slide which kind of just says what logistic regression is about, and I and
those that don't know that, I will just -- I would just refer them to some standard
group to learn about what it is.
But the problem in -- the -- I assume that everybody [laughter].
>>: [inaudible].
>> Lukas Burget: Yeah, I'm sorry. But, well ->>: Oh, my goodness.
>> Lukas Burget: Everybody was waiting for thank you or something like that.
I'm sorry. And then I did that by purpose because I thought that you would be
nervous [inaudible] anyway. So the thing is we have to realize that for the
discriminative training now we have the -- we have the -- each training example
is going to be pair of iVectors, and now we can create our training example. We
have the bench of training data and we can create our training examples. We
can -- we have to actually create the send vector trials out of iVectors that
correspond to the recordings of the same speaker and we can specially create lot
of imposter example, different speaker example by just combining any iVectors
that are corresponding to utterances of different speakers.
And in our case, we have about 2,000 training examples from -- for female, 16K
for males, so we could create almost billion of trials for training our systems. So
now one could suggest that we should probably simple from those that would be
too many. But, in fact, we don't have to because if we want to calculate the
gradient for the logistic regulation we can use exactly the same tricks for like
what we used for calculating the scores. And you can actually figure out that the
gradient can be calculated just using the same tricks by just computing dot
products of matrices with all our training trials. And the G is just matrix of all the
scores calculated for -- calculated by this simple dot product and then applying
some simple nonlinearity to that.
So we can actually calculate the -- we train using almost billion trials and we
calculate the gradient in few seconds which allows us to train the system pretty
fast. And hopefully we could train it even much larger amounts of data. Yes?
>>: So do you run into any class imbalance issues? Like, it feels likes ->> Lukas Burget: Yes, you have to -- I mean, sure. That was what I would
otherwise explain on the previous slides where we introduce some scaling factor
for trials and I give more importance to target trials than non-target trials.
So I mean now it just compares this table just compares the performs gains from
the discriminative training and you can see that the PLDA already perform quite
well but just retraining things discriminatively can give you some -- another nice
gains. Most likely we are getting actually better performance with super vector
machines, but that was problem just because we didn't use any regularization in
case of logistic regularization that would probably help also.
So this is the final slide, I promise. So, yeah, but I'm just summarizing what I was
telling you. The -- we have seen that the -- over the pass few years we have
speeded up the technology quite a lot, like I mean by few -- the speedup factor
would be in orders of magnitudes and we have improved the accuracy which you
could just see like by factor, I'm saying five factors and more. But you could see
that we -- one from -- for the particular form like form 20 person accuracy to 12
person error. So it's like really huge gains.
And you could see the shift in the paradigm that we -- we are moving from the
paradigm that we were training the speaker model and then testing it on the test
utterances to something we just compares the pair of -- pair of utterances in
more principled way which is similar to -- which is beyond the variation
approaches for model comparison. And the -- the new paradigm based on the
iVectors actually open new research directions so I have been just showing you
that they train the PLDA discriminatively. But there are other possibilities that we
work on currently, so there is -- you can actually count with -- you can easily
combine the low-dimensional presentation. You could have different features for
different -- different system for different features, for example, and getting
different iVectors for different systems. You can simply stack the iVectors and
train the system for that, for example. So they allow us to combine the
information in a very easy way from different systems. Or you can use the same
idea of extracting the low-dimensional iVector based on other models. Instead of
Gaussian mixture model, we have recently used Multinominal Subspace model,
which is quite similar to what -- how we model the weights as GMM and
something that we have been using before for ->>: I was going to suggest doing that.
>> Lukas Burget: Yeah. And yeah. But this case was just the weights actually.
It wasn't -- it's still not the Gaussian mixture model. But we use it for modeling
prosodic features and now we got like huge gains compared to what has been
before on prosodic features and things like that.
Yeah. So that's pretty much everything I wanted to say.
[applause].
>> Dan Povey: I guess it's lunch time.
Download