>> Geoffrey Zweig: Okay. So it's a pleasure... Rouvier today. Mickael comes to us from Marseille in...

advertisement
>> Geoffrey Zweig: Okay. So it's a pleasure to interests dust Mickael
Rouvier today. Mickael comes to us from Marseille in France. He has had
a long trip. Prior to his post-doc position at the University of
Marseille, he obtained a Ph.D. at the University of Avignon. While he
was there he developed a super elegant way of doing diarization that
basically reduces the problem to 01 integer linear programming and today
he is going to tell us about that and then some extensions that use
embeddings to do diarization.
>> Mickael Rouvier: Thank you. Thank you, Geoffrey. I want to thank
for the invitation. I am very happy to be here. So today I will speak
about speaker diarization and I will speak about automatic storing using
[indiscernible] and after that I will speak about speaker harmonizing.
So the outline of my presentation, so first I will speak about me. So
about my project and my background. And after I will speak about speaker
diarization I will just reveal this part, what is speaker diarization and
I just describe speaker diarization applied on broadcast news, not in
meetings. So there is some difference.
So during all my conversation I just present speaker diarization.
Applied on broadcast news and after I will speak about speaker
clustering. So I will present the method on IP clustering. The IP
clustering was very much appreciated by the community. In fact when I
sent the paper on the conference, that Jim and Patrick sent me an email.
So that Jim proposed to come to MIT in order to present this method and
Patrick wants to reproduce all my results. And after there is a
conference, many people asked me to send the report, release the report
about IP clustering. After I will speak about speaker clustering and how
to exploit relevant information on speech segments using deep neural
network and I will present you this method that I call speaker
clustering.
So what about me? My name is Mickael. I am 33 years old. I come from
France. France is near England, German, Spain.
[laughter.]
>> Mickael Rouvier: I very apologize because my English is not perfect.
It is not perfect. It is very bad. If you don't understand, please stop
me and ask me to repeat. You can stop me during the conversation to ask
me every question.
So I obtained my Ph.D. in 2011 in Avignon. I work on video
summarization, automatic video summarization.
Then I moved two years old in computer science at Le Mans. I work on
automatic speech recognition and speaker diarization. And after I move
at Marseille at the University of Marseille, I work on exploit relevant
information using deep neural network.
Sorry, I just forget to say this is Avignon, Le Mans is located here and
Marseille is located here.
So I just present two projects two interesting projects. That can be
interesting to Microsoft. The first one is REPERE project. This project
consists in identifying the person in the TV show. We can identify the
person on the video, but also on the audio. And here, by example, if you
want to identify this guy, this guy is Jean Macero, you can use a face ID
or speaker ID, but you can also use audio information, by example this
information. The name of the guy is right in here. The name of the guy
was spoken the last segment. And you can propagate all the information
in the video.
So during this project I worked on automatic speech recognition, on
speaker diarization and on speaker [indiscernible] but I also work in the
propagation of all the information around all the modalities.
So the second project is the BOLT project. The core of this project is
machine mediated dealing with conversation. So the main idea was to go
beyond the sample, the simple by plan automatic speech recognition and
machine conversation. So one of the goals of this project was to detect
zero and to, so the machine must detect zero and to ask it to use it to
repeat the sentence. By example, if the user say I like your Barrett and
the system doesn't know the word Barrett, the system say: You like what?
The user say: Your hat. This merged the first sentence and the second
sentence and try to create the sentence "I like your hat" and translate
"J'aime votre chapeau."
So I worked in a lot of fields, but one of my goals was to extract
relevant information from documents.
So I work in speech, image, natural language processing, machine learning
and information retrieval.
The number that you see here is the number of publications in each of the
fields.
And now I try to work in extracting features using neural network.
So let's -- so what about speaker diarization? The goal of speaker
diarization is to answer to the question who speaks and when? So given
audio stream as abroad cast news, we can solve this problem in three
steps. The first step is to detect the speech and nonspeech segment.
After we take on the speech segment and we do a segmentation, and knowing
that each segment contains the voice of only one speaker. And after
that, we do clustering. So we try to say [indiscernible] is this a
different speaker or the same speaker?
In my presentation I focused on the last steps speaker clustering and I
propose a new method about optimal clustering. So in speaker clustering,
there is two kinds of approach. So the first approach is the
hierarchical clustering like bottom-up or top-down that consists of, at
each iteration to increase or decrease the number of clusters.
The second kind of method is the partitioning clustering, like K-means or
mean-shift and consists to refine the partitioning during each iteration.
So concerning the bottom-up, so the bottom-up algorithm is an iterative
algorithm that merges two most similar clusters, but there are limits
about this approach. First it is a greedy algorithm. By greedy I want
to say at each iteration it begs the best solution. But we are not sure
that the best solution is the best solution at the end of the process.
And the second problem is that never, bottom-up algorithm never
reconsiders the decision. I just want to say that if error is done, this
error is propagated along all this process.
So the second algorithm is the partitioning clustering. By example the
K-means algorithm. The K-means algorithm tries to minimize the objective
function. For example, the distance and [indiscernible]. There is some
problem. The first is that the number of clusters is fixed. And in the
case of speaker diarization it is a problem because in speaker
diarization we don't know the number of speakers and we don't know who is
speaking in the audio stream.
So there is a problem here. And the second problem is that about the
initialization of the cluster. So if [indiscernible] the cluster at the
beginning, we can fall in local solution and not to the [indiscernible]
solution.
So we must, I like the idea about partitioning clustering because there
is an objective function, but we must sec the true program that the first
one is a problem is the number of clusters and to tack em the second
problem is the problem of initialization of the cluster center.
So in order to tackle the second problem, I propose to use the integral
in the program. The integral in our programming is a mathematical model
that can, so given objective function in some constraints we can obtain
zero optimal solution of this problem.
So how does that work? So we give an objective function that we want to
minimize or maximize the objective function is a linear function with
some integral viable, some [indiscernible] viable and we give some
constraints like linear function. After that we can use like an
algorithm. This algorithm will remove all the foot less during the
research of the best solution.
So one of the goals of the problem is to express the clustering problem
as an ILP problem. After that we assume the center of a cluster is
necessarily a point of a problem.
So how to express the binary variable this problem? So I want to say
that this segment as in the same cluster and this segment are in the same
cluster. So first we enumerate all the points. One, two, three, four,
five to ten and after we say that the legit 4 is the center of the blue
cluster. And that the segment nine is the center of the red cluster.
And we assume that all the points that are beyond this threshold belong
to this cluster.
So in order to explain this in a binary variable, we create a matrix,
this kind of matrix. So the role ->>: Is the peak in the center going to be part of the problem or are you
going to take the center separately, for example by considering each
center and computing the mean distance to all the other ->> Mickael Rouvier:
No, no, we.
>>: So you couldn't do that?
center is?
>> Mickael Rouvier:
slides.
>>:
You are going to have to discover what the
Maybe we will answer this question in two or three
Okay.
>> Mickael Rouvier: I will explain about the distance in the next slide.
Here I want to explain how to express this problem in a binary variable
and after that I will speak about distance.
So we want to fill this matrix. This matrix. So this is a segment and
here this is a cluster. There is as many segments as clusters because we
can say that there is, that each segment are a cluster. Okay?
So if I want to say that this segment belongs to the cluster, I just to
put one here. So this segment four belongs to the cluster four and the
same for nine. This secretary nine belongs to the cluster nine.
And if I want to say that this segment belongs to this cluster, I just
fill one here and one here for this segment.
And after I put zero to the whole matrix. Okay.
So we are expressed in terms of binary variables the clustering. So now
I will present the ILP clustering models, but I will begin by the end. I
will present the constraints. And after I will present the objective
function.
So in this constraint I just say that a segment can belong to only one
cluster. I say if we take the sum of that, the sum must again to one
because one segment can belong to one cluster.
After I just say that if a segment five belongs to the cluster four, so
the segment four must belong to the cluster four. By example, I say that
if I put a 1 here, so it is obligatory to put a 1 here. And the last and
important constraint is to say that the distance between this segment and
this segment must be below the threshold. So it is very important
because if the value is very high, so we can say there is only one
cluster. If the value is very small, we can say that every segment are
the same cluster. So we can try to search the best value. And after the
objective function try to minimize the number of cluster and try to
minimize the dispersion intra-class ->>: -- the representation is some kind of feature space and D is the
distance?
>> Mickael Rouvier: Yes. So the W, okay, so W is i-vector.
speak about that, but you can use what you want. I-vector
[indiscernible], by example. You can use what you want.
>>:
I will
[speaker away from microphone.]
>> Mickael Rouvier: It is most important is the distance between the two
segments. And X is binary table.
>>:
And in cluster N, K in the center or points at your cluster?
>> Mickael Rouvier: So this is the segment.
that answer your question?
This is the cluster.
Does
>>: I'm trying to understand that. So the segment and the cluster, that
-- it is a one. It doesn't have to be small, does it?
>> Mickael Rouvier:
It is like the field with the matrix.
>>: In your example what did the XKM minus -- XMN is less than zero
because you gave the example of rho six, column four. So K is 6. 4 is
column N. And then you said that reduces that row 4, column 4.
>> Mickael Rouvier:
Yes.
>> Audience: That would be the -[overlapping speech.]
>>: KXNN, not X.
[overlapping speech.]
>> Mickael Rouvier: No, sorry, for your question is the opposite.
this is the cluster and this is the segment.
So
>> Audience:
Oh.
>> Mickael Rouvier:
I say if the segment is in the cluster, we, so --
>> Audience: They are not matrix indices.
[overlapping speech.]
[laughter.]
>> Mickael Rouvier:
>> Audience:
Yeah, it's the opposite.
[indiscernible]
>> Mickael Rouvier: So we just minimize the number of cluster and the
[indiscernible] of each cluster.
Yes?
>> Audience: What properties do you have to assume for your distance
measure? Does it have to be a true distance measure -- baying the
triangularic quality, the deep?
>> Mickael Rouvier:
vector.
This measure?
In fact, they are the PDI for i-
>> AUDIENCE: I'm asking, can you substitute other distance measures in
there today? Is part a derivation that it has to be a distance measure
or can it be a pseudo-distance measure?
>> Mickael Rouvier: I don't understand.
question. It's my English.
Sorry, I don't understand your
>> Audience: So you can have functions that behave somewhat like a
distance measure but don't behave, don't obey the triangling quality
property?
>> Mickael Rouvier: Oh, yes. Oh, the PIGI? I think you say if you -so there is three segments, A, B, C, if you calculate the distance
between A-B and B-C, it is equal to A-C, is that your question or not?
>> Audience:
Does it require that?
>> Mickael Rouvier: Okay, okay, it is not requires that because I think
that the PIGI not follows this inequality, but I am not sure about that.
>> Audience: What is required for this derivation to work, for this
system to work for D?
[overlapping speech.]
>> Mickael Rouvier:
>> Audience:
You can use what distance measure you want.
Okay.
>> Mickael Rouvier: So I use, so PIGI, but also I use other distance
metric like cosine and [indiscernible] and different metrics.
>> Audience: The property -[coughing.] [indiscernible]
>> Mickael Rouvier:
whatever --
If putting whatever thing you want.
You put
>> Mickael Rouvier:
want.
Yes, you can even use what distance measure you
>> Audience: So is that the third constraint kind of redundant to the
minimization criterion?
>> Mickael Rouvier:
Sorry?
>> Audience: So the third constraint? If that is going to determine
sort of how many clusters you have because if you set the ->> Mickael Rouvier:
>>:
So I assume to say that the threshold is very high.
If you set D small.
>> Mickael Rouvier:
Small?
So there is --
>>: Then you.
[overlapping speech.]
>> Mickael Rouvier:
-- cluster.
>>: No, wait. Now I'm confused. D, the third one might not be satisfy
I believe, right? Satisfiable, right?
>> Mickael Rouvier:
>>:
All right.
So if I make D, it is hard when D is small.
>> Mickael Rouvier:
>>:
So I can say the distance of two segments.
Yes.
As I make D smaller.
>> Mickael Rouvier: Yes, because -[overlapping speech.]
>>: Then this may force more clusters. But isn't that redundant to
what's going on in the minimization there? Where if you put a very -yeah, if you put a very high weight on that second term in the
minimization, it would effectively do the same thing as the third
constraint. Why do you need that hard constraint there?
>> Mickael Rouvier: You can remove the [indiscernible] and remove the
[indiscernible] of the program. You can remove all the this ->>:
Padlocks.
>> Mickael Rouvier: You can do that. And in fact, when you use speaker
diarization cluster, but cluster I just want to say that we close the
diarization of the single [indiscernible] but make it show and we try to
detect the regular speaker and when we do this, we remove this constraint
and we modify the program in order to speed up the process.
>>:
But the bulk of putting the test [indiscernible]
>> Mickael Rouvier:
Yeah, yeah.
>>: So is this constraint only.
[overlapping speech.]
>>:
-- or is also aimed at improving accuracy?
>> Mickael Rouvier:
Yes, accuracy is the same.
>>: Oh, so the.
[overlapping speech.]
[overlapping speech.]
>> Mickael Rouvier:
same.
We can just remove that and the accuracy is the
>>: Okay, so that's fair enough.
Okay.
>> Mickael Rouvier:
This is just to speed up, all right.
So [indiscernible]
>>: I don't understand why the accuracy would be the same. I think the
final number of clusters will be determined by the data you choose,
right?
>>:
Will be constrained by the data.
>> Mickael Rouvier: It is not the number of clusters.
of clusters is [indiscernible] by this threshold.
It is the number
>>: Right. So if you choose different threshold, then the number of
constants you get are different, right?
>> Mickael Rouvier:
>>:
It is [indiscernible]
If you change the value of --
>> Mickael Rouvier:
Yes, yes, number of clusters is different.
>>:
[overlapping speech.]
>>: Therefore, if you remove that constraint entirely, the number of
clusters will be different.
>> Mickael Rouvier:
Yes.
>>:
Therefore, the accuracy will be different.
>>:
Well, I don't think --
>>: You are saying the constraint is not really needed to get best
possible accuracy. It is just needed to make things faster?
>> Mickael Rouvier: No, this cluster, is only to see if we remove this
constraint, is only to speed up the process. But, but we need to remove
all the value in the [indiscernible]. Okay?
>>:
Oh, I see.
>> Mickael Rouvier:
variable.
We cannot remove this constraint without remove this
>>: Okay. How will.
[overlapping speech.]
>> Mickael Rouvier: By example, I just want to -- so if each
[indiscernible] is about this threshold, we remove all this variable in
this program.
>>:
Right, right, yeah.
>> Mickael Rouvier: And we are saying that this is the same accuracy.
But we are remove multiple constraints so the resolution is more fast.
Is faster.
>>:
Okay.
>>: Oh, but you change, you change the sums, the sums are no longer over
everything.
>> Mickael Rouvier: Yes, we can change.
[overlapping speech.]
>> Mickael Rouvier:
And change now the whole framework.
>>: I got it, yes, okay.
So lots of variations here.
>> Mickael Rouvier: Also this is my experiment. So we use our own
[indiscernible] of broadcast news. It is French broadcast news. The
variation I set is REPERE 2013. It is classical. We use an UBM of 1024
Gaussians, i-vector dimension 200. Length normalization PLDA scoring
[indiscernible]
So the metrics is diarization error rates is in fact a combination of
three errors. Missal am and speech error and I use the speaker
diarization for the ILP I use the GLPK toolkit.
So first we compare the bottom lack rhythm for -- [indiscernible] sorry,
I just want to mention this is the name of the show in the copies and
this is the [indiscernible] error rate. We compare the bottom half with
the cross discern with i-vector. We obtain again and after we compare
bottom F with IP clustering and we obtain a new gain with ILP.
>>: I have a quick question about using the i-vector to compare these
two classes. Is that because your test discern has longer segment? So
we can get a reliable i-vector score? We have this conversation in each
segment too short?
>> Mickael Rouvier: Okay. So in broadcast news the segments, the
reiteration of the segments are around one, between one and two seconds.
>>:
Aha.
Short.
>> Mickael Rouvier:
>>:
Yes, it is very short.
You are basing your i-vector on one to two seconds worth of speech?
>> Mickael Rouvier:
Yes, it is very short.
We obtain [indiscernible]
>>: The i-vector is computed, there is a separate i-vector for each
segment?
>> Mickael Rouvier:
>>:
Yes.
Yes, yes.
So I can understand what is the meaning of HAC there?
>> Mickael Rouvier:
HAC.
>>:
Hierarchical.
>>:
Hierarchical cluster ...
>> Mickael Rouvier:
>>:
But there are other popular clustering algorithms there.
>> Mickael Rouvier:
baseline.
>>:
It is bottom up.
This is the cluster approach for broadcast news, the
No, there are many other clustering algorithms?
>> Mickael Rouvier: Yes, yes, for broadcast news, so we the
[indiscernible] approach is to use bottom up algorithm and CNN for
broadcast news.
>>: Okay. So what is the training criteria?
criteria here?
>>:
Is this the three error rate?
>> Mickael Rouvier:
>>:
Diarization error rate.
Diarization error rate, okay.
I'm sorry, the evaluation
>>: I forget.
[coughing.]
>>: I should probably know, but does the diarization error rate include
the speech accuracy?
>> Mickael Rouvier: Yes, yes, it is speech [indiscernible] but in fact
it is kind of diarization, we just -- in fact the segmentation is the
same because we just change the speaker clustering. We just change the
method of speaker clustering. So yes, the diarization error rate
combines the three metrics. But in fact, we just compare the one
metrics.
>>:
[overlapping speech.]
>>:
Segmentation is --
>> Mickael Rouvier: The clustering, we compare in fact the speaker error
clustering. But yeah, I just indicate the diarization error rate.
>>:
Okay.
>>: So that clustering is to achieve this [indiscernible] variable.
you need to adjust the [indiscernible] in the constraint?
>> Mickael Rouvier:
Adjust, sorry?
>>: You have the constraint with the data, right?
the threshold? Extend the threshold?
>> Mickael Rouvier:
Do
Threshold, you have
Sorry, I don't understand.
>>: The third constraint that we talked about?
the delta of this result?
Do you need to adjust
>> Mickael Rouvier: Yes, yes. I don't mention, but we have data
(indiscernible) and we optimize also the results on the development
purpose.
The threshold are optimized on the development [indiscernible] and we
apply and we take this threshold and we apply that on the test corpus.
Is that your question?
>>:
Yes.
>>:
Yes.
>> Mickael Rouvier: Yeah, I forget to, I have a curve and I forget to
present the curve. But yes, I have a curve and we have indicate the
diarization error rate in function of the threshold.
>>: Okay. So my question is, if you adjust the number of clusters in
the other approaches owe.
[overlapping speech.]
>> Mickael Rouvier: So for this, for the ILP in fact, I try to search
the optimal threshold. It is exactly the same as the development
purpose.
>>:
No, I am not talking about the -- I'm talking about H ACI vector.
>> Mickael Rouvier:
>>:
Yes.
I don't know for that.
You did it to [indiscernible]
>> Mickael Rouvier: Yes, I don't try to research the best solution in
the best corpus. I just.
[overlapping speech.]
>>:
-- development.
>> Mickael Rouvier: Yes, it is optimized on the development.
[overlapping speech.]
>>:
On the development sector.
>> Mickael Rouvier: Yes, but your question was to optimize on the
threshold was the best set?
>>:
No, no.
>> Mickael Rouvier:
No?
>>: Because you have these three parameters. The first of the
parameters will affect the final result. And so in the other approach
you can also adjust for example, the maximum distance between two
clusters and since you do that, you can also choose the number of
clusters you can generate.
So then that will also affect the results in your baselines.
>> Mickael Rouvier:
Yes.
>>: So I'm wondering why you don't do the same for.
[overlapping speech.]
>> Mickael Rouvier: No, no, you still don't have to optimize the results
on the development purpose.
>>:
All right.
>>:
So it is a fair comparison.
>> Mickael Rouvier:
[chuckling.]
And as, sorry, on the same development corpus ...
>>: You need an additional segmentation, right? So you need to generate
the segments that you submit to the clustering. How did you generate
those? And were they the same for all your different methods?
>> Mickael Rouvier: Because the segmentation was run by the new speaker
diarization. So there is ->>: Sorry, is that some kind of -- it's just a speaker change detecter,
right?
>> Mickael Rouvier:
Yes.
>>: You basically look for speaker changes and then any, you know,
whatever comes between two changes is one of the segments that you will
feed to your.
[overlapping speech.]
>> Mickael Rouvier:
>>:
Yes, I do that in speaker clustering.
And it was the same for all?
>> Mickael Rouvier:
Yeah, yeah.
>>: So you didn't do this method like in the [indiscernible] diarization
system where you generate clusters and then you use those clusters to resegment the data iteratively. Right? So you have some segmentation at
some stage in your algorithm.
>> Mickael Rouvier:
Yeah.
>>: You generated HMM with cluster models and you used the
[indiscernible] to segment, re-segment the data. Then you redo the
clustering and so forth. So you iteratively re-segment and re-cluster.
You didn't do anything like that? You just found the speaker boundaries
once and then you simply do the clustering based on that?
>> Mickael Rouvier:
Yeah, yeah.
I know that in meetings you do that.
>>: For broadcast, that is enough? I mean, that's sufficient?
don't have to do the segmentation in.
>> Mickael Rouvier:
>>:
I see.
Yeah, you don't need to do that in broadcast news.
Okay.
>> Mickael Rouvier:
>>:
You
But it is fair that in meetings you do that.
Okay, okay.
>> Mickael Rouvier: So yes, to finish the training we optimized on the
development conference. So I will speak about speaker embeddings. So
one of the limits that ->>:
Where has that gone?
>>:
Are you using it now?
Is it adopted at Marseille?
Or at Avignon?
>> Mickael Rouvier:
Mans.
>>:
Who is using it now?
>> Mickael Rouvier:
>>:
So the ID plus [indiscernible] was developed in Le
The IP clustering?
Yes.
>> Mickael Rouvier: There is some guys in IBM. So LIUM uses that.
orange and I know that Najim uses that and Patrick also uses that.
L-
>>: -- maybe try it.
[chuckling.]
>>:
The possibility of ... you go back to the last slide?
>> Mickael Rouvier:
Yes.
>>: So the first question, based on your initial speaker change in the
[indiscernible], what is best error rate you can get?
>> Mickael Rouvier:
So if --
>>: Based on your speaker, user speaker change?
you can get?
>>:
What is the best achievable?
>> Mickael Rouvier:
>>:
The error rate.
>> Mickael Rouvier:
>>:
What is best error rate
Okay.
If you had like Oracle to tell you the -Yeah, yeah.
Cluster rate.
>> Mickael Rouvier: Each steps has an error -- I don't know. I do the
fast one time, but I never reduce the -- reproduce the test on this
corpus.
>>:
That's the question.
>> Mickael Rouvier: I know that if we obtain the best clustering with
this segmentation on this corpus, I think it is around 10 percent of the
diarization error rate. But it is 10 percent because there is some
problem of speech overlapping in this kind of corpus that the system does
not take into account. So there is some problems. Is it your question?
>>: Yes. The second question is actually, so how fast runtime compared
in counter two and counter three?
>> Mickael Rouvier: IP is very fast. It is more fast than the
[indiscernible] algorithm and bottom-up. The [indiscernible] bottom up
and i-vector, you must match, you must calculate the i-vector and make it
stop. With IP you just calculate one time all the i-vector and you do
the clustering.
>>:
What package do you use to solve in your program?
>> Mickael Rouvier: [indiscernible]. So on the line, it is a free hit.
It is very fast. And ... we do some tests in [indiscernible] around 150
hours of shows and it takes around ten minutes. I'm not sure about the
time, but it is around ten minutes to to solve this kind of problem.
It's very fast because just ...
>>:
[speaker away from microphone.]
>> Mickael Rouvier: Because we remove this constraint. We remove this
binary variable and the third, we do not solve this problem like that.
We try to separate the problem in to subproblems.
I don't explain that, but we can, I can explain later if you want.
>>: So what is the scaling with the number of clusters? So if you
double the number of clusters of initial segments, sorry, if you double
the number of secretaries, how does the runtime of the clustering ->> Mickael Rouvier:
>>:
Right.
Okay.
Because it is very, very fast.
But I mean theoretically speaking is it quadratic or linear?
>> Mickael Rouvier:
>>:
I don't know.
No, it is linear.
It is not relative.
But the number of variables is quadratic, right?
>> Mickael Rouvier: Yes, but as you use the PLDA metrics you can remove
a lot of binary variables. I am okay to say that this metrics, if you
increase the number of speakers [indiscernible] the segment is quadratic,
but as you use the discriminate distance you can remove a lot of binary
variables and just focus on the binary, important binary variable. So I
say it is not quadratic. It is more than linear.
>>:
Oh.
>> Mickael Rouvier: So I just want to speak about speaker clustering.
One of the problem about i-vector, i-vector PLDA obtain very good results
in NIST evaluation campaign, but we know that ->>:
Which evaluation was that?
Broadcast news again?
>> Mickael Rouvier: No, no, it is on I speak in generally on speaker
verification. And just an overview.
>>:
Speaker identification?
Oh, yeah, yeah, yeah.
>> Mickael Rouvier: But we observe that on short segments the
performance degrades rapidly. So there is a lot of work that try to
tackle this kind of problem to come in PLDA scoring, to normalize i-
vector about the duration of the segment and me, I propose to because the
i-vector is extract from the -- [indiscernible] space and me, I want to
extract the features on the speaker space, the [indiscernible] in the
speaker space. So in order to do that, I propose to learn to try to this
feature presentation with a deep neural network. So I take the deep
neural network through the task of the deep neural network, is the
speaker identification task. So in the input, this is the first of the
[indiscernible] statistic. So we take a segment. We take a UBM, we take
the bandwidth statistic and after we normalize by the mean and the
covariance about the UBN. So we give that the input of the deep neural
network so the output is the speaker identification. For example, if the
segment is speaker ID 3 we say 001, 000, [indiscernible] and after we I
propose in fact to use one of the item linear like neural presentation.
So by example, if I say this item linear is linear presentation, so I
compact the number of neurons and I take this item linear like a neural
presentation.
And I call this the speaker embeddings.
So just to, in order to -- so this is the speaker embeddings. So for the
presentation normally speaker embeddings are vector, but for the
presentation so I must compact the vector and we see that there is some
patterns that are in common with the different associations, train and
test corpus and we see that there is some pattern similar between the
training and test corpus. Just for information.
So one of my goals is to replace i-vector by the i-vector PLDA pipeline.
One of the problem is that the speaker embeddings is built without any
Gaussian pre-exception.
>>:
Why is that a problem?
>> Mickael Rouvier: Yes, there is a problem because PLDA assumes also
the [indiscernible] must negotiate. So we observe that every color
represent the speakers. So I take four speakers and every segment, every
point represents a segment. And we observe that there is a random
distribution of the data. So it is clear that the Gaussian scoring,
[indiscernible] scoring, Gaussian scoring, sorry, focused only on the
direction of the submittee. So the question must be adapted to this kind
of task.
But I try a technique, these techniques was used in speaker verification.
It is a light white [indiscernible] normalization. This technique
consists of artificial Gaussian data in order to remove the mean and to
multiply by the widening of the metrics.
So for the experiments I use the same background training like I said,
like the previous experiments. But I use the data set that is ETAPE
2012.
I use the same UBM. The vector of dimension one, noise mistakes is 200,
but the speaker embeddings of dimension 500.
The i-vector I like (indiscernible) with two iterations become the
[indiscernible] with one iteration. For the speaker diarization we use
the same toolkit. For the DNN we use the Kaldi toolkit.
The activation function [indiscernible] and we use three ID linears.
First I want to compare, so I take the same different show and this is
overall of the test data set.
So in the first experiments I just compare i-vector PLDA with speaker
embeddings cosine. We observe that there is a gain.
And after I just try to normalize the speaker embeddings with the light
white normalization and I use this speaker embeddings normalized with
PLDA.
So the same results. If I use speaker embeddings with no normalization
and PLDA, we obtain worse results. If we use speaker embeddings with one
lightening iteration, we obtain a new gain.
And we have two or three Gaussians doesn't work.
>>:
Why -- [indiscernible] is different from the previous slide?
>> Mickael Rouvier: No, because the last slide I used the
[indiscernible] and in this slide I used PLDA metrics.
>>:
Oh.
[speaker away from microphone.]
>> Mickael Rouvier:
assumption.
>>:
Did you find anything in the i-vectors?
>> Mickael Rouvier:
>>:
So PLDA doesn't work because there is no Gaussian
Yeah, yeah, i-vector is light white normalization.
Already?
>> Mickael Rouvier: Yeah. I don't -- they normalized.
So for the conclusion and perspectives, for the ILP clustering, one of
the conclusions is that ILP clustering explores more clustering solution
than greedy bottom up algorithm.
So when I was in first in [indiscernible] the systems that developed with
ILP I started to the ETAPE 2012 and the REPERE 2012 and 2013, and we
obtained the best results in single and cross show.
One of my perspectives with ILP clustering is to make a joint model with
speaker clustering and speaker identification by joint model I say for
example if you do a clustering, so you make identification in order to
know the name of the guys, you can do that in jointly at the same time.
And maybe we can do that with the ILP clustering.
So for the speaker embeddings, so features are extracted on speaker
space. And I don't present the works, but we try to test with automatic
speech recognition. By example, when we try the acoustic models, we try
the acoustic models and we also add the i-vector. So we remove the ivector and we use the speaker embeddings. And it works better than the
i-vector. And we try to use speaker embeddings in the speaker
identification task, and it works more than speaker -- more than ivector. And it works also very well with speaker diarization task.
One of my perspectives, I investigate the use of different input
features. So we can, by example, use the use of deep neural network. By
i-vector, I say you use the [indiscernible] probabilities given by the D
in. N and use this input vector in the DNN and the last perspective is
to create a DNN to automatically learn the input vector and one of the
approach is to use the combination neural network, text classification.
Thank you.
[applause.]
>> Geoffrey Zweig:
So we have time for a couple questions.
>>: So in the [indiscernible] case... There are a few parameters that
will kind of empirically set how many clusters, will favor some number of
clusters. I guess the question is, is there a way in the framework to
incorporate prior knowledge in this way? To present, without having to
[indiscernible]
So, for example, let's say you have some collection of data where you
know that the typical number of speakers is, cluster, you should have
between five and 15 and there's another data set where it's between one
and five, but you don't want to -- each time you don't have to go retune
all the, you don't want to have -- for each time you have a case like
that, you don't want to have a data set with hyper tune parameters.
So is there, if you have a generative model approach, like clustering,
there may be a way to put a prior kind of pin or something like that in
the [indiscernible]
>> Mickael Rouvier: I'm not sure to understand your question. So you
have some data set with one through five speakers and there is a data set
with ten to 15 speakers. And you want to adapt your systems?
>>: Is it possible to get the best performance on one set, you would
have one set of hyper parameters.
>> Mickael Rouvier:
Yeah, yeah.
>>: And a different set, you have a different set of hyper parameters
but you done have to do that every time. Is there a way to ->> Mickael Rouvier: If you know the [indiscernible] number of cluster,
you can add that in the constraints of the problem. You can say, okay,
for this kind of problem there is between one and to five cluster. And
so you can add this kind of constraints in the AE and P.
But for me the threshold that you learn ... for me, this threshold is in
fact represents the distance of, an average of speaker. So it is for me
it's optimal to, what kind of corpus you want.
>>:
So that should be independent of --
>> Mickael Rouvier:
>>:
Yeah.
Of the factors, since those are overall cluster [indiscernible]
>> Mickael Rouvier:
Yes, it is a mean of the speaker, in fact.
>>: So do you observe that in your tests? You have maybe, you have ten
different shows, right? Did you optimize that for every show
individually?
>> Mickael Rouvier: Yes. In fact, in all these shows, this kind of show
is very different because it is a show of pop star. It explains why we
obtained very bad results. In fact, I try to optimize my threshold on
all the corpus without this and the opposite. But in fact where the
threshold is particularly the same.
>>:
Okay.
>> Mickael Rouvier: So I take all the corpus and I apply the same
thresholds on all the corpus.
>>: So it is not a threshold, so why is that one so much worse?
the i-vectors or not?
Is it
>> Mickael Rouvier: No, it is the same rotation. There is a lot of
music. So the format is five minutes. A lot of music. And it is very,
it is not usually like ->>: Is it that there's a lot of back and forth between people, shortterms?
>> Mickael Rouvier:
Yeah.
>>: Because this goes back to the question, would this work on like
meetings?
>> Mickael Rouvier:
Yeah.
>>: Where you have a lot shorter turns and you don't have segments that
are long enough to estimate stable i-vectors?
>>:
[indiscernible]
>> Mickael Rouvier: No.
[overlapping speech.]
>> Mickael Rouvier: I removed the test [indiscernible] computers.
Sorry, I can show you, but we have, it is very short and a lot of music.
So in fact there are, given by the segmentations, the segmentation is
very bad because it is segmentation it says that there is a segment, but
with lot of speakers and sometimes there is a segment and also we have
some problem with speech and nonspeech to say okay, there is nonspeech,
but the speakers speak and ...
>>: So on a different topic, this is a very general way of doing
clustering, the ILP insight that they had.
>> Mickael Rouvier:
>>:
I hope that is considered [indiscernible]
So what else have people been clustering with ILP?
>> Mickael Rouvier:
What else?
>>: Is this the first application of ILP to clustering? Or has it been
used like to cluster documents previously? Or used to ->> Mickael Rouvier: No, no. I don't think that we use in documents. So
yeah, I try this kind of approach and I never been used in other things.
>>:
It never has been used in other fields?
>> Mickael Rouvier:
Yeah, yeah, it is only in speaker diarization.
>>: Okay. So that's a real innovation. When -- have you seen it since
applied to other areas? Or do you think it would be applicable to other
areas?
I mean, it seems like anything that you currently say K-means clustering,
we can use this.
>>:
No, I doubt it do much better than [indiscernible] in many cases.
>> Mickael Rouvier: So when I explain the REPERE and when I do the
publication of the IDT, I use the ILP in order to provide, to propagate
all the identity. And it is based on my ILP clustering. So we can do -and we can apply [indiscernible] task like, I don't know.
>>:
Yeah, yeah, but it is not impossible.
>>:
It can be limitation, it requires distance, two distance.
>>:
But does HAC essentially K-means cluster, close emergence?
>>:
No, it is hire logical.
>>:
Because you don't know the number of the cluster.
>>:
You just merge.
>>:
Right, you don't iterate, you go only to [indiscernible]
>> Mickael Rouvier:
It is [indiscernible]
You don't iterate.
Clustering, you won't match.
>>: So it is very -[overlapping speech.]
>>: You have to stop problem, right? The hierarchy in that way, you
have to have a separate criterion for knowing when to stop clustering.
What is the corresponding parameter in your system? Right? So the
problem -- the general problem in hierarchical clustering is knowing when
to stop clustering.
>> Mickael Rouvier:
Yes.
>>: Whether you know you have the right number
course is somehow related to the number of true
diarization. So in your system, with ILP, what
parameter? How do you control how aggressively
things together?
>> Mickael Rouvier:
>>:
By the threshold.
Is it the delta?
>> Mickael Rouvier:
Yeah, yeah, yeah.
of clusters which of
speakers in the case of
is the corresponding
you want to cluster
>>: It's the delta.
[overlapping speech.]
>>:
-- the goal?
>>:
You're right.
>>:
That's the ambiguity I was talking about before.
>>:
Okay.
>>:
I think you can get rid of the third thing.
There's the weight parameter, the F parameter.
>>: That's an efficiency.
two variables.
They both.
You just use Fs.
The delta is to rule out, to eliminate the
>>:
[speaker away from microphone.]
>>:
Exactly.
>> Mickael Rouvier: To eliminate, you just set up, but you can eliminate
this constraint, but you eliminate all the binary variables and you
eliminate [indiscernible] data. So you must to track the [indiscernible]
parameter.
And you compare the cluster with that.
>>:
And you have a separate, you have a separate data set between those?
>> Mickael Rouvier: Yeah, yeah. Different set to choose this parameter
and after applies that on the test.
>>: And have you tried to re-optimize it? How far from the optimal
value were you when you, after you tuned those parameters? Have you
[indiscernible] out?
>> Mickael Rouvier: So for the two copies, the ETAPE and the EPAC, in
fact the parameter on the test and that corpus is the same.
>>:
Really?
Okay.
>> Mickael Rouvier: But maybe in the corpus they develop.
[overlapping speech.]
>>: It would be interesting, like to, English broadcast news or
something, to see if the same parameters would show.
>> Mickael Rouvier:
Yes.
>>: This criterion to minimize. What you minimize is a very generic
criterion. I think it's very similar to all the other clustering
approaches, right? It's a cluster plus some distance thingy.
So your method is not about the criterion, but it is about finding the
correct solution? Or it is an optimization problem you are solving? Is
it the criterion that needs to be solved?
>> Mickael Rouvier: Good question. For me it is to optimize this, I
want to optimize this, than and this criterion in just to say that, to
constrain to say that the segment, to ... hmm, light, create the model,
the speaker model to say that ->>:
Yes, I understand.
>> Mickael Rouvier:
-- all the speakers are the same.
>>: My question is, the same criterion, the first formula, you could
also optimize that with K-means, for example, right? I think.
>> Mickael Rouvier: Yeah, yeah, we can do that but there is a problem of
initialization of the K-means. And this is a problem. And I test that
K-means and I run several times the K-means.
[overlapping speech.] from [indiscernible], but with the ILP, we obtain
each time the same results.
>>:
You actually did that experiment?
>> Mickael Rouvier: Yes, I did. And after I tried to different
objective function. I try to minimize, minimize [indiscernible] and to
maximize the adaptive dispersion and I try the objective function, I try
to maximize the number of speaker and to minimize the dispersion
[indiscernible]
But the best results I obtained with this kind of approach.
>>: So you can also, I think, apply the same criterion for HAC, right?
The hierarchical clustering?
>> Mickael Rouvier: Yes, we can, but the problem is with [indiscernible]
it is if we take the position one time, so this is propagate.
>>: Right. So what you're saying your solution is really a search
algorithm. That is what you are solving. You have the best way of
solving the same equation with the same criterion.
>>:
[speaker away from microphone.]
>>: So can you say a little more about this, about how -- anything in
particular about the cross show? Diarization? Is that done just putting
two shows with one big broadcast or one, then the other?
>> Mickael Rouvier: Yeah. So it's an [indiscernible] type. So in
figure show, you train the program like that, like I explain in my crossshow and cross-show, you take this is show A, show B, and in fact you
want to detect the [indiscernible] speaker. For example, if you treat
show A and B separately, you can say that, you cannot detect that this
speaker and this speaker are the same, but you by example can detect that
this speaker and this speaker are the same like that.
But they are not the same speaker.
>>:
So it tends to be mislabeled?
>>: It is still.
[overlapping speech.]
>>: When you say same identity, you mean the same label.
[overlapping speech.]
>> Mickael Rouvier:
>>:
The same identity is the same label.
They are both speaker one?
>> Mickael Rouvier: Yes, yes, but you say, in cross show you have new
steps, four steps. It's a clustering, but a global clustering on all
these shows. So you treat the shows separately. And after you add four
steps, global clustering that trains all the data. Now you can detect
the [indiscernible] speaker and say that this speaker and this speaker
are not the same.
We try to do that.
>>: But you are not really examining the identity decisions you make
within the show.
>> Mickael Rouvier:
Sorry?
>>: So when you do the cross show clustering with ILP, do you reconsider
identity within the same show?
>> Mickael Rouvier:
>>:
Yes.
>> Mickael Rouvier:
reorganize.
>>:
In fact, to not to fix but to reorganize works
Okay.
>> Mickael Rouvier:
some ->>:
We try to fix the clustering and we try to
Sure.
>> Mickael Rouvier:
better.
>>:
So we try the experiments.
Because I think that in the first clustering we miss
[speaker away from microphone.]
>>: By the way, do you have explanation why you have this video
distribution? I don't understand that.
>> Mickael Rouvier:
Sorry?
>>:
Video distribution.
>> Mickael Rouvier:
>>:
So on the radio --
Radio distribution.
>>: The radio distribution of the i-vector.
have an intuition for it?
>>:
Can you show the picture?
>>:
Yes.
>> Mickael Rouvier:
embeddings.
>>:
Do you know why?
Do you
So this is random distribution of speaker
So have you seen that before?
[indiscernible] gave a talk.
>>: This is not some.
[overlapping speech.]
>>: Some years ago and said, it was like showing word embeddings or
document embeddings. He made a big deal about, oh, look at this, it's
radio.
>>:
Okay, so why is it radio?
>>: Wait, this is distribution class.
[overlapping speech.]
>> Mickael Rouvier: I don't know. I don't know why is it is radio
distribution, but we examined that. In fact, so first experiments was to
know if the speaker embeddings was Gaussian or not. And at the end we
observe that this is radio distribution. But we don't know why.
>>:
But this is --
>> Mickael Rouvier:
>>:
This is random distribution of speaker embeddings.
But this is like the first two dimensions?
Is this a projected --
>> Mickael Rouvier: Yeah, yeah. We take four speaker, normally four
speaker, on the segment and we use a PCA to project the two dimensions.
>>: Doesn't this -- isn't this just to incorporate normalization
directly in the objective function term?
>> Mickael Rouvier:
network?
>>:
It is the objective function of the deep neural
Yes.
>> Mickael Rouvier:
Yes, maybe.
>>: Also the length has to do with the length of the segment?
far you are from the center?
>>:
Maybe.
>>:
That is related to the length of the segments?
And how
>> Mickael Rouvier: I don't know, but maybe we can have some
normalization between the [indiscernible]
>>: No, because the input is length and variance, right? So the input
to your network is the, are the Bonn house statistics, it's actually the
old IBEC statistics.
>> Mickael Rouvier:
Is exactly the same.
>>: Basically, it is why they converge to the same appointment.
have a point.
[overlapping speech.]
You
>>: But it is -- it's got to be some physical correlate to the distance
from the origin point.
>>:
Is that zero?
>> Mickael Rouvier: I am not sure. No, it is not zero, but I don't
know, maybe ... I am to indicate to the [indiscernible]
>> Geoffrey Zweig:
[applause.]
(End of file .]
Okay.
Let's thank the speaker again.
Download