>> Geoffrey Zweig: Okay. So it's a pleasure to interests dust Mickael Rouvier today. Mickael comes to us from Marseille in France. He has had a long trip. Prior to his post-doc position at the University of Marseille, he obtained a Ph.D. at the University of Avignon. While he was there he developed a super elegant way of doing diarization that basically reduces the problem to 01 integer linear programming and today he is going to tell us about that and then some extensions that use embeddings to do diarization. >> Mickael Rouvier: Thank you. Thank you, Geoffrey. I want to thank for the invitation. I am very happy to be here. So today I will speak about speaker diarization and I will speak about automatic storing using [indiscernible] and after that I will speak about speaker harmonizing. So the outline of my presentation, so first I will speak about me. So about my project and my background. And after I will speak about speaker diarization I will just reveal this part, what is speaker diarization and I just describe speaker diarization applied on broadcast news, not in meetings. So there is some difference. So during all my conversation I just present speaker diarization. Applied on broadcast news and after I will speak about speaker clustering. So I will present the method on IP clustering. The IP clustering was very much appreciated by the community. In fact when I sent the paper on the conference, that Jim and Patrick sent me an email. So that Jim proposed to come to MIT in order to present this method and Patrick wants to reproduce all my results. And after there is a conference, many people asked me to send the report, release the report about IP clustering. After I will speak about speaker clustering and how to exploit relevant information on speech segments using deep neural network and I will present you this method that I call speaker clustering. So what about me? My name is Mickael. I am 33 years old. I come from France. France is near England, German, Spain. [laughter.] >> Mickael Rouvier: I very apologize because my English is not perfect. It is not perfect. It is very bad. If you don't understand, please stop me and ask me to repeat. You can stop me during the conversation to ask me every question. So I obtained my Ph.D. in 2011 in Avignon. I work on video summarization, automatic video summarization. Then I moved two years old in computer science at Le Mans. I work on automatic speech recognition and speaker diarization. And after I move at Marseille at the University of Marseille, I work on exploit relevant information using deep neural network. Sorry, I just forget to say this is Avignon, Le Mans is located here and Marseille is located here. So I just present two projects two interesting projects. That can be interesting to Microsoft. The first one is REPERE project. This project consists in identifying the person in the TV show. We can identify the person on the video, but also on the audio. And here, by example, if you want to identify this guy, this guy is Jean Macero, you can use a face ID or speaker ID, but you can also use audio information, by example this information. The name of the guy is right in here. The name of the guy was spoken the last segment. And you can propagate all the information in the video. So during this project I worked on automatic speech recognition, on speaker diarization and on speaker [indiscernible] but I also work in the propagation of all the information around all the modalities. So the second project is the BOLT project. The core of this project is machine mediated dealing with conversation. So the main idea was to go beyond the sample, the simple by plan automatic speech recognition and machine conversation. So one of the goals of this project was to detect zero and to, so the machine must detect zero and to ask it to use it to repeat the sentence. By example, if the user say I like your Barrett and the system doesn't know the word Barrett, the system say: You like what? The user say: Your hat. This merged the first sentence and the second sentence and try to create the sentence "I like your hat" and translate "J'aime votre chapeau." So I worked in a lot of fields, but one of my goals was to extract relevant information from documents. So I work in speech, image, natural language processing, machine learning and information retrieval. The number that you see here is the number of publications in each of the fields. And now I try to work in extracting features using neural network. So let's -- so what about speaker diarization? The goal of speaker diarization is to answer to the question who speaks and when? So given audio stream as abroad cast news, we can solve this problem in three steps. The first step is to detect the speech and nonspeech segment. After we take on the speech segment and we do a segmentation, and knowing that each segment contains the voice of only one speaker. And after that, we do clustering. So we try to say [indiscernible] is this a different speaker or the same speaker? In my presentation I focused on the last steps speaker clustering and I propose a new method about optimal clustering. So in speaker clustering, there is two kinds of approach. So the first approach is the hierarchical clustering like bottom-up or top-down that consists of, at each iteration to increase or decrease the number of clusters. The second kind of method is the partitioning clustering, like K-means or mean-shift and consists to refine the partitioning during each iteration. So concerning the bottom-up, so the bottom-up algorithm is an iterative algorithm that merges two most similar clusters, but there are limits about this approach. First it is a greedy algorithm. By greedy I want to say at each iteration it begs the best solution. But we are not sure that the best solution is the best solution at the end of the process. And the second problem is that never, bottom-up algorithm never reconsiders the decision. I just want to say that if error is done, this error is propagated along all this process. So the second algorithm is the partitioning clustering. By example the K-means algorithm. The K-means algorithm tries to minimize the objective function. For example, the distance and [indiscernible]. There is some problem. The first is that the number of clusters is fixed. And in the case of speaker diarization it is a problem because in speaker diarization we don't know the number of speakers and we don't know who is speaking in the audio stream. So there is a problem here. And the second problem is that about the initialization of the cluster. So if [indiscernible] the cluster at the beginning, we can fall in local solution and not to the [indiscernible] solution. So we must, I like the idea about partitioning clustering because there is an objective function, but we must sec the true program that the first one is a problem is the number of clusters and to tack em the second problem is the problem of initialization of the cluster center. So in order to tackle the second problem, I propose to use the integral in the program. The integral in our programming is a mathematical model that can, so given objective function in some constraints we can obtain zero optimal solution of this problem. So how does that work? So we give an objective function that we want to minimize or maximize the objective function is a linear function with some integral viable, some [indiscernible] viable and we give some constraints like linear function. After that we can use like an algorithm. This algorithm will remove all the foot less during the research of the best solution. So one of the goals of the problem is to express the clustering problem as an ILP problem. After that we assume the center of a cluster is necessarily a point of a problem. So how to express the binary variable this problem? So I want to say that this segment as in the same cluster and this segment are in the same cluster. So first we enumerate all the points. One, two, three, four, five to ten and after we say that the legit 4 is the center of the blue cluster. And that the segment nine is the center of the red cluster. And we assume that all the points that are beyond this threshold belong to this cluster. So in order to explain this in a binary variable, we create a matrix, this kind of matrix. So the role ->>: Is the peak in the center going to be part of the problem or are you going to take the center separately, for example by considering each center and computing the mean distance to all the other ->> Mickael Rouvier: No, no, we. >>: So you couldn't do that? center is? >> Mickael Rouvier: slides. >>: You are going to have to discover what the Maybe we will answer this question in two or three Okay. >> Mickael Rouvier: I will explain about the distance in the next slide. Here I want to explain how to express this problem in a binary variable and after that I will speak about distance. So we want to fill this matrix. This matrix. So this is a segment and here this is a cluster. There is as many segments as clusters because we can say that there is, that each segment are a cluster. Okay? So if I want to say that this segment belongs to the cluster, I just to put one here. So this segment four belongs to the cluster four and the same for nine. This secretary nine belongs to the cluster nine. And if I want to say that this segment belongs to this cluster, I just fill one here and one here for this segment. And after I put zero to the whole matrix. Okay. So we are expressed in terms of binary variables the clustering. So now I will present the ILP clustering models, but I will begin by the end. I will present the constraints. And after I will present the objective function. So in this constraint I just say that a segment can belong to only one cluster. I say if we take the sum of that, the sum must again to one because one segment can belong to one cluster. After I just say that if a segment five belongs to the cluster four, so the segment four must belong to the cluster four. By example, I say that if I put a 1 here, so it is obligatory to put a 1 here. And the last and important constraint is to say that the distance between this segment and this segment must be below the threshold. So it is very important because if the value is very high, so we can say there is only one cluster. If the value is very small, we can say that every segment are the same cluster. So we can try to search the best value. And after the objective function try to minimize the number of cluster and try to minimize the dispersion intra-class ->>: -- the representation is some kind of feature space and D is the distance? >> Mickael Rouvier: Yes. So the W, okay, so W is i-vector. speak about that, but you can use what you want. I-vector [indiscernible], by example. You can use what you want. >>: I will [speaker away from microphone.] >> Mickael Rouvier: It is most important is the distance between the two segments. And X is binary table. >>: And in cluster N, K in the center or points at your cluster? >> Mickael Rouvier: So this is the segment. that answer your question? This is the cluster. Does >>: I'm trying to understand that. So the segment and the cluster, that -- it is a one. It doesn't have to be small, does it? >> Mickael Rouvier: It is like the field with the matrix. >>: In your example what did the XKM minus -- XMN is less than zero because you gave the example of rho six, column four. So K is 6. 4 is column N. And then you said that reduces that row 4, column 4. >> Mickael Rouvier: Yes. >> Audience: That would be the -[overlapping speech.] >>: KXNN, not X. [overlapping speech.] >> Mickael Rouvier: No, sorry, for your question is the opposite. this is the cluster and this is the segment. So >> Audience: Oh. >> Mickael Rouvier: I say if the segment is in the cluster, we, so -- >> Audience: They are not matrix indices. [overlapping speech.] [laughter.] >> Mickael Rouvier: >> Audience: Yeah, it's the opposite. [indiscernible] >> Mickael Rouvier: So we just minimize the number of cluster and the [indiscernible] of each cluster. Yes? >> Audience: What properties do you have to assume for your distance measure? Does it have to be a true distance measure -- baying the triangularic quality, the deep? >> Mickael Rouvier: vector. This measure? In fact, they are the PDI for i- >> AUDIENCE: I'm asking, can you substitute other distance measures in there today? Is part a derivation that it has to be a distance measure or can it be a pseudo-distance measure? >> Mickael Rouvier: I don't understand. question. It's my English. Sorry, I don't understand your >> Audience: So you can have functions that behave somewhat like a distance measure but don't behave, don't obey the triangling quality property? >> Mickael Rouvier: Oh, yes. Oh, the PIGI? I think you say if you -so there is three segments, A, B, C, if you calculate the distance between A-B and B-C, it is equal to A-C, is that your question or not? >> Audience: Does it require that? >> Mickael Rouvier: Okay, okay, it is not requires that because I think that the PIGI not follows this inequality, but I am not sure about that. >> Audience: What is required for this derivation to work, for this system to work for D? [overlapping speech.] >> Mickael Rouvier: >> Audience: You can use what distance measure you want. Okay. >> Mickael Rouvier: So I use, so PIGI, but also I use other distance metric like cosine and [indiscernible] and different metrics. >> Audience: The property -[coughing.] [indiscernible] >> Mickael Rouvier: whatever -- If putting whatever thing you want. You put >> Mickael Rouvier: want. Yes, you can even use what distance measure you >> Audience: So is that the third constraint kind of redundant to the minimization criterion? >> Mickael Rouvier: Sorry? >> Audience: So the third constraint? If that is going to determine sort of how many clusters you have because if you set the ->> Mickael Rouvier: >>: So I assume to say that the threshold is very high. If you set D small. >> Mickael Rouvier: Small? So there is -- >>: Then you. [overlapping speech.] >> Mickael Rouvier: -- cluster. >>: No, wait. Now I'm confused. D, the third one might not be satisfy I believe, right? Satisfiable, right? >> Mickael Rouvier: >>: All right. So if I make D, it is hard when D is small. >> Mickael Rouvier: >>: So I can say the distance of two segments. Yes. As I make D smaller. >> Mickael Rouvier: Yes, because -[overlapping speech.] >>: Then this may force more clusters. But isn't that redundant to what's going on in the minimization there? Where if you put a very -yeah, if you put a very high weight on that second term in the minimization, it would effectively do the same thing as the third constraint. Why do you need that hard constraint there? >> Mickael Rouvier: You can remove the [indiscernible] and remove the [indiscernible] of the program. You can remove all the this ->>: Padlocks. >> Mickael Rouvier: You can do that. And in fact, when you use speaker diarization cluster, but cluster I just want to say that we close the diarization of the single [indiscernible] but make it show and we try to detect the regular speaker and when we do this, we remove this constraint and we modify the program in order to speed up the process. >>: But the bulk of putting the test [indiscernible] >> Mickael Rouvier: Yeah, yeah. >>: So is this constraint only. [overlapping speech.] >>: -- or is also aimed at improving accuracy? >> Mickael Rouvier: Yes, accuracy is the same. >>: Oh, so the. [overlapping speech.] [overlapping speech.] >> Mickael Rouvier: same. We can just remove that and the accuracy is the >>: Okay, so that's fair enough. Okay. >> Mickael Rouvier: This is just to speed up, all right. So [indiscernible] >>: I don't understand why the accuracy would be the same. I think the final number of clusters will be determined by the data you choose, right? >>: Will be constrained by the data. >> Mickael Rouvier: It is not the number of clusters. of clusters is [indiscernible] by this threshold. It is the number >>: Right. So if you choose different threshold, then the number of constants you get are different, right? >> Mickael Rouvier: >>: It is [indiscernible] If you change the value of -- >> Mickael Rouvier: Yes, yes, number of clusters is different. >>: [overlapping speech.] >>: Therefore, if you remove that constraint entirely, the number of clusters will be different. >> Mickael Rouvier: Yes. >>: Therefore, the accuracy will be different. >>: Well, I don't think -- >>: You are saying the constraint is not really needed to get best possible accuracy. It is just needed to make things faster? >> Mickael Rouvier: No, this cluster, is only to see if we remove this constraint, is only to speed up the process. But, but we need to remove all the value in the [indiscernible]. Okay? >>: Oh, I see. >> Mickael Rouvier: variable. We cannot remove this constraint without remove this >>: Okay. How will. [overlapping speech.] >> Mickael Rouvier: By example, I just want to -- so if each [indiscernible] is about this threshold, we remove all this variable in this program. >>: Right, right, yeah. >> Mickael Rouvier: And we are saying that this is the same accuracy. But we are remove multiple constraints so the resolution is more fast. Is faster. >>: Okay. >>: Oh, but you change, you change the sums, the sums are no longer over everything. >> Mickael Rouvier: Yes, we can change. [overlapping speech.] >> Mickael Rouvier: And change now the whole framework. >>: I got it, yes, okay. So lots of variations here. >> Mickael Rouvier: Also this is my experiment. So we use our own [indiscernible] of broadcast news. It is French broadcast news. The variation I set is REPERE 2013. It is classical. We use an UBM of 1024 Gaussians, i-vector dimension 200. Length normalization PLDA scoring [indiscernible] So the metrics is diarization error rates is in fact a combination of three errors. Missal am and speech error and I use the speaker diarization for the ILP I use the GLPK toolkit. So first we compare the bottom lack rhythm for -- [indiscernible] sorry, I just want to mention this is the name of the show in the copies and this is the [indiscernible] error rate. We compare the bottom half with the cross discern with i-vector. We obtain again and after we compare bottom F with IP clustering and we obtain a new gain with ILP. >>: I have a quick question about using the i-vector to compare these two classes. Is that because your test discern has longer segment? So we can get a reliable i-vector score? We have this conversation in each segment too short? >> Mickael Rouvier: Okay. So in broadcast news the segments, the reiteration of the segments are around one, between one and two seconds. >>: Aha. Short. >> Mickael Rouvier: >>: Yes, it is very short. You are basing your i-vector on one to two seconds worth of speech? >> Mickael Rouvier: Yes, it is very short. We obtain [indiscernible] >>: The i-vector is computed, there is a separate i-vector for each segment? >> Mickael Rouvier: >>: Yes. Yes, yes. So I can understand what is the meaning of HAC there? >> Mickael Rouvier: HAC. >>: Hierarchical. >>: Hierarchical cluster ... >> Mickael Rouvier: >>: But there are other popular clustering algorithms there. >> Mickael Rouvier: baseline. >>: It is bottom up. This is the cluster approach for broadcast news, the No, there are many other clustering algorithms? >> Mickael Rouvier: Yes, yes, for broadcast news, so we the [indiscernible] approach is to use bottom up algorithm and CNN for broadcast news. >>: Okay. So what is the training criteria? criteria here? >>: Is this the three error rate? >> Mickael Rouvier: >>: Diarization error rate. Diarization error rate, okay. I'm sorry, the evaluation >>: I forget. [coughing.] >>: I should probably know, but does the diarization error rate include the speech accuracy? >> Mickael Rouvier: Yes, yes, it is speech [indiscernible] but in fact it is kind of diarization, we just -- in fact the segmentation is the same because we just change the speaker clustering. We just change the method of speaker clustering. So yes, the diarization error rate combines the three metrics. But in fact, we just compare the one metrics. >>: [overlapping speech.] >>: Segmentation is -- >> Mickael Rouvier: The clustering, we compare in fact the speaker error clustering. But yeah, I just indicate the diarization error rate. >>: Okay. >>: So that clustering is to achieve this [indiscernible] variable. you need to adjust the [indiscernible] in the constraint? >> Mickael Rouvier: Adjust, sorry? >>: You have the constraint with the data, right? the threshold? Extend the threshold? >> Mickael Rouvier: Do Threshold, you have Sorry, I don't understand. >>: The third constraint that we talked about? the delta of this result? Do you need to adjust >> Mickael Rouvier: Yes, yes. I don't mention, but we have data (indiscernible) and we optimize also the results on the development purpose. The threshold are optimized on the development [indiscernible] and we apply and we take this threshold and we apply that on the test corpus. Is that your question? >>: Yes. >>: Yes. >> Mickael Rouvier: Yeah, I forget to, I have a curve and I forget to present the curve. But yes, I have a curve and we have indicate the diarization error rate in function of the threshold. >>: Okay. So my question is, if you adjust the number of clusters in the other approaches owe. [overlapping speech.] >> Mickael Rouvier: So for this, for the ILP in fact, I try to search the optimal threshold. It is exactly the same as the development purpose. >>: No, I am not talking about the -- I'm talking about H ACI vector. >> Mickael Rouvier: >>: Yes. I don't know for that. You did it to [indiscernible] >> Mickael Rouvier: Yes, I don't try to research the best solution in the best corpus. I just. [overlapping speech.] >>: -- development. >> Mickael Rouvier: Yes, it is optimized on the development. [overlapping speech.] >>: On the development sector. >> Mickael Rouvier: Yes, but your question was to optimize on the threshold was the best set? >>: No, no. >> Mickael Rouvier: No? >>: Because you have these three parameters. The first of the parameters will affect the final result. And so in the other approach you can also adjust for example, the maximum distance between two clusters and since you do that, you can also choose the number of clusters you can generate. So then that will also affect the results in your baselines. >> Mickael Rouvier: Yes. >>: So I'm wondering why you don't do the same for. [overlapping speech.] >> Mickael Rouvier: No, no, you still don't have to optimize the results on the development purpose. >>: All right. >>: So it is a fair comparison. >> Mickael Rouvier: [chuckling.] And as, sorry, on the same development corpus ... >>: You need an additional segmentation, right? So you need to generate the segments that you submit to the clustering. How did you generate those? And were they the same for all your different methods? >> Mickael Rouvier: Because the segmentation was run by the new speaker diarization. So there is ->>: Sorry, is that some kind of -- it's just a speaker change detecter, right? >> Mickael Rouvier: Yes. >>: You basically look for speaker changes and then any, you know, whatever comes between two changes is one of the segments that you will feed to your. [overlapping speech.] >> Mickael Rouvier: >>: Yes, I do that in speaker clustering. And it was the same for all? >> Mickael Rouvier: Yeah, yeah. >>: So you didn't do this method like in the [indiscernible] diarization system where you generate clusters and then you use those clusters to resegment the data iteratively. Right? So you have some segmentation at some stage in your algorithm. >> Mickael Rouvier: Yeah. >>: You generated HMM with cluster models and you used the [indiscernible] to segment, re-segment the data. Then you redo the clustering and so forth. So you iteratively re-segment and re-cluster. You didn't do anything like that? You just found the speaker boundaries once and then you simply do the clustering based on that? >> Mickael Rouvier: Yeah, yeah. I know that in meetings you do that. >>: For broadcast, that is enough? I mean, that's sufficient? don't have to do the segmentation in. >> Mickael Rouvier: >>: I see. Yeah, you don't need to do that in broadcast news. Okay. >> Mickael Rouvier: >>: You But it is fair that in meetings you do that. Okay, okay. >> Mickael Rouvier: So yes, to finish the training we optimized on the development conference. So I will speak about speaker embeddings. So one of the limits that ->>: Where has that gone? >>: Are you using it now? Is it adopted at Marseille? Or at Avignon? >> Mickael Rouvier: Mans. >>: Who is using it now? >> Mickael Rouvier: >>: So the ID plus [indiscernible] was developed in Le The IP clustering? Yes. >> Mickael Rouvier: There is some guys in IBM. So LIUM uses that. orange and I know that Najim uses that and Patrick also uses that. L- >>: -- maybe try it. [chuckling.] >>: The possibility of ... you go back to the last slide? >> Mickael Rouvier: Yes. >>: So the first question, based on your initial speaker change in the [indiscernible], what is best error rate you can get? >> Mickael Rouvier: So if -- >>: Based on your speaker, user speaker change? you can get? >>: What is the best achievable? >> Mickael Rouvier: >>: The error rate. >> Mickael Rouvier: >>: What is best error rate Okay. If you had like Oracle to tell you the -Yeah, yeah. Cluster rate. >> Mickael Rouvier: Each steps has an error -- I don't know. I do the fast one time, but I never reduce the -- reproduce the test on this corpus. >>: That's the question. >> Mickael Rouvier: I know that if we obtain the best clustering with this segmentation on this corpus, I think it is around 10 percent of the diarization error rate. But it is 10 percent because there is some problem of speech overlapping in this kind of corpus that the system does not take into account. So there is some problems. Is it your question? >>: Yes. The second question is actually, so how fast runtime compared in counter two and counter three? >> Mickael Rouvier: IP is very fast. It is more fast than the [indiscernible] algorithm and bottom-up. The [indiscernible] bottom up and i-vector, you must match, you must calculate the i-vector and make it stop. With IP you just calculate one time all the i-vector and you do the clustering. >>: What package do you use to solve in your program? >> Mickael Rouvier: [indiscernible]. So on the line, it is a free hit. It is very fast. And ... we do some tests in [indiscernible] around 150 hours of shows and it takes around ten minutes. I'm not sure about the time, but it is around ten minutes to to solve this kind of problem. It's very fast because just ... >>: [speaker away from microphone.] >> Mickael Rouvier: Because we remove this constraint. We remove this binary variable and the third, we do not solve this problem like that. We try to separate the problem in to subproblems. I don't explain that, but we can, I can explain later if you want. >>: So what is the scaling with the number of clusters? So if you double the number of clusters of initial segments, sorry, if you double the number of secretaries, how does the runtime of the clustering ->> Mickael Rouvier: >>: Right. Okay. Because it is very, very fast. But I mean theoretically speaking is it quadratic or linear? >> Mickael Rouvier: >>: I don't know. No, it is linear. It is not relative. But the number of variables is quadratic, right? >> Mickael Rouvier: Yes, but as you use the PLDA metrics you can remove a lot of binary variables. I am okay to say that this metrics, if you increase the number of speakers [indiscernible] the segment is quadratic, but as you use the discriminate distance you can remove a lot of binary variables and just focus on the binary, important binary variable. So I say it is not quadratic. It is more than linear. >>: Oh. >> Mickael Rouvier: So I just want to speak about speaker clustering. One of the problem about i-vector, i-vector PLDA obtain very good results in NIST evaluation campaign, but we know that ->>: Which evaluation was that? Broadcast news again? >> Mickael Rouvier: No, no, it is on I speak in generally on speaker verification. And just an overview. >>: Speaker identification? Oh, yeah, yeah, yeah. >> Mickael Rouvier: But we observe that on short segments the performance degrades rapidly. So there is a lot of work that try to tackle this kind of problem to come in PLDA scoring, to normalize i- vector about the duration of the segment and me, I propose to because the i-vector is extract from the -- [indiscernible] space and me, I want to extract the features on the speaker space, the [indiscernible] in the speaker space. So in order to do that, I propose to learn to try to this feature presentation with a deep neural network. So I take the deep neural network through the task of the deep neural network, is the speaker identification task. So in the input, this is the first of the [indiscernible] statistic. So we take a segment. We take a UBM, we take the bandwidth statistic and after we normalize by the mean and the covariance about the UBN. So we give that the input of the deep neural network so the output is the speaker identification. For example, if the segment is speaker ID 3 we say 001, 000, [indiscernible] and after we I propose in fact to use one of the item linear like neural presentation. So by example, if I say this item linear is linear presentation, so I compact the number of neurons and I take this item linear like a neural presentation. And I call this the speaker embeddings. So just to, in order to -- so this is the speaker embeddings. So for the presentation normally speaker embeddings are vector, but for the presentation so I must compact the vector and we see that there is some patterns that are in common with the different associations, train and test corpus and we see that there is some pattern similar between the training and test corpus. Just for information. So one of my goals is to replace i-vector by the i-vector PLDA pipeline. One of the problem is that the speaker embeddings is built without any Gaussian pre-exception. >>: Why is that a problem? >> Mickael Rouvier: Yes, there is a problem because PLDA assumes also the [indiscernible] must negotiate. So we observe that every color represent the speakers. So I take four speakers and every segment, every point represents a segment. And we observe that there is a random distribution of the data. So it is clear that the Gaussian scoring, [indiscernible] scoring, Gaussian scoring, sorry, focused only on the direction of the submittee. So the question must be adapted to this kind of task. But I try a technique, these techniques was used in speaker verification. It is a light white [indiscernible] normalization. This technique consists of artificial Gaussian data in order to remove the mean and to multiply by the widening of the metrics. So for the experiments I use the same background training like I said, like the previous experiments. But I use the data set that is ETAPE 2012. I use the same UBM. The vector of dimension one, noise mistakes is 200, but the speaker embeddings of dimension 500. The i-vector I like (indiscernible) with two iterations become the [indiscernible] with one iteration. For the speaker diarization we use the same toolkit. For the DNN we use the Kaldi toolkit. The activation function [indiscernible] and we use three ID linears. First I want to compare, so I take the same different show and this is overall of the test data set. So in the first experiments I just compare i-vector PLDA with speaker embeddings cosine. We observe that there is a gain. And after I just try to normalize the speaker embeddings with the light white normalization and I use this speaker embeddings normalized with PLDA. So the same results. If I use speaker embeddings with no normalization and PLDA, we obtain worse results. If we use speaker embeddings with one lightening iteration, we obtain a new gain. And we have two or three Gaussians doesn't work. >>: Why -- [indiscernible] is different from the previous slide? >> Mickael Rouvier: No, because the last slide I used the [indiscernible] and in this slide I used PLDA metrics. >>: Oh. [speaker away from microphone.] >> Mickael Rouvier: assumption. >>: Did you find anything in the i-vectors? >> Mickael Rouvier: >>: So PLDA doesn't work because there is no Gaussian Yeah, yeah, i-vector is light white normalization. Already? >> Mickael Rouvier: Yeah. I don't -- they normalized. So for the conclusion and perspectives, for the ILP clustering, one of the conclusions is that ILP clustering explores more clustering solution than greedy bottom up algorithm. So when I was in first in [indiscernible] the systems that developed with ILP I started to the ETAPE 2012 and the REPERE 2012 and 2013, and we obtained the best results in single and cross show. One of my perspectives with ILP clustering is to make a joint model with speaker clustering and speaker identification by joint model I say for example if you do a clustering, so you make identification in order to know the name of the guys, you can do that in jointly at the same time. And maybe we can do that with the ILP clustering. So for the speaker embeddings, so features are extracted on speaker space. And I don't present the works, but we try to test with automatic speech recognition. By example, when we try the acoustic models, we try the acoustic models and we also add the i-vector. So we remove the ivector and we use the speaker embeddings. And it works better than the i-vector. And we try to use speaker embeddings in the speaker identification task, and it works more than speaker -- more than ivector. And it works also very well with speaker diarization task. One of my perspectives, I investigate the use of different input features. So we can, by example, use the use of deep neural network. By i-vector, I say you use the [indiscernible] probabilities given by the D in. N and use this input vector in the DNN and the last perspective is to create a DNN to automatically learn the input vector and one of the approach is to use the combination neural network, text classification. Thank you. [applause.] >> Geoffrey Zweig: So we have time for a couple questions. >>: So in the [indiscernible] case... There are a few parameters that will kind of empirically set how many clusters, will favor some number of clusters. I guess the question is, is there a way in the framework to incorporate prior knowledge in this way? To present, without having to [indiscernible] So, for example, let's say you have some collection of data where you know that the typical number of speakers is, cluster, you should have between five and 15 and there's another data set where it's between one and five, but you don't want to -- each time you don't have to go retune all the, you don't want to have -- for each time you have a case like that, you don't want to have a data set with hyper tune parameters. So is there, if you have a generative model approach, like clustering, there may be a way to put a prior kind of pin or something like that in the [indiscernible] >> Mickael Rouvier: I'm not sure to understand your question. So you have some data set with one through five speakers and there is a data set with ten to 15 speakers. And you want to adapt your systems? >>: Is it possible to get the best performance on one set, you would have one set of hyper parameters. >> Mickael Rouvier: Yeah, yeah. >>: And a different set, you have a different set of hyper parameters but you done have to do that every time. Is there a way to ->> Mickael Rouvier: If you know the [indiscernible] number of cluster, you can add that in the constraints of the problem. You can say, okay, for this kind of problem there is between one and to five cluster. And so you can add this kind of constraints in the AE and P. But for me the threshold that you learn ... for me, this threshold is in fact represents the distance of, an average of speaker. So it is for me it's optimal to, what kind of corpus you want. >>: So that should be independent of -- >> Mickael Rouvier: >>: Yeah. Of the factors, since those are overall cluster [indiscernible] >> Mickael Rouvier: Yes, it is a mean of the speaker, in fact. >>: So do you observe that in your tests? You have maybe, you have ten different shows, right? Did you optimize that for every show individually? >> Mickael Rouvier: Yes. In fact, in all these shows, this kind of show is very different because it is a show of pop star. It explains why we obtained very bad results. In fact, I try to optimize my threshold on all the corpus without this and the opposite. But in fact where the threshold is particularly the same. >>: Okay. >> Mickael Rouvier: So I take all the corpus and I apply the same thresholds on all the corpus. >>: So it is not a threshold, so why is that one so much worse? the i-vectors or not? Is it >> Mickael Rouvier: No, it is the same rotation. There is a lot of music. So the format is five minutes. A lot of music. And it is very, it is not usually like ->>: Is it that there's a lot of back and forth between people, shortterms? >> Mickael Rouvier: Yeah. >>: Because this goes back to the question, would this work on like meetings? >> Mickael Rouvier: Yeah. >>: Where you have a lot shorter turns and you don't have segments that are long enough to estimate stable i-vectors? >>: [indiscernible] >> Mickael Rouvier: No. [overlapping speech.] >> Mickael Rouvier: I removed the test [indiscernible] computers. Sorry, I can show you, but we have, it is very short and a lot of music. So in fact there are, given by the segmentations, the segmentation is very bad because it is segmentation it says that there is a segment, but with lot of speakers and sometimes there is a segment and also we have some problem with speech and nonspeech to say okay, there is nonspeech, but the speakers speak and ... >>: So on a different topic, this is a very general way of doing clustering, the ILP insight that they had. >> Mickael Rouvier: >>: I hope that is considered [indiscernible] So what else have people been clustering with ILP? >> Mickael Rouvier: What else? >>: Is this the first application of ILP to clustering? Or has it been used like to cluster documents previously? Or used to ->> Mickael Rouvier: No, no. I don't think that we use in documents. So yeah, I try this kind of approach and I never been used in other things. >>: It never has been used in other fields? >> Mickael Rouvier: Yeah, yeah, it is only in speaker diarization. >>: Okay. So that's a real innovation. When -- have you seen it since applied to other areas? Or do you think it would be applicable to other areas? I mean, it seems like anything that you currently say K-means clustering, we can use this. >>: No, I doubt it do much better than [indiscernible] in many cases. >> Mickael Rouvier: So when I explain the REPERE and when I do the publication of the IDT, I use the ILP in order to provide, to propagate all the identity. And it is based on my ILP clustering. So we can do -and we can apply [indiscernible] task like, I don't know. >>: Yeah, yeah, but it is not impossible. >>: It can be limitation, it requires distance, two distance. >>: But does HAC essentially K-means cluster, close emergence? >>: No, it is hire logical. >>: Because you don't know the number of the cluster. >>: You just merge. >>: Right, you don't iterate, you go only to [indiscernible] >> Mickael Rouvier: It is [indiscernible] You don't iterate. Clustering, you won't match. >>: So it is very -[overlapping speech.] >>: You have to stop problem, right? The hierarchy in that way, you have to have a separate criterion for knowing when to stop clustering. What is the corresponding parameter in your system? Right? So the problem -- the general problem in hierarchical clustering is knowing when to stop clustering. >> Mickael Rouvier: Yes. >>: Whether you know you have the right number course is somehow related to the number of true diarization. So in your system, with ILP, what parameter? How do you control how aggressively things together? >> Mickael Rouvier: >>: By the threshold. Is it the delta? >> Mickael Rouvier: Yeah, yeah, yeah. of clusters which of speakers in the case of is the corresponding you want to cluster >>: It's the delta. [overlapping speech.] >>: -- the goal? >>: You're right. >>: That's the ambiguity I was talking about before. >>: Okay. >>: I think you can get rid of the third thing. There's the weight parameter, the F parameter. >>: That's an efficiency. two variables. They both. You just use Fs. The delta is to rule out, to eliminate the >>: [speaker away from microphone.] >>: Exactly. >> Mickael Rouvier: To eliminate, you just set up, but you can eliminate this constraint, but you eliminate all the binary variables and you eliminate [indiscernible] data. So you must to track the [indiscernible] parameter. And you compare the cluster with that. >>: And you have a separate, you have a separate data set between those? >> Mickael Rouvier: Yeah, yeah. Different set to choose this parameter and after applies that on the test. >>: And have you tried to re-optimize it? How far from the optimal value were you when you, after you tuned those parameters? Have you [indiscernible] out? >> Mickael Rouvier: So for the two copies, the ETAPE and the EPAC, in fact the parameter on the test and that corpus is the same. >>: Really? Okay. >> Mickael Rouvier: But maybe in the corpus they develop. [overlapping speech.] >>: It would be interesting, like to, English broadcast news or something, to see if the same parameters would show. >> Mickael Rouvier: Yes. >>: This criterion to minimize. What you minimize is a very generic criterion. I think it's very similar to all the other clustering approaches, right? It's a cluster plus some distance thingy. So your method is not about the criterion, but it is about finding the correct solution? Or it is an optimization problem you are solving? Is it the criterion that needs to be solved? >> Mickael Rouvier: Good question. For me it is to optimize this, I want to optimize this, than and this criterion in just to say that, to constrain to say that the segment, to ... hmm, light, create the model, the speaker model to say that ->>: Yes, I understand. >> Mickael Rouvier: -- all the speakers are the same. >>: My question is, the same criterion, the first formula, you could also optimize that with K-means, for example, right? I think. >> Mickael Rouvier: Yeah, yeah, we can do that but there is a problem of initialization of the K-means. And this is a problem. And I test that K-means and I run several times the K-means. [overlapping speech.] from [indiscernible], but with the ILP, we obtain each time the same results. >>: You actually did that experiment? >> Mickael Rouvier: Yes, I did. And after I tried to different objective function. I try to minimize, minimize [indiscernible] and to maximize the adaptive dispersion and I try the objective function, I try to maximize the number of speaker and to minimize the dispersion [indiscernible] But the best results I obtained with this kind of approach. >>: So you can also, I think, apply the same criterion for HAC, right? The hierarchical clustering? >> Mickael Rouvier: Yes, we can, but the problem is with [indiscernible] it is if we take the position one time, so this is propagate. >>: Right. So what you're saying your solution is really a search algorithm. That is what you are solving. You have the best way of solving the same equation with the same criterion. >>: [speaker away from microphone.] >>: So can you say a little more about this, about how -- anything in particular about the cross show? Diarization? Is that done just putting two shows with one big broadcast or one, then the other? >> Mickael Rouvier: Yeah. So it's an [indiscernible] type. So in figure show, you train the program like that, like I explain in my crossshow and cross-show, you take this is show A, show B, and in fact you want to detect the [indiscernible] speaker. For example, if you treat show A and B separately, you can say that, you cannot detect that this speaker and this speaker are the same, but you by example can detect that this speaker and this speaker are the same like that. But they are not the same speaker. >>: So it tends to be mislabeled? >>: It is still. [overlapping speech.] >>: When you say same identity, you mean the same label. [overlapping speech.] >> Mickael Rouvier: >>: The same identity is the same label. They are both speaker one? >> Mickael Rouvier: Yes, yes, but you say, in cross show you have new steps, four steps. It's a clustering, but a global clustering on all these shows. So you treat the shows separately. And after you add four steps, global clustering that trains all the data. Now you can detect the [indiscernible] speaker and say that this speaker and this speaker are not the same. We try to do that. >>: But you are not really examining the identity decisions you make within the show. >> Mickael Rouvier: Sorry? >>: So when you do the cross show clustering with ILP, do you reconsider identity within the same show? >> Mickael Rouvier: >>: Yes. >> Mickael Rouvier: reorganize. >>: In fact, to not to fix but to reorganize works Okay. >> Mickael Rouvier: some ->>: We try to fix the clustering and we try to Sure. >> Mickael Rouvier: better. >>: So we try the experiments. Because I think that in the first clustering we miss [speaker away from microphone.] >>: By the way, do you have explanation why you have this video distribution? I don't understand that. >> Mickael Rouvier: Sorry? >>: Video distribution. >> Mickael Rouvier: >>: So on the radio -- Radio distribution. >>: The radio distribution of the i-vector. have an intuition for it? >>: Can you show the picture? >>: Yes. >> Mickael Rouvier: embeddings. >>: Do you know why? Do you So this is random distribution of speaker So have you seen that before? [indiscernible] gave a talk. >>: This is not some. [overlapping speech.] >>: Some years ago and said, it was like showing word embeddings or document embeddings. He made a big deal about, oh, look at this, it's radio. >>: Okay, so why is it radio? >>: Wait, this is distribution class. [overlapping speech.] >> Mickael Rouvier: I don't know. I don't know why is it is radio distribution, but we examined that. In fact, so first experiments was to know if the speaker embeddings was Gaussian or not. And at the end we observe that this is radio distribution. But we don't know why. >>: But this is -- >> Mickael Rouvier: >>: This is random distribution of speaker embeddings. But this is like the first two dimensions? Is this a projected -- >> Mickael Rouvier: Yeah, yeah. We take four speaker, normally four speaker, on the segment and we use a PCA to project the two dimensions. >>: Doesn't this -- isn't this just to incorporate normalization directly in the objective function term? >> Mickael Rouvier: network? >>: It is the objective function of the deep neural Yes. >> Mickael Rouvier: Yes, maybe. >>: Also the length has to do with the length of the segment? far you are from the center? >>: Maybe. >>: That is related to the length of the segments? And how >> Mickael Rouvier: I don't know, but maybe we can have some normalization between the [indiscernible] >>: No, because the input is length and variance, right? So the input to your network is the, are the Bonn house statistics, it's actually the old IBEC statistics. >> Mickael Rouvier: Is exactly the same. >>: Basically, it is why they converge to the same appointment. have a point. [overlapping speech.] You >>: But it is -- it's got to be some physical correlate to the distance from the origin point. >>: Is that zero? >> Mickael Rouvier: I am not sure. No, it is not zero, but I don't know, maybe ... I am to indicate to the [indiscernible] >> Geoffrey Zweig: [applause.] (End of file .] Okay. Let's thank the speaker again.