>> Ran Gilad-Bachrach: So it’s a great pleasure to... >>: [indiscernible]

>> Ran Gilad-Bachrach: So it’s a great pleasure to introduce a … >>: [indiscernible] >> Ran Gilad-Bachrach: … Emmanuel Dupoux [indiscernible] working with Microsoft. ‘Manuel requires his PhD in cognitive psychology from the Ecole Normale Supérieure in Paris, but he also has an engineering degree in telecom, and therefore, his research kind of bridges between cognition and communication, and he focuses on language acquisition in little kids. He’s published many dozens; I didn’t—you know—counting [indiscernible] paper, it was too much for me—too big of a number—but one thing—interesting—that I found out is that his book has been translated to English, Chinese, Greek, Italian, Japanese, Portuguese, and Spanish, but it you want to, you all can also read it in French in its original form. [laughter] Don’t worry about it. So without further ado, I’ll give the thing to you. >> Emmanuel Dupoux: Okay, thank you very much. So thank you for organizing all this and inviting me here. So yeah, so I would like to talk to you about … it’s an old project of mine that I have been pursuing for a long time, but most of my career was done in studying infant language acquisition, but I always wanted to try to make the link between this study, which is a fiel … a sub-part of cognitive science, and engineering, which was sort of a part of my initial training. So motivation is like thi … if you look at how machine learn, typically in the standard paradigm of machine learning, is that you start with some data—okay—you have an algorithm, and then you basically are using human experts to generate, on the same data, lots and lots of labels, which you use then to train your algorithms to reproduce the target, okay? So that’s the standard paradigm in machine learning. If you look at how humans are doing learning, it’s quite different. What you still get: you also get lots of data—okay—even, maybe, more data, but what you get is you get, basically, everybody is comforted with that, and humans are interacting with one another, and they somehow manage to learn together, okay? One other thing that’s happening on the top there is that they do exchange information—sort of feedback information— but it’s much, much less … the amount of data that you get—the bitrate—is much lower than what you get when you do the machine learning fat part. So that means that if you want to build a machine that would learn just like the human, you would have to radically change the paradigm and try to basically build systems that would learn in a sort of a weakly-supervised way. I’m not saying it’s completely unsupervi … I’m just saying it’s weakly supervised; and what you get is basically ambiguous and sparse signals to learn. Okay, so I’m going to focus on the case of speech recognition—okay—because that’s what I’ve been studying mostly—how babies are learning languages. And so in a typical HMM deep architecture, that’s what you get: you get speech—hours of speech—and then you code that into some kind of feature; you’ll train your system provided a lot of labels, okay? Now, of course, that’s—even from an engineering point of view—it’s a problem if you move to languages that have low resources, okay? So this is the rank of languages with the number of speakers, and you … so these languages are pretty wellstudied; there are lot of resources; you get TIMIT, switchboard, and LibriSpeech—lots of hours that have been carefully annotated—but then you go to most other languages, you don’t have that; in fact, half of a languages in the world don’t have even autography to start with. So that’s a problem if you want to build speech technology for these languages, and there’s another problem. Of course, here I noted that to gather high-quality linguistic annotation is costly, but also, it’s not even sure that it’s done right. In fact, when phonetician annotate speech or build dictionaries, are we sure that they are really doing the right thing? We don’t know, actually. So they, actually, in some areas of linguistics, there’s a bit of debate which label you should use, et cetera, et cetera. So that’s another issue, or that’s really not much discussed, but it’s … I think it’s an issue. So what we wanted to do was take the extreme case—completely extreme case—on the other end of the spectrum: learning speech from scratch, okay? So that’s the so-called zero-resource challenge; zero resource doesn’t mean that you have no … nothing at all; you have the speech—okay—but you have no label, okay? And so for a completely unknown language, what you have to do is to learn acoustic models and learn the pronunciation lexicon, okay? So that’s the two-track. Now, it may seems a little bit extreme, but we know that human infants are doing something like that during the first year of life, okay? So during the first year of life, they don’t really talk much, so—in terms of uttering words with a … and sentences—so they don’t get a lot of feedback from their parent about what they are saying is correct or not, but if you look at what we know now in terms of what they understand, experiments have showed that they start to understand … to basically build acoustic models for vowels and consonants in the … in their language, and this is done … this is starting around six months, and it seems to be almost over—not completely over, but certainly well advanced—at the end of the … of that first year of life. They also do a lot of prosody, and they also start to recognize isolate words—okay— recognize words. So they do a little bit of language modelling as well. All of this is taking place in sort of … in the parallel way, which is a bit strange when you think of it, but that’s the data we have. So the idea is that we should try to model that with a system that tries to learn from raw speech, okay? So why would we would like to do that? Well, the idea is that if you do this sort of unspervi … weaklysupervised learning, you will gather new ideas about new architectures that haven’t been tried before in this area of automatic speech recognition; you may gain flexible, adaptive algorithms or a boost … or system that can bootstrap your ASR in an unknown language, for instance. You can also do … probably be better at doing task like keyword spotting; imagine you have a large library of sounds—of recordings—in an unknown language, and you want to retrieve documents that contain this word; okay, this kind of task can be done even if you have … don’t have any labels, so that’s the kind of task that you could do if you were to make improvement in this problem, okay? Now, the other thing that you will gain if you were to succeed in this enterprise is a model of la … how infant could be learning language, which would have applications in clinical research, for instance. Okay, so what I’m going to talk to you about today is the challenge that we organized and presented at the last InterSpeech; it’s called a zero-resource speech challenge. And so as I said before, we have two tracks; the first one is to learn acoustic representations, okay? Now, this is something that actually has been around for quite a while. I … the base was this paper that, when I was a student, I read this Kohonen paper, and I’ve … I was very impressed; I say, “Wow, you could learn”—this is the phonetic typewriter, right—“you can learn a representation of speech from raw signal with just these autoorganizing maps.” That was a paper, and then nothing came out of it, and then, all of these different other ideas have been around; people have been trying other things, like basically try to, for instance, develop speech features in speech technology, inspired by human physiology; then people have been devising features that are even more close to what we know the brain—or the auditory nerve—is doing, okay? So this is more in the area of cognitive science; this is more in the speech engineering; and then people’ve more recently have been trying deep autoencoders; they have been trying to use HMM— unsupervise HMM—and the MIT group has been playing with a y’archical non-parametric Bayesian clustering. All of these are trying to build good acoustic representations, and by good acoustic reprensations, I mean representations that would basically have a linguistic use—that is, if you take two instances of the syllable “ba,” they would look together … they would be put together in a … very closely, and if you have a “ba” and “ga,” they would be separated, okay? So we … you want to have … that’s the notion of good acoustic representations. Now, what’s striking to me, when I started to review this, is that this—by the way, this is not at all a complete list, okay? They are … huge number of people have been interested, at some point in the career, to discover automatically units in speech, and they have tried many different ideas. And so what was striking to me is that none of them used the same way to evaluate whether the system were working or not; each paper is using its own dataset, its own evaluation procedure; they don’t quote the others—so none of these people are tal … are quoting the others and vice versa. So it’s an extremely fragmented field that really needs to have some kind of a work … I mean, if we want to have some progress in that, we will need to basically be able to evaluate all of these different ideas in the common framework, and that’s … was my main motivation is setting up the challenge, really, was to stop having this completely chaotic process of people trying things and not relating to others. So to evaluate an acoustic representation, what people typically do—because they are in the frame of mind of supervised learning—is they would train a classifier. Okay, you have a reposition of speech—let’s say its auditory features or whatever, or posterograms, or something—then you train the phone classifier, and you compute the phone error rate. That’s what people do. Now, of course, I am not very happy with that, and why is that? Because, in fact, you have many different type of classifiers that you could train, so you have to specify a hypothesis space; you have a problem oftimization—some algorithms are better than others—then you have some of them have free parameters, and then you need to worry about over-fitting. All these problems arise when you try to do supervised learning, and of … and they are going to be affected by number of dimensions in your representation—all that stuff. So what we decided to do was to go back to the basics and try to do a task that involves no learning, okay? We remain faithful to our idea of unsupervised system, so we do it also for the evaluation. So this is a task that’s very well-known in human psycho-physics: I give you free tokens—so “ba,” “ga,” and then another one that could be either “ba” or “ga”—that could be the same speaker or not, okay? And your task is to say whether the X is closer to R or to B, okay? And so we set up the task like this, by constructing a big list of minimal pairs like this—pairs of syllables or words that only differ in one particular phoneme— becau … I mean, why we do that? Because we know that if a system wants to recognize words at the end, they—at least—they will have to distinguish between “buh” and “guh,” okay? Yes? >>: [indiscernible] at minus ten signal to noise ratio, you must convey youthful formation? >> Emmanuel Dupoux: Sorry … mean, yeah, yeah. >>: Could you repeat the question? >> Emmanuel Dupoux: So there … yeah, yeah, I didn’t comment on this graph, but I’m going to repeat the question when it’s there. So with this kind of task—okay—so you can test this kind of ta … so this graph here was obtained with a speech dataset that’s very simple; it’s the … called the articulation index; it only has syllables—isolated syllables—that have been produced in the lab, and people have to recognize them in a various amount of noise. And so that’s basically the curve that you get if you do this kind of task—ABX task—on human, and that’s what … the performance that you get if you use some of the standard features in a speech recognition, like MFCC, for instance, or PLP. So RASTA-PLP’s actually getting better results. Now, if you do supervised HMM, you are around here, and really, the issue is: can we bring unsupervised techniques to go as close as possible as the supervised technique? Okay, yeah, the only thing I didn’t say is that in order to run this task, what you have to provide—if you want to participate in the challenge—you have to provide speech features for your syllables. Okay, I gave you syllables in wave form, and you have to translate that into acoustic features—okay—which is basically going to be a matrix of values, and then, the other thing you have to provide is a distance— a’ight—because I want to be able to compute the distance between “ba” and “ga,” okay? So the two things you provide are the features and the distance, and then you can compu … I will evaluate your features with this minimal pair ABS task and give you a number. Yeah? >>: I’m just wondering how you control the signal to noise ratio, and you can set it. >> Emmanuel Dupoux: Okay, so that … there—in this experiment—we just added noise. >>: Oh. >> Emmanuel Dupoux: Yeah, yeah. We added … so the … I think the … some of the noise that are used in the … in one of these noise challenge; I don’t remember which one it is—chimes or something. Yeah? >>: And how much human would do in this? >> Emmanuel Dupoux: So that’s how we do. >>: Oh, this is human. Oh, this is human. Okay, okay. >> Emmanuel Dupoux: This is human. This is human, and this is the features that have not been trained, so this is the … just the acoustic features, and we tried to devise features that would go to us, the human. >>: So … but think how do they do that determination in this case? >> Emmanuel Dupoux: Oh, okay, so they actually… so that’s interesting thing: in this study, we actually cheated; this … we didn’t run the ABX on this study; we have it, but I didn’t put it on this graph. There, the humans have to transcribe, and we use the posterogram of their transcription as features, okay? >>: Okay. >> Emmanuel Dupoux: So it’s not exactly the … we use it as if they were machines trying to make posteriorgram, okay. >>: How many bears you have? >> Emmanuel Dupoux: How many …? >>: Bears. You have “ba” and “ga;” how many bears you …? >> Emmanuel Dupoux: Well, in this dataset, you have the entire set of syllables in English—CV syllables—you also have VC syllables and some of the CVC syllables; there are too many of them. So it’s a … it’s spoken by a forty speakers. So it’s a big dataset that has been used quite a bit in psycholinguistic. >>: So I assume that … the rest of the papers here—unsupervised learning—what they did is that they automatically come up with the speech units. >> Emmanuel Dupoux: Yes, but that’s … that … yeah. >>: [indiscernible] the supervised learning. >> Emmanuel Dupoux: Well, the … so the idea is to have an unsupervised system and evaluate it without doing any running. So with this ABX, you can evaluate any model, okay? Any of these model, they’re … them … these … there can be evaluated with the ABX minimal pair task, and it requires no learning. >>: I see. >> Emmanuel Dupoux: It just requires to take the features and compute the distance—that’s all you need to do. So that’s the only way we found to compare completely … >>: Still [indiscernible] this is my diagram; I don’t know [indiscernible] How this guys are doing this? >> Emmanuel Dupoux: So this one? >>: Yeah. So you get original data into there, and then the hidden layer here—so-called ant layer—give you, maybe, thirty bits or something. >> Emmanuel Dupoux: Right, so that’s the … >>: How they do that? How do they actually get the number? >> Emmanuel Dupoux: Well, they determine this back-prop. I mean, this is … it’s a [indiscernible] >>: Well, I know, but … >>: But how do you complete the similarity [indiscernible] >>: No, I just want to know how do you get the curve—the number representive, the zero or the one. >> Emmanuel Dupoux: Ah, okay. So if you do that, you take the … >>: Thirty-two bits, for example. >> Emmanuel Dupoux: Thirty-two bits, and you do that [indiscernible] for each slices of your speech. Now you basically construct now a time matrix, and what you do is you do it for ba; you do it for ga; you compute the distance … >>: Okay. >> Emmanuel Dupoux: … and you compute whether the error rate in assigning … >>: Oh, I … for the new one, you use the same Euclidean distance, for example, to see which error is most [indiscernible] >> Emmanuel Dupoux: Yeah, yeah, yeah. Yeah, we use a cosine distance [indiscernible] So for each of these representations, you can … you just have to define a distance, and then you can extract a value and put it on the graph, okay? >>: Am I correct here—and I think I’m missing something, right? If this is ABX, can I assume—if I’m a system—can I assume that A and B are from different phonemes? >> Emmanuel Dupoux: Yeah, that’s a very good point. It … so you … >>: So assuming that’s the case, then I have a lot of supervised signal, and … >> Emmanuel Dupoux: [laughs] Yeah, that’s an excellent point, and in fact, none of our models use that signal, but I agree that it could be that … I mean, that’s also why we run it in humans like that. When you run humans in the true ABX task, the humans could actually also learn about what the task is and improve the performance. There, we use the humans … they are just transcribing their … the syllables, so they don’t know that they are in an ABX task, but that’s an excellent point. I think they had a question. >>: Yeah, so [indiscernible] resource does each group can use? So I’m a bit confused about how the definition of a zero resource … are you able to access a lot of unsupervised oat … signal audio, or they have …? >> Emmanuel Dupoux: No. >>: [indiscernible] so they had just ABX—the signal? >> Emmanuel Dupoux: So ABX is for the evaluation, okay? But the … for the training, I haven’t actually still … I haven’t showed you some of the models … >>: Okay. >> Emmanuel Dupoux: … that we’ll learn, but they just have speech, and that’s all they have. >>: They have speech, okay. >> Emmanuel Dupoux: And in principle, they shouldn’t use the ABX evaluation to make their system better, which some of the participants did, but that’s okay. [laughter] It’s always like that a little bit. Okay, so that’s the track one—okay—track one, you have to … basically, the task is: find a representation, and then, I’m going to evaluate it in terms of this ABX, okay? It should be as gross as possible to the human. Now, track two: spoken term discovery, so spoken term discovery … yeah? >>: So a correct performance is merely that you pick whichever of the two—A or B—is closer to the … >> Emmanuel Dupoux: That’s right, yeah. >>: … vector encoding X. That’s it; doesn’t matter how far apart they are? >> Emmanuel Dupoux: No, you just take that and … >>: Just closer. >> Emmanuel Dupoux: Yeah, and they’re … and you are given an error rate, which in that case—you just have one triplet—is going to be zero or one, but of course, we aggregate over many, many triplets like that. >>: I have another high-level question. >> Emmanuel Dupoux: Yes? >>: Are the acoustic signals the original training data? Are they just—you know—recordings of dove speech? >> Emmanuel Dupoux: Mmhmm. >>: I was wondering: what’s your thought on, like, when is … when a child actually learns, there is a lot more information that just the speech signal. >> Emmanuel Dupoux: Yeah. >>: There is a visual [indiscernible] >> Emmanuel Dupoux: Yeah, yeah. >>: So that is not considered in this [indiscernible] >> Emmanuel Dupoux: Not in this challenge, but it might be in the future challenge, which I will come to after that. Yes, but that’s also an exam question. [laughs] Okay, so track two: track two, you have to discover word-like units, okay? So there, our notion of a language model is very simple; we just have to listen to speech, and then, you have find patterns that would match, okay? So if I say twice the word “dog,” then you’d basically discover that these two things are matching. So there have been, again … I mean, there’s a large number of people who have tried that—again, with different evaluations each time. So what we tried to do was to conceptually break down this process of spoken term discovery into parts and then evaluate each of them separately—okay, like a little bit of a unit testing. So what people do, typically, in spoken term discovery is that they will build a big similarity matrix, so imagine that you have your TIMIT corpus here; you have … so you have corpus here, corpus there, okay? And then you try to find matching patterns, which are going to be diagonals in your similarity matrix—so this is a very brute-force thing, okay? So now, you find matching pairs, okay? And then, typically, what people do is they will cluster the pairs in a … with some kind of graph clustering algorithm, and then, they will use this cluster to go back to the original signal, and parse it, and say, “Well, this is the word: my cluster one, followed by cluster three, followed by cluster one.” Okay, so these are basically the three steps that people do when they do spoken term discovery automatically, and so we have developed a bunch of magic … of metrics to measure these free steps. I’m not going to bother you with the detail, but they are … typically, these metrics are what are … what is used in NLP; when people do a word segmentation based on text, they are using this kind of F-score matrix, and this is specific to the fact that it’s speech. >>: The very first step with the matching, is that done on the feature? That’s on the waveform or that’s on the feature? >> Emmanuel Dupoux: That’s … so that’s on the web … so if you have a system, you are competing in this thing, you use … I give you the waveform; you can transform that to whatever you want, except that it shouldn’t be using the fact that you know that it’s a language, English. Like for instance, if you … you can do this kind of stuff on posterograms of a phone recognizer, for instance; that would be cheating if you use the fact that you know its language—English, for instance. >>: But the work in track one could plug right into track two. >> Emmanuel Dupoux: Exactly, exactly, and vice versa. >>: Okay. >> Emmanuel Dupoux: Yep? >>: What are the frame size in step one? >> Emmanuel Dupoux: The what? >>: The size of the frames used. You’d imagine “dog” would be a smaller frame than “octopus.” >> Emmanuel Dupoux: Yeah, so … >>: [indiscernible] choose a fixed frame … >> Emmanuel Dupoux: So that’s for this challenge? >>: Yeah, for this match. >> Emmanuel Dupoux: Well, peep … well, I mean, people … typically, you have frames here which are typically: you have one frame every ten millisecond. So “dog” is going to be, anyway, a sequence of such vectors, so this is … the word “dog” is going to be a portion of the matrix. So in fact, the … imagine that you have found a repetition of “dog” here and here; they’re going to slightly different durations in the end, because you may speak faster or something. >>: [indiscernible] like that. >> Emmanuel Dupoux: So that’s … these algorithms are doing veet in the vernu, and they take care of these differences in time. Okay, data sets—so that was a question about data sets—so we used that are not child-directed, because we wanted to have a very ti … detailed annotation of the phoneme to evaluate, and so we couldn’t find anyone that was good enough quality, so we use a smartphone-recorded read speech in a com … in a language that’s not very well-known—is it Xitsonga, not very well-studied—and then we have … we add an English with more harder speech, like this conversational speech. This is the Buckeye corpus for those of you who are doing speech. So that’s what we had, and then we provided a baselines and toplines. So for ABX—for the first track—the baseline was just MFCC features; so already, with the MFCC features, that you get this performance in ABX—okay—which are around—I mean—thirty percent … twenty—yeah—thirty percent error. And for toplines, we used the posteriorgram of a train … HMMGMM train on English, okay? So this is the cheating thing, okay? So somehow, we expected that the systems would be somewhere in the middle, okay? Yeah, then on track two, for our baselines, we use a spoken term discovery system developed at the John Hopkins, and for the topline, we use the transcription—the phoneme transcription—and we use a segmentator called adaptor grammar that is sort of the state of the art in text segmentation, okay? So this is, of course, completely cheating; that’s going to be extremely difficult to beat that topline, but we had that for comparison, because you need … these metrics are novel; we wanted to have some way of to compare them. >>: So for ABX topline, that means they use this posterior … >> Emmanuel Dupoux: That’s right. >>: … term as the back-term. >> Emmanuel Dupoux: You use the [indiscernible] >>: And then the same procedure just to compare [indiscernible] >> Emmanuel Dupoux: That’s right; that’s right. So … >>: And when you train this topline, you train it in a supervised manner; you start with an alliant. >> Emmanuel Dupoux: Yeah, yeah. You start with a transcription and then force alignment on whatever. I mean, I’m not saying that this is the best possible topline; I mean, we took the caldy; we tweaked it; and we tried to get a good performance; but we didn’t spend months to optimize on the dataset. This is a very difficult dataset, I must say—the English one. >>: Because you could have actual [indiscernible] JMA mainstream in a completely unsupervised manner, also. >> Emmanuel Dupoux: Yeah, yeah, but that’s actually what competitors are going to do. So we would … we just provide them this, because the competitors had to basically provide systems that would be unsupervised. Okay, so participants: we got a twenty-eight registrations; a number of people didn’t end up doing the challenge, but we had nine accepted papers at InterSpeech that were represented; and a number of institutions participated, so we were quite happy that people from completely different labs—some of them I never heard before—participated. They were—interestingly—there were people working on under-resourced languages, like people in this … the group in South Africa, where they have lots and lots of languages with very few resources, and so they were very interested to participate in this. >>: Just a com … and I think it’s a Jim Glass … >> Emmanuel Dupoux: Yeah, so Jim Glass actually, in the end, didn’t send anything, [laughs] but he was … he registered at the beginning. >>: Oh [indiscernible] >> Emmanuel Duoux: I know; I know. He’s at the very beginning of this—all these ideas. >>: That’s what I thought, yeah. >> Emmanuel Dupoux: But somehow, the system was not ready for the competition. >>: Which group at Stanford participated? >> Emmanuel Dupoux: So who …? >>: Which group at Stanford participated? >> Emmanuel Dupoux: So it’s actually people from the cognitive science—Mike Frank. Mmhmm? >>: Was the test set blind, or did they have access to the data? >> Emmanuel Dupoux: So they had access to the speech—only to the speech … >>: Only to the speech. >> Emmanuel Dupoux: Yeah, and no label, yeah—no label. There was … we released a sort of a dev set with a fraction of the English with labels at the beginning, but the people had the evaluation, so they could actually … some … a team that actually was not, at the end, selected for InterSpeech, they had really engineered their thing by using the … our evaluation. So next time we do that, we’ll … we won’t give the evaluation publically; we’ll just keep it and register everything so that we know what happens. Okay, so it’s … there are lot of … so there’re a lot of papers; as I just wanted to summarize for you the main ideas that came out. So basically, in track one, there were three systems that were submitted that had roughly the same kind of intuitions. What they wanted to do is basically … you could say it’s density estimation somehow. So you are given speech; that’s all you have, so basically, what you try to do is to estimate the probability of that speech with some reduced model, okay? So one way to do that was to do dictionary learning; so these guys were doing the bottleneck autoencoders that we saw before; and this group did a generative model with a Dirichlet process JMM, which actually was the best model of all—I was … for me, was completely unexpected—but they got extremely good result with a fairly simple idea: you just build an HMM … not an HMM, a mixture of Gaussians—a large one—using a Dirichlet process to find the number of component that you need, and you take the posteriorgram of that, and that give, actually, very good ABX results. So this autoencoder, however, was interesting, because it can … it could beat the MFCC with only six bottleneck features—so that’s extreme compression of speech— and it also could do quite well with binary features—so you just binarize them, zero and one—so you have twelve binaries features, and you can beat the MFCC on the Tsonga cop. So these were sort of unexpected results and interesting results. So as I said, this one was … had very, very good results … >>: But how do you get posteriorgram is you don’t have phonetic information? >> Emmanuel Dupoux: Well, you just … >>: Posteriorgram is based on the phones. >> Emmanuel Dupoux: No, it’s just based … you take … so you have all these Gaussians—so they are full diagonal mixture of Gaussians … >>: Oh [indiscernible] of the components. >> Emmanuel Dupoux: Of the … that’s right. >>: I see, and that become the feature … >> Emmanuel Dupoux: That become the feature, exactly. >>: … which they do Euclidean distance algorithm. >> Emmanuel Dupoux: Yeah, yeah. >>: Oh, okay. >> Emmanuel Dupoux: Alright. Then there’s a group from Carnegie Mellon who did articulatory features. So they were actually using some side information, because had trained an acoustic-toarticulation system, and they used that as features, and they didn’t do very well, but I thought this was very interesting to try to use articulatory features, because in fact, we know that babies probably have … I mean, they do babbling, and so they could have access to some articulation. So I think—even though they didn’t do very well—I think it’s a very interesting idea to try to continue exploring. And now that they … the two other systems were using the same idea: is was using the idea that you use track two to help track one, okay? So imagine that you have found some words—candidates—then you can now use them to teach you to have better phone representations. Way, I’m going to explain that a little bit more. So that’s the system we submitted—I mean, a part of our team was organizing the competition; another part was trying to submit something; there was a sort of … we tried to put a wall between them. [laughs] So … but so idea was that; so the main idea is: because the lexicon is sparse in phonetic space, it’s easier to find matching words than to find matching phonemes, okay? Phonemes are very influenced by co-articulation, so if you look in the acoustic space, the phonemes are going to be very, very different. But words are long words, and they don’t have a lot of neighbors, meaning that if you take two words at random, they are going to differ in most of their phonemes, so now you can cumulate the distance, and you have a much better separation. Okay, so that’s the idea. The intuition is: you use the lexicon using track two, and then you’d use this … you do this Siamese network architecture in which what you do first is you take—okay, you have your two words; imagine that you found that one instance of “rhinoceros” speaken by the mother, another one by the father, okay? What you do is you align them using DTW, and now, you take each part of that word—okay—you present that to a neural network, and because it’s aligned now, you can say that this part of “rhinoceros” is going to be aligned with this part; so even though the “o” of the mother and the “o” of the father are different acoustically, now the system is trying to match … to merge them, okay? Because the cost function there is basically the … based on the cosine of the output of these two networks; these two networks have the same weights, okay? They are Siamese; they’re … share everything; they don’t share the input. You take the input; you compute the cosine distance between the two; and you try to make the words orthogonal … the output orthogonal if the words are different, like in “rhinoceros grapefruit,” and you try to make them collinear if it’s the same word. >>: Yes, so the A and B here are the same word uttered by different people. >> Emmanuel Dupoux: That’s right. >>: That’s why it [indiscernible] >> Emmanuel Dupoux: That’s right. >>: How does it do with rhyming words with the same number of syllables? >> Emmanuel Dupoux: It will do badly. [laughs] Yeah? >>: Just a … >> Emmanuel Dupoux: Yeah, of course, and there are minimal pairs in languages. They are cases where you have—I don’t know—“dog” and “doll,” okay? They are almost the same, except for one phone … but the thing is that if you take two words randomly, they … the … basically, the percentage of shared phonemes is going to be only twenty or ten percent on average. So you are going to make some errors, but it’s only going to be a fraction of the word, not … the rest is going to be fine. >>: That is kind of interesting given your interest in the way kids acquire language, because in many cases, people try to rhyme to kids. >> Emmanuel Dupoux: Yes. Well, so maybe—I dunno—maybe they can ignore that. I don’t know, I mean, that’s a … rhyming is an—I never thought of that—it’s an interesting, interesting thing. >>: Do you task a track when [indiscernible] share the same data? >> Emmanuel Dupoux: Sorry? >>: Task one and two, do they share the same data? >> Emmanuel Dupoux: Yeah, yeah. So they were based on the same data. For the … yeah, so just— yes—so this is the performance that we got on previous paper; so in the previous paper, we ran this on TIMIT, and we … there, we had the leg … the words were actually gold words; they were … we were cheating; we were using the lexicon to do that. But in this test, we found that—so this is the performance of the filter bank on the IBX; this is the H … the posterogram of a supervised system; and this is what we get with a sort of weakly or side information with lexicon, which shows you that you can really improve on the input presentation and be almost as good as a supervised system. So this was done on TIMIT. >>: Yes, so for some is a DNN, and so where did you extract features? [indiscernible] can you go back to previous slides? Sorry, really curious. >> Emmanuel Dupoux: Yeah. >>: Yeah, so from this one, how did you … so what features did you extract for here to do comparison for ABX? >> Emmanuel Dupoux: Oh, okay, so the feature we get is this layer. >>: The top layer. >> Emmanuel Dupoux: The top layer. >>: Okay, I see. >> Emmanuel Dupoux: That’s right. Mmhmm, yeah. Okay, so that’s previous work we did last year with TIMIT, and now, we did it here, using the JSU spoken term discovery’s baseline, which is the unsupervised system that was provided in a challenge. So we use that to give us the words—okay—that we then use to train our IB … our Siamese network, okay? So now, the words that are found are completely unsupervised; they’re found by the spoken term discovery, which work with the DTW, et cetera, so we find the words, and then we train the system, and then we compare on the ABnet … on the ABX. And so basically, our system was beating the MFCC in the two languages, and they were actually … we are beating the topline in one of the metrics, okay? So it’s getting pretty close to the topline on English. >>: How well was it doing the DBGMMs? >> Emmanuel Dupoux: So the … it was just a couple of points. [laughter] >>: The DBGMMs were … you mentioned, like, they were better than the topline. >> Emmanuel Dupoux: Yeah, so they are … they were, like, eleven point nine. >>: In both tasks; in English … >> Emmanuel Dupoux: Mmhmm. Yeah, yeah, I have the results somewhere. >>: Yeah, give me one sense why that GMM could just so much better than … >> Emmanuel Dupoux: No, it … so there’s something that they did … >>: Yeah. >> Emmanuel Dupoux: … which I discovered after the fact, is they did speaker nomerization in the input feature. >>: Oh, okay. >> Emmanuel Dupoux: And I think that could help a lot to … >>: Okay, but you can do the same thing for [indiscernible] >> Emmanuel Dupoux: Yes, of course, but we … later, we didn’t think, you know? >>: It’s harder to do by syllable. >> Emmanuel Dupoux: The thing is that these things are … I mean, it’s so … it’s really a new field, and I think what’s interesting is that they are sort of all had ideas that people said, “Well, that’s not really useful.” But the Jane … the game is completely different when you do unsupervised, and basically, we have to restart exploring all the ideas back and then trying to put them together. So that was my main lesson there is that … yeah, I’m not going to talk about the track two; the track two, we have only two papers; there again, they were very interesting ideas—completely novel ideas, like for instance, using syllabic rhythm to try to find word boundaries, which turns out to work very well, surprising idea, I guess. But so in brief, I think time is ripe for doing this kind of a comparative research; peep … there had been lots of ideas around, but they hadn’t been evaluated in the same way previously, and I think there’s a lot to be done in tra … after that to try to combine the ideas in a system that would work in a coherent way. The challenge is still open, and people are still actually registering, and downloading, and trying to tinkle … tinker with the system. There’s going to be a special session at the SLTU and perhaps a CLSP workshop for those interested. So for the future, the future for me is: try to continue this logic by pushing the technology even further, and I would like to have your feedback, actually, on what would be both interesting and feasible. One idea that was floated was to scale up to have larger datasets, have more evaluations—like prosody, language modelling, lexical semantics—or to go multimodal—like adding the video, for instance. If you train a pic … associations between pictures and descriptions in speech in a language you don’t know, and now, you try to find more pictures that match, for instance—this kind of thing—that would be very tough. Of course, we want to go back to infants, and so they are already applications of this kind of zeroresource technology that we’ve started to do. And for instance, I … what we did was to use, in the dataset that we could have access to only in a special place, so you have to go there. I mean, most of the datasets in infants are not free, and they are not open, so it’s … to do experiments, you have to go to this place; you run your software; and then you go back with your table of results, but you can’t distribute anything, so that’s very annoying. But … so we show that, for instance, infant-directed speech—contrary to what everybody thinks—is not clearer than adult-directed speech; it’s actually more difficult, because parents are using this [indiscernible] and super-exaggerated intonation, and the—when they do that—they are making the acoustic tokens actually more variable. And so if you try to learn an automatic classifier or representation, you are actually going to have a harder time. So that’s interesting. What I’m saying is that this … all of these techniques are basically giving to the linguistic community a quantitative tools that can help to analyze the data. >>: Can you say a little bit more about what you mean by clearer [indiscernible] >> Emmanuel Dupoux: Okay, so we—in that case—we just run the ABX evaluation on a bunch of features, like MFCC and all these re features, et cetera, on triplets of things like: “ba,” “ga,” “ba.” >>: Child around. >> Emmanuel Dupoux: Child-directed or adult-directed, and matched with phonetic content and everything. And we found that the ABX was actually lower in many cases. >>: So that’s actually very counter-intuitive. You would think parents articulate—over-articulate. >> Emmanuel Dupoux: Yes, so they actually over … so they over-articulate, but also over-vary, so the variance is increasing larger than the mean, and in the end, you’re … you have a more overlap, at least in the dataset we had, which was in Japanese—a Japanese data set collected in the lab. So I mean, that brings me basically to my last point. Yes? >>: Yeah, it’s also not clear that the function of the exaggerated prosody has anything to do with improving phonetics. >> Emmanuel Dupoux: Absolutely. I totally agree with that, but people have been assuming that because parents are doing this thing, it must be for a pedagogic reason—it’s for the good of the baby. >>: Wouldn’t necessarily be phonetic pedagogy—I mean—if it entrains it … attention in a special way, then that might be its function. >> Emmanuel Dupoux: Yeah, completely. So that must be what happening; there may be other ways, like for instance, if you are basically trying to make … to modulate your emotion espicifly, you may not … have less attention to do to your articulators; I mean, there may be a bunch of explanation for this effect, but I think it’s interesting to—for the first time—being able to really do a large-scale analysis on this kind of dataset. Okay, but … so that makes me to: the very important question for me is knowing more about how infants learn language in terms of the input they have. So what is the variation in amount, quantity, quality of speech? Or what is the effect feedback? Do … et cetera, et cetera. All these questions are super-important; people have been saying lots of things about it; but not a lot of quantitative analysis had been done, because we simply don’t have the datasets. >>: You’re talking … it occurs to me kind of … you’re talking just about adult speech, and how adults talk to infants, and how they talk to adults. >> Emmanuel Dupoux: Right. Yes, that’ right. >>: Even in that thousand hours of speech, it’s all just adults. >> Emmanuel Dupoux: Yes, yes. We haven’t … I mean, it would be interesting to quantify also the amount of speech they receive from siblings, but that would also completely co-vary with, for instance, the socio-economic status; I mean, some families, you have large number of siblings and others not. It’s a … all this questions could be looked at in a quantitative fashion, but for this, we need to have the data. So there is a big community of people starting to gather, using these kind of devices, where you have a small microphones you put in baby’s outfit, and then you can collect large amounts of speech data in various settings. So right now, there’s a consortium of ten labs working together; we already have a thousand hours; but it’s in completely different languages and stuff like that; and this is going to grow. So I think there is going to be masses and masses of data arriving, and what we need—and that’s also why I sort of start to contact people like you and other labs—we need some help with trying to automatize the annotation of these, ‘cause it’s mess. If you just have just the raw data, there’s nothing you can do with it; you need to be able to do things like—okay—how many … how much of this is childdirected versus non-child-directed and this kind of stuff. >>: It’d be real nice to have mics on the parents, too. >> Emmanuel Dupoux: Yes. [laughter] >>: And exactly it [indiscernible] >> Emmanuel Dupoux: So that’s the second one; the second phase is the speech home for … so you all know about the Deb Roy’s gigantic experiment where he hooked a camera—videos—in his home for free years. So this is fanta … >>: Relate all the switch to the communicators? >> Emmanuel Dupoux: Yes, so no. What’s your question? >>: The heat the link all these switch that can … >> Emmanuel Dupoux: No, no, ‘cause data is—look—the data … I mean, I’ve seen in this; it’s a pile of hard drive that are in a box and locked; no one has access to that. >>: Oh. >> Emmanuel Dupoux: Okay, so that’s a big problem, because it was a huge effort; there was a lot of annotation were done; but he didn’t think in advance of how is he going to use it and open it to the scientific community? So that’s why I don’t want to start doing this before I have thought all the steps, okay? So all the steps are going to be—okay—we need, first of all, better sensors, because this video— apparently, she told me—was impossible to find objects in the video; it was difficult; they tried to do automatic extraction; and it was too difficult, so they just didn’t do it. So I think that, with the Kinect sensors and things like this, we get better abilities to reconstruct the scene. So basically, the aim will be to reconstruct the sensory input of available to the child—okay—as if it were like—you know—a little movie. You can see: okay, that’s the wipe; that’s life of my baby. Okay on time. Then we need to do semi-automatic annotations to have metadata, like speech activity, object-person tracking, et cetera, that’s resistant to what’s happening house. So we tried to play with the Kinect SDK, and as soon as you move behind a table, it’s finished; the baby’s actually also invisible—that’s very funny. You have the— you know—the skeleton tracking, and then the baby is not there; [laughs] it’s not seen by the model. Okay, so all of this needs to be fixed, and then, the final question, which is extremely important, is the ability to open this data set while preserving the sec … the privacy, and so they are number of different ideas that have been floated around, like having a server in which you have the data, and you cannot access the data directly; what you could do is send algorithms to the server that would run you analysis and extract only the summary tables, but nothing of content, okay? So all of these have to be thought of, and if you are interested or you know people who could be interested in this kind of enterprise, I think it’s a—given the scale of it—it needs to be done in a right way this time, not ending up sitting with lots of piles of hard drive doing nothing. So I’m very happy to have a feedback on that. So I guess I’m done. Thank you very much. [applause] >>: So just one … I mean, would it be okay? Would it be enough to have the parents sign privacy disclosure that they won’t [indiscernible] for anything they’re wanting to do that? Would that be enough to get around the privacy issue, or are there other problems? >> Emmanuel Dupoux: I’m not sure I would be happy with that. I mean, that’s what people do in this DARCLE community; they have people sign that; but this device is only there for one day or a week; and then, they are, like, ten recordings, and that’s all. So you can imagine that parents could sort of review and have asked for delete … deletion of certain parts. If you have three years of life recorded continuously, it’s going to take you three years to find the things you want to remove. So I think … I mean, I … my experim … experience with that is that I had a post-doc who wanted to do that, okay? So he started to wire this … his house with two Kinects and stuff like that, and after a while, he say, “Well, no, I don’t want to release that, because I’m talking about colleagues in the lab, for instance.” [laughs] You know? And sometimes they are talking about their credit card number and whatever. I mean, so it’s … I mean, I think it’s going to be tough. I … probably, we could do things to try to automatize parents … help parents find places that … where … which they can delete things; like for instance, you could do emotional analysis and if the … if things become heated, you could say, “Well, maybe you want to delete that.” So I think it is … it’s still an option to try to make a sanitized version of it; of course, the sanitized version would be only a … would be a—then—a biased sample of what the baby hear, but still could be useful for—immensely useful—for research. And the other possibility—yeah—is to have everything, but then restrict the access in this way. So yeah, there was a question, then the … >>: Have you come across Cynthia Dwork work on privacy and database searches? >> Emmanuel Dupoux: So who? >>: Cynthia Dwork—D-W-O-R-K. >> Emmanuel Dupoux: Okay. >>: Basically, it’s way too easy to print tangible links [indiscernible] six and zone in on individual people. >> Emmanuel Dupoux: Yeah. >>: And she’s try … she’s fighting this beast with some success. >> Emmanuel Dupoux: Okay, yeah, I should look that up. Yeah, I know it’s super difficult to really prevent a extraction of data for … >>: Differential privacy. >> Emmanuel Dupoux: Yeah, differential privacy. >>: Yeah, that’s right. >> Emmanuel Dupoux: Yeah, I mean, it … for instance, I were thinking the other day of something very simple, like: imagine you trained a neural network. Okay, you train neural network with, like, several families of words of speech; then can you extract the weights? Maybe the weights contain information—private information—you don’t know, right? Becau … I mean, if … I don’t know if you’ve seen these networks, that you can make networks dream, and they will reconstruct what they’ve seen. So you could do the same thing here, so it’s scary, right? [laughs] But it’s a neat problem. Okay, so we had … yeah? >>: Yeah, I … go ahead, okay. >>: So here, you’re actually handling two problems at the same time: the language acquisition problem and the knowledge of the world. You know, like, as a kid, your issue might be just like learning about the world. >> Emmanuel Dupoux: Right. >>: Not exactly language, but do you, like, do you look at the other bominals, like second language acquisition? >> Emmanuel Dupoux: Uh-huh. >>: Where it’s actually … you are conditioning over certain knowledge of the world—like I already know the world as an adult, but I’m learning a new language. >> Emmanuel Dupoux: Mmhmm. >>: Maybe the data would be easier to collect; it’s not as controversial. >> Emmanuel Dupoux: Yeah, yeah, sure, but think … >>: And it’s maybe similar problem, but it’s not exactly the same problem as you’re solving. >> Emmanuel Dupoux: I agree; I agree it’s an interesting problem; it’s … I think it’s not the same problem, because you already have a language model; you have a word knowledge; and basically, it’s now becomes a problem of domain transfer. And a lot of people who are doing, actually, underresourced languages, that’s the approach they take. So they take, for instance, if you have a … one or two African languages, and then you don’t have this particular one, well, you use these: you use the phone recognizers; you concatenate them; and you get some features that are not too bad for the new language, okay? That’s the kind of thing that people are doing, and it’s similar—actually—it’s interestingly similar in … to what we do as adults when we learn a second language. >>: But for example, it … is there any project to kind of like … okay, if there is this person’s going to learn new language or traveling to a new country, just like have a microphone and record the experience, because to be the same. >> Emmanuel Dupoux: You know, the thing is that when you learn a second language in most—I mean—most cases—I mean—they are very, very different experiences, but at least when we—in our economically-advanced societies—when we learn a second language, we go to school; we take a book; and you learn the words, lexicon; you learn the rule of grammar. And so it’s a very, very top-down. So in fact, we learn it basically the way the machine learning does it; [laughs] it’s through supervision. So it’s—I think—it’s not actually very successful, because we are very bad at learning in this kind of situation. [laughs] But yeah, I think it’s very … so I think I had one more question that … >>: Throughout the world, it must not be the case that most second language acquisition’s like that. >> Emmanuel Dupoux: Sorry? >>: Throughout the world, that must not be that. >> Emmanuel Dupoux: In fact, there are other cases of second language acquisition in the kid themselves. So there … in fact, there are many families—maybe most of families—they learn several languages at once. So that’s not the situation we put around in our challenge; I’m working with a [indiscernible] with a PhD student on trying to basically use language ID technology to distinguish—I mean, to know—how many languages are spoken around. Imagine you are just born; you hear people talking; how many languages are there? So I think it’s possible to have an answer to that with some of the techniques. That would help, because otherwise, you would have a … mixing everything; it would be a disaster. Yeah? >>: So I just had a follow-up point. On the speech home, do you think … have you considered having people do the editing of the speech and video recording, like, at … in real times? Like, is that too much overhead? If you had, like, a big red button in the house, where as you’re talking, decide: oh, I don’t want the last half hour recorded, so you just kind of hit the button, and it goes away. >> Emmanuel Dupoux: Yeah, yeah. So what you—yeah—we build a little … a [indiscernible] little data acquisition system; we each had a button like this. So you have a button to turn it off, and then you have another odd button to basically remove or at least mark the … what just happened. >>: Yeah. >> Emmanuel Dupoux: And then you can review. >>: Right. >> Emmanuel Dupoux: But that would help; that certainly would help to the … >>: So a slightly easier way to do the same thing: you care to have keywords that are very easy to identify. >> Emmanuel Dupoux: Ah, I see. >>: [indiscernible] the process starts, and I process now something, end, but this don’t [laughter] … they’d have to remember to say that. >> Emmanuel Dupoux: Yes. >>: They’d have to remember to say “end” specially. >> Emmanuel Dupoux: Right, right, right, right. So I have a question here. >>: Yeah, I have a [indiscernible] very fundamental question here. So my understanding is that this community—right, the zero-resource—I mean, there was an earlier community that was doing a slow resource. >> Emmanuel Dupoux: Mmhmm, yeah. >>: Like that’s typically [indiscernible] >> Emmanuel Dupoux: Right, right, right. That’s right. >>: [indiscernible] a key, and they have it. So that’s ultimately leading to the dialect in a recognition. For example—you know—in some become reading challenge. They do have all these recon language, but it just doesn’t make sense. To link with them is—you know—ignometry; it isn’t … it’s not viable. That’s manifold speech recognition comparison, for example. So I thought that that’s unsupposed learning, maybe semi-supervised, and it actually will help to synergies save the cost of—you know— doing the labor, but … and I were thinking carefully about the way you are talking about here is … doesn’t address the issue at all. I’ll tell you why. >> Emmanuel Dupoux: Yeah, yeah, yeah. >>: I don’t know whether you’ll agree with me. >> Emmanuel Dupoux: Yeah, yeah, yeah. >>: So you say this unsupposed learning, but in order to produce the text coming out of that language; you still have to know it’s “ga” or “da.” >> Emmanuel Dupoux: That’s right. >>: And that’s supervision. You have to label that in order for that to take place. >> Emmanuel Dupoux: Yeah, yeah, you have to learn how to read; that’s the problem. [laughs] >>: Yeah, exactly. So it really doesn’t address that practical issue that … >> Emmanuel Dupoux: I guess, I guess. >>: So any ideas in this community that might address the issue so … to make them reason more practically useful [indiscernible] >> Emmanuel Dupoux: Well, yeah, for these kind of cir … so I’ve talked to people, and they were not very … I mean, they didn’t basically want to go too far to the bubble—the bubble thing. But one way I were thinking of a bit was to … you could imagine to turn the challenge in this way: you ei … so I give you a thousand-hour speech and then twenty minutes of annotation. >>: Correct. >> Emmanuel Dupoux: And now you have to learn. That would be … >>: Correct. So that [indiscernible] doesn’t solve the problem, because it … so the only thing you do is that you extract features. >> Emmanuel Dupoux: Right. No. >>: And the reason you don’t need to do—you know—label is because there’s no private to learn. >> Emmanuel Dupoux: Yeah, yeah, yeah. >>: And there’s … there’ll be a reason, but it doesn’t address the issue of doing transcription. >> Emmanuel Dupoux: But you could; so imagine that you have found good representations for— invariant representations—for phonemes, then it would be reasonably easy to map it to autography, even with a very, very small data set—just twenty minutes or even ten minutes, maybe. >>: Okay. >> Emmanuel Dupoux: So that will be the way to link it is to give a situation where you just have a minimal amount of annotation—just very, very short—and then you have to basically transcribe the rest, but we haven’t done it; we don’t know whether it’s feasible. >>: I see. Well, there are two things: well, one is that it’s obvious imposs … [indiscernible] I—you know—without the label, that it’s really hard to come up with some the features that may be relevant directly to [indiscernible] >> Emmanuel Dupoux: Well, we don’t know. I mean, that’s … maybe, maybe. >>: Yeah, we don’t know that part. I mean, all of the things that you listed there, they’re really rubbish in my book. I mean, if you put these features into recognizers, they’re not going be producing good result. >> Emmanuel Dupoux: Not … >>: Compared with the [indiscernible] result with the supervision, so we know that they’re not good enough. And the second thing is that, even if you come up with efficient thing in order to produce a— you know—text transcription, you still have to do some kind of learning, so … >> Emmanuel Dupoux: You have to learn … >>: Otherwise, this … so this is really … to me, it’s not realized one of the … okay, maybe in the sense, it is; in the sense of AB testing, you get a feature; but in practical sense, this just doesn’t happen. >> Emmanuel Dupoux: But if you have a … I mean, I think if you want to have a practical system, you will have to go all the way, like not only challenge two, but challenge three or challenge five, where basically, you learn the … an entire dialogue system from scratch. >>: Yeah, I … that’s … yeah. >> Emmanuel Dupoux: That would be the instruction. >>: I agree with you. Yeah, okay, so this is the very beginning of … >> Emmanuel Dupoux: But this is the beginning. So I think that we are still … we are here, reallies, at the beginning of something that could become the next technology in ten or twenty years. >>: Yeah, but there are some ideas that could be viable, for example. >> Emmanuel Dupoux: That can be recy … >>: In this stuff, you never use the prior information about how phoneme sequence are … >> Emmanuel Dupoux: Mmhmm, mmhmm. >>: But if you use that, there may be some hope. [indiscernible] yeah, I have some ideas about [indiscernible] >> Emmanuel Dupoux: Okay, well, we can talk about it, okay. >>: So the prior is missing here. I think that’s the main problem. >>: So two points: one is there’s a problem in software in order, say, “start decoding now;” you can say, “Start decoding half an hour ago.” [laughter] So when you know … >> Emanuel Dupoux: That’s good. >>: Of course, that really means don’t wipe out the last half hour. >> Emmanuel Dupoux: Yeah, you keep everything, and then you … >>: And it’ll destroy unless. >> Emmanuel Dupoux: Correct. >>: Other thing is … >> Emmanuel Dupoux: That’s excellent. >>: … there’s an experiment in some ways quite similar to infant learning, which is mirror … migrant communities where the mother learns a mother tongue from the children, because the children figure out from the other kids in the street how to speak the language and certainly not from their parents, who just don’t know the language. It’s happening now in Europe at … with a few hundred thousand people. >> Emmanuel Dupoux: Mmhmm, mmhmm. >>: And you can ask the kids in their own language what’s the hell is going on. They will … then you will explain. [laughter] >> Emmanuel Dupoux: Alright, sorry. Yes, that’s right. >>: I had another question on this unsupervised versus semi-supervised versus supervised. >>: They do that now with texting. >> Emmanuel Dupoux: Sorry? >>: Texting? >>: Kids do that already with texting. [laughs] Parent has to ask what the kid just said. >> Emmanuel Dupoux: Yep. >>: And kind of following on Lee’s point—so what I’m going to say—oh, so if you have … if you had … so the thousand hours of unsupervised is never going to be if it was a thousand hours of supervised, obviously. And I like your example of just having, say, twenty minutes is supervised, just do some final kind of touch-up. But there’s potential, of course, if you have—you know—a million hours of unsupervised—which we don’t have a million hours of supervised—that it could do better. So have you compared sort of the features that come out after you then run them through a supervision system to see how they do, versus, like, MFCCs or whatever on a supervised system? >> Emmanuel Dupoux: Well, sure, sure. That’s one of our topline we set up. >>: The topline was MFCCs or …? >> Emmanuel Dupoux: So … well, so we had … we are two toplines; we had a topline with MFCC and a topline which was the … >>: Yeah, I know; then it was well. So that’s why. >> Emmanuel Dupoux: … which was—where is it? Let me see … >>: But the topline’s the kind of things that old people are doing … >> Emmanuel Dupoux: Yeah, on this one, we had a … we … the posterogram of MFCC, and this one was the final layer of a DNN. >>: And was yours purely unsupervised, or was yours then … >> Emmanuel Dupoux: So this one was done with the words—words of sentences. >>: Right, but the … but you never—oh, the … okay—but you never did a final sort of phase of supervision of the feature? >> Emmanuel Dupoux: No, no. >>: Like sprite. So have you … I think that’d be interesting to try, like not that you’ve gotten these new features [indiscernible] >> Emmanuel Dupoux: [indiscernible] you try to improve on that—I mean—to see whether with these feats, starting from that, we can beat this one. >>: Yeah, exactly. >> Emmanuel Dupoux: That’s it, okay. Yeah, yeah, that’s a good idea. >>: Yeah, ‘cause that’s sort of … I mean, that’s not the point, but that would be a advantage of doing unsupervised data is that you’re trying to produce better features. >> Emmanuel Dupoux: Yeah, because you could … basically, you could do that as a pre-training. >>: Yeah, yeah, mmhmm. >> Emmanuel Dupoux: You pre-train your system on millions of hours, and then you only need a little bit of fine tuning that would not be very costly. >>: Right, and you could also see where the crossover point is. So if you take the supervi … purely-only supervised and then your unsupervised with some amount of supervised data, and you can see how much supervision … like, with no supervision, you’ll beat it, and with, like, with full supervision, maybe it’ll beat you, but could kind of try to work the crossover point farther and farther so that it … your advantage happens with less and less … with more and more data. >> Emmanuel Dupoux: Yeah, yeah, I think that’s what … I think we sti … we’re still actually … I think we can improve this, really, because I mean, this … each of these models I presented are … were using different ideas, and as you say, there are many other ideas, like prior information, that was not used. So I think one other thing would be: try to combine them, see how far we can get, and then, we may actually already be in competitive with the supervised system with these measures. Of course, you—to be useful in the supervised setting—you need to go back to supervision, and that’s what the problem is. >>: Sure, but … >> Emmanuel Dupoux: But I agree, that that’s the way to … >>: But if you start at this point, you may do even better. >> Ran Gilad-Bachrach: Let’s try to … so ‘Manuel is gonna be down for the next couple of days. If you want to discuss more things with him—still has some open time, I think, in his calendar—let me know, or go directly to him. But I sugge … at this point, we’ll to the discussion hall offline. Let’s thank him once again. [applause] >> Emmanuel Dupoux: Thank you very much.

>> Ran Gilad-Bachrach: So it’s a great pleasure to... >>: [indiscernible]

Related documents

Products

Support

&gt;&gt; Ran Gilad-Bachrach: So it’s a great pleasure to... &gt;&gt;: [indiscernible]

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Ran Gilad-Bachrach: So it’s a great pleasure to... >>: [indiscernible]