>> Ran Gilad-Bachrach: So it’s a great pleasure to... >>: [indiscernible]

advertisement
>> Ran Gilad-Bachrach: So it’s a great pleasure to introduce a …
>>: [indiscernible]
>> Ran Gilad-Bachrach: … Emmanuel Dupoux [indiscernible] working with Microsoft. ‘Manuel requires
his PhD in cognitive psychology from the Ecole Normale Supérieure in Paris, but he also has an
engineering degree in telecom, and therefore, his research kind of bridges between cognition and
communication, and he focuses on language acquisition in little kids. He’s published many dozens; I
didn’t—you know—counting [indiscernible] paper, it was too much for me—too big of a number—but
one thing—interesting—that I found out is that his book has been translated to English, Chinese, Greek,
Italian, Japanese, Portuguese, and Spanish, but it you want to, you all can also read it in French in its
original form. [laughter] Don’t worry about it. So without further ado, I’ll give the thing to you.
>> Emmanuel Dupoux: Okay, thank you very much. So thank you for organizing all this and inviting me
here. So yeah, so I would like to talk to you about … it’s an old project of mine that I have been pursuing
for a long time, but most of my career was done in studying infant language acquisition, but I always
wanted to try to make the link between this study, which is a fiel … a sub-part of cognitive science, and
engineering, which was sort of a part of my initial training. So motivation is like thi … if you look at how
machine learn, typically in the standard paradigm of machine learning, is that you start with some
data—okay—you have an algorithm, and then you basically are using human experts to generate, on the
same data, lots and lots of labels, which you use then to train your algorithms to reproduce the target,
okay? So that’s the standard paradigm in machine learning. If you look at how humans are doing
learning, it’s quite different. What you still get: you also get lots of data—okay—even, maybe, more
data, but what you get is you get, basically, everybody is comforted with that, and humans are
interacting with one another, and they somehow manage to learn together, okay? One other thing
that’s happening on the top there is that they do exchange information—sort of feedback information—
but it’s much, much less … the amount of data that you get—the bitrate—is much lower than what you
get when you do the machine learning fat part. So that means that if you want to build a machine that
would learn just like the human, you would have to radically change the paradigm and try to basically
build systems that would learn in a sort of a weakly-supervised way. I’m not saying it’s completely
unsupervi … I’m just saying it’s weakly supervised; and what you get is basically ambiguous and sparse
signals to learn.
Okay, so I’m going to focus on the case of speech recognition—okay—because that’s what I’ve been
studying mostly—how babies are learning languages. And so in a typical HMM deep architecture, that’s
what you get: you get speech—hours of speech—and then you code that into some kind of feature;
you’ll train your system provided a lot of labels, okay? Now, of course, that’s—even from an
engineering point of view—it’s a problem if you move to languages that have low resources, okay? So
this is the rank of languages with the number of speakers, and you … so these languages are pretty wellstudied; there are lot of resources; you get TIMIT, switchboard, and LibriSpeech—lots of hours that have
been carefully annotated—but then you go to most other languages, you don’t have that; in fact, half of
a languages in the world don’t have even autography to start with. So that’s a problem if you want to
build speech technology for these languages, and there’s another problem. Of course, here I noted that
to gather high-quality linguistic annotation is costly, but also, it’s not even sure that it’s done right. In
fact, when phonetician annotate speech or build dictionaries, are we sure that they are really doing the
right thing? We don’t know, actually. So they, actually, in some areas of linguistics, there’s a bit of
debate which label you should use, et cetera, et cetera. So that’s another issue, or that’s really not
much discussed, but it’s … I think it’s an issue.
So what we wanted to do was take the extreme case—completely extreme case—on the other end of
the spectrum: learning speech from scratch, okay? So that’s the so-called zero-resource challenge; zero
resource doesn’t mean that you have no … nothing at all; you have the speech—okay—but you have no
label, okay? And so for a completely unknown language, what you have to do is to learn acoustic
models and learn the pronunciation lexicon, okay? So that’s the two-track. Now, it may seems a little
bit extreme, but we know that human infants are doing something like that during the first year of life,
okay? So during the first year of life, they don’t really talk much, so—in terms of uttering words with a
… and sentences—so they don’t get a lot of feedback from their parent about what they are saying is
correct or not, but if you look at what we know now in terms of what they understand, experiments
have showed that they start to understand … to basically build acoustic models for vowels and
consonants in the … in their language, and this is done … this is starting around six months, and it seems
to be almost over—not completely over, but certainly well advanced—at the end of the … of that first
year of life. They also do a lot of prosody, and they also start to recognize isolate words—okay—
recognize words. So they do a little bit of language modelling as well. All of this is taking place in sort of
… in the parallel way, which is a bit strange when you think of it, but that’s the data we have. So the
idea is that we should try to model that with a system that tries to learn from raw speech, okay?
So why would we would like to do that? Well, the idea is that if you do this sort of unspervi … weaklysupervised learning, you will gather new ideas about new architectures that haven’t been tried before in
this area of automatic speech recognition; you may gain flexible, adaptive algorithms or a boost … or
system that can bootstrap your ASR in an unknown language, for instance. You can also do … probably
be better at doing task like keyword spotting; imagine you have a large library of sounds—of
recordings—in an unknown language, and you want to retrieve documents that contain this word; okay,
this kind of task can be done even if you have … don’t have any labels, so that’s the kind of task that you
could do if you were to make improvement in this problem, okay? Now, the other thing that you will
gain if you were to succeed in this enterprise is a model of la … how infant could be learning language,
which would have applications in clinical research, for instance.
Okay, so what I’m going to talk to you about today is the challenge that we organized and presented at
the last InterSpeech; it’s called a zero-resource speech challenge. And so as I said before, we have two
tracks; the first one is to learn acoustic representations, okay? Now, this is something that actually has
been around for quite a while. I … the base was this paper that, when I was a student, I read this
Kohonen paper, and I’ve … I was very impressed; I say, “Wow, you could learn”—this is the phonetic
typewriter, right—“you can learn a representation of speech from raw signal with just these autoorganizing maps.” That was a paper, and then nothing came out of it, and then, all of these different
other ideas have been around; people have been trying other things, like basically try to, for instance,
develop speech features in speech technology, inspired by human physiology; then people have been
devising features that are even more close to what we know the brain—or the auditory nerve—is doing,
okay? So this is more in the area of cognitive science; this is more in the speech engineering; and then
people’ve more recently have been trying deep autoencoders; they have been trying to use HMM—
unsupervise HMM—and the MIT group has been playing with a y’archical non-parametric Bayesian
clustering. All of these are trying to build good acoustic representations, and by good acoustic
reprensations, I mean representations that would basically have a linguistic use—that is, if you take two
instances of the syllable “ba,” they would look together … they would be put together in a … very
closely, and if you have a “ba” and “ga,” they would be separated, okay? So we … you want to have …
that’s the notion of good acoustic representations. Now, what’s striking to me, when I started to review
this, is that this—by the way, this is not at all a complete list, okay? They are … huge number of people
have been interested, at some point in the career, to discover automatically units in speech, and they
have tried many different ideas. And so what was striking to me is that none of them used the same
way to evaluate whether the system were working or not; each paper is using its own dataset, its own
evaluation procedure; they don’t quote the others—so none of these people are tal … are quoting the
others and vice versa.
So it’s an extremely fragmented field that really needs to have some kind of a work … I mean, if we want
to have some progress in that, we will need to basically be able to evaluate all of these different ideas in
the common framework, and that’s … was my main motivation is setting up the challenge, really, was to
stop having this completely chaotic process of people trying things and not relating to others. So to
evaluate an acoustic representation, what people typically do—because they are in the frame of mind of
supervised learning—is they would train a classifier. Okay, you have a reposition of speech—let’s say its
auditory features or whatever, or posterograms, or something—then you train the phone classifier, and
you compute the phone error rate. That’s what people do. Now, of course, I am not very happy with
that, and why is that? Because, in fact, you have many different type of classifiers that you could train,
so you have to specify a hypothesis space; you have a problem oftimization—some algorithms are better
than others—then you have some of them have free parameters, and then you need to worry about
over-fitting. All these problems arise when you try to do supervised learning, and of … and they are
going to be affected by number of dimensions in your representation—all that stuff. So what we
decided to do was to go back to the basics and try to do a task that involves no learning, okay? We
remain faithful to our idea of unsupervised system, so we do it also for the evaluation. So this is a task
that’s very well-known in human psycho-physics: I give you free tokens—so “ba,” “ga,” and then another
one that could be either “ba” or “ga”—that could be the same speaker or not, okay? And your task is to
say whether the X is closer to R or to B, okay? And so we set up the task like this, by constructing a big
list of minimal pairs like this—pairs of syllables or words that only differ in one particular phoneme—
becau … I mean, why we do that? Because we know that if a system wants to recognize words at the
end, they—at least—they will have to distinguish between “buh” and “guh,” okay? Yes?
>>: [indiscernible] at minus ten signal to noise ratio, you must convey youthful formation?
>> Emmanuel Dupoux: Sorry … mean, yeah, yeah.
>>: Could you repeat the question?
>> Emmanuel Dupoux: So there … yeah, yeah, I didn’t comment on this graph, but I’m going to repeat
the question when it’s there. So with this kind of task—okay—so you can test this kind of ta … so this
graph here was obtained with a speech dataset that’s very simple; it’s the … called the articulation
index; it only has syllables—isolated syllables—that have been produced in the lab, and people have to
recognize them in a various amount of noise. And so that’s basically the curve that you get if you do this
kind of task—ABX task—on human, and that’s what … the performance that you get if you use some of
the standard features in a speech recognition, like MFCC, for instance, or PLP. So RASTA-PLP’s actually
getting better results. Now, if you do supervised HMM, you are around here, and really, the issue is: can
we bring unsupervised techniques to go as close as possible as the supervised technique?
Okay, yeah, the only thing I didn’t say is that in order to run this task, what you have to provide—if you
want to participate in the challenge—you have to provide speech features for your syllables. Okay, I
gave you syllables in wave form, and you have to translate that into acoustic features—okay—which is
basically going to be a matrix of values, and then, the other thing you have to provide is a distance—
a’ight—because I want to be able to compute the distance between “ba” and “ga,” okay? So the two
things you provide are the features and the distance, and then you can compu … I will evaluate your
features with this minimal pair ABS task and give you a number. Yeah?
>>: I’m just wondering how you control the signal to noise ratio, and you can set it.
>> Emmanuel Dupoux: Okay, so that … there—in this experiment—we just added noise.
>>: Oh.
>> Emmanuel Dupoux: Yeah, yeah. We added … so the … I think the … some of the noise that are used
in the … in one of these noise challenge; I don’t remember which one it is—chimes or something. Yeah?
>>: And how much human would do in this?
>> Emmanuel Dupoux: So that’s how we do.
>>: Oh, this is human. Oh, this is human. Okay, okay.
>> Emmanuel Dupoux: This is human. This is human, and this is the features that have not been
trained, so this is the … just the acoustic features, and we tried to devise features that would go to us,
the human.
>>: So … but think how do they do that determination in this case?
>> Emmanuel Dupoux: Oh, okay, so they actually… so that’s interesting thing: in this study, we actually
cheated; this … we didn’t run the ABX on this study; we have it, but I didn’t put it on this graph. There,
the humans have to transcribe, and we use the posterogram of their transcription as features, okay?
>>: Okay.
>> Emmanuel Dupoux: So it’s not exactly the … we use it as if they were machines trying to make
posteriorgram, okay.
>>: How many bears you have?
>> Emmanuel Dupoux: How many …?
>>: Bears. You have “ba” and “ga;” how many bears you …?
>> Emmanuel Dupoux: Well, in this dataset, you have the entire set of syllables in English—CV
syllables—you also have VC syllables and some of the CVC syllables; there are too many of them. So it’s
a … it’s spoken by a forty speakers. So it’s a big dataset that has been used quite a bit in psycholinguistic.
>>: So I assume that … the rest of the papers here—unsupervised learning—what they did is that they
automatically come up with the speech units.
>> Emmanuel Dupoux: Yes, but that’s … that … yeah.
>>: [indiscernible] the supervised learning.
>> Emmanuel Dupoux: Well, the … so the idea is to have an unsupervised system and evaluate it
without doing any running. So with this ABX, you can evaluate any model, okay? Any of these model,
they’re … them … these … there can be evaluated with the ABX minimal pair task, and it requires no
learning.
>>: I see.
>> Emmanuel Dupoux: It just requires to take the features and compute the distance—that’s all you
need to do. So that’s the only way we found to compare completely …
>>: Still [indiscernible] this is my diagram; I don’t know [indiscernible] How this guys are doing this?
>> Emmanuel Dupoux: So this one?
>>: Yeah. So you get original data into there, and then the hidden layer here—so-called ant layer—give
you, maybe, thirty bits or something.
>> Emmanuel Dupoux: Right, so that’s the …
>>: How they do that? How do they actually get the number?
>> Emmanuel Dupoux: Well, they determine this back-prop. I mean, this is … it’s a [indiscernible]
>>: Well, I know, but …
>>: But how do you complete the similarity [indiscernible]
>>: No, I just want to know how do you get the curve—the number representive, the zero or the one.
>> Emmanuel Dupoux: Ah, okay. So if you do that, you take the …
>>: Thirty-two bits, for example.
>> Emmanuel Dupoux: Thirty-two bits, and you do that [indiscernible] for each slices of your speech.
Now you basically construct now a time matrix, and what you do is you do it for ba; you do it for ga; you
compute the distance …
>>: Okay.
>> Emmanuel Dupoux: … and you compute whether the error rate in assigning …
>>: Oh, I … for the new one, you use the same Euclidean distance, for example, to see which error is
most [indiscernible]
>> Emmanuel Dupoux: Yeah, yeah, yeah. Yeah, we use a cosine distance [indiscernible] So for each of
these representations, you can … you just have to define a distance, and then you can extract a value
and put it on the graph, okay?
>>: Am I correct here—and I think I’m missing something, right? If this is ABX, can I assume—if I’m a
system—can I assume that A and B are from different phonemes?
>> Emmanuel Dupoux: Yeah, that’s a very good point. It … so you …
>>: So assuming that’s the case, then I have a lot of supervised signal, and …
>> Emmanuel Dupoux: [laughs] Yeah, that’s an excellent point, and in fact, none of our models use that
signal, but I agree that it could be that … I mean, that’s also why we run it in humans like that. When
you run humans in the true ABX task, the humans could actually also learn about what the task is and
improve the performance. There, we use the humans … they are just transcribing their … the syllables,
so they don’t know that they are in an ABX task, but that’s an excellent point. I think they had a
question.
>>: Yeah, so [indiscernible] resource does each group can use? So I’m a bit confused about how the
definition of a zero resource … are you able to access a lot of unsupervised oat … signal audio, or they
have …?
>> Emmanuel Dupoux: No.
>>: [indiscernible] so they had just ABX—the signal?
>> Emmanuel Dupoux: So ABX is for the evaluation, okay? But the … for the training, I haven’t actually
still … I haven’t showed you some of the models …
>>: Okay.
>> Emmanuel Dupoux: … that we’ll learn, but they just have speech, and that’s all they have.
>>: They have speech, okay.
>> Emmanuel Dupoux: And in principle, they shouldn’t use the ABX evaluation to make their system
better, which some of the participants did, but that’s okay. [laughter] It’s always like that a little bit.
Okay, so that’s the track one—okay—track one, you have to … basically, the task is: find a
representation, and then, I’m going to evaluate it in terms of this ABX, okay? It should be as gross as
possible to the human. Now, track two: spoken term discovery, so spoken term discovery … yeah?
>>: So a correct performance is merely that you pick whichever of the two—A or B—is closer to the …
>> Emmanuel Dupoux: That’s right, yeah.
>>: … vector encoding X. That’s it; doesn’t matter how far apart they are?
>> Emmanuel Dupoux: No, you just take that and …
>>: Just closer.
>> Emmanuel Dupoux: Yeah, and they’re … and you are given an error rate, which in that case—you just
have one triplet—is going to be zero or one, but of course, we aggregate over many, many triplets like
that.
>>: I have another high-level question.
>> Emmanuel Dupoux: Yes?
>>: Are the acoustic signals the original training data? Are they just—you know—recordings of dove
speech?
>> Emmanuel Dupoux: Mmhmm.
>>: I was wondering: what’s your thought on, like, when is … when a child actually learns, there is a lot
more information that just the speech signal.
>> Emmanuel Dupoux: Yeah.
>>: There is a visual [indiscernible]
>> Emmanuel Dupoux: Yeah, yeah.
>>: So that is not considered in this [indiscernible]
>> Emmanuel Dupoux: Not in this challenge, but it might be in the future challenge, which I will come to
after that. Yes, but that’s also an exam question. [laughs] Okay, so track two: track two, you have to
discover word-like units, okay? So there, our notion of a language model is very simple; we just have to
listen to speech, and then, you have find patterns that would match, okay? So if I say twice the word
“dog,” then you’d basically discover that these two things are matching. So there have been, again … I
mean, there’s a large number of people who have tried that—again, with different evaluations each
time. So what we tried to do was to conceptually break down this process of spoken term discovery into
parts and then evaluate each of them separately—okay, like a little bit of a unit testing. So what people
do, typically, in spoken term discovery is that they will build a big similarity matrix, so imagine that you
have your TIMIT corpus here; you have … so you have corpus here, corpus there, okay? And then you
try to find matching patterns, which are going to be diagonals in your similarity matrix—so this is a very
brute-force thing, okay? So now, you find matching pairs, okay? And then, typically, what people do is
they will cluster the pairs in a … with some kind of graph clustering algorithm, and then, they will use
this cluster to go back to the original signal, and parse it, and say, “Well, this is the word: my cluster one,
followed by cluster three, followed by cluster one.” Okay, so these are basically the three steps that
people do when they do spoken term discovery automatically, and so we have developed a bunch of
magic … of metrics to measure these free steps. I’m not going to bother you with the detail, but they
are … typically, these metrics are what are … what is used in NLP; when people do a word segmentation
based on text, they are using this kind of F-score matrix, and this is specific to the fact that it’s speech.
>>: The very first step with the matching, is that done on the feature? That’s on the waveform or that’s
on the feature?
>> Emmanuel Dupoux: That’s … so that’s on the web … so if you have a system, you are competing in
this thing, you use … I give you the waveform; you can transform that to whatever you want, except that
it shouldn’t be using the fact that you know that it’s a language, English. Like for instance, if you … you
can do this kind of stuff on posterograms of a phone recognizer, for instance; that would be cheating if
you use the fact that you know its language—English, for instance.
>>: But the work in track one could plug right into track two.
>> Emmanuel Dupoux: Exactly, exactly, and vice versa.
>>: Okay.
>> Emmanuel Dupoux: Yep?
>>: What are the frame size in step one?
>> Emmanuel Dupoux: The what?
>>: The size of the frames used. You’d imagine “dog” would be a smaller frame than “octopus.”
>> Emmanuel Dupoux: Yeah, so …
>>: [indiscernible] choose a fixed frame …
>> Emmanuel Dupoux: So that’s for this challenge?
>>: Yeah, for this match.
>> Emmanuel Dupoux: Well, peep … well, I mean, people … typically, you have frames here which are
typically: you have one frame every ten millisecond. So “dog” is going to be, anyway, a sequence of
such vectors, so this is … the word “dog” is going to be a portion of the matrix. So in fact, the … imagine
that you have found a repetition of “dog” here and here; they’re going to slightly different durations in
the end, because you may speak faster or something.
>>: [indiscernible] like that.
>> Emmanuel Dupoux: So that’s … these algorithms are doing veet in the vernu, and they take care of
these differences in time.
Okay, data sets—so that was a question about data sets—so we used that are not child-directed,
because we wanted to have a very ti … detailed annotation of the phoneme to evaluate, and so we
couldn’t find anyone that was good enough quality, so we use a smartphone-recorded read speech in a
com … in a language that’s not very well-known—is it Xitsonga, not very well-studied—and then we
have … we add an English with more harder speech, like this conversational speech. This is the Buckeye
corpus for those of you who are doing speech. So that’s what we had, and then we provided a baselines
and toplines. So for ABX—for the first track—the baseline was just MFCC features; so already, with the
MFCC features, that you get this performance in ABX—okay—which are around—I mean—thirty percent
… twenty—yeah—thirty percent error. And for toplines, we used the posteriorgram of a train … HMMGMM train on English, okay? So this is the cheating thing, okay? So somehow, we expected that the
systems would be somewhere in the middle, okay? Yeah, then on track two, for our baselines, we use a
spoken term discovery system developed at the John Hopkins, and for the topline, we use the
transcription—the phoneme transcription—and we use a segmentator called adaptor grammar that is
sort of the state of the art in text segmentation, okay? So this is, of course, completely cheating; that’s
going to be extremely difficult to beat that topline, but we had that for comparison, because you need …
these metrics are novel; we wanted to have some way of to compare them.
>>: So for ABX topline, that means they use this posterior …
>> Emmanuel Dupoux: That’s right.
>>: … term as the back-term.
>> Emmanuel Dupoux: You use the [indiscernible]
>>: And then the same procedure just to compare [indiscernible]
>> Emmanuel Dupoux: That’s right; that’s right. So …
>>: And when you train this topline, you train it in a supervised manner; you start with an alliant.
>> Emmanuel Dupoux: Yeah, yeah. You start with a transcription and then force alignment on
whatever. I mean, I’m not saying that this is the best possible topline; I mean, we took the caldy; we
tweaked it; and we tried to get a good performance; but we didn’t spend months to optimize on the
dataset. This is a very difficult dataset, I must say—the English one.
>>: Because you could have actual [indiscernible] JMA mainstream in a completely unsupervised
manner, also.
>> Emmanuel Dupoux: Yeah, yeah, but that’s actually what competitors are going to do. So we would …
we just provide them this, because the competitors had to basically provide systems that would be
unsupervised. Okay, so participants: we got a twenty-eight registrations; a number of people didn’t end
up doing the challenge, but we had nine accepted papers at InterSpeech that were represented; and a
number of institutions participated, so we were quite happy that people from completely different
labs—some of them I never heard before—participated. They were—interestingly—there were people
working on under-resourced languages, like people in this … the group in South Africa, where they have
lots and lots of languages with very few resources, and so they were very interested to participate in
this.
>>: Just a com … and I think it’s a Jim Glass …
>> Emmanuel Dupoux: Yeah, so Jim Glass actually, in the end, didn’t send anything, [laughs] but he was
… he registered at the beginning.
>>: Oh [indiscernible]
>> Emmanuel Duoux: I know; I know. He’s at the very beginning of this—all these ideas.
>>: That’s what I thought, yeah.
>> Emmanuel Dupoux: But somehow, the system was not ready for the competition.
>>: Which group at Stanford participated?
>> Emmanuel Dupoux: So who …?
>>: Which group at Stanford participated?
>> Emmanuel Dupoux: So it’s actually people from the cognitive science—Mike Frank. Mmhmm?
>>: Was the test set blind, or did they have access to the data?
>> Emmanuel Dupoux: So they had access to the speech—only to the speech …
>>: Only to the speech.
>> Emmanuel Dupoux: Yeah, and no label, yeah—no label. There was … we released a sort of a dev set
with a fraction of the English with labels at the beginning, but the people had the evaluation, so they
could actually … some … a team that actually was not, at the end, selected for InterSpeech, they had
really engineered their thing by using the … our evaluation. So next time we do that, we’ll … we won’t
give the evaluation publically; we’ll just keep it and register everything so that we know what happens.
Okay, so it’s … there are lot of … so there’re a lot of papers; as I just wanted to summarize for you the
main ideas that came out. So basically, in track one, there were three systems that were submitted that
had roughly the same kind of intuitions. What they wanted to do is basically … you could say it’s density
estimation somehow. So you are given speech; that’s all you have, so basically, what you try to do is to
estimate the probability of that speech with some reduced model, okay? So one way to do that was to
do dictionary learning; so these guys were doing the bottleneck autoencoders that we saw before; and
this group did a generative model with a Dirichlet process JMM, which actually was the best model of
all—I was … for me, was completely unexpected—but they got extremely good result with a fairly simple
idea: you just build an HMM … not an HMM, a mixture of Gaussians—a large one—using a Dirichlet
process to find the number of component that you need, and you take the posteriorgram of that, and
that give, actually, very good ABX results. So this autoencoder, however, was interesting, because it can
… it could beat the MFCC with only six bottleneck features—so that’s extreme compression of speech—
and it also could do quite well with binary features—so you just binarize them, zero and one—so you
have twelve binaries features, and you can beat the MFCC on the Tsonga cop. So these were sort of
unexpected results and interesting results. So as I said, this one was … had very, very good results …
>>: But how do you get posteriorgram is you don’t have phonetic information?
>> Emmanuel Dupoux: Well, you just …
>>: Posteriorgram is based on the phones.
>> Emmanuel Dupoux: No, it’s just based … you take … so you have all these Gaussians—so they are full
diagonal mixture of Gaussians …
>>: Oh [indiscernible] of the components.
>> Emmanuel Dupoux: Of the … that’s right.
>>: I see, and that become the feature …
>> Emmanuel Dupoux: That become the feature, exactly.
>>: … which they do Euclidean distance algorithm.
>> Emmanuel Dupoux: Yeah, yeah.
>>: Oh, okay.
>> Emmanuel Dupoux: Alright. Then there’s a group from Carnegie Mellon who did articulatory
features. So they were actually using some side information, because had trained an acoustic-toarticulation system, and they used that as features, and they didn’t do very well, but I thought this was
very interesting to try to use articulatory features, because in fact, we know that babies probably have …
I mean, they do babbling, and so they could have access to some articulation. So I think—even though
they didn’t do very well—I think it’s a very interesting idea to try to continue exploring.
And now that they … the two other systems were using the same idea: is was using the idea that you use
track two to help track one, okay? So imagine that you have found some words—candidates—then you
can now use them to teach you to have better phone representations. Way, I’m going to explain that a
little bit more. So that’s the system we submitted—I mean, a part of our team was organizing the
competition; another part was trying to submit something; there was a sort of … we tried to put a wall
between them. [laughs] So … but so idea was that; so the main idea is: because the lexicon is sparse in
phonetic space, it’s easier to find matching words than to find matching phonemes, okay? Phonemes
are very influenced by co-articulation, so if you look in the acoustic space, the phonemes are going to be
very, very different. But words are long words, and they don’t have a lot of neighbors, meaning that if
you take two words at random, they are going to differ in most of their phonemes, so now you can
cumulate the distance, and you have a much better separation. Okay, so that’s the idea. The intuition
is: you use the lexicon using track two, and then you’d use this … you do this Siamese network
architecture in which what you do first is you take—okay, you have your two words; imagine that you
found that one instance of “rhinoceros” speaken by the mother, another one by the father, okay? What
you do is you align them using DTW, and now, you take each part of that word—okay—you present that
to a neural network, and because it’s aligned now, you can say that this part of “rhinoceros” is going to
be aligned with this part; so even though the “o” of the mother and the “o” of the father are different
acoustically, now the system is trying to match … to merge them, okay? Because the cost function there
is basically the … based on the cosine of the output of these two networks; these two networks have the
same weights, okay? They are Siamese; they’re … share everything; they don’t share the input. You
take the input; you compute the cosine distance between the two; and you try to make the words
orthogonal … the output orthogonal if the words are different, like in “rhinoceros grapefruit,” and you
try to make them collinear if it’s the same word.
>>: Yes, so the A and B here are the same word uttered by different people.
>> Emmanuel Dupoux: That’s right.
>>: That’s why it [indiscernible]
>> Emmanuel Dupoux: That’s right.
>>: How does it do with rhyming words with the same number of syllables?
>> Emmanuel Dupoux: It will do badly. [laughs] Yeah?
>>: Just a …
>> Emmanuel Dupoux: Yeah, of course, and there are minimal pairs in languages. They are cases where
you have—I don’t know—“dog” and “doll,” okay? They are almost the same, except for one phone …
but the thing is that if you take two words randomly, they … the … basically, the percentage of shared
phonemes is going to be only twenty or ten percent on average. So you are going to make some errors,
but it’s only going to be a fraction of the word, not … the rest is going to be fine.
>>: That is kind of interesting given your interest in the way kids acquire language, because in many
cases, people try to rhyme to kids.
>> Emmanuel Dupoux: Yes. Well, so maybe—I dunno—maybe they can ignore that. I don’t know, I
mean, that’s a … rhyming is an—I never thought of that—it’s an interesting, interesting thing.
>>: Do you task a track when [indiscernible] share the same data?
>> Emmanuel Dupoux: Sorry?
>>: Task one and two, do they share the same data?
>> Emmanuel Dupoux: Yeah, yeah. So they were based on the same data. For the … yeah, so just—
yes—so this is the performance that we got on previous paper; so in the previous paper, we ran this on
TIMIT, and we … there, we had the leg … the words were actually gold words; they were … we were
cheating; we were using the lexicon to do that. But in this test, we found that—so this is the
performance of the filter bank on the IBX; this is the H … the posterogram of a supervised system; and
this is what we get with a sort of weakly or side information with lexicon, which shows you that you can
really improve on the input presentation and be almost as good as a supervised system. So this was
done on TIMIT.
>>: Yes, so for some is a DNN, and so where did you extract features? [indiscernible] can you go back to
previous slides? Sorry, really curious.
>> Emmanuel Dupoux: Yeah.
>>: Yeah, so from this one, how did you … so what features did you extract for here to do comparison
for ABX?
>> Emmanuel Dupoux: Oh, okay, so the feature we get is this layer.
>>: The top layer.
>> Emmanuel Dupoux: The top layer.
>>: Okay, I see.
>> Emmanuel Dupoux: That’s right. Mmhmm, yeah. Okay, so that’s previous work we did last year with
TIMIT, and now, we did it here, using the JSU spoken term discovery’s baseline, which is the
unsupervised system that was provided in a challenge. So we use that to give us the words—okay—that
we then use to train our IB … our Siamese network, okay? So now, the words that are found are
completely unsupervised; they’re found by the spoken term discovery, which work with the DTW, et
cetera, so we find the words, and then we train the system, and then we compare on the ABnet … on
the ABX. And so basically, our system was beating the MFCC in the two languages, and they were
actually … we are beating the topline in one of the metrics, okay? So it’s getting pretty close to the
topline on English.
>>: How well was it doing the DBGMMs?
>> Emmanuel Dupoux: So the … it was just a couple of points. [laughter]
>>: The DBGMMs were … you mentioned, like, they were better than the topline.
>> Emmanuel Dupoux: Yeah, so they are … they were, like, eleven point nine.
>>: In both tasks; in English …
>> Emmanuel Dupoux: Mmhmm. Yeah, yeah, I have the results somewhere.
>>: Yeah, give me one sense why that GMM could just so much better than …
>> Emmanuel Dupoux: No, it … so there’s something that they did …
>>: Yeah.
>> Emmanuel Dupoux: … which I discovered after the fact, is they did speaker nomerization in the input
feature.
>>: Oh, okay.
>> Emmanuel Dupoux: And I think that could help a lot to …
>>: Okay, but you can do the same thing for [indiscernible]
>> Emmanuel Dupoux: Yes, of course, but we … later, we didn’t think, you know?
>>: It’s harder to do by syllable.
>> Emmanuel Dupoux: The thing is that these things are … I mean, it’s so … it’s really a new field, and I
think what’s interesting is that they are sort of all had ideas that people said, “Well, that’s not really
useful.” But the Jane … the game is completely different when you do unsupervised, and basically, we
have to restart exploring all the ideas back and then trying to put them together. So that was my main
lesson there is that … yeah, I’m not going to talk about the track two; the track two, we have only two
papers; there again, they were very interesting ideas—completely novel ideas, like for instance, using
syllabic rhythm to try to find word boundaries, which turns out to work very well, surprising idea, I
guess. But so in brief, I think time is ripe for doing this kind of a comparative research; peep … there had
been lots of ideas around, but they hadn’t been evaluated in the same way previously, and I think
there’s a lot to be done in tra … after that to try to combine the ideas in a system that would work in a
coherent way. The challenge is still open, and people are still actually registering, and downloading, and
trying to tinkle … tinker with the system. There’s going to be a special session at the SLTU and perhaps a
CLSP workshop for those interested.
So for the future, the future for me is: try to continue this logic by pushing the technology even further,
and I would like to have your feedback, actually, on what would be both interesting and feasible. One
idea that was floated was to scale up to have larger datasets, have more evaluations—like prosody,
language modelling, lexical semantics—or to go multimodal—like adding the video, for instance. If you
train a pic … associations between pictures and descriptions in speech in a language you don’t know,
and now, you try to find more pictures that match, for instance—this kind of thing—that would be very
tough.
Of course, we want to go back to infants, and so they are already applications of this kind of zeroresource technology that we’ve started to do. And for instance, I … what we did was to use, in the
dataset that we could have access to only in a special place, so you have to go there. I mean, most of
the datasets in infants are not free, and they are not open, so it’s … to do experiments, you have to go to
this place; you run your software; and then you go back with your table of results, but you can’t
distribute anything, so that’s very annoying. But … so we show that, for instance, infant-directed
speech—contrary to what everybody thinks—is not clearer than adult-directed speech; it’s actually
more difficult, because parents are using this [indiscernible] and super-exaggerated intonation, and
the—when they do that—they are making the acoustic tokens actually more variable. And so if you try
to learn an automatic classifier or representation, you are actually going to have a harder time. So that’s
interesting. What I’m saying is that this … all of these techniques are basically giving to the linguistic
community a quantitative tools that can help to analyze the data.
>>: Can you say a little bit more about what you mean by clearer [indiscernible]
>> Emmanuel Dupoux: Okay, so we—in that case—we just run the ABX evaluation on a bunch of
features, like MFCC and all these re features, et cetera, on triplets of things like: “ba,” “ga,” “ba.”
>>: Child around.
>> Emmanuel Dupoux: Child-directed or adult-directed, and matched with phonetic content and
everything. And we found that the ABX was actually lower in many cases.
>>: So that’s actually very counter-intuitive. You would think parents articulate—over-articulate.
>> Emmanuel Dupoux: Yes, so they actually over … so they over-articulate, but also over-vary, so the
variance is increasing larger than the mean, and in the end, you’re … you have a more overlap, at least in
the dataset we had, which was in Japanese—a Japanese data set collected in the lab. So I mean, that
brings me basically to my last point. Yes?
>>: Yeah, it’s also not clear that the function of the exaggerated prosody has anything to do with
improving phonetics.
>> Emmanuel Dupoux: Absolutely. I totally agree with that, but people have been assuming that
because parents are doing this thing, it must be for a pedagogic reason—it’s for the good of the baby.
>>: Wouldn’t necessarily be phonetic pedagogy—I mean—if it entrains it … attention in a special way,
then that might be its function.
>> Emmanuel Dupoux: Yeah, completely. So that must be what happening; there may be other ways,
like for instance, if you are basically trying to make … to modulate your emotion espicifly, you may not …
have less attention to do to your articulators; I mean, there may be a bunch of explanation for this
effect, but I think it’s interesting to—for the first time—being able to really do a large-scale analysis on
this kind of dataset. Okay, but … so that makes me to: the very important question for me is knowing
more about how infants learn language in terms of the input they have. So what is the variation in
amount, quantity, quality of speech? Or what is the effect feedback? Do … et cetera, et cetera. All
these questions are super-important; people have been saying lots of things about it; but not a lot of
quantitative analysis had been done, because we simply don’t have the datasets.
>>: You’re talking … it occurs to me kind of … you’re talking just about adult speech, and how adults talk
to infants, and how they talk to adults.
>> Emmanuel Dupoux: Right. Yes, that’ right.
>>: Even in that thousand hours of speech, it’s all just adults.
>> Emmanuel Dupoux: Yes, yes. We haven’t … I mean, it would be interesting to quantify also the
amount of speech they receive from siblings, but that would also completely co-vary with, for instance,
the socio-economic status; I mean, some families, you have large number of siblings and others not. It’s
a … all this questions could be looked at in a quantitative fashion, but for this, we need to have the data.
So there is a big community of people starting to gather, using these kind of devices, where you have a
small microphones you put in baby’s outfit, and then you can collect large amounts of speech data in
various settings. So right now, there’s a consortium of ten labs working together; we already have a
thousand hours; but it’s in completely different languages and stuff like that; and this is going to grow.
So I think there is going to be masses and masses of data arriving, and what we need—and that’s also
why I sort of start to contact people like you and other labs—we need some help with trying to
automatize the annotation of these, ‘cause it’s mess. If you just have just the raw data, there’s nothing
you can do with it; you need to be able to do things like—okay—how many … how much of this is childdirected versus non-child-directed and this kind of stuff.
>>: It’d be real nice to have mics on the parents, too.
>> Emmanuel Dupoux: Yes. [laughter]
>>: And exactly it [indiscernible]
>> Emmanuel Dupoux: So that’s the second one; the second phase is the speech home for … so you all
know about the Deb Roy’s gigantic experiment where he hooked a camera—videos—in his home for
free years. So this is fanta …
>>: Relate all the switch to the communicators?
>> Emmanuel Dupoux: Yes, so no. What’s your question?
>>: The heat the link all these switch that can …
>> Emmanuel Dupoux: No, no, ‘cause data is—look—the data … I mean, I’ve seen in this; it’s a pile of
hard drive that are in a box and locked; no one has access to that.
>>: Oh.
>> Emmanuel Dupoux: Okay, so that’s a big problem, because it was a huge effort; there was a lot of
annotation were done; but he didn’t think in advance of how is he going to use it and open it to the
scientific community? So that’s why I don’t want to start doing this before I have thought all the steps,
okay? So all the steps are going to be—okay—we need, first of all, better sensors, because this video—
apparently, she told me—was impossible to find objects in the video; it was difficult; they tried to do
automatic extraction; and it was too difficult, so they just didn’t do it. So I think that, with the Kinect
sensors and things like this, we get better abilities to reconstruct the scene. So basically, the aim will be
to reconstruct the sensory input of available to the child—okay—as if it were like—you know—a little
movie. You can see: okay, that’s the wipe; that’s life of my baby. Okay on time. Then we need to do
semi-automatic annotations to have metadata, like speech activity, object-person tracking, et cetera,
that’s resistant to what’s happening house. So we tried to play with the Kinect SDK, and as soon as you
move behind a table, it’s finished; the baby’s actually also invisible—that’s very funny. You have the—
you know—the skeleton tracking, and then the baby is not there; [laughs] it’s not seen by the model.
Okay, so all of this needs to be fixed, and then, the final question, which is extremely important, is the
ability to open this data set while preserving the sec … the privacy, and so they are number of different
ideas that have been floated around, like having a server in which you have the data, and you cannot
access the data directly; what you could do is send algorithms to the server that would run you analysis
and extract only the summary tables, but nothing of content, okay? So all of these have to be thought
of, and if you are interested or you know people who could be interested in this kind of enterprise, I
think it’s a—given the scale of it—it needs to be done in a right way this time, not ending up sitting with
lots of piles of hard drive doing nothing. So I’m very happy to have a feedback on that. So I guess I’m
done. Thank you very much. [applause]
>>: So just one … I mean, would it be okay? Would it be enough to have the parents sign privacy
disclosure that they won’t [indiscernible] for anything they’re wanting to do that? Would that be
enough to get around the privacy issue, or are there other problems?
>> Emmanuel Dupoux: I’m not sure I would be happy with that. I mean, that’s what people do in this
DARCLE community; they have people sign that; but this device is only there for one day or a week; and
then, they are, like, ten recordings, and that’s all. So you can imagine that parents could sort of review
and have asked for delete … deletion of certain parts. If you have three years of life recorded
continuously, it’s going to take you three years to find the things you want to remove. So I think … I
mean, I … my experim … experience with that is that I had a post-doc who wanted to do that, okay? So
he started to wire this … his house with two Kinects and stuff like that, and after a while, he say, “Well,
no, I don’t want to release that, because I’m talking about colleagues in the lab, for instance.” [laughs]
You know? And sometimes they are talking about their credit card number and whatever. I mean, so
it’s … I mean, I think it’s going to be tough. I … probably, we could do things to try to automatize parents
… help parents find places that … where … which they can delete things; like for instance, you could do
emotional analysis and if the … if things become heated, you could say, “Well, maybe you want to delete
that.” So I think it is … it’s still an option to try to make a sanitized version of it; of course, the sanitized
version would be only a … would be a—then—a biased sample of what the baby hear, but still could be
useful for—immensely useful—for research. And the other possibility—yeah—is to have everything, but
then restrict the access in this way. So yeah, there was a question, then the …
>>: Have you come across Cynthia Dwork work on privacy and database searches?
>> Emmanuel Dupoux: So who?
>>: Cynthia Dwork—D-W-O-R-K.
>> Emmanuel Dupoux: Okay.
>>: Basically, it’s way too easy to print tangible links [indiscernible] six and zone in on individual people.
>> Emmanuel Dupoux: Yeah.
>>: And she’s try … she’s fighting this beast with some success.
>> Emmanuel Dupoux: Okay, yeah, I should look that up. Yeah, I know it’s super difficult to really
prevent a extraction of data for …
>>: Differential privacy.
>> Emmanuel Dupoux: Yeah, differential privacy.
>>: Yeah, that’s right.
>> Emmanuel Dupoux: Yeah, I mean, it … for instance, I were thinking the other day of something very
simple, like: imagine you trained a neural network. Okay, you train neural network with, like, several
families of words of speech; then can you extract the weights? Maybe the weights contain
information—private information—you don’t know, right? Becau … I mean, if … I don’t know if you’ve
seen these networks, that you can make networks dream, and they will reconstruct what they’ve seen.
So you could do the same thing here, so it’s scary, right? [laughs] But it’s a neat problem. Okay, so we
had … yeah?
>>: Yeah, I … go ahead, okay.
>>: So here, you’re actually handling two problems at the same time: the language acquisition problem
and the knowledge of the world. You know, like, as a kid, your issue might be just like learning about
the world.
>> Emmanuel Dupoux: Right.
>>: Not exactly language, but do you, like, do you look at the other bominals, like second language
acquisition?
>> Emmanuel Dupoux: Uh-huh.
>>: Where it’s actually … you are conditioning over certain knowledge of the world—like I already know
the world as an adult, but I’m learning a new language.
>> Emmanuel Dupoux: Mmhmm.
>>: Maybe the data would be easier to collect; it’s not as controversial.
>> Emmanuel Dupoux: Yeah, yeah, sure, but think …
>>: And it’s maybe similar problem, but it’s not exactly the same problem as you’re solving.
>> Emmanuel Dupoux: I agree; I agree it’s an interesting problem; it’s … I think it’s not the same
problem, because you already have a language model; you have a word knowledge; and basically, it’s
now becomes a problem of domain transfer. And a lot of people who are doing, actually, underresourced languages, that’s the approach they take. So they take, for instance, if you have a … one or
two African languages, and then you don’t have this particular one, well, you use these: you use the
phone recognizers; you concatenate them; and you get some features that are not too bad for the new
language, okay? That’s the kind of thing that people are doing, and it’s similar—actually—it’s
interestingly similar in … to what we do as adults when we learn a second language.
>>: But for example, it … is there any project to kind of like … okay, if there is this person’s going to
learn new language or traveling to a new country, just like have a microphone and record the
experience, because to be the same.
>> Emmanuel Dupoux: You know, the thing is that when you learn a second language in most—I
mean—most cases—I mean—they are very, very different experiences, but at least when we—in our
economically-advanced societies—when we learn a second language, we go to school; we take a book;
and you learn the words, lexicon; you learn the rule of grammar. And so it’s a very, very top-down. So
in fact, we learn it basically the way the machine learning does it; [laughs] it’s through supervision. So
it’s—I think—it’s not actually very successful, because we are very bad at learning in this kind of
situation. [laughs] But yeah, I think it’s very … so I think I had one more question that …
>>: Throughout the world, it must not be the case that most second language acquisition’s like that.
>> Emmanuel Dupoux: Sorry?
>>: Throughout the world, that must not be that.
>> Emmanuel Dupoux: In fact, there are other cases of second language acquisition in the kid
themselves. So there … in fact, there are many families—maybe most of families—they learn several
languages at once. So that’s not the situation we put around in our challenge; I’m working with a
[indiscernible] with a PhD student on trying to basically use language ID technology to distinguish—I
mean, to know—how many languages are spoken around. Imagine you are just born; you hear people
talking; how many languages are there? So I think it’s possible to have an answer to that with some of
the techniques. That would help, because otherwise, you would have a … mixing everything; it would be
a disaster. Yeah?
>>: So I just had a follow-up point. On the speech home, do you think … have you considered having
people do the editing of the speech and video recording, like, at … in real times? Like, is that too much
overhead? If you had, like, a big red button in the house, where as you’re talking, decide: oh, I don’t
want the last half hour recorded, so you just kind of hit the button, and it goes away.
>> Emmanuel Dupoux: Yeah, yeah. So what you—yeah—we build a little … a [indiscernible] little data
acquisition system; we each had a button like this. So you have a button to turn it off, and then you
have another odd button to basically remove or at least mark the … what just happened.
>>: Yeah.
>> Emmanuel Dupoux: And then you can review.
>>: Right.
>> Emmanuel Dupoux: But that would help; that certainly would help to the …
>>: So a slightly easier way to do the same thing: you care to have keywords that are very easy to
identify.
>> Emmanuel Dupoux: Ah, I see.
>>: [indiscernible] the process starts, and I process now something, end, but this don’t [laughter] …
they’d have to remember to say that.
>> Emmanuel Dupoux: Yes.
>>: They’d have to remember to say “end” specially.
>> Emmanuel Dupoux: Right, right, right, right. So I have a question here.
>>: Yeah, I have a [indiscernible] very fundamental question here. So my understanding is that this
community—right, the zero-resource—I mean, there was an earlier community that was doing a slow
resource.
>> Emmanuel Dupoux: Mmhmm, yeah.
>>: Like that’s typically [indiscernible]
>> Emmanuel Dupoux: Right, right, right. That’s right.
>>: [indiscernible] a key, and they have it. So that’s ultimately leading to the dialect in a recognition.
For example—you know—in some become reading challenge. They do have all these recon language,
but it just doesn’t make sense. To link with them is—you know—ignometry; it isn’t … it’s not viable.
That’s manifold speech recognition comparison, for example. So I thought that that’s unsupposed
learning, maybe semi-supervised, and it actually will help to synergies save the cost of—you know—
doing the labor, but … and I were thinking carefully about the way you are talking about here is …
doesn’t address the issue at all. I’ll tell you why.
>> Emmanuel Dupoux: Yeah, yeah, yeah.
>>: I don’t know whether you’ll agree with me.
>> Emmanuel Dupoux: Yeah, yeah, yeah.
>>: So you say this unsupposed learning, but in order to produce the text coming out of that language;
you still have to know it’s “ga” or “da.”
>> Emmanuel Dupoux: That’s right.
>>: And that’s supervision. You have to label that in order for that to take place.
>> Emmanuel Dupoux: Yeah, yeah, you have to learn how to read; that’s the problem. [laughs]
>>: Yeah, exactly. So it really doesn’t address that practical issue that …
>> Emmanuel Dupoux: I guess, I guess.
>>: So any ideas in this community that might address the issue so … to make them reason more
practically useful [indiscernible]
>> Emmanuel Dupoux: Well, yeah, for these kind of cir … so I’ve talked to people, and they were not
very … I mean, they didn’t basically want to go too far to the bubble—the bubble thing. But one way I
were thinking of a bit was to … you could imagine to turn the challenge in this way: you ei … so I give you
a thousand-hour speech and then twenty minutes of annotation.
>>: Correct.
>> Emmanuel Dupoux: And now you have to learn. That would be …
>>: Correct. So that [indiscernible] doesn’t solve the problem, because it … so the only thing you do is
that you extract features.
>> Emmanuel Dupoux: Right. No.
>>: And the reason you don’t need to do—you know—label is because there’s no private to learn.
>> Emmanuel Dupoux: Yeah, yeah, yeah.
>>: And there’s … there’ll be a reason, but it doesn’t address the issue of doing transcription.
>> Emmanuel Dupoux: But you could; so imagine that you have found good representations for—
invariant representations—for phonemes, then it would be reasonably easy to map it to autography,
even with a very, very small data set—just twenty minutes or even ten minutes, maybe.
>>: Okay.
>> Emmanuel Dupoux: So that will be the way to link it is to give a situation where you just have a
minimal amount of annotation—just very, very short—and then you have to basically transcribe the
rest, but we haven’t done it; we don’t know whether it’s feasible.
>>: I see. Well, there are two things: well, one is that it’s obvious imposs … [indiscernible] I—you
know—without the label, that it’s really hard to come up with some the features that may be relevant
directly to [indiscernible]
>> Emmanuel Dupoux: Well, we don’t know. I mean, that’s … maybe, maybe.
>>: Yeah, we don’t know that part. I mean, all of the things that you listed there, they’re really rubbish
in my book. I mean, if you put these features into recognizers, they’re not going be producing good
result.
>> Emmanuel Dupoux: Not …
>>: Compared with the [indiscernible] result with the supervision, so we know that they’re not good
enough. And the second thing is that, even if you come up with efficient thing in order to produce a—
you know—text transcription, you still have to do some kind of learning, so …
>> Emmanuel Dupoux: You have to learn …
>>: Otherwise, this … so this is really … to me, it’s not realized one of the … okay, maybe in the sense, it
is; in the sense of AB testing, you get a feature; but in practical sense, this just doesn’t happen.
>> Emmanuel Dupoux: But if you have a … I mean, I think if you want to have a practical system, you will
have to go all the way, like not only challenge two, but challenge three or challenge five, where basically,
you learn the … an entire dialogue system from scratch.
>>: Yeah, I … that’s … yeah.
>> Emmanuel Dupoux: That would be the instruction.
>>: I agree with you. Yeah, okay, so this is the very beginning of …
>> Emmanuel Dupoux: But this is the beginning. So I think that we are still … we are here, reallies, at
the beginning of something that could become the next technology in ten or twenty years.
>>: Yeah, but there are some ideas that could be viable, for example.
>> Emmanuel Dupoux: That can be recy …
>>: In this stuff, you never use the prior information about how phoneme sequence are …
>> Emmanuel Dupoux: Mmhmm, mmhmm.
>>: But if you use that, there may be some hope. [indiscernible] yeah, I have some ideas about
[indiscernible]
>> Emmanuel Dupoux: Okay, well, we can talk about it, okay.
>>: So the prior is missing here. I think that’s the main problem.
>>: So two points: one is there’s a problem in software in order, say, “start decoding now;” you can say,
“Start decoding half an hour ago.” [laughter] So when you know …
>> Emanuel Dupoux: That’s good.
>>: Of course, that really means don’t wipe out the last half hour.
>> Emmanuel Dupoux: Yeah, you keep everything, and then you …
>>: And it’ll destroy unless.
>> Emmanuel Dupoux: Correct.
>>: Other thing is …
>> Emmanuel Dupoux: That’s excellent.
>>: … there’s an experiment in some ways quite similar to infant learning, which is mirror … migrant
communities where the mother learns a mother tongue from the children, because the children figure
out from the other kids in the street how to speak the language and certainly not from their parents,
who just don’t know the language. It’s happening now in Europe at … with a few hundred thousand
people.
>> Emmanuel Dupoux: Mmhmm, mmhmm.
>>: And you can ask the kids in their own language what’s the hell is going on. They will … then you will
explain. [laughter]
>> Emmanuel Dupoux: Alright, sorry. Yes, that’s right.
>>: I had another question on this unsupervised versus semi-supervised versus supervised.
>>: They do that now with texting.
>> Emmanuel Dupoux: Sorry?
>>: Texting?
>>: Kids do that already with texting. [laughs] Parent has to ask what the kid just said.
>> Emmanuel Dupoux: Yep.
>>: And kind of following on Lee’s point—so what I’m going to say—oh, so if you have … if you had … so
the thousand hours of unsupervised is never going to be if it was a thousand hours of supervised,
obviously. And I like your example of just having, say, twenty minutes is supervised, just do some final
kind of touch-up. But there’s potential, of course, if you have—you know—a million hours of
unsupervised—which we don’t have a million hours of supervised—that it could do better. So have you
compared sort of the features that come out after you then run them through a supervision system to
see how they do, versus, like, MFCCs or whatever on a supervised system?
>> Emmanuel Dupoux: Well, sure, sure. That’s one of our topline we set up.
>>: The topline was MFCCs or …?
>> Emmanuel Dupoux: So … well, so we had … we are two toplines; we had a topline with MFCC and a
topline which was the …
>>: Yeah, I know; then it was well. So that’s why.
>> Emmanuel Dupoux: … which was—where is it? Let me see …
>>: But the topline’s the kind of things that old people are doing …
>> Emmanuel Dupoux: Yeah, on this one, we had a … we … the posterogram of MFCC, and this one was
the final layer of a DNN.
>>: And was yours purely unsupervised, or was yours then …
>> Emmanuel Dupoux: So this one was done with the words—words of sentences.
>>: Right, but the … but you never—oh, the … okay—but you never did a final sort of phase of
supervision of the feature?
>> Emmanuel Dupoux: No, no.
>>: Like sprite. So have you … I think that’d be interesting to try, like not that you’ve gotten these new
features [indiscernible]
>> Emmanuel Dupoux: [indiscernible] you try to improve on that—I mean—to see whether with these
feats, starting from that, we can beat this one.
>>: Yeah, exactly.
>> Emmanuel Dupoux: That’s it, okay. Yeah, yeah, that’s a good idea.
>>: Yeah, ‘cause that’s sort of … I mean, that’s not the point, but that would be a advantage of doing
unsupervised data is that you’re trying to produce better features.
>> Emmanuel Dupoux: Yeah, because you could … basically, you could do that as a pre-training.
>>: Yeah, yeah, mmhmm.
>> Emmanuel Dupoux: You pre-train your system on millions of hours, and then you only need a little
bit of fine tuning that would not be very costly.
>>: Right, and you could also see where the crossover point is. So if you take the supervi … purely-only
supervised and then your unsupervised with some amount of supervised data, and you can see how
much supervision … like, with no supervision, you’ll beat it, and with, like, with full supervision, maybe
it’ll beat you, but could kind of try to work the crossover point farther and farther so that it … your
advantage happens with less and less … with more and more data.
>> Emmanuel Dupoux: Yeah, yeah, I think that’s what … I think we sti … we’re still actually … I think we
can improve this, really, because I mean, this … each of these models I presented are … were using
different ideas, and as you say, there are many other ideas, like prior information, that was not used. So
I think one other thing would be: try to combine them, see how far we can get, and then, we may
actually already be in competitive with the supervised system with these measures. Of course, you—to
be useful in the supervised setting—you need to go back to supervision, and that’s what the problem is.
>>: Sure, but …
>> Emmanuel Dupoux: But I agree, that that’s the way to …
>>: But if you start at this point, you may do even better.
>> Ran Gilad-Bachrach: Let’s try to … so ‘Manuel is gonna be down for the next couple of days. If you
want to discuss more things with him—still has some open time, I think, in his calendar—let me know,
or go directly to him. But I sugge … at this point, we’ll to the discussion hall offline. Let’s thank him
once again. [applause]
>> Emmanuel Dupoux: Thank you very much.
Download