Ivan Tashev: Good afternoon, everyone, those who are in the room

advertisement
>> Ivan Tashev: Good afternoon, everyone, those who are in the room and
those who are watching us remotely. It's my pleasure to have professor Bryan
Pardo here. He is the head of Northwestern University Interactive Audio Lab,
and his talk will be about crowdsourcing audio production interfaces.
Without further ado, Bryan, you have the floor.
>> Bryan Pardo: Okay. Thank you. So here we are, crowdsourcing audio
production interfaces. And let me just tell you about where I work. I work
in the Interactive Audio Lab. And the Interactive Audio Lab is at
Northwestern University. And Northwestern University is in Chicago,
Illinois.
And I felt required to put in the next slide, given that I am in the Pacific
Northwest, for some of you here at Microsoft may be wondering how is it that
Northwestern University is in Chicago instead of in Seattle. And so for
historical reasons Northwestern is called Northwestern. Specifically back in
the 1790s there was a territory called the Northwest Territory long before
Seattle was even -- this area was even part of the United States, and
Northwestern University was named after the Northwest Territory. So we had
the name before you guys did, and we're not letting it go without a fight.
But you're not here to get a history lesson on American geography. You're
here to find out about what I'm doing and what goes on in my lab. So I work
in computer audition, and computer audition in particular is something that
combines a lot of stuff. And some of the things that it combines are
crowdsourcing, linguistics, music cognition, or at least the way I do it, and
we also bring in machine learning, information retrieval, and signal
processing. And we take all of these things together and apply them to
problems in computer audition. And you'll see a lot of that today.
So I want to start out with a question, which is: How do we humans
communicate ideas about sound? And I'm going to propose to you that a
composer named Mark Applebaum really kind of nailed it in this piece that he
wrote called Pre-Composition. And this is a sort of taped piece about his
process of making a piece of sound art. So let me play this for you.
>> Audio Playing: Let's get back to this -- to where we were. So we got
this -- there's this like nasal pulsing thing. Does anyone remember that
[making noise]? And then suddenly there's going to be like this kind of wind
sound, like [making noise]. And then, I don't know, what do you think should
happen next then? It could get like distorted and static-y, like [making
noise]. Oh, yeah, I like that. Oh, yeah, and then we could like send it
around like ->> Bryan Pardo: Okay. So what was that? That was actually a pretty apt
kind of description or snippet of the kinds of conversations -- now, here he
had the conversation with himself, essentially, but the kind of conversations
that happen between artists and producers when they are making music.
And some of the things I want to point out here are the approaches that they
were using to communicate, or he was using to communicate in his own head.
One is a use of natural language terms such as distorted or static-y.
Another is examples, in this case made with his mouth [making noise]. And
finally evaluative feedback, like, oh, yeah, I like that.
At no point did you hear people talk about, say, poles and zeros of filters,
about constant cue, about decibels. At no point did anybody use those terms.
And that is typical when people are discussing their artistic creative
process when dealing with sound.
Now I'm going to give you an example task, which is a real task. So here we
go. I have a recording made in the 21st century. It sounds like this.
[music playing]. Now, this is specifically in an older style of American
music, and I would like it to sound tinny, like it's coming out of an
old-style radio, something more like this [music playing], as opposed to the
original [music playing].
So how do I do this? Now, of course, your first answer might be why don't
you just play it out of a 1940s radio and rerecord it? And that is an
excellent solution, but not necessarily one that I can implement if it is,
say, midnight and I have inspiration in my home and I don't happen to own a
1940s radio. But what I do have is Pro Tools, the industry-standard tool for
mixing, mastering, editing your software -- or editing your audio, I should
say.
So great. Let's take a look at a Pro Tools session. What you see on the
screen is a typical Pro Tools session, and up on the screen right now is the
tool that I would use to make it sound tinny. And my first question is: Is
it immediately obvious to you which tool that is? Have a look. Think
quietly to yourself which one you would pick.
>> Ivan Tashev:
The question is [inaudible]?
>> Bryan Pardo: Okay. I'm going to make it easier on you. That's the tool.
Now I'm going to blow it up for you. This is a parametric equalizer, and a
parametric equalizer is not necessarily something labeled with words like
tinny or static-y or distorted or warm. It's got a bunch of things that help
you control gain, center frequencies, maybe width of filters. And then
there's something over here that if you're not someone steeped in audio you
may not know what this display is all about. This is actually frequency and
this is boost and cut.
And if you think about this as a search problem in an N dimensional space,
you have N knobs. Somewhere in here is a region that we would call tinny.
All you have to do is turn the knobs until you discover the region. Look at
how many knobs there are and ask yourself if there were even 10 knobs with 10
settings each, do the numbers, that's 10 to the 10 combinations you would
need to try. But there's more than 10 knobs here.
Now, this interface bears some thought: Why is it like this? And the answer
is actually a cultural one. Before we had software tools to do this, we had
hardware tools. And as you can see, this is really an emulation of a
hardware tool with -- this is kind of a new feature. But in general they are
trying to stick to an interface that would have made a 1970s music producer
happy, someone used to the knobs and switches that were on their analog
equipment in 1972.
And when the transition to digital happened, these plug-ins, these audio
plug-ins were expensive. They might be 500 or $1,000. So their only market
was going to be people who were already professional engineers, who already
understood this interface.
So when the time came, they duplicated the interface so that there was
seamless transition from their point of view. And that made sense for that
generation of people. But it is not 1979. It is not even 2009. It is 2016.
And why are we continuing to emulate this hardware? The hardware was built
this way because of the constraints that they were dealing with when building
physical hardware. But we no longer have these constraints.
So what is the first thing that people did to make this easier to use? Well,
presets. So there'd be a drop-down menu that you could click on and see
presets that are going to solve your problem. So, okay, great. Here, this
isn't an equalizer now. This is a reverberation tool, something that makes
echoes. This, I just went this morning and I opened up a standard
reverberator and clicked on its drop-down list of presets, and some of these
words make sense to me. Like maybe under the bridge, I might have some vague
idea of what an echo of under the bridge might be like.
But what about memory space or corner verbation? And what is a bitter
hallway? Presets have historically been selected by the tool builders with
no actual thought about whether or not anyone other than the original maker
of the tool has any idea what these words mean. And as the number -- this is
something with a small number of presets. You can find tools that will have
50 or 100 presets with descriptive names like space laser Frisbee. And when
you have that, you are back to random search through a space; this time a
random search through presets.
It's not just EQ, it's not just reverberators. Here's a traditional -here's a simplified traditional synthesizer controller. Again, lots of
knobs. Now, they've given up the whole trying to emulate the exact look, but
the same underlying thought: I'm going to have knobs that can correspond to
features, and I'm going to set myself in a feature space.
We tried some user experiments where we would generate a synthesizer sound
using this interface, then we would hand the interface to a novice, a
musician, but who wasn't one experienced with synthesizers, and say we
promise you that this sound was generated with this synthesizer and this
interface. All you have to do is set the knobs until you can get that sound
again. Here's a typical response: I give up. It's as if you put me in the
control room of an airplane and I can't take off.
This has effects, this choice of interfaces has effects. Here's a longtime
musician on trying to learn to produce music: I've been playing guitar for
30 years. I bought the interface software. I'm 48, work as a carpenter, and
I'm just too tired all the time to learn this stuff. There is so much to
learn at the same time. I don't know the terminology. I have given up for
now. Sad, because I have lots of ideas.
So for all of you who are watching this right now and think right to yourself
I have no problem with these interfaces, I don't see what the big deal is, I
bit the bullet and learned them, for every one of you, there are many other
people who tried, became overwhelmed and just gave up.
What we're going to work on, what I'm going to talk about today is how we've
been using crowdsourcing and a new sort of way of thinking about the issue to
provide an alternative set of interfaces that the kind of person who doesn't
naturally think in these terms of these existing tools can use. I'm not
trying to take away your existing knobs; I'm just trying to provide an
alternative.
You know, put another way, let's say the world was all pianos and I
introduced the trombone. The trombone can do things that pianos can't do;
pianos can do things trombones can't do. I'm not saying the piano has to go
away; I'm just providing you an alternate kind of interface, a different way
of dealing with the issue.
Specifically, we want to build interfaces that support natural ways of
communicating ideas, and we've defined natural as the way people talk to each
other about it. Natural language terms, examples, evaluation. Things I gave
you in that other example. The Mark Applebaum piece.
Okay. So here's the first interface. We called it iQ. Now, I see at least
two people, three, many people in the audience with glasses. And if you
think about the interface for selecting the right settings for your glasses,
you do not necessarily -- the traditional approach has not been you walk into
an eyeglasses store and say I would like something minus 1.75 spherical
diopters in my right eye minus 2.5 on my left eye with an axis of astigmatism
of 75 degrees off the vertical. Many of you probably don't even know your
exact prescription.
What happens is you go to the eye doctor, and the eye doctor says, okay, what
do you think of this? Is it better A or B? You are doing an evaluative
paradigm. And you quickly get to the eyeglasses setting that you want.
We decided to do this with equalization. And notice you don't have to know
terms like diopter to get your glasses prescription. You just have to be
able to say whether you like one thing more than another thing.
So that's the idea. So how do we make it go? Well, simply put, we take a
sound and we know that you're going to want to change it to -- you're going
to want to reEQ it, you're going to want to raise some frequencies and lower
some frequencies to make a sound more along the lines of what you want.
Okay. So what we do is we play you different EQ curves. These are different
EQ curves. Here's boost and cut. Here's high frequency to low frequency.
We take the sound, we run it through an EQ curve. You're thinking of the
effect you want, and then you rate it. How much do I like this? Is this
pretty tinny or not?
And so let me show you this in action. I use dark, and I'm going to teach it
the word dark on a piano sound [music playing]. Maybe I'll use tinny just
because tinny was the example I've been using as a running example. Never
mind. Anyway, same piano sound, tinny. So this is the original sound. And
I sincerely hope that this works. I'm downloading -- this is all a Web demo.
This one is running in Flash; the next one will be node.js [music playing].
Here's the sound. Do I think it's tinny? I don't. Nah, I don't think that
one's tinny either. It's closer to what I think of as tinny.
So what I'm doing here is I'm -- as something is closer to my idea of tinny,
I'm going over here and pulling it to the right, which is where it says very
tinny, and if it's further, let's see what I get. I got an EQ curve [music
playing]. Well, it's more of a bright. Now what have I done? This is the
EQ curve I learned. And just for fun, we've got lots and lots of EQ curves
other people have taught it. And over here I've got the five nearest words
and the five farthest words. And so I can take a look and say somebody
taught it aggravating or sharp or bright. So I'm in a certain space of
words.
And so this first -- and so let me -- let me say what's happening here. This
first interface -- oh, we don't need to see that. This first interface was
one where we wanted to see whether we could use evaluative interaction. And
now I -- what I did was I just did it with eight. If I do it with 25 or 30,
I get a very nice capturing of my concept.
But then how did I do anything at all with eight? Well, the issue here,
why -- and why am I just choosing to do eight? Well, when I went ahead to do
this, even though we'd done some studies that showed that when you hand
somebody that first interface with all the knobs, that they get lost and they
never even finish. But when we handed them something with evaluative
interfaces where they had to answer 25, 30 questions before we gave them
something, they became annoyed that we asked them 30 questions. Even though
we could show that if we just handed them the original interface they'd never
actually get to the end. So the problem was now we were solving the problem
but we were still annoying people. How could we annoy people less? You ask
too many questions.
Well, one way is perhaps to use prior user data. Here are three EQ settings
at a much larger set that we would normally play to someone if we didn't have
prior user data. Here are three people -- Sue, Jim, and Bob -- and each of
them had some concept in their head that they were going for. And if we
play -- when we train the system, if we play a manipulated audio example with
a certain EQ curve and say, well, how warm was that, Sue? And she gives an
answer. How dark was that, Jim? Mmm, Jim liked it, 1 is the strongest
positive thing you can say. And maybe down here this other example, Bob, who
is going for something called phat, really didn't like it.
Now we have a space where these EQ curves could be looked at in different
ways. They could be looked at in the space of how similar the curves
themselves are. We could also look at them in the space of how similar
different user ratings of the curves are, for certain curves may look a bit
different but maybe effectively people think of them as, eh, roughly
equivalent.
So if we take these EQ curves and let's say I'm going to take the responses,
Sue's evaluation of curve 2 and curve 3 where her goal was to make it sound
warm, I could take her concept now and put it in a space of evaluation. So
the concept is Sue is warm and in the space of what she rated those two
examples as, it's here. And these other ones could be placed in this space
as well.
So I've just done a transformation from which space I'm looking at things in.
Before I had EQ curves; now I have concepts, user concepts. And now it -the EQ curve, all my audio data has disappeared and now I just talk about the
ratings of the EQ curves, these underlying things, which could be anything.
Now, if I have some new concept I'm teaching the system, I can say, well, in
this space of user ratings, prior user ratings, who am I closest to? And
instead of having you evaluate lots and lots of EQ curves to try and figured
out what exactly I should do to manipulate all the different frequencies, I
ask you to evaluate a few EQ curves and then compare your answers to prior
users' answers on those EQ curves, prior users who did answer all the
questions, who did evaluate all of the different EQ settings.
And then I can say, well, rather than just hand you something arbitrary or
hand you something that required lots of answers, you answer a few questions,
we figure out who you're most like, and we use a weighted average weighted by
your distance to this new concept in this space. And instead of 25 questions
maybe I could take it down to a much smaller number. And, in fact, in this
case I showed you an example where we did it with eight questions.
And I might want to ask, well, which eight questions do I ask? If I have a
bunch of examples that people had to rate in the past before they figured out
what they were doing, maybe what I should do is figure out what the most
informative question is. And in this case we decided that the most
informative question or the most informative EQ curve to ask would be the one
that caused the largest disagreement amongst prior raters.
And if you think about this, let's say that I was trying to -- let's pick
something dumb. I was trying to determine your political party. And one
question that I have on the questionnaire is do I like ice cream and the
other one is do I like Obamacare. Now, probably most people will say they
like ice cream and so it doesn't have a lot of discriminative power. But
maybe do I like Obamacare would have a lot of discriminative power and I
would want to use that question first.
In this case, this example turned out to have the largest variance in user
ratings. And so if I was going to have to pick one dimension along which to
rate you or to judge you, I think I should ask this question first. And
amongst these three I would say I'll ask this one, then this one, then that
one.
And the result is social EQ, which is the thing that I just demoed for you
where instead of answering 25, 30 questions I answer about eight, and these
questions are really designed to locate me in a space of prior users who have
answered these questions before and then say, okay, we think you're most like
a weighted combination of these individuals; let's give you something like
that.
And that was pretty good. And then we -- and now you might be asking how did
we get this prior user data? And the answer is as it is for all
crowdsourcing, Amazon Mechanical Turk. If it were about five years ago, I'd
be telling you a story about how we were going to gamify all of this problem.
But as many of you who work in crowdsourcing know, gamification -- it turns
out those of us who are good researchers aren't necessarily good game
designers. And therefore, when you make a really great game, the problem of
making a really great game that people will come back to and do over and over
again is really different from the problem of doing good research and machine
learning or interface design -- well, closer to interface design.
And if you think you can get the answer by paying $1,000 to get it on
Mechanical Turk, you should do that rather than spend a year designing a game
to get as much data as you would have by paying $1,000 from Mechanical Turk.
Which is what we did.
So we got that prior set of people by going onto Mechanical Turk and having
them teach us a word, in which case we were paid between $1 and $1.50,
depending on how reliable they were, to teach us a word. And words were
selected by the contributors. So we didn't tell them what word to choose.
We said we just want to learn a word that's an adjective about sound. You
pick the word and then show us what it means. And in this case we had them
do 40 rated examples per word.
And we had some inclusion criteria: Words existed in an English or Spanish
dictionary -- one thing I should mention is we did this both in English and
Spanish -- and that they self-report on listening on good speakers in a quiet
room and that they took longer than one minute to complete the task of rating
40 different examples in light of a particular goal. If you could rate 40
examples in less than a minute, we figured you were probably just after the
money. And, finally, we wanted people with consistency greater than one
standard deviation below the mean consistency.
What do I mean by consistency? We actually had 15 repeated examples in our
set of 40. So that means that if you heard an example and said, yeah, that's
a 1 and then the next time you heard the example and said that's a negative
1, you're not being consistent. So we wanted to make sure that independent
of the context in which it was heard, you would still say that EQ setting was
warm or whatever word you were going for.
After the inclusion criterion, we ended up with 932 English words, 676
Spanish words of which 388 and 384 were unique. And here's a thing to
mention, the learning curve. So what is this learning curve about? Here's
the number of user ratings. And what do I mean by that? I have a sound that
I would like to have EQ'd somehow. Now, the machine hands me an EQ curve and
I rate it. That's one user rating. Based on this, it's going to try and
figure out what my rating of the next curve will be. Okay?
I rate the next one, sees how far off it is. Now I'm in a 2 space. I've got
two ratings, and we can try and build a model of me based on prior users who
answered questions like I did on two examples. We build a new model. We try
and predict your answer to question three. That is machine-to-user
correlation. The better the machine is at predicting your answer, the more
correlated the user and the machine are.
By the time we get to about 25, we can predict your answer, and we're kind of
done. This is our training curve without prior data. This is our training
curve in the end with prior data. And so we moved that learning curve up,
which means that prior users are, in fact, helpful.
And a thing I don't actually have on this slide is, of course, if we
constrain it -- once we have enough data, if we constrain it to prior users
who used a word like your word -- that is, if I was training tinny, we only
use prior users who were tinny or a synonym drawn from WordNet like bright or
sharp or something like -- well, those aren't synonyms from WordNet, but we
know they're synonyms. You can do even better.
I mentioned English and Spanish, and one of the things that I think is
interesting about this work is we're actually doing a translation problem.
And we started out thinking of it as a translation problem between the
control parameter space or knobs or feature space, which is sort of a
perceptual thing, and perhaps a word space. I want to -- I'm leery about
saying a semantic space because a lot of times engineers throw around the
word semantic without knowing what it means, which is great because semantic
is about meaning.
But we were doing this translation. We started to think about, well, what
could this translation be useful for. It initially started out as a
translation between this complicated space of the tool and the user. It
turned out this translation was a great one between users and audio
engineers. And by user I now mean acoustic musicians and audio engineers
because we had this -- we were actually reviewed in Sound on Sound magazine
by a famous audio engineer for the initial version of this tool who said that
this solved a problem that audio engineers have been having since time
immemorial, which is you get a customer, a client who comes in and says make
my sound buttery, and the audio engineer says there's no buttery knob. I
have no idea what you're talking about. You go through this teaching
process, and we can discover whether buttery is in fact related to EQ at all
or some other tool, and if it is related to EQ what the EQ curve would be.
Then we got to thinking about translation in another sense, which is what's
the right word, como se dice, a warm sound. How do you say a warm sound in
Spanish? Would it be a sound that's calido? Is that the right term? And it
occurred to us we have this really interesting possibility here. Rather than
do the kind of machine translation that we're all used to, which is you have
paired texts and you find statistical correlations, we're going to do a
mapping between the perceptual space and the word space as follows.
You have people train -- here's an example. Rate how warm it is and give a
rating. Then you got someone else in another language. Here's an example.
Rate how calido it is. They give a rating. Examples where lots of people
give high ratings to this word in English and that word in Spanish, that
might be a good translation for that descriptive term.
>> Ivan Tashev: What about if the -- because warm is abstract here, it's not
temperature-wise form. What about in Spanish for the same sound to use a
different abstract ->> Bryan Pardo:
>>:
Exactly.
That's a problem more than --
>> Bryan Pardo: Yeah, yeah. So what we do is this. We don't actually
translate from the word to the word. We have words that get associated EQ
curves. So I have a word in English and a word in Spanish, and each of them
has an EQ curve that has been taught to it by the users.
Then if I take an English word, chunky, which has an EQ curve associated with
it, I go looking for the EQ curve from the Spanish users that looks most
similar to the EQ curve for chunky, and I call that a likely translation.
And, of course, these translations are different from paired texts, and at
least for audio, maybe better descriptions.
So putting this in more concrete terms, here we go. What this is now is an
EQ curve but it's sort of also a probability distribution. And don't worry
about the units of this. Basically this is boost and that's cut. Here's 0.
And the brighter it is, statistically speaking, for the set of people, the
more likely it is that this amount of boost or cut was what people would say
it should have to teach it the word tinny. That's a spectral curve for
tinny.
Here's a spectral curve for light. Again, this is learned from six people in
this case, but you can kind of see this sort of. And one of the things we
discovered is you might have a word that you wouldn't expect to be, for
example, the antonym, the opposite of this. Here's hard. There's light.
There's hard.
Whoop. Here's warm. Here's profundo. It's not calido, it's more like deep.
So now we've got this interesting thing. You can translate between
nonexperts and experts using a tool like this. What is a buttery sound?
What's a whatever sound? You can also translate between different linguistic
groups like this. And this obviously isn't limited to audio per se. I mean,
I'm an audio guy, so that's what I did. We could do this -- the same game
could be played with color words, for example. Or maybe flavors. Said all
that.
We also -- we didn't -- once we did this and we thought, well, it's not just
equalization, of course, again, I'm a sound guy, so I just said color words
and flavors, but I'm not interested in that. I'm interested in sound. So I
went ahead and did it again for reverberation. And we went out and collected
a lot of examples of reverberation. And so then we had this big dataset,
prior dataset of reverberation effects and the words associated with them,
equalization effects and the words associated with them. And we had this
philosophy which was the less teaching a human has to do, the better.
Now, we'd taken it from our interfaces from the starting point, which was
those hard-to-understand knobs, hard to understand by someone who isn't an
expert who already understands the signal processing or whatever, into here's
a teaching paradigm. Essentially, I evaluate -- it asks me questions. What
about this? That's pretty good. What about that? Better. But that
interaction, you know, we took it from maybe 25, 30 examples down to about
eight, and that was still pretty good.
But there's a reason that we use language as humans. It's a shortcut.
Right? When I say the word dog, you all picture something in your heads.
It's probably, you know, this tall, it's got four legs, its whole body is
covered with hair, it's got a wet nose. Imagine if I had to teach you what a
dog was every time we interacted. That would be very slow.
So since we'd been collecting this vocabulary, why don't we use this
vocabulary as a starting point and then only fine tune if the word that we
use isn't quite working out, which gets us to the interface for Audealize.
What you have here is an interface for -- I'm not sure if you can see this on
your tiny video screens for those of you who are watching at home, but EQ and
reverb, equalization and reverb. We've learned vocabularies for both of
them.
>> Ivan Tashev: Actually, the other video is small on their screens but the
slides are larger.
>> Bryan Pardo: Oh, okay. Great. So now the EQ and reverb, we've got
vocabularies for both. And we even have a reverberation setting and an EQ
setting that somehow we have some confidence in for each one. Now we can
take these words and place them in a space. In the case of reverberation,
there's five control parameters that we're using here. So you can imagine
each of these words is -- and six is a gain, but we'll say five dimensional.
You can imagine each of these wording with a reverb setting in a
five-dimensional space, and we're projecting this down into 2D to provide you
a word map.
So what this means is that words that are closer to each other probably have
reverb effects that are more similar sounding and words that are further from
each other more different sounding. Now, your first thought might be, wait,
aren't I just getting back to that preset list? I thought you said that
presets were a problem. And they were a problem. They were a problem when
the presets used a vocabulary that was arbitrary, defined by the tool
builder, with no real thought about whether or not anyone else would
understand what the words are. The presets list were a problem when they
were organized in some way that's not intuitive. Why were they in the order
they were in? It's hard to know how to search that list. Here close things
sound like each other; far things sound different from each other. So let us
say -- oh, and the other thing is, these things are a social construct in
some sense.
Let me tell you what I mean. When I was describing this work, I was talking
to the bass player who had gone on tour with Liz Phair, who is a Chicago -she came up out of Chicago in the '90s, and she's a relatively well-known
indie rock artist. And she was talking to her engineer. He remembered a
particular conversation where she said I want you to make my guitar sound
underwater. This was an electric guitar. Now, if you think about it for
just an instant, if I walked over to Lake Michigan and took my electric
guitar and dropped it in the water, the sound that would happen would be a
splash and maybe if it was shorted out, maybe if it was turned on, there
would be a shorting out of something.
Here's what underwater sounds like. Here's a guitar first
[music playing]. Here's underwater [music playing]. Now,
you can see this, but down here it tells you for each word
this case was learned from 95 contributors. Down here for
you how many people it learned an example from.
>> Ivan Tashev:
without underwater
I'm not sure if
-- underwater in
each word it tells
[inaudible] reverberation as underwater.
>> Bryan Pardo: Yes. So a certain reverberation setting was labeled as
underwater by 95 people, and therefore it made its way to be one of the
bigger words on our map. Or you could have, coming back to this [music
playing], maybe that's too much and I want something that's like crisp or
clean. Maybe I can't find the word on the map so I type it into this search
box. Got to cave.
This interface, if you compare exploring this interface to exploring that
first interface with all the knobs on it and just, again, don't think like an
engineer for a moment, think like you're an acoustic musician, you'd like to
make some -- it's midnight, you're at home, you're playing stuff into Garage
Band or whatever it is, and now you'd like to make it sound like I'm playing
my guitar and then I plunge underwater, it would be nice to know what reverb
effect would not just be underwater to me but give an impression, a general
impression to an average person of underwater. This kind of gives you that.
Gets you in the ballpark.
But let's say that you don't exactly like what you heard. Maybe underwater
is overdone. Whoop. Let's gets that going again [music playing]. You could
always go back to the traditional interface. We provide you the controls if
you want them. You can search around in this space. You can go back to the
controls in the old parameter-based space. Or if you want to go back to that
evaluative thing, maybe there's a word like fitzleplink, which is not going
to be on my map. It tells me to try teaching it, and then we drop back to
the teaching interface that you saw earlier.
So the idea here now is that we started from interfaces, coming way back,
wherever it is, that look like this, right, the interface that comes -- ships
with a standard equalizer, parametric equalizer, and we turned it into an
interface that looks more like this. Here's the EQ tab. Where you think of
the word that would describe what you're after, and then you click on it.
And if you want to see -- and you want to get an idea for generally what
would happen if I went from here to over here, you very easily just drag
across. If you want to get back to the old interface, it's down there, but
you don't have to use it. And if you like just teaching it, there it is.
So now we've managed to cover two of our approaches. So one of the approach
that I mentioned was an evaluative interface, and we got that. The next one
was natural language using words that regular people know. We went out and
asked lots of regular people on Mechanical Turk what word does that sound
make you think of, and we got that answer and we created an interface like
this.
But there was one other kind of interface that I'd mentioned we'd use or we
wanted to explore, which was the example based. Sometimes I say it's wind,
and you go wind, like this? And we use our example, and it's like no. And
then I try and go through the learning and no, it's still no, could you give
me an example yourself?
So Mark Cartwright, a doctoral student in my group, became very interested in
that. And so he came up with something called SynthAssist. And let me tell
you what the problem was. You might remember back earlier I said that we
gave an example problem to some novices, which was here's the sound that a
music synthesizer made. Here's the knob-based interface that someone used to
program the synthesizer. All we want from you is to get that sound back.
And this is a typical problem that you have with synthesizers if you're a
musician. Maybe you -- this is something a friend of mine had to do a lot.
She worked in a wedding band as a keyboard player. And so you've got your
keyboard, and the bride has asked you to do the latest Katy Perry hit. So
you need to make it sound like the Katy Perry. So you listen to the Katy
Perry, hear that keyboard sound, and then you go looking for it on your
interface. Oh, if only there would be something that would help me with this
search.
So Mark Cartwright went ahead and made a thing. And now this one I don't
have a live demo for because it actually -- you have to hook it in with his
synthesizer and stuff, but Mark was kind enough to make me a movie to show
you. So let me show you this movie.
>> Video playing: SynthAssist is an audio production tool that allows users
to interact with synthesizers in a more natural way. Instead of navigating
the SynthAssist space using knobs and sliders that control
difficult-to-understand SynthAssist parameters, users can search using
examples like existing recordings or vocal imitations. They then listen and
rate the machine's suggestions to give feedback during the interactive search
process. Let's see an example using a synthesizer database with 10,000
randomly generated sounds [making sounds].
>> Bryan Pardo:
sound was --
Let me just pause it there.
>> Ivan Tashev:
What we have in mind.
So what happened?
That first
>> Bryan Pardo: What he has in mind. The second sound was the best he could
do imitated with his mouth. So we record that imitation -- not what he has
in mind but the imitation, because you can't actually get what he has in
mind -- give it to the system [making sounds]. Let me pause it for a moment
and tell you what's happening here.
The system has produced a bunch of examples. In each of these examples,
along some dimension it's measuring, is a highly similar example. Now, what
the user is doing is clicking on them. And if you click on them and drag it
towards the center, towards the target, you're telling it that this example
is in some way similar. If you right click to delete, you're telling it it's
completely off. So dragging it towards the center means, yeah, this is
something good; away means something not so good; and click to delete means
terrible.
Whoop. I didn't mean to restart that. Is there a way for me to restart it
without -- nope, going to have to do that.
>> Video Playing: Let's see an example using a synthesizer database with
10,000 randomly generated sounds.
>> Bryan Pardo: We'll do it that way [making sounds]. So it searches.
Here's his initial example that he provided. It's close but not that close
he thinks [making noises]. And so now he's done that and now he wants to
re-run the search. You have to re-run the search now that you've done this
to give it an idea what are the important features [making sounds]. And he
found his sound.
Now -- oh, let's get past this. Okay. So how does this thing work? You
have some examples you could load up, whether it's your vocalization or
something prerecorded. Features are extracted. You rate some results.
Based on this, it's going to update its features. And we're going to have
this sort of loop through the dataset each time you're updating what's
closer, what's further. And it's learning the distance metric. Basically
it's learning the importance of various dimensions until presumably we find
something close enough.
Now, you say to yourself this is similar to something like Youflik, which is
in some sense true. One of the things that's sort of innovative about what
we're doing is in the audio domain, well, there's a couple of differences.
In the audio domain, we work under certain constraints that they don't in the
visual domain, one of which is when you're comparing individual pictures,
it's very, very fast. Humans are just very good at very quickly rating into
pictures. Boom. You're done. You've already seen it.
With sound one of the issues is we always have to deal with is realtime
issue. We can't present you ten sounds at once. Like one of the things you
can do in a lot of visual comparison tasks is we present you ten pictures at
once and say find the one that sticks out and you can just grab it. We can
grasp so much more at once. And anything time based you have to listen to
it.
The other thing is in a lot of prior work when you're talking about what
features are in here, a lot of times there's been this tendency to use what's
called a bag of frames. And what I mean by bag of frames is that they don't
think too hard about the timing relationships, the sequence of features. And
there's a big difference to a human between [making noises]. But to a bag of
frame's representation, these two things are identical.
A last thing I'll mention, relative versus absolute. Now, maybe you can
imagine that there's a sound that goes [making noise]. Maybe I can't whistle
so I go [making noise]. Obviously in absolute terms, these are very
different sounds. But they both have this characteristic along a certain
dimension that if you measure the change over time, they both have this it's
going like this, even though, absolute terms, if we grabbed the frequency,
the center frequency of what I was doing, these would not be the same.
So we know that the humans are going to produce some sound. The range of
features they have to work with are limited. We may remap onto -- we may
translate down. We may remap onto a different feature but still somehow get
this change. So when we started looking at things, we grabbed features which
were typical -- pitch, loudness, how harmonic is it, clarity, spectral
centroid, how peaky is it, et cetera.
And we're actually using dynamic time warping distance calculated on each 14
time series independently. And by that we mean both the absolute and the
relative versions of them. And for those of you don't know what dynamic time
warping is, dynamic time warping is an algorithm that is used to take
sequence one and sequence two and find the mapping between them, the time
stretching between them that would make them best aligned.
So we do a time -- dynamic time warping between two sequences in each of the
different dimensions and then from this we get a similarity measure in each
of the different dimensions, both the absolute and the relative versions of
them. Then we give you some top candidates to show you. You rate the top
candidates along each of the dimensions and we learn whether it's the
absolute pitch that matters or maybe it's the relative pitch contour that
matters.
And as I said, initially we weighed all dimensions equally. And as we move
along, we change the ratings. So here's a preliminary experiment. We're
going to be doing a bigger one a little later. This is the underlying
interface for a synthesizer. And the question is can a user find a specified
sound faster with SynthAssist, the interface we just showed you, or a
traditional synthesizer interface. In fact, the interface that was used to
make the sound in the first place.
And the task was as follows. Here's a sound. Play it as much as you want.
Once you start, you have five minutes to match the sound as closely as you
can with this interface. And we then interfaces. Here's a new sound. As
much as you want, listen to as much as you want, then you got five minutes to
try and match the sound.
>> Ivan Tashev:
How you measure the progress?
>> Bryan Pardo: Excellent question. This, in this preliminary thing, it was
actually straightforward. The user is actually rating themselves how close
they think it is. Now, why are we doing that? Well, if we used an absolute
measure, the problem is if we already knew the right absolute measure to use,
we wouldn't need to go through this whole weighting procedure. We would know
exactly which was the closest thing, and SynthAssist would just hand it to
them. So we turned back to the users and said, okay, you hear the sound, you
hear the result of what you're doing, you tell us how close it is.
Now, what are we looking at here? Each red -- okay, so there were three
trials per user. So this is just -- this is very preliminary, just so you
know. Three trials per user. Here's the red line is the experienced user;
the blue line is the novice. Every minute we ask them to rate how well
they're doing, how close have they gotten.
And then what you see here are on -- well, actually, I said three, but
there's clearly four dots here. So, hmm, maybe there was four trials. I'll
have to go back and double-check. In any event, yeah, there's four here as
well. So let me back up. Four trials per user.
The important thing here is to notice this. Down here is the traditional
interface. And by experienced user, we went out and sought someone who was
used to programming synthesizers with the knobs. And by novice, we went and
found a musician who was an acoustic musician and not programming
synthesizers for their living. And we asked them to rate their progress as
they went. And the reason we don't have dots at the end for many of the
trials is because the user gave up. Up here, so take a look at these
lines ->> Ivan Tashev: [inaudible] on top of each other, this is the reason why you
have three or four ->> Bryan Pardo: Yeah, this would be the reason for the experienced one. But
I know for a fact that the novice gave up on at least one of these examples
and I think maybe two at the end.
>> Ivan Tashev:
And you quoted.
>> Bryan Pardo:
Sorry?
>> Ivan Tashev:
You quoted --
>> Bryan Pardo: I quoted him earlier, yeah. Again, this is very preliminary
and this is me waving my hands in the air and going believe me, from just
looking at two users, but we really feel like there is something here; that
if you've got -- there's the blue line for the novice and the red line for
the experienced. This is our SynthAssist interface, how close they thought
they were getting over time. This is the traditional interface.
And basically what you're seeing here is the on average the experienced
person went from, oh, on a Likert 7-point scale up to, you know, what is
that, 4, 5 1/2-ish. And up here, you know, 1, 2, 3, 4, 5 1/2 -- the
experienced one, uh, we can't really talk about statistics, but they weren't
doing a whole lot worse with SynthAssist, but the novice suddenly went from I
give up or I'm doing terribly to I got somewhere.
>> Ivan Tashev: Would you say that the novices are less critical to the
results of the experienced users?
>> Bryan Pardo: Well, I don't know. Down here they were very critical of
their own results. And, again, so few a data point -- this is really me.
This is something that -- you know, we ran this in the lab. We don't -- I'm
not talking like -- all the other results I've been showing you are we have
500 users, we had 75 users, we had big -- for psychological experiments, big
numbers of users. This is a couple of users. So this is me basically just
trying to convince you that we think we're on to something, but we don't
actually have the data yet to back that up.
But what I will say is we're trying to do something a little bit different,
and this is the punch line I want you to think about when you walk out of
here. We're not rethinking sound interfaces. We're rethinking -- we're
teaching interfaces to rethink themselves. With the first interface, the
evaluative interface, social EQ, or iQ as it started out, all we do is we
hand you examples and have you rate them. The result comes out like the
result comes out.
The second one we went to the Web and we asked lots of people to give us
examples. We asked lots of people to do ratings. We collected the
vocabulary from them without trying to constrain the vocabulary or see the
vocabulary. We got back the vocabulary and made Audealize so we could use
natural language vocabulary.
In fact, Audealize is running up on the Web now. And if you want to teach it
words, if enough people teach it the same word, it runs each night, processes
and thinks about what it's learned from user interactions. And if a dozen of
you got on and decided to collude and teach us the word fizzjibit, but you'd
have to give it the same audio concept, if you manage to do that, fizzjibit
will end up on the word map.
And then SynthAssist does the thing I was just describing. Again, this is -this combines two ideas which was the user provides an example and then
there's an interactive evaluative paradigm that goes with it. My student
Mark had a -- we had a philosophical disagreement about whether using words
would be a good idea, and so Mark leaned towards a word-free interface and I
sort of pushed towards a word-heavy interface. And, again, you know, I think
of this as two alternatives. One could have gone the other route with
SynthAssist and done like we did with the other stuff and had lots of people
name things to give them meaningful names. Here he wanted to say some things
maybe we don't have a good language to describe, there's nothing even
remotely universal about [making sound], but maybe you'd want to make that
sound. And so he thought examples in evaluation was a better way to go with
this. Which I guess I agree with.
So if the stuff you think you saw -- the stuff you think you saw. If you
think the stuff you saw is interesting, then please go check it out at
music.eecs.northwestern.edu.
I need to acknowledge the people who actually did all the hard work of
building stuff because, of course, I'm a university professor, I very lightly
touch the code, if at all, and it's the students who really do the heavy
lifting.
Zafar Rafii worked on having something learn to -- learn reverberation
effects from user interactions. Bongjun Kim is the person that sped up the
equalization learning from having to use 25 or 30 examples down to eight by
using transfer learning from prior users and active learning. Prem
Seetharaman is the one who came up with the Audealize word map. Mark
Cartwright did SynthAssist that you just heard about.
And, by the way, Mark's graduating this year, so if anyone out in the
audience is looking for a really excellent person with a Ph.D. who knows
about machine learning, audio processing ->> Ivan Tashev:
[inaudible] interaction [inaudible]?
>> Bryan Pardo: Right here. Andy Sabin was the one who did the
That was the first evaluative interface. And these two are both
in industry, and these three are still with me. But, of course,
looking for that job, so anyone on the Web, want to reach out to
put you in touch with Mark.
That's the talk.
initial iQ.
now working
Mark is
me and I'll
So thank you very much those of you who came.
[applause]
>> Ivan Tashev: We are not supposed to have direct interaction, but
questions please.
>>: I have a quick question. The -- through all of these there's a mapping
of perception to some parameters based on a very specific implementation of
EQ or reverb, or whatever it is. So you guys have done a bunch of learning
already of how perception maps to those parameters. Let's say that I were -I were interested in this technology and I wanted to make a new product. Is
your idea that, hey, you'll have a -- you may have a very different
implementation, you have your own implementation of a reverb, is your idea
that you would be able to use the training you've already done, or is it
that, hey, here is the model and design of the interfaces, if you implement
it in this way, then you can do your own training with your own users.
>> Bryan Pardo: Right. That is an excellent question. So what Jesse
Bowman -- can I just brag and say also former student in my lab? Asked a
very insightful question. So there's this idea of the control space, which
is the parameters that control a specific tool, like the reverberator that we
used, actually, ba-ba-ba, here's a little diagram describing the reverberator
that -- of the underlying reverberator that we used.
And so he's asking the question, well, there's a bunch of control parameters
for this reverberator and that's great and you learn that, but now I don't
want to use your reverberator, I want to use a different reverberator, maybe
one that's an impulse response-based reverberator. Do I have to start my
learning over from scratch?
And the answer is going to be that depends because there's two different ways
that we can do this thing. One way is we learn a mapping onto the control
feature space in which case, yes, you would have to learn it over. Another
thing that I didn't talk about, some work that Zafar Rafii did, was we also
did a version, which did not end up getting incorporated to Audealize, but we
can -- I can send a paper on it, where you use descriptive statistics on the
resulting audio, on the resulting impulse response function such as RT60 or
spectral centroid and blah, blah, blah.
And in this case if you learn the mapping between these descriptive
parameters of an impulse response function and the words, then you can
happily swap out any other reverberator as long as you know when you turn the
knobs this way it will end up with this kind of impulse response function.
So you would have to learn that mapping yourself, but then all of our user
learning could be dropped in with no problem.
So it depends how you did it. Now, the way we did the equalizer, we can do
that. Because we describe EQ curves and it's very straightforward to map
between an EQ curve and a parametric equalizer or graphical equalizer. With
the reverberator, we went straight to the control programmer space which kind
of ties us to this reverberator. But of course you could do this again
mapped onto this feature space, and then you're free.
>>: And I was thinking with the synthesizer as well, you know, it's a single
oscillator amplitude envelope of -- what if I have six LFOs and ->> Bryan Pardo: Ah, well SynthAssist actually was not a mapping into the
feature space -- or, sorry, not a mapping into the control space [making
noise]. It was a mapping into a feature space that described the output.
These are the features that we were using, these seven plus the absolute
version of them, absolute pitch and relative pitch, as in change of pitch
over time and inharmonicity, change of inharmonicity over time.
So SynthAssist is one where you could, in fact, swap out the underlying
synthesizer with a sampler or a granular synthesizer or anything you want.
>>: How [inaudible] SynthAssist in the example it was what you were
searching for, like a very short sound and it's time bound. So could a
similar technique be used to be able to select something maybe distinguished
between a string sound and a trumpet sound? If I have a sample and I want to
give it an example, what kind of instrument I want to find, and I use my
voice to give an example, and could it help me find that?
>> Bryan Pardo: Well, this is kind of an interesting question. What we're
trying to do is make something that could go ideally either way, ideally what
we would hope is this thing would let you find either something about the
underlying tambour or -- and this is somewhere we were a little more
focused -- the overall shape of the note.
So maybe you could imagine a string going [making noise] or a trumpet doing a
fall off [making noise], and we would like it if what would happen -- this is
our ideal -- is that when I go [making noise], let's say that that was a
great imitation of a trumpet falling off, then what we would want is that top
set of examples, one of them might be the string sound that also did the
falling off and the other would be a trumpet, and then in that first round of
ratings you would tell it, well, what matters, was it the underlying tambour
or was it more this sort of pitch contour? And so our hope is that this
technology when appropriately used will allow you to do either kind of
search.
But our medium to long term with this is now we talked about evaluative
interfaces, we've talked about examples, and we've talked about words. What
we haven't shown you is an interface that incorporates all of them. And
we're hoping that the search for this evaluative interface could be
constrained. If I said it's a string sound that kind of goes [making noise],
then it goes, great, I'm going to search in my set of strings. I'm going to
give you things that have some sort of -- well, he sang kind of low, so we'll
give him low strings. He had sort of a falling away; maybe we'll give him a
couple of violins that are high but they kind of have this -- you know, they
did this. And the overall thing will allow you to use all of the tools that
you would use to describe it to another person if you didn't happen to have a
violin in your hand. That's our goal. I'm not claiming we've integrated
them all yet, but that's what we're hoping for.
>> Ivan Tashev: How much we can [inaudible] approach with the words and
description to other areas which are also subjective and [inaudible] emotion,
sound quality, spatiality, stuff like this. We have our experience with
[inaudible] really painful. I doubt the colloquial normal people here are
using slightly different meaning than what we have clearly describe it in the
literature but the data were enormously noisy. What do you think, is it
better to let them use their own words, the judges, and then try to mark to
the scientific meaning, or it's a corpus thing?
>> Bryan Pardo: Well, I think we all know this about language.
Colloquial -- there's a reason that math was invented. There's a reason that
specific technical terms were invented. It's because we don't all
necessarily exactly agree. When I say the word love I know just what I mean,
but you probably have a slightly different idea of what love is.
My hope would be the following. I haven't shown you a thing, but I will show
it to you now. Where is it? Ah. This is -- we did multidimensional
scaling. Do not worry about the about what these dimensions mean. We took
higher dimensional space and shoved it in a lower dimensional one. What I
want you to see is this. Everywhere it says the word underwater in this
feature space, somebody labeled a reverberation setting, at least once, that
reverberation setting with underwater. The bigger the word is, the more
often that reverberation setting was labeled with underwater.
The reason I say this, the reason I'm showing this to you is this is what,
from our data, a tight distribution looks like. This is a word we can count
on more or less, more than average.
This is the word warm done the same way. Again, these were reverberation
settings. And what happens is we play a reverberation setting to someone.
For this test, what we did was we played a reverberation setting and we said
what word would you use to describe that. Someone says warm. And then we
took all of the events where a reverberation setting was labeled warm. Each
point is a reverberation setting. Again, size is how many people label that
setting warm.
But what you can see about this distribution compared to that distribution, I
could sure take the average of these, I would get a point here, which is of
course a place that nobody, nobody would call warm. What this says to me is
that there might be multiple regions, each of which there are some group of
people who think, yeah, that's what I mean when I say warm.
Our current interface does not -- we decided
building for this that we would sort of hide
to give them here's a warm and here's a warm
have. Instead we decided to kind of go with
ah, we're going to have to pick.
for the interface we were
that complexity. We didn't want
and here's a warm. We could
the most frequent one and, say,
Now, but knowing the shape of these distributions gives us some
possibilities. We can -- if you have enough data from enough people, we can
answer questions for particular words. Is this a technical term that most
people seem to agree with the experts on? Is this a technical term where you
know the technical term that generated the most, we'll call it disagreement
but more maybe generality. For the reverberation problem, every single
reverberation effect got labeled by at least one person with the word echo.
>> Ivan Tashev:
[inaudible].
[laughter]
>> Bryan Pardo: Which makes echo a useless word for describing
reverberation. Why? Because by definition all reverberations are echoes.
If you had seen the echo distribution, it's everywhere. But now what we have
is we can do the following. We can have -- if you gave me enough money to
hire enough experts and poll them, we could get their definitions for words.
We can crowdsource our definitions and ask how much overlap there is and then
do something.
But let me -- since you've asked these questions, I've saved a couple of
slides. No, no, no. Now, for the moment, please grant me that the person
who wrote, say, Adobe Audition, or one of the people who wrote Adobe
Audition, this is why I was hoping a certain somebody would be here ->> Ivan Tashev:
[inaudible]
>> Bryan Pardo: Let's grant that they're experts and that the words they use
are the words experts would use and we can talk about why that's probably not
true for their drop-down menus. What we did is this is kind of an
interesting thing. Here's the vocabulary. Here's -- the size of a circle
correlates with the size of vocabulary. This is in Ableton Live their EQ had
this many drop-down menu preset labels. Audition had that many. Audealize,
the ones we learned from the crowd, that we decided we trusted 365 words.
This -- actually we trusted more than 365. It's 365 plus 18 plus 5 plus 4.
Okay. These are the overlapping regions.
And so from this, what do we gather? It turns out between the preset list in
this and the preset list in that and the crowd, there are exactly five words
that overlapped amongst all three. Wow. Here is one of them. Clear. And
this is the preset that got labeled with clear for Audition. This is the
preset that got -- or, sorry, for Ableton. This is the Audition one. And
here's the one we learned from the crowd. The blue is the one we learned
from the crowd.
>> Ivan Tashev:
[inaudible] how he came up with that.
>> Bryan Pardo: Well, at this point, who knows.
replaced by somebody else's.
>>:
Maybe his presets were
Yeah.
>> Bryan Pardo: And between these two I would say the crowd pretty much
agrees with the Ableton one. Like these have qualitatively similar shapes in
pretty strong disagreement with the Adobe Audition one.
So here's an example, what maybe the experts don't exactly agree on what the
word clear, for example, means. And so it turns out to be a linguistic
problem all around.
And there was actually a study -- and you would expect this between England
and the United States. There was a study done on organ sounds where they
asked people -- oh, I'm forgetting the descriptive word now. I think it was
warm. They took audio engineers in New Jersey -- that's in the United
States -- and audio engineers in London -- and by that I mean London,
England, not London, Ontario -- and said make this organ sound warm. And so
they did EQ things. And there was broad agreement amongst the London audio
engineers and broad agreement amongst the New Jersey audio engineers. But if
you looked at the EQ curves between London and New Jersey, they were not the
same.
So this language stuff gets tough. Right? So there's, oh, on this
particular word the crowd agrees with Ableton; on another word, maybe they
agree with the definition in Audition; maybe the person who wrote Ableton or
the person who did this preset has a whole community that all agrees with
him; and maybe this person also has a whole community that agrees with him.
You're only going to know if you go out and gather enough data, which a
company like Microsoft can do more easily than I can.
You know, I can do this to get a thousand people, and then I kind of run out
of both research dollars and, you know, time and all that other stuff. And a
company can do something at a scale that I never could. So maybe one day you
will find this out and tell me which words are actually truly broadly agreed
upon by experts and by nonexperts. Or we could work together and you can
spend a few thousand dollars and find out together.
So I see it's almost 5:00.
So I figure --
>> Ivan Tashev: Any more questions?
we have four minutes.
We have more time, but we can also --
>>: One more quick thing. Have you considered the dependency of the
meanings of word like clear, for example, for equalizers on the sound you
apply it to?
>> Bryan Pardo: One of the things that I didn't talk to you about, how do
certain words end up on the map and other words not. Let's pick an obvious
pair. I have a tuba which plays really low notes and I have a piccolo which
plays really high notes. Now, if I ask people to evaluate -- what should I
do to the equalization of 100 hertz for a piccolo note? Well, there's no
energy at 100 hertz. So obviously their responses will only be meaningful in
the frequencies in which the piccolo is actually making noise. And if it's
not making noise at a frequency, whatever evaluations or things we learned
from people will be meaningless.
To get onto the map, we also considered the frequency bands which were
represented in the underlying audio upon which they did their -- the people
did their labeling. So we had multiple audio files -- drums, voices,
guitars, et cetera, and only things that where the word -- to be a big word
on the map.
We also took into account the underlying distribution of spectral energy
amongst the sounds. So if you're a big word like underwater or clear, that
means that we had a variety of sounds that covered the entire -- not the
entire, but a large part of the frequency -- range of frequencies we expect
someone to have sounds in and that the word was consistently used among them
and that's what makes it on. There's a lot of details in the paper that I
have to sort of gloss over, but it's a very, very important point.
And, you know, we could do it another way, which is -- I said frequency.
What happens when someone hands me a sound they'd like to apply reverb to
that already has reverb on it? This is one we haven't been able to deal well
with. With the frequency one we can obviously look at the frequencies and
say there's no energy here, but when we start to allow people, as we do now
with Audealize, to upload their own sound and teach it a word, we can play
these games of quality control with equalization, but we don't really know a
good way of telling that there's reverb already on the sound or -- we're also
starting to work in compression -- that the sound is already compressed, and
that will obviously change a lot of things about how they perceive when we
add more reverb.
>>: You are able to tell whether or not people agree on a word, right, and
then you know if the word is generally agreed upon, what kind of EQ, for
example, belongs to the word. So it would be interesting to also be able to
tell whether or not the word is independent, more independent of the source
sound or if it's specific to certain ->> Bryan Pardo:
>>:
Yeah, okay, I see what you're saying.
We have not looked --
[inaudible]
>> Bryan Pardo: We have not looked for words that seem to only be used to
describe, say, animal cries versus machine sounds or something like this.
All of the data that I've described was collected on isolated instrumental
and vocal sounds recorded in a studio that were dry. That means no reverb
added. And this is true for both equalization and reverb.
We do have datasets at -- we do have some datasets available on the Web. If
you wanted to download some of this data and have a look at it and see if you
saw any correlations, I would be very happy for you to do so and get back to
me. And that would be -- that would be excellent. So let me know.
>> Ivan Tashev:
[applause]
Let's thank our speaker again.
Thank you, Bryan.
Download