>> Ivan Tashev: Good afternoon, everyone, those who are in the room and those who are watching us remotely. It's my pleasure to have professor Bryan Pardo here. He is the head of Northwestern University Interactive Audio Lab, and his talk will be about crowdsourcing audio production interfaces. Without further ado, Bryan, you have the floor. >> Bryan Pardo: Okay. Thank you. So here we are, crowdsourcing audio production interfaces. And let me just tell you about where I work. I work in the Interactive Audio Lab. And the Interactive Audio Lab is at Northwestern University. And Northwestern University is in Chicago, Illinois. And I felt required to put in the next slide, given that I am in the Pacific Northwest, for some of you here at Microsoft may be wondering how is it that Northwestern University is in Chicago instead of in Seattle. And so for historical reasons Northwestern is called Northwestern. Specifically back in the 1790s there was a territory called the Northwest Territory long before Seattle was even -- this area was even part of the United States, and Northwestern University was named after the Northwest Territory. So we had the name before you guys did, and we're not letting it go without a fight. But you're not here to get a history lesson on American geography. You're here to find out about what I'm doing and what goes on in my lab. So I work in computer audition, and computer audition in particular is something that combines a lot of stuff. And some of the things that it combines are crowdsourcing, linguistics, music cognition, or at least the way I do it, and we also bring in machine learning, information retrieval, and signal processing. And we take all of these things together and apply them to problems in computer audition. And you'll see a lot of that today. So I want to start out with a question, which is: How do we humans communicate ideas about sound? And I'm going to propose to you that a composer named Mark Applebaum really kind of nailed it in this piece that he wrote called Pre-Composition. And this is a sort of taped piece about his process of making a piece of sound art. So let me play this for you. >> Audio Playing: Let's get back to this -- to where we were. So we got this -- there's this like nasal pulsing thing. Does anyone remember that [making noise]? And then suddenly there's going to be like this kind of wind sound, like [making noise]. And then, I don't know, what do you think should happen next then? It could get like distorted and static-y, like [making noise]. Oh, yeah, I like that. Oh, yeah, and then we could like send it around like ->> Bryan Pardo: Okay. So what was that? That was actually a pretty apt kind of description or snippet of the kinds of conversations -- now, here he had the conversation with himself, essentially, but the kind of conversations that happen between artists and producers when they are making music. And some of the things I want to point out here are the approaches that they were using to communicate, or he was using to communicate in his own head. One is a use of natural language terms such as distorted or static-y. Another is examples, in this case made with his mouth [making noise]. And finally evaluative feedback, like, oh, yeah, I like that. At no point did you hear people talk about, say, poles and zeros of filters, about constant cue, about decibels. At no point did anybody use those terms. And that is typical when people are discussing their artistic creative process when dealing with sound. Now I'm going to give you an example task, which is a real task. So here we go. I have a recording made in the 21st century. It sounds like this. [music playing]. Now, this is specifically in an older style of American music, and I would like it to sound tinny, like it's coming out of an old-style radio, something more like this [music playing], as opposed to the original [music playing]. So how do I do this? Now, of course, your first answer might be why don't you just play it out of a 1940s radio and rerecord it? And that is an excellent solution, but not necessarily one that I can implement if it is, say, midnight and I have inspiration in my home and I don't happen to own a 1940s radio. But what I do have is Pro Tools, the industry-standard tool for mixing, mastering, editing your software -- or editing your audio, I should say. So great. Let's take a look at a Pro Tools session. What you see on the screen is a typical Pro Tools session, and up on the screen right now is the tool that I would use to make it sound tinny. And my first question is: Is it immediately obvious to you which tool that is? Have a look. Think quietly to yourself which one you would pick. >> Ivan Tashev: The question is [inaudible]? >> Bryan Pardo: Okay. I'm going to make it easier on you. That's the tool. Now I'm going to blow it up for you. This is a parametric equalizer, and a parametric equalizer is not necessarily something labeled with words like tinny or static-y or distorted or warm. It's got a bunch of things that help you control gain, center frequencies, maybe width of filters. And then there's something over here that if you're not someone steeped in audio you may not know what this display is all about. This is actually frequency and this is boost and cut. And if you think about this as a search problem in an N dimensional space, you have N knobs. Somewhere in here is a region that we would call tinny. All you have to do is turn the knobs until you discover the region. Look at how many knobs there are and ask yourself if there were even 10 knobs with 10 settings each, do the numbers, that's 10 to the 10 combinations you would need to try. But there's more than 10 knobs here. Now, this interface bears some thought: Why is it like this? And the answer is actually a cultural one. Before we had software tools to do this, we had hardware tools. And as you can see, this is really an emulation of a hardware tool with -- this is kind of a new feature. But in general they are trying to stick to an interface that would have made a 1970s music producer happy, someone used to the knobs and switches that were on their analog equipment in 1972. And when the transition to digital happened, these plug-ins, these audio plug-ins were expensive. They might be 500 or $1,000. So their only market was going to be people who were already professional engineers, who already understood this interface. So when the time came, they duplicated the interface so that there was seamless transition from their point of view. And that made sense for that generation of people. But it is not 1979. It is not even 2009. It is 2016. And why are we continuing to emulate this hardware? The hardware was built this way because of the constraints that they were dealing with when building physical hardware. But we no longer have these constraints. So what is the first thing that people did to make this easier to use? Well, presets. So there'd be a drop-down menu that you could click on and see presets that are going to solve your problem. So, okay, great. Here, this isn't an equalizer now. This is a reverberation tool, something that makes echoes. This, I just went this morning and I opened up a standard reverberator and clicked on its drop-down list of presets, and some of these words make sense to me. Like maybe under the bridge, I might have some vague idea of what an echo of under the bridge might be like. But what about memory space or corner verbation? And what is a bitter hallway? Presets have historically been selected by the tool builders with no actual thought about whether or not anyone other than the original maker of the tool has any idea what these words mean. And as the number -- this is something with a small number of presets. You can find tools that will have 50 or 100 presets with descriptive names like space laser Frisbee. And when you have that, you are back to random search through a space; this time a random search through presets. It's not just EQ, it's not just reverberators. Here's a traditional -here's a simplified traditional synthesizer controller. Again, lots of knobs. Now, they've given up the whole trying to emulate the exact look, but the same underlying thought: I'm going to have knobs that can correspond to features, and I'm going to set myself in a feature space. We tried some user experiments where we would generate a synthesizer sound using this interface, then we would hand the interface to a novice, a musician, but who wasn't one experienced with synthesizers, and say we promise you that this sound was generated with this synthesizer and this interface. All you have to do is set the knobs until you can get that sound again. Here's a typical response: I give up. It's as if you put me in the control room of an airplane and I can't take off. This has effects, this choice of interfaces has effects. Here's a longtime musician on trying to learn to produce music: I've been playing guitar for 30 years. I bought the interface software. I'm 48, work as a carpenter, and I'm just too tired all the time to learn this stuff. There is so much to learn at the same time. I don't know the terminology. I have given up for now. Sad, because I have lots of ideas. So for all of you who are watching this right now and think right to yourself I have no problem with these interfaces, I don't see what the big deal is, I bit the bullet and learned them, for every one of you, there are many other people who tried, became overwhelmed and just gave up. What we're going to work on, what I'm going to talk about today is how we've been using crowdsourcing and a new sort of way of thinking about the issue to provide an alternative set of interfaces that the kind of person who doesn't naturally think in these terms of these existing tools can use. I'm not trying to take away your existing knobs; I'm just trying to provide an alternative. You know, put another way, let's say the world was all pianos and I introduced the trombone. The trombone can do things that pianos can't do; pianos can do things trombones can't do. I'm not saying the piano has to go away; I'm just providing you an alternate kind of interface, a different way of dealing with the issue. Specifically, we want to build interfaces that support natural ways of communicating ideas, and we've defined natural as the way people talk to each other about it. Natural language terms, examples, evaluation. Things I gave you in that other example. The Mark Applebaum piece. Okay. So here's the first interface. We called it iQ. Now, I see at least two people, three, many people in the audience with glasses. And if you think about the interface for selecting the right settings for your glasses, you do not necessarily -- the traditional approach has not been you walk into an eyeglasses store and say I would like something minus 1.75 spherical diopters in my right eye minus 2.5 on my left eye with an axis of astigmatism of 75 degrees off the vertical. Many of you probably don't even know your exact prescription. What happens is you go to the eye doctor, and the eye doctor says, okay, what do you think of this? Is it better A or B? You are doing an evaluative paradigm. And you quickly get to the eyeglasses setting that you want. We decided to do this with equalization. And notice you don't have to know terms like diopter to get your glasses prescription. You just have to be able to say whether you like one thing more than another thing. So that's the idea. So how do we make it go? Well, simply put, we take a sound and we know that you're going to want to change it to -- you're going to want to reEQ it, you're going to want to raise some frequencies and lower some frequencies to make a sound more along the lines of what you want. Okay. So what we do is we play you different EQ curves. These are different EQ curves. Here's boost and cut. Here's high frequency to low frequency. We take the sound, we run it through an EQ curve. You're thinking of the effect you want, and then you rate it. How much do I like this? Is this pretty tinny or not? And so let me show you this in action. I use dark, and I'm going to teach it the word dark on a piano sound [music playing]. Maybe I'll use tinny just because tinny was the example I've been using as a running example. Never mind. Anyway, same piano sound, tinny. So this is the original sound. And I sincerely hope that this works. I'm downloading -- this is all a Web demo. This one is running in Flash; the next one will be node.js [music playing]. Here's the sound. Do I think it's tinny? I don't. Nah, I don't think that one's tinny either. It's closer to what I think of as tinny. So what I'm doing here is I'm -- as something is closer to my idea of tinny, I'm going over here and pulling it to the right, which is where it says very tinny, and if it's further, let's see what I get. I got an EQ curve [music playing]. Well, it's more of a bright. Now what have I done? This is the EQ curve I learned. And just for fun, we've got lots and lots of EQ curves other people have taught it. And over here I've got the five nearest words and the five farthest words. And so I can take a look and say somebody taught it aggravating or sharp or bright. So I'm in a certain space of words. And so this first -- and so let me -- let me say what's happening here. This first interface -- oh, we don't need to see that. This first interface was one where we wanted to see whether we could use evaluative interaction. And now I -- what I did was I just did it with eight. If I do it with 25 or 30, I get a very nice capturing of my concept. But then how did I do anything at all with eight? Well, the issue here, why -- and why am I just choosing to do eight? Well, when I went ahead to do this, even though we'd done some studies that showed that when you hand somebody that first interface with all the knobs, that they get lost and they never even finish. But when we handed them something with evaluative interfaces where they had to answer 25, 30 questions before we gave them something, they became annoyed that we asked them 30 questions. Even though we could show that if we just handed them the original interface they'd never actually get to the end. So the problem was now we were solving the problem but we were still annoying people. How could we annoy people less? You ask too many questions. Well, one way is perhaps to use prior user data. Here are three EQ settings at a much larger set that we would normally play to someone if we didn't have prior user data. Here are three people -- Sue, Jim, and Bob -- and each of them had some concept in their head that they were going for. And if we play -- when we train the system, if we play a manipulated audio example with a certain EQ curve and say, well, how warm was that, Sue? And she gives an answer. How dark was that, Jim? Mmm, Jim liked it, 1 is the strongest positive thing you can say. And maybe down here this other example, Bob, who is going for something called phat, really didn't like it. Now we have a space where these EQ curves could be looked at in different ways. They could be looked at in the space of how similar the curves themselves are. We could also look at them in the space of how similar different user ratings of the curves are, for certain curves may look a bit different but maybe effectively people think of them as, eh, roughly equivalent. So if we take these EQ curves and let's say I'm going to take the responses, Sue's evaluation of curve 2 and curve 3 where her goal was to make it sound warm, I could take her concept now and put it in a space of evaluation. So the concept is Sue is warm and in the space of what she rated those two examples as, it's here. And these other ones could be placed in this space as well. So I've just done a transformation from which space I'm looking at things in. Before I had EQ curves; now I have concepts, user concepts. And now it -the EQ curve, all my audio data has disappeared and now I just talk about the ratings of the EQ curves, these underlying things, which could be anything. Now, if I have some new concept I'm teaching the system, I can say, well, in this space of user ratings, prior user ratings, who am I closest to? And instead of having you evaluate lots and lots of EQ curves to try and figured out what exactly I should do to manipulate all the different frequencies, I ask you to evaluate a few EQ curves and then compare your answers to prior users' answers on those EQ curves, prior users who did answer all the questions, who did evaluate all of the different EQ settings. And then I can say, well, rather than just hand you something arbitrary or hand you something that required lots of answers, you answer a few questions, we figure out who you're most like, and we use a weighted average weighted by your distance to this new concept in this space. And instead of 25 questions maybe I could take it down to a much smaller number. And, in fact, in this case I showed you an example where we did it with eight questions. And I might want to ask, well, which eight questions do I ask? If I have a bunch of examples that people had to rate in the past before they figured out what they were doing, maybe what I should do is figure out what the most informative question is. And in this case we decided that the most informative question or the most informative EQ curve to ask would be the one that caused the largest disagreement amongst prior raters. And if you think about this, let's say that I was trying to -- let's pick something dumb. I was trying to determine your political party. And one question that I have on the questionnaire is do I like ice cream and the other one is do I like Obamacare. Now, probably most people will say they like ice cream and so it doesn't have a lot of discriminative power. But maybe do I like Obamacare would have a lot of discriminative power and I would want to use that question first. In this case, this example turned out to have the largest variance in user ratings. And so if I was going to have to pick one dimension along which to rate you or to judge you, I think I should ask this question first. And amongst these three I would say I'll ask this one, then this one, then that one. And the result is social EQ, which is the thing that I just demoed for you where instead of answering 25, 30 questions I answer about eight, and these questions are really designed to locate me in a space of prior users who have answered these questions before and then say, okay, we think you're most like a weighted combination of these individuals; let's give you something like that. And that was pretty good. And then we -- and now you might be asking how did we get this prior user data? And the answer is as it is for all crowdsourcing, Amazon Mechanical Turk. If it were about five years ago, I'd be telling you a story about how we were going to gamify all of this problem. But as many of you who work in crowdsourcing know, gamification -- it turns out those of us who are good researchers aren't necessarily good game designers. And therefore, when you make a really great game, the problem of making a really great game that people will come back to and do over and over again is really different from the problem of doing good research and machine learning or interface design -- well, closer to interface design. And if you think you can get the answer by paying $1,000 to get it on Mechanical Turk, you should do that rather than spend a year designing a game to get as much data as you would have by paying $1,000 from Mechanical Turk. Which is what we did. So we got that prior set of people by going onto Mechanical Turk and having them teach us a word, in which case we were paid between $1 and $1.50, depending on how reliable they were, to teach us a word. And words were selected by the contributors. So we didn't tell them what word to choose. We said we just want to learn a word that's an adjective about sound. You pick the word and then show us what it means. And in this case we had them do 40 rated examples per word. And we had some inclusion criteria: Words existed in an English or Spanish dictionary -- one thing I should mention is we did this both in English and Spanish -- and that they self-report on listening on good speakers in a quiet room and that they took longer than one minute to complete the task of rating 40 different examples in light of a particular goal. If you could rate 40 examples in less than a minute, we figured you were probably just after the money. And, finally, we wanted people with consistency greater than one standard deviation below the mean consistency. What do I mean by consistency? We actually had 15 repeated examples in our set of 40. So that means that if you heard an example and said, yeah, that's a 1 and then the next time you heard the example and said that's a negative 1, you're not being consistent. So we wanted to make sure that independent of the context in which it was heard, you would still say that EQ setting was warm or whatever word you were going for. After the inclusion criterion, we ended up with 932 English words, 676 Spanish words of which 388 and 384 were unique. And here's a thing to mention, the learning curve. So what is this learning curve about? Here's the number of user ratings. And what do I mean by that? I have a sound that I would like to have EQ'd somehow. Now, the machine hands me an EQ curve and I rate it. That's one user rating. Based on this, it's going to try and figure out what my rating of the next curve will be. Okay? I rate the next one, sees how far off it is. Now I'm in a 2 space. I've got two ratings, and we can try and build a model of me based on prior users who answered questions like I did on two examples. We build a new model. We try and predict your answer to question three. That is machine-to-user correlation. The better the machine is at predicting your answer, the more correlated the user and the machine are. By the time we get to about 25, we can predict your answer, and we're kind of done. This is our training curve without prior data. This is our training curve in the end with prior data. And so we moved that learning curve up, which means that prior users are, in fact, helpful. And a thing I don't actually have on this slide is, of course, if we constrain it -- once we have enough data, if we constrain it to prior users who used a word like your word -- that is, if I was training tinny, we only use prior users who were tinny or a synonym drawn from WordNet like bright or sharp or something like -- well, those aren't synonyms from WordNet, but we know they're synonyms. You can do even better. I mentioned English and Spanish, and one of the things that I think is interesting about this work is we're actually doing a translation problem. And we started out thinking of it as a translation problem between the control parameter space or knobs or feature space, which is sort of a perceptual thing, and perhaps a word space. I want to -- I'm leery about saying a semantic space because a lot of times engineers throw around the word semantic without knowing what it means, which is great because semantic is about meaning. But we were doing this translation. We started to think about, well, what could this translation be useful for. It initially started out as a translation between this complicated space of the tool and the user. It turned out this translation was a great one between users and audio engineers. And by user I now mean acoustic musicians and audio engineers because we had this -- we were actually reviewed in Sound on Sound magazine by a famous audio engineer for the initial version of this tool who said that this solved a problem that audio engineers have been having since time immemorial, which is you get a customer, a client who comes in and says make my sound buttery, and the audio engineer says there's no buttery knob. I have no idea what you're talking about. You go through this teaching process, and we can discover whether buttery is in fact related to EQ at all or some other tool, and if it is related to EQ what the EQ curve would be. Then we got to thinking about translation in another sense, which is what's the right word, como se dice, a warm sound. How do you say a warm sound in Spanish? Would it be a sound that's calido? Is that the right term? And it occurred to us we have this really interesting possibility here. Rather than do the kind of machine translation that we're all used to, which is you have paired texts and you find statistical correlations, we're going to do a mapping between the perceptual space and the word space as follows. You have people train -- here's an example. Rate how warm it is and give a rating. Then you got someone else in another language. Here's an example. Rate how calido it is. They give a rating. Examples where lots of people give high ratings to this word in English and that word in Spanish, that might be a good translation for that descriptive term. >> Ivan Tashev: What about if the -- because warm is abstract here, it's not temperature-wise form. What about in Spanish for the same sound to use a different abstract ->> Bryan Pardo: >>: Exactly. That's a problem more than -- >> Bryan Pardo: Yeah, yeah. So what we do is this. We don't actually translate from the word to the word. We have words that get associated EQ curves. So I have a word in English and a word in Spanish, and each of them has an EQ curve that has been taught to it by the users. Then if I take an English word, chunky, which has an EQ curve associated with it, I go looking for the EQ curve from the Spanish users that looks most similar to the EQ curve for chunky, and I call that a likely translation. And, of course, these translations are different from paired texts, and at least for audio, maybe better descriptions. So putting this in more concrete terms, here we go. What this is now is an EQ curve but it's sort of also a probability distribution. And don't worry about the units of this. Basically this is boost and that's cut. Here's 0. And the brighter it is, statistically speaking, for the set of people, the more likely it is that this amount of boost or cut was what people would say it should have to teach it the word tinny. That's a spectral curve for tinny. Here's a spectral curve for light. Again, this is learned from six people in this case, but you can kind of see this sort of. And one of the things we discovered is you might have a word that you wouldn't expect to be, for example, the antonym, the opposite of this. Here's hard. There's light. There's hard. Whoop. Here's warm. Here's profundo. It's not calido, it's more like deep. So now we've got this interesting thing. You can translate between nonexperts and experts using a tool like this. What is a buttery sound? What's a whatever sound? You can also translate between different linguistic groups like this. And this obviously isn't limited to audio per se. I mean, I'm an audio guy, so that's what I did. We could do this -- the same game could be played with color words, for example. Or maybe flavors. Said all that. We also -- we didn't -- once we did this and we thought, well, it's not just equalization, of course, again, I'm a sound guy, so I just said color words and flavors, but I'm not interested in that. I'm interested in sound. So I went ahead and did it again for reverberation. And we went out and collected a lot of examples of reverberation. And so then we had this big dataset, prior dataset of reverberation effects and the words associated with them, equalization effects and the words associated with them. And we had this philosophy which was the less teaching a human has to do, the better. Now, we'd taken it from our interfaces from the starting point, which was those hard-to-understand knobs, hard to understand by someone who isn't an expert who already understands the signal processing or whatever, into here's a teaching paradigm. Essentially, I evaluate -- it asks me questions. What about this? That's pretty good. What about that? Better. But that interaction, you know, we took it from maybe 25, 30 examples down to about eight, and that was still pretty good. But there's a reason that we use language as humans. It's a shortcut. Right? When I say the word dog, you all picture something in your heads. It's probably, you know, this tall, it's got four legs, its whole body is covered with hair, it's got a wet nose. Imagine if I had to teach you what a dog was every time we interacted. That would be very slow. So since we'd been collecting this vocabulary, why don't we use this vocabulary as a starting point and then only fine tune if the word that we use isn't quite working out, which gets us to the interface for Audealize. What you have here is an interface for -- I'm not sure if you can see this on your tiny video screens for those of you who are watching at home, but EQ and reverb, equalization and reverb. We've learned vocabularies for both of them. >> Ivan Tashev: Actually, the other video is small on their screens but the slides are larger. >> Bryan Pardo: Oh, okay. Great. So now the EQ and reverb, we've got vocabularies for both. And we even have a reverberation setting and an EQ setting that somehow we have some confidence in for each one. Now we can take these words and place them in a space. In the case of reverberation, there's five control parameters that we're using here. So you can imagine each of these words is -- and six is a gain, but we'll say five dimensional. You can imagine each of these wording with a reverb setting in a five-dimensional space, and we're projecting this down into 2D to provide you a word map. So what this means is that words that are closer to each other probably have reverb effects that are more similar sounding and words that are further from each other more different sounding. Now, your first thought might be, wait, aren't I just getting back to that preset list? I thought you said that presets were a problem. And they were a problem. They were a problem when the presets used a vocabulary that was arbitrary, defined by the tool builder, with no real thought about whether or not anyone else would understand what the words are. The presets list were a problem when they were organized in some way that's not intuitive. Why were they in the order they were in? It's hard to know how to search that list. Here close things sound like each other; far things sound different from each other. So let us say -- oh, and the other thing is, these things are a social construct in some sense. Let me tell you what I mean. When I was describing this work, I was talking to the bass player who had gone on tour with Liz Phair, who is a Chicago -she came up out of Chicago in the '90s, and she's a relatively well-known indie rock artist. And she was talking to her engineer. He remembered a particular conversation where she said I want you to make my guitar sound underwater. This was an electric guitar. Now, if you think about it for just an instant, if I walked over to Lake Michigan and took my electric guitar and dropped it in the water, the sound that would happen would be a splash and maybe if it was shorted out, maybe if it was turned on, there would be a shorting out of something. Here's what underwater sounds like. Here's a guitar first [music playing]. Here's underwater [music playing]. Now, you can see this, but down here it tells you for each word this case was learned from 95 contributors. Down here for you how many people it learned an example from. >> Ivan Tashev: without underwater I'm not sure if -- underwater in each word it tells [inaudible] reverberation as underwater. >> Bryan Pardo: Yes. So a certain reverberation setting was labeled as underwater by 95 people, and therefore it made its way to be one of the bigger words on our map. Or you could have, coming back to this [music playing], maybe that's too much and I want something that's like crisp or clean. Maybe I can't find the word on the map so I type it into this search box. Got to cave. This interface, if you compare exploring this interface to exploring that first interface with all the knobs on it and just, again, don't think like an engineer for a moment, think like you're an acoustic musician, you'd like to make some -- it's midnight, you're at home, you're playing stuff into Garage Band or whatever it is, and now you'd like to make it sound like I'm playing my guitar and then I plunge underwater, it would be nice to know what reverb effect would not just be underwater to me but give an impression, a general impression to an average person of underwater. This kind of gives you that. Gets you in the ballpark. But let's say that you don't exactly like what you heard. Maybe underwater is overdone. Whoop. Let's gets that going again [music playing]. You could always go back to the traditional interface. We provide you the controls if you want them. You can search around in this space. You can go back to the controls in the old parameter-based space. Or if you want to go back to that evaluative thing, maybe there's a word like fitzleplink, which is not going to be on my map. It tells me to try teaching it, and then we drop back to the teaching interface that you saw earlier. So the idea here now is that we started from interfaces, coming way back, wherever it is, that look like this, right, the interface that comes -- ships with a standard equalizer, parametric equalizer, and we turned it into an interface that looks more like this. Here's the EQ tab. Where you think of the word that would describe what you're after, and then you click on it. And if you want to see -- and you want to get an idea for generally what would happen if I went from here to over here, you very easily just drag across. If you want to get back to the old interface, it's down there, but you don't have to use it. And if you like just teaching it, there it is. So now we've managed to cover two of our approaches. So one of the approach that I mentioned was an evaluative interface, and we got that. The next one was natural language using words that regular people know. We went out and asked lots of regular people on Mechanical Turk what word does that sound make you think of, and we got that answer and we created an interface like this. But there was one other kind of interface that I'd mentioned we'd use or we wanted to explore, which was the example based. Sometimes I say it's wind, and you go wind, like this? And we use our example, and it's like no. And then I try and go through the learning and no, it's still no, could you give me an example yourself? So Mark Cartwright, a doctoral student in my group, became very interested in that. And so he came up with something called SynthAssist. And let me tell you what the problem was. You might remember back earlier I said that we gave an example problem to some novices, which was here's the sound that a music synthesizer made. Here's the knob-based interface that someone used to program the synthesizer. All we want from you is to get that sound back. And this is a typical problem that you have with synthesizers if you're a musician. Maybe you -- this is something a friend of mine had to do a lot. She worked in a wedding band as a keyboard player. And so you've got your keyboard, and the bride has asked you to do the latest Katy Perry hit. So you need to make it sound like the Katy Perry. So you listen to the Katy Perry, hear that keyboard sound, and then you go looking for it on your interface. Oh, if only there would be something that would help me with this search. So Mark Cartwright went ahead and made a thing. And now this one I don't have a live demo for because it actually -- you have to hook it in with his synthesizer and stuff, but Mark was kind enough to make me a movie to show you. So let me show you this movie. >> Video playing: SynthAssist is an audio production tool that allows users to interact with synthesizers in a more natural way. Instead of navigating the SynthAssist space using knobs and sliders that control difficult-to-understand SynthAssist parameters, users can search using examples like existing recordings or vocal imitations. They then listen and rate the machine's suggestions to give feedback during the interactive search process. Let's see an example using a synthesizer database with 10,000 randomly generated sounds [making sounds]. >> Bryan Pardo: sound was -- Let me just pause it there. >> Ivan Tashev: What we have in mind. So what happened? That first >> Bryan Pardo: What he has in mind. The second sound was the best he could do imitated with his mouth. So we record that imitation -- not what he has in mind but the imitation, because you can't actually get what he has in mind -- give it to the system [making sounds]. Let me pause it for a moment and tell you what's happening here. The system has produced a bunch of examples. In each of these examples, along some dimension it's measuring, is a highly similar example. Now, what the user is doing is clicking on them. And if you click on them and drag it towards the center, towards the target, you're telling it that this example is in some way similar. If you right click to delete, you're telling it it's completely off. So dragging it towards the center means, yeah, this is something good; away means something not so good; and click to delete means terrible. Whoop. I didn't mean to restart that. Is there a way for me to restart it without -- nope, going to have to do that. >> Video Playing: Let's see an example using a synthesizer database with 10,000 randomly generated sounds. >> Bryan Pardo: We'll do it that way [making sounds]. So it searches. Here's his initial example that he provided. It's close but not that close he thinks [making noises]. And so now he's done that and now he wants to re-run the search. You have to re-run the search now that you've done this to give it an idea what are the important features [making sounds]. And he found his sound. Now -- oh, let's get past this. Okay. So how does this thing work? You have some examples you could load up, whether it's your vocalization or something prerecorded. Features are extracted. You rate some results. Based on this, it's going to update its features. And we're going to have this sort of loop through the dataset each time you're updating what's closer, what's further. And it's learning the distance metric. Basically it's learning the importance of various dimensions until presumably we find something close enough. Now, you say to yourself this is similar to something like Youflik, which is in some sense true. One of the things that's sort of innovative about what we're doing is in the audio domain, well, there's a couple of differences. In the audio domain, we work under certain constraints that they don't in the visual domain, one of which is when you're comparing individual pictures, it's very, very fast. Humans are just very good at very quickly rating into pictures. Boom. You're done. You've already seen it. With sound one of the issues is we always have to deal with is realtime issue. We can't present you ten sounds at once. Like one of the things you can do in a lot of visual comparison tasks is we present you ten pictures at once and say find the one that sticks out and you can just grab it. We can grasp so much more at once. And anything time based you have to listen to it. The other thing is in a lot of prior work when you're talking about what features are in here, a lot of times there's been this tendency to use what's called a bag of frames. And what I mean by bag of frames is that they don't think too hard about the timing relationships, the sequence of features. And there's a big difference to a human between [making noises]. But to a bag of frame's representation, these two things are identical. A last thing I'll mention, relative versus absolute. Now, maybe you can imagine that there's a sound that goes [making noise]. Maybe I can't whistle so I go [making noise]. Obviously in absolute terms, these are very different sounds. But they both have this characteristic along a certain dimension that if you measure the change over time, they both have this it's going like this, even though, absolute terms, if we grabbed the frequency, the center frequency of what I was doing, these would not be the same. So we know that the humans are going to produce some sound. The range of features they have to work with are limited. We may remap onto -- we may translate down. We may remap onto a different feature but still somehow get this change. So when we started looking at things, we grabbed features which were typical -- pitch, loudness, how harmonic is it, clarity, spectral centroid, how peaky is it, et cetera. And we're actually using dynamic time warping distance calculated on each 14 time series independently. And by that we mean both the absolute and the relative versions of them. And for those of you don't know what dynamic time warping is, dynamic time warping is an algorithm that is used to take sequence one and sequence two and find the mapping between them, the time stretching between them that would make them best aligned. So we do a time -- dynamic time warping between two sequences in each of the different dimensions and then from this we get a similarity measure in each of the different dimensions, both the absolute and the relative versions of them. Then we give you some top candidates to show you. You rate the top candidates along each of the dimensions and we learn whether it's the absolute pitch that matters or maybe it's the relative pitch contour that matters. And as I said, initially we weighed all dimensions equally. And as we move along, we change the ratings. So here's a preliminary experiment. We're going to be doing a bigger one a little later. This is the underlying interface for a synthesizer. And the question is can a user find a specified sound faster with SynthAssist, the interface we just showed you, or a traditional synthesizer interface. In fact, the interface that was used to make the sound in the first place. And the task was as follows. Here's a sound. Play it as much as you want. Once you start, you have five minutes to match the sound as closely as you can with this interface. And we then interfaces. Here's a new sound. As much as you want, listen to as much as you want, then you got five minutes to try and match the sound. >> Ivan Tashev: How you measure the progress? >> Bryan Pardo: Excellent question. This, in this preliminary thing, it was actually straightforward. The user is actually rating themselves how close they think it is. Now, why are we doing that? Well, if we used an absolute measure, the problem is if we already knew the right absolute measure to use, we wouldn't need to go through this whole weighting procedure. We would know exactly which was the closest thing, and SynthAssist would just hand it to them. So we turned back to the users and said, okay, you hear the sound, you hear the result of what you're doing, you tell us how close it is. Now, what are we looking at here? Each red -- okay, so there were three trials per user. So this is just -- this is very preliminary, just so you know. Three trials per user. Here's the red line is the experienced user; the blue line is the novice. Every minute we ask them to rate how well they're doing, how close have they gotten. And then what you see here are on -- well, actually, I said three, but there's clearly four dots here. So, hmm, maybe there was four trials. I'll have to go back and double-check. In any event, yeah, there's four here as well. So let me back up. Four trials per user. The important thing here is to notice this. Down here is the traditional interface. And by experienced user, we went out and sought someone who was used to programming synthesizers with the knobs. And by novice, we went and found a musician who was an acoustic musician and not programming synthesizers for their living. And we asked them to rate their progress as they went. And the reason we don't have dots at the end for many of the trials is because the user gave up. Up here, so take a look at these lines ->> Ivan Tashev: [inaudible] on top of each other, this is the reason why you have three or four ->> Bryan Pardo: Yeah, this would be the reason for the experienced one. But I know for a fact that the novice gave up on at least one of these examples and I think maybe two at the end. >> Ivan Tashev: And you quoted. >> Bryan Pardo: Sorry? >> Ivan Tashev: You quoted -- >> Bryan Pardo: I quoted him earlier, yeah. Again, this is very preliminary and this is me waving my hands in the air and going believe me, from just looking at two users, but we really feel like there is something here; that if you've got -- there's the blue line for the novice and the red line for the experienced. This is our SynthAssist interface, how close they thought they were getting over time. This is the traditional interface. And basically what you're seeing here is the on average the experienced person went from, oh, on a Likert 7-point scale up to, you know, what is that, 4, 5 1/2-ish. And up here, you know, 1, 2, 3, 4, 5 1/2 -- the experienced one, uh, we can't really talk about statistics, but they weren't doing a whole lot worse with SynthAssist, but the novice suddenly went from I give up or I'm doing terribly to I got somewhere. >> Ivan Tashev: Would you say that the novices are less critical to the results of the experienced users? >> Bryan Pardo: Well, I don't know. Down here they were very critical of their own results. And, again, so few a data point -- this is really me. This is something that -- you know, we ran this in the lab. We don't -- I'm not talking like -- all the other results I've been showing you are we have 500 users, we had 75 users, we had big -- for psychological experiments, big numbers of users. This is a couple of users. So this is me basically just trying to convince you that we think we're on to something, but we don't actually have the data yet to back that up. But what I will say is we're trying to do something a little bit different, and this is the punch line I want you to think about when you walk out of here. We're not rethinking sound interfaces. We're rethinking -- we're teaching interfaces to rethink themselves. With the first interface, the evaluative interface, social EQ, or iQ as it started out, all we do is we hand you examples and have you rate them. The result comes out like the result comes out. The second one we went to the Web and we asked lots of people to give us examples. We asked lots of people to do ratings. We collected the vocabulary from them without trying to constrain the vocabulary or see the vocabulary. We got back the vocabulary and made Audealize so we could use natural language vocabulary. In fact, Audealize is running up on the Web now. And if you want to teach it words, if enough people teach it the same word, it runs each night, processes and thinks about what it's learned from user interactions. And if a dozen of you got on and decided to collude and teach us the word fizzjibit, but you'd have to give it the same audio concept, if you manage to do that, fizzjibit will end up on the word map. And then SynthAssist does the thing I was just describing. Again, this is -this combines two ideas which was the user provides an example and then there's an interactive evaluative paradigm that goes with it. My student Mark had a -- we had a philosophical disagreement about whether using words would be a good idea, and so Mark leaned towards a word-free interface and I sort of pushed towards a word-heavy interface. And, again, you know, I think of this as two alternatives. One could have gone the other route with SynthAssist and done like we did with the other stuff and had lots of people name things to give them meaningful names. Here he wanted to say some things maybe we don't have a good language to describe, there's nothing even remotely universal about [making sound], but maybe you'd want to make that sound. And so he thought examples in evaluation was a better way to go with this. Which I guess I agree with. So if the stuff you think you saw -- the stuff you think you saw. If you think the stuff you saw is interesting, then please go check it out at music.eecs.northwestern.edu. I need to acknowledge the people who actually did all the hard work of building stuff because, of course, I'm a university professor, I very lightly touch the code, if at all, and it's the students who really do the heavy lifting. Zafar Rafii worked on having something learn to -- learn reverberation effects from user interactions. Bongjun Kim is the person that sped up the equalization learning from having to use 25 or 30 examples down to eight by using transfer learning from prior users and active learning. Prem Seetharaman is the one who came up with the Audealize word map. Mark Cartwright did SynthAssist that you just heard about. And, by the way, Mark's graduating this year, so if anyone out in the audience is looking for a really excellent person with a Ph.D. who knows about machine learning, audio processing ->> Ivan Tashev: [inaudible] interaction [inaudible]? >> Bryan Pardo: Right here. Andy Sabin was the one who did the That was the first evaluative interface. And these two are both in industry, and these three are still with me. But, of course, looking for that job, so anyone on the Web, want to reach out to put you in touch with Mark. That's the talk. initial iQ. now working Mark is me and I'll So thank you very much those of you who came. [applause] >> Ivan Tashev: We are not supposed to have direct interaction, but questions please. >>: I have a quick question. The -- through all of these there's a mapping of perception to some parameters based on a very specific implementation of EQ or reverb, or whatever it is. So you guys have done a bunch of learning already of how perception maps to those parameters. Let's say that I were -I were interested in this technology and I wanted to make a new product. Is your idea that, hey, you'll have a -- you may have a very different implementation, you have your own implementation of a reverb, is your idea that you would be able to use the training you've already done, or is it that, hey, here is the model and design of the interfaces, if you implement it in this way, then you can do your own training with your own users. >> Bryan Pardo: Right. That is an excellent question. So what Jesse Bowman -- can I just brag and say also former student in my lab? Asked a very insightful question. So there's this idea of the control space, which is the parameters that control a specific tool, like the reverberator that we used, actually, ba-ba-ba, here's a little diagram describing the reverberator that -- of the underlying reverberator that we used. And so he's asking the question, well, there's a bunch of control parameters for this reverberator and that's great and you learn that, but now I don't want to use your reverberator, I want to use a different reverberator, maybe one that's an impulse response-based reverberator. Do I have to start my learning over from scratch? And the answer is going to be that depends because there's two different ways that we can do this thing. One way is we learn a mapping onto the control feature space in which case, yes, you would have to learn it over. Another thing that I didn't talk about, some work that Zafar Rafii did, was we also did a version, which did not end up getting incorporated to Audealize, but we can -- I can send a paper on it, where you use descriptive statistics on the resulting audio, on the resulting impulse response function such as RT60 or spectral centroid and blah, blah, blah. And in this case if you learn the mapping between these descriptive parameters of an impulse response function and the words, then you can happily swap out any other reverberator as long as you know when you turn the knobs this way it will end up with this kind of impulse response function. So you would have to learn that mapping yourself, but then all of our user learning could be dropped in with no problem. So it depends how you did it. Now, the way we did the equalizer, we can do that. Because we describe EQ curves and it's very straightforward to map between an EQ curve and a parametric equalizer or graphical equalizer. With the reverberator, we went straight to the control programmer space which kind of ties us to this reverberator. But of course you could do this again mapped onto this feature space, and then you're free. >>: And I was thinking with the synthesizer as well, you know, it's a single oscillator amplitude envelope of -- what if I have six LFOs and ->> Bryan Pardo: Ah, well SynthAssist actually was not a mapping into the feature space -- or, sorry, not a mapping into the control space [making noise]. It was a mapping into a feature space that described the output. These are the features that we were using, these seven plus the absolute version of them, absolute pitch and relative pitch, as in change of pitch over time and inharmonicity, change of inharmonicity over time. So SynthAssist is one where you could, in fact, swap out the underlying synthesizer with a sampler or a granular synthesizer or anything you want. >>: How [inaudible] SynthAssist in the example it was what you were searching for, like a very short sound and it's time bound. So could a similar technique be used to be able to select something maybe distinguished between a string sound and a trumpet sound? If I have a sample and I want to give it an example, what kind of instrument I want to find, and I use my voice to give an example, and could it help me find that? >> Bryan Pardo: Well, this is kind of an interesting question. What we're trying to do is make something that could go ideally either way, ideally what we would hope is this thing would let you find either something about the underlying tambour or -- and this is somewhere we were a little more focused -- the overall shape of the note. So maybe you could imagine a string going [making noise] or a trumpet doing a fall off [making noise], and we would like it if what would happen -- this is our ideal -- is that when I go [making noise], let's say that that was a great imitation of a trumpet falling off, then what we would want is that top set of examples, one of them might be the string sound that also did the falling off and the other would be a trumpet, and then in that first round of ratings you would tell it, well, what matters, was it the underlying tambour or was it more this sort of pitch contour? And so our hope is that this technology when appropriately used will allow you to do either kind of search. But our medium to long term with this is now we talked about evaluative interfaces, we've talked about examples, and we've talked about words. What we haven't shown you is an interface that incorporates all of them. And we're hoping that the search for this evaluative interface could be constrained. If I said it's a string sound that kind of goes [making noise], then it goes, great, I'm going to search in my set of strings. I'm going to give you things that have some sort of -- well, he sang kind of low, so we'll give him low strings. He had sort of a falling away; maybe we'll give him a couple of violins that are high but they kind of have this -- you know, they did this. And the overall thing will allow you to use all of the tools that you would use to describe it to another person if you didn't happen to have a violin in your hand. That's our goal. I'm not claiming we've integrated them all yet, but that's what we're hoping for. >> Ivan Tashev: How much we can [inaudible] approach with the words and description to other areas which are also subjective and [inaudible] emotion, sound quality, spatiality, stuff like this. We have our experience with [inaudible] really painful. I doubt the colloquial normal people here are using slightly different meaning than what we have clearly describe it in the literature but the data were enormously noisy. What do you think, is it better to let them use their own words, the judges, and then try to mark to the scientific meaning, or it's a corpus thing? >> Bryan Pardo: Well, I think we all know this about language. Colloquial -- there's a reason that math was invented. There's a reason that specific technical terms were invented. It's because we don't all necessarily exactly agree. When I say the word love I know just what I mean, but you probably have a slightly different idea of what love is. My hope would be the following. I haven't shown you a thing, but I will show it to you now. Where is it? Ah. This is -- we did multidimensional scaling. Do not worry about the about what these dimensions mean. We took higher dimensional space and shoved it in a lower dimensional one. What I want you to see is this. Everywhere it says the word underwater in this feature space, somebody labeled a reverberation setting, at least once, that reverberation setting with underwater. The bigger the word is, the more often that reverberation setting was labeled with underwater. The reason I say this, the reason I'm showing this to you is this is what, from our data, a tight distribution looks like. This is a word we can count on more or less, more than average. This is the word warm done the same way. Again, these were reverberation settings. And what happens is we play a reverberation setting to someone. For this test, what we did was we played a reverberation setting and we said what word would you use to describe that. Someone says warm. And then we took all of the events where a reverberation setting was labeled warm. Each point is a reverberation setting. Again, size is how many people label that setting warm. But what you can see about this distribution compared to that distribution, I could sure take the average of these, I would get a point here, which is of course a place that nobody, nobody would call warm. What this says to me is that there might be multiple regions, each of which there are some group of people who think, yeah, that's what I mean when I say warm. Our current interface does not -- we decided building for this that we would sort of hide to give them here's a warm and here's a warm have. Instead we decided to kind of go with ah, we're going to have to pick. for the interface we were that complexity. We didn't want and here's a warm. We could the most frequent one and, say, Now, but knowing the shape of these distributions gives us some possibilities. We can -- if you have enough data from enough people, we can answer questions for particular words. Is this a technical term that most people seem to agree with the experts on? Is this a technical term where you know the technical term that generated the most, we'll call it disagreement but more maybe generality. For the reverberation problem, every single reverberation effect got labeled by at least one person with the word echo. >> Ivan Tashev: [inaudible]. [laughter] >> Bryan Pardo: Which makes echo a useless word for describing reverberation. Why? Because by definition all reverberations are echoes. If you had seen the echo distribution, it's everywhere. But now what we have is we can do the following. We can have -- if you gave me enough money to hire enough experts and poll them, we could get their definitions for words. We can crowdsource our definitions and ask how much overlap there is and then do something. But let me -- since you've asked these questions, I've saved a couple of slides. No, no, no. Now, for the moment, please grant me that the person who wrote, say, Adobe Audition, or one of the people who wrote Adobe Audition, this is why I was hoping a certain somebody would be here ->> Ivan Tashev: [inaudible] >> Bryan Pardo: Let's grant that they're experts and that the words they use are the words experts would use and we can talk about why that's probably not true for their drop-down menus. What we did is this is kind of an interesting thing. Here's the vocabulary. Here's -- the size of a circle correlates with the size of vocabulary. This is in Ableton Live their EQ had this many drop-down menu preset labels. Audition had that many. Audealize, the ones we learned from the crowd, that we decided we trusted 365 words. This -- actually we trusted more than 365. It's 365 plus 18 plus 5 plus 4. Okay. These are the overlapping regions. And so from this, what do we gather? It turns out between the preset list in this and the preset list in that and the crowd, there are exactly five words that overlapped amongst all three. Wow. Here is one of them. Clear. And this is the preset that got labeled with clear for Audition. This is the preset that got -- or, sorry, for Ableton. This is the Audition one. And here's the one we learned from the crowd. The blue is the one we learned from the crowd. >> Ivan Tashev: [inaudible] how he came up with that. >> Bryan Pardo: Well, at this point, who knows. replaced by somebody else's. >>: Maybe his presets were Yeah. >> Bryan Pardo: And between these two I would say the crowd pretty much agrees with the Ableton one. Like these have qualitatively similar shapes in pretty strong disagreement with the Adobe Audition one. So here's an example, what maybe the experts don't exactly agree on what the word clear, for example, means. And so it turns out to be a linguistic problem all around. And there was actually a study -- and you would expect this between England and the United States. There was a study done on organ sounds where they asked people -- oh, I'm forgetting the descriptive word now. I think it was warm. They took audio engineers in New Jersey -- that's in the United States -- and audio engineers in London -- and by that I mean London, England, not London, Ontario -- and said make this organ sound warm. And so they did EQ things. And there was broad agreement amongst the London audio engineers and broad agreement amongst the New Jersey audio engineers. But if you looked at the EQ curves between London and New Jersey, they were not the same. So this language stuff gets tough. Right? So there's, oh, on this particular word the crowd agrees with Ableton; on another word, maybe they agree with the definition in Audition; maybe the person who wrote Ableton or the person who did this preset has a whole community that all agrees with him; and maybe this person also has a whole community that agrees with him. You're only going to know if you go out and gather enough data, which a company like Microsoft can do more easily than I can. You know, I can do this to get a thousand people, and then I kind of run out of both research dollars and, you know, time and all that other stuff. And a company can do something at a scale that I never could. So maybe one day you will find this out and tell me which words are actually truly broadly agreed upon by experts and by nonexperts. Or we could work together and you can spend a few thousand dollars and find out together. So I see it's almost 5:00. So I figure -- >> Ivan Tashev: Any more questions? we have four minutes. We have more time, but we can also -- >>: One more quick thing. Have you considered the dependency of the meanings of word like clear, for example, for equalizers on the sound you apply it to? >> Bryan Pardo: One of the things that I didn't talk to you about, how do certain words end up on the map and other words not. Let's pick an obvious pair. I have a tuba which plays really low notes and I have a piccolo which plays really high notes. Now, if I ask people to evaluate -- what should I do to the equalization of 100 hertz for a piccolo note? Well, there's no energy at 100 hertz. So obviously their responses will only be meaningful in the frequencies in which the piccolo is actually making noise. And if it's not making noise at a frequency, whatever evaluations or things we learned from people will be meaningless. To get onto the map, we also considered the frequency bands which were represented in the underlying audio upon which they did their -- the people did their labeling. So we had multiple audio files -- drums, voices, guitars, et cetera, and only things that where the word -- to be a big word on the map. We also took into account the underlying distribution of spectral energy amongst the sounds. So if you're a big word like underwater or clear, that means that we had a variety of sounds that covered the entire -- not the entire, but a large part of the frequency -- range of frequencies we expect someone to have sounds in and that the word was consistently used among them and that's what makes it on. There's a lot of details in the paper that I have to sort of gloss over, but it's a very, very important point. And, you know, we could do it another way, which is -- I said frequency. What happens when someone hands me a sound they'd like to apply reverb to that already has reverb on it? This is one we haven't been able to deal well with. With the frequency one we can obviously look at the frequencies and say there's no energy here, but when we start to allow people, as we do now with Audealize, to upload their own sound and teach it a word, we can play these games of quality control with equalization, but we don't really know a good way of telling that there's reverb already on the sound or -- we're also starting to work in compression -- that the sound is already compressed, and that will obviously change a lot of things about how they perceive when we add more reverb. >>: You are able to tell whether or not people agree on a word, right, and then you know if the word is generally agreed upon, what kind of EQ, for example, belongs to the word. So it would be interesting to also be able to tell whether or not the word is independent, more independent of the source sound or if it's specific to certain ->> Bryan Pardo: >>: Yeah, okay, I see what you're saying. We have not looked -- [inaudible] >> Bryan Pardo: We have not looked for words that seem to only be used to describe, say, animal cries versus machine sounds or something like this. All of the data that I've described was collected on isolated instrumental and vocal sounds recorded in a studio that were dry. That means no reverb added. And this is true for both equalization and reverb. We do have datasets at -- we do have some datasets available on the Web. If you wanted to download some of this data and have a look at it and see if you saw any correlations, I would be very happy for you to do so and get back to me. And that would be -- that would be excellent. So let me know. >> Ivan Tashev: [applause] Let's thank our speaker again. Thank you, Bryan.