>>: Good evening gentleman. Thank you all very much for coming to our lecture: What’s Hot in the Seattle Section. I would like to really thank you to be here tonight and I hope you have a very fruitful evening. I want to especially thank Li Deng who organized this lecture and Dr. [indiscernible]. He is the sponsor of all the Seattle Section events. So we are grateful you are all here. I would like to introduce professor Atlas. He is a IEEE fellow and has been chairs of many conferences, being the professor and doing fantastic work which is very respected in the whole field of signal processing. Let’s welcome professor Atlas. [Applause] >> Les Atlas: All right. Thank you all for coming. I have got a title here which is a little bit ambitions, but if you think of what things will be like 100 years from now you know we are going to have fresh ideas. My claim is that it is up to us to generate them. This work has been funded by a few places; office of Naval Research, Army Research Office and several other places, in particular the Coulter Foundation for some of the later work I am going to be talking about. And those of you who have seen me speak in the last five years you are going to see some stuff that I start with, the background, which is vocoding speech and the notion of an envelope is an old idea. It’s an old idea that was initially put together in a way that was kind of [indiscernible] and was defined by the signal processing operations instead of what was really in the signal. So I am going to talk about what’s okay with it, what’s great about it and what’s wrong with it. Then I’m going to do some demonstrations and these again are neat demos, but they are not that new. But then I’m going to come up with potentially a new model and we are going to use it in a way that’s going to impact for example what people hear when they are deaf and have a cochlear implant. And over the next five years we are going to be rolling out a new version of cochlear implant coding, new software update and I will be talking about why that is with reference to a recent, 2 weeks ago, NPR story. So the basic idea in an envelope, a signal envelope, this one dimensional time signal, this redline looks like an envelope. It’s an upper bounding kind of line. I just drew it that way in there. There are various ways of estimating this upper envelope. This is the passage in time bird populations over 1.2 seconds. If you zoom in on that you see the red part, which people call an envelope, just like an AM radio, the blue part which people call temporal fine structure. They just call it that. It is defined by signal processing operations. People have not argued about how it should be composed, but they have showed it’s useful and most people have agreed that it’s useful. In 1939 I quote something out of the journal of Acoustical Society of America from Homer Dudley. And in kind of old time language in so many words he says that speech and other acoustic signals are actually low bandwidth processes that modulate higher bandwidth carriers. Now why is that? Why does speech do that? Why does, for example music do that? Well if you took a look at the nature of information rate in something like speech you would see it around 10 Hz, 50 Hz, and 100 Hz, something like that. So if we used something like a Nyquist criteria based on the bit rate of speech. You would then need to produce signals that are 50 Hz, 100 Hz, 200 Hz and if we needed to produce signals with that kind of low frequency content and launch them across the room we would need heads that were maybe as tall as the ceiling, bigger than this ceiling, about 20 feet high if you take a look at the size of wavelength of sound. Our heads and vocal track would be so big they would just be impossible to carry around in our head. So the idea that evolution gave us by trial and error I guess to form speech was to modulate higher frequency information up around 8 kHz or 10 kHz, which are the range of our vocal tract and other associated articulators and our glottis and so on. So we don’t generate signals that are down at 40 Hz, or 20 Hz or 4 Hz when we modulate speech. This red envelope is going about 2 Hz up to about 16 Hz or 20 Hz and we don’t generate those frequencies. If you look with a spectrum analyzer you won’t see that, but we modulate stuff with those frequencies. Modulate means product and I’m going to come back to that notion of product pretty quick. Now the question I raise in my own research, and this is what I have been raising for maybe now the last 10 or 15 years and it has finally come to fruition or worked out in some areas, is people have said because the were inspired by AM radio, that’s what was known when people started thinking about this concept. AM radio has a big fat carrier and it has smaller side bands that are information vary. So the receiver can be cheap. You waste a lot of carrier power, but the receiver’s are cheap and you needed it in the 20s and 30s and still works now, but it’s not necessarily optimal. People’s thinking was bounded by that view and the envelope in AM radio is non-negative and real. Why, because that’s the way it was and that’s what worked back then. And when people think of these modulations they still think non-negative and real. That’s where things start to fall apart and that’s what I will be honing in on shortly. What that really affects is what’s left. If you assume that modulation means multiplication and you are multiplying that blue stuff, which is we will call it temporal fine structure, we will call it higher frequency information, that stuff up at 2 kHz to 8 kHz or maybe 80 Hz to 8 kHz depending on how you define it, we will call that temporal fine structure. That multiplies the red stuff which is a modulating leaf form which is what we are after. And that red stuff is non-negative real because people assume it that way. That’s not a mathematical fact it’s just what people have assumed and in fact they have assumed it in a couple of couple of fairly classic papers. In 1994 they took a look at smearing the envelope. What happens if you take speech, you take the envelope, you preserve its temporal fine structure, preserve the blue stuff so it’s the same, you take the envelope and you smear it? I am going to do an experiment like that and show you what’s wrong with the usual assumption. The second paper says to take the envelope from speech, combine it with fine structure music, and slam them together. What do people hear speech or music? Now let’s do the opposite, let’s take the envelope from music and combine it with the fine structure high-frequency information from speech, slam them together, multiply them by each other and see what you hear. That’s this experiment that was in Letters to Nature, the top journal, one of the highest impact journals there is. What I am going to do is debunk these ideas, sorry. I’m going to debunk them right now. What I am going to do is not necessarily combine speech and music; I’m just going to start with speech. And I’m going to take speech and I’m going to take the envelope defined to be non-negative and real as carefully as I can using state of the art technology, high enough sample rate that everything engineering wise is done correctly, but the envelope is assumed to be non-negative and real. These pictures are spectrogram, time verses frequency, standard sliding short-time before you transfer them, then take it’s magnitude and then DB into a color map. The top picture has no modifications. I am going to play it for you so you hear what it is, just an un-modified signal and all we are doing is looking at a picture of it. [Demo] >> Les Atlas: So that’s Z, Z, K right there. Now the next thing I’m going to do is take the envelope of the signal and I’m going to smear it with a lowpass filter, a quite extreme lowpass filter with a corner at about 1.2 Hz/2 Hz. Let’s hear what happens when I do that. [Demo] >> Les Atlas: There are actually three tokens in that example that I just played. I will play it for you again, but I will warn you what you are hearing. When you smear an envelope what you should be doing is stretching these things out so it sound like you are taking someone’s vocal cords and zzzzzz, like doing that to it. You kind of hear that a little bit. You also hear a “buzzy” sound and you also hear an “echoy” sound. So let’s listen again, smearing the zzzzz like that, “buzzy” and “echoy”. Let me play it and then I’m going to explain the “buzzy” and the “echoy” in this shortly. [Demo] >> Les Atlas: Okay, the “buzzyness” is pretty obvious. It turns out when you look at this mathematically you are turning this signal, when you look at the spectrum closely into a bunch of comb like functions. We don’t see it at this resolution on the spectrogram and I won’t go into that in detail. You can see if things are smeared. They are smeared in time, zzzzzz, like that and this is non-clausal so it’s smeared at both ways and time. I can do a clausal version with the same kind of idea. And the echoing is right there, a little bit there, a little bit there and a little there. What’s going on with the “echoyness”? The result of a lowpass filter of the envelope, which is a rather severe lowpass, is not non-negative anymore. Its lowpass, but nothing is forcing it to be non-negative. So the envelope goes negative, what do you then do? You either switch the phase or you set it to 0. And it’s switching the phase right here, it’s passing through 0. These actually are negative envelopes right in the middle, negative going envelopes right in the middle there. I am going to pass through 0 and it get’s very quite. So that’s the reason for that “echoy” sound. So we know that this kind of analysis has artifacts. In particular the “echoyness” is due to the assumption that it’s non-negative real. You can’t then smear and envelope and force it to stay non-negative. You can force it to say real, but there is problems with that too, like the “buzzyness”. So, all forms of analysis that I will be talking about, even the cochlear implant analysis, has some form of frequency a filterbank. Dividing a signal into it’s short-time Fourier transform is a filterbank and dividing into a bank of filters that are like the filters in a cochlea is one thing to do. So it could be an audiogram like people do in an audiologist laboratories, fast Fourier transform as in a spectrogram, or it could be a cochlear mode. It could be a wavelet transform, except it might not be representable this way, but it would be very close in terms of higher represented. And I’m not going to talk about the difference between these kinds of filterbanks. It’s not what I’m talking about, but instead what happens after the filterbank? What is the first thing you do after a filterbank? In those pictures I showed you in the last slide we took a DB magnitude into a color map, period. That’s what we did, so you saw pictures, which forced the conduit line complex numbers of the short-term Fourier transform to be real and non-negative. We did it, okay, that wasn’t science, it wasn’t our vocal stretching, it was us and our signal processing that did it. So let’s talk some about that decomposition. In particular what people tend to do when they do decomposition is AM radio example. AM radio would take a filterbank and this is what the cochlea does to first approximation also. It inputs signal into a bunch of filters cap N of them. If you are talking about cochlear implant cap N might be 22 or 8. If you are talking about something like a cochlea itself that’s intact and a normal cochlea you are talking a number like 1,500 to 30,000 overlapping filterbanks. It’s the same idea in all cases. What happens after the filterbank? AM radio model in the air, but you have got 30,000 of them. In a cochlear implant you have 8 or 22 of them, a much smaller number. A spectrogram you might have 10, 20, 4, 5 or 12. The first thing that happens is some form of rectifier. Okay, I actually have a base level that’s a little below zero to represent what the ear does. There is a suppression of the base level when things go negative. That’s an instantaneous non-linearity here. In other words it’s linear for inputs that grow positive. If the input is going negative it depresses the rate a little bit and that’s it. So it’s a non-linear function, which is instantaneous. You know if you rectify a signal and then lowpass it you are going to find an envelope and that’s an envelope follower. People say the ear is an envelope follower and I think that actually is a good approximation of what the ear does, but remember an ear does it with 30,000 or so overlapping filterbank channels. Those are all then pulled together by a stream of hundreds of thousands of inner connected neurons with billions of inner connections before it get’s to the brain itself, the auditory cortex. So lot’s going on with those 30,000 channels coming in. So this is a model at the micro level of what happens to the ear. In terms of a macro model, not every good, but this has been used in MP3 through parts of AACplus, other things, what comes out using a second transform of that and we have worked on that ourselves. I won’t be talking about that, but I will talk about some past views. So this is what I have been talking about the last slide. The simplest way to look at modulation or envelope is to take rectification and follow it by a lowpass filter AM radio detector. Another model gives you these same results except you don’t need the lowpass filter. What it does is it splits the spectrum into positive and negative half by finding an analytic signal. It uses a Hilbert transform to do that. Set all negative frequencies to 0 and keep positive frequencies untouched. That means the signal is complex. It has a magnitude and a phase. So when people talk about the magnitude being an envelope it is the same thing as that lowpass envelope. The details have to do with how you implement the Hilbert transform verses the lowpass filter. Other than that it’s the same concept, but when you use this kind of processing you also have, remember those blue lines at higher frequency that I called temporal fine structure, you take this term, this phase angle and take the angle itself or maybe differentiate it if you are brave or take this co-sign and that’s called temporal fine structure Hilbert phase. It’s a nasty thing to deal with but that’s science, Letters to Science paper used it. When they combine the fine structure for music with envelope from speech and vice versa they were using that part that phase, the co-sign actually. These approaches are often called “vocoding” and what’s implicitly assumed? There it is, what is everyone assuming when they are doing this? They are assuming a product. This capital end different filterbanks in the ear, capital ends 30,000 and a cochlear implant it’s 8 or 22 and a standard speech experiment it might be 4, it might be 16, it might be 1024, some number that the designer chooses. Then you have a modulating envelope which is a function of whatever subband. This is the processing that happens outside. It’s not the processing, it’s the model. You assume a model which is an envelope at times, multiplying a carrier, a higher frequency carrier. That’s what people are implicitly assuming. Now it’s the constraints that you put on M and the constraints you put on C that matter a whole lot. And why are people assuming that M is non-negative and real? Well because of AM radio, that’s the reason why. And of course we are going to deviate from that. In fact if you just start with this equation right here from 1963/1964 Alan Oppenheim, before he did deconvolution did demultiplication, homomorphic demultiplication. I worked with that, there was a paper with Tom Stockham 1963 with Alan Oppenheim 1964 Stockham and Al Oppenheim. They did homomorphic demultiplication before they did homomorphic deconvolution and it was so hard to do. It was really hard to do, because everything is passing through 0, you have got to take a log, and they go negative you have to take a log. All kinds of reasons it was too hard to do. So they instead applied deconvolution and that worked pretty well. That’s what his PhD dissertation was on. He pretty much started his career based on that, but in that case he gave up with the multiplication and it’s a hard problem because they will post. Okay, it will post doesn’t mean it’s not solvable, it means it could use some optimization technique and other people have done that Malcolm Slaney with Greg Sell, colleagues of mine, another few people like Cambridge have done this with using considering an optimization technique where they put down constraints. But fundamentally you can’t find a unique MN and given ST. You know 6 is equal to 2 x 3, but 6 is equal to 6 x 1 also, or 3 x 2. So there isn’t a unique solution for M and C given S as a fundamental problem. >>: Do you mind if I ask a question? >> Les Atlas: Please. >>: So what if you had some assumptions on the C’s. So for example if C’s are roughly [indiscernible] or clues to narrow band and if you are willing to do synchronous modulation then you don’t need to constraint the modulation to be positive, right. The modulation could be negative [indiscernible]. >> Les Atlas: In fact it will be complex. >>: Right, it will be complex. >> Les Atlas: Yes, in general it will be. >>: I think what you are saying is that because we keep thinking about AM and simple rectifying that’s what motivates us, but there’s no [indiscernible] to actually restricts that. >> Les Atlas: People have defined the decomposition based on how it was implemented in the first place, which is an AM radio model and they have stuck with that. That’s all I am arguing, get out of that box a little bit. So what you are saying is constrain the C to be narrow band. It’s very close to the constrain I’m going to make, okay. I like that constraint as you are going to see. So how significant is this problem? We have written about this because the first thing that we did is work on the homomorphic modulation spectra. We use what Stockham and Oppenheim tried to do. We modernized a little bit. What we found when we did that, this is a theoretical proof where we showed that in general this constraint problem is not one where the modulator has to be non-negative and real. That under very general assumptions the modulator in fact is complex. Then it has to do with spectral symmetries, bound mid points of narrow band assumptions, make a narrow band assumption about the carrier. What is the symmetry and what remains around it? There is nothing that forces it to be Hermitian symmetric in frequency. That’s really what we showed in general. And for a product model coherent and band limited carrier is needed. So you have to estimate narrow band harmonics. Not a single harmonic, but narrow band harmonics will hopefully at one course bound to every frequency subband, which is not easy to do. Estimate the harmonics for example of speech, which requires that you know the pitch of speech, you assume it’s voiced proper speech and not un-voiced. Track all those harmonics real-time is what we are going to have to do in the cochlear implant and I’m doing that right now. What’s left is now your modulator and that modulator in general is complex. Now that you have a complex modulator you don’t have to keep it complex, but you at least have information that’s not distorted by doing that and you can do things to it like take it’s real part or other things like that. I’m going to give you a demonstration of what I mean by distortion free. So what Drullman did was highly distorting for this filtering system. It did not match the modified magnitude to a consistent phase. It wasn’t just that you are forcing things to go negative. The phase you end up with and the magnitude do not match each other. You may iterate on that, make it sound a little better, but the more modification you make the worse it gets. The Smith paper, instead of using a rectifier and lowpass filter you used a Hilbert phase, used the Hilbert transform to find an analytic signal. That seemed more sophisticated and fancier sounding terminology. A little more complicated code, but the underlying concept of a non-negative real envelope was still there and still assumed. And both are similar along with most and they have the same severe problems with their definition of what’s left that temporal fine structure. So what’s wrong with the usual vocoding? And actually the first person to mention it was not Homer Dudley, if you take a look at Helmholtz, the person who discovered the fact that the ear does a Fourier transform and wrote a book about it. If you look at the English translation of Helmholtz’s book he talks about this same problem. He says, “Beware if you have two tones beating with each other the envelope goes negative.” I read that and said, “Oh my god people missed this because they assumed AM where the envelope stays only positive.” So Helmholtz predicted there would be a problem and a few other people over the years did. Lotfi Zadeh did, Tom Kailath did and a few others like that. So I can show you an example right now. These are now subbands that are spaced like an auditory filter bank. This is not a representation in auditory system because it’s not dense enough in frequency sampling, but it’s close to what the ear does in terms of spacing and close in bandwidth. What’s significant is that’s a single speech input and you can see how it’s broken up the unvoiced portion of speech corresponds now to higher frequencies. I think this is the word she and the sh sound is up at higher frequencies. Lower frequencies you lose it and the e sound shows up at lower frequencies from 100 Hz up to 1250 Hz and then drops off at higher frequency. So the voice part is lower frequency and the un-voiced sh part is up at higher frequency. That’s all well known, but the red part is what I drew in. And I didn’t draw that it, that’s actually found by a Hilbert envelope, the red part in this one. This is the first hint that there’s a problem. Let’s zoom in on this part right here. Let’s take just this portion of the voiced part of speech and let’s zoom in on it. There it is and there’s the envelope found by the Hilbert envelope non-negative real and you can see what’s happening and that’s not what Helmholtz would have said should happen. That’s not what happens if you have two tones beating with each other. What happens if you have two tones beating with each other is it goes positive, then negative, positive, then negative and it just worked out the math. If you work out the math the envelope is going to do this. There it is, right out of Helmholtz’s 1864. So if you assume the red is the correct solution it’s wrong, it doesn’t match what Helmholtz said things should do. It doesn’t match a model of two tones that are beating with each other. It matches two tones beating with each other with a big fat carrier in between, the AM radio model, but that’s not what’s going on in these subbands and this is natural speech. This is not anything synthetic. So, real speech should go negative in terms of its envelope. So let’s make use of this and come up with a representation and come up with a representation that’s distortion free in terms of modifying the modulations. What I am going to do is I am going to use a more careful way of pulling the fine structure carrier and the envelopes apart. I’m going to do that with speech, I’m going to pull it apart and then I’m going to put it back together piece by piece, harmonic by harmonic, build it up and then hear what happens. Then I’m going to go ahead and modify the speech by modifying the envelope and the fine structure, that carrier stuff independently and we will hear what happens when you do it right. Then I will go back to the original assumption which is non-negative real envelope and do the same operation, of course it won’t sound as good, not even close. Then we will finish with some applications. So let’s take a look at just some speech I am going to play with, unprocessed speech to begin with. [Demo] Okay, this is a spectrogram where this dimensions time, this dimensions frequency. You can see clear harmonics in this portion. I’ve only got a range of 3 kHz so you can see the harmonics clearly. There is frequency content that goes up higher than 3 kHz in this signal you heard, but I’m really stressing the lower frequencies. I’ve got no pre-emphasis to bring out the high frequencies and I’m only doing this so you can see how clear these harmonics are and this high SNR example. >>: Doesn’t this create a frequency just because [indiscernible], the way we do the sampling that really our vocal has that kind of discrete performance? >> Les Atlas: Yes the nature of our speech is one where it is a harmonic series because it’s almost periodic. So this for example is bird populations, bird right there. That’s go a periodic signal, almost exactly periodic. I feel vibrations in my glottis, bird when I say that, the ird sound. And the pitch of that ir sound in that female speaker about 210 Hz, but this is not exactly flat, it’s got a tilt to it. Then it moves down here and it moves down there. So it’s not exactly periodic, but it’s almost periodic. So it’s almost a flat line and I think we’ll force it to be periodic later. You will hear what happens and because it’s not a pure sinusoid for that ir sound it’s got a first term over tone, second, third or partials if they were music. You’ve got all those harmonics that are tilted at increasing slope as a function of which harmonic it is. So voice speech is like this. Musical instruments that are sono instruments are like this also. Yes? >>: [indiscernible]. >> Les Atlas: I can give you one quick answer that won’t be the whole story, but the fundamental frequency which is this frequency right here, that’s a female speaker, if it was a male speaker that would drop down by about two-thirds as high. If it was a child speaker it could be a little bit or a lot higher, quite a bit higher for some children who are really young. So the fundamental frequency, that lowest line, its height is a function of whether it’s male, female or a base male would be really low, but there’s so many other attributes. This is only a small part of it. In fact let’s go ahead and estimate what that pitch is. If we do that the fundamental frequency, it’s a step to be able to pull out the envelope in a correct way using a product model more carefully. And before I start into this I’m going to make a narrow band assumption like Rico Malvar suggested and that narrow band assumption is going to be severe, because I’m going to say the signal has no harmonics or only has a fundamental frequency just to start. I’m going to say this fundamental frequency I’m going to estimate. I shouldn’t have said this about 6, 7, 8 years ago when we first were doing this because now we’ve got to do this real-time and accurately and plus 5 DBSNR for the cochlear implant is hard. We are getting there, but it’s hard. So it turns out this is a useful thing to do. I never thought estimating pitch is what this is would come back to haunt me, but when you do and you do in high SNR we had a system that did it well and this is what it sounds like for the same passage. [Demo] What you just heard in the last slide, the bird populations, if you only listen to the pitch. [Demo] That’s it, fundamental frequency only. So now what I’m going to do, because I had the fundamental frequency only I have a constraint on that product form. Given that constraint I can start stripping off the terms of M sub of N of T. I am going to start with M sub one of T, the fundamental frequency M sub zero of T. There it is, the first modulated component, the same pitch track, but now I am modulating it. Because I found what it is I solve that equation by constraining the carrier to follow the last slide that had that pitch and see what modulators lift in this low frequency range that corresponds to the fundamental frequency. So let me play this, this is now modulated with the complex, not a real non-negative signal, it’s a complex signal. [Demo] Okay, and that’s a little better than the last one. It’s getting closer, let’s keep going, and let’s do it to 2, the first and the second, fundamental frequency and its first harmonic, both being independently modulated by what the residual that resulted from that constraint. [Demo] Okay, let’s put in the next one, fundamental and two harmonics. [Demo] It’s getting closer. I am pretty deaf so that sounds almost perfect to me because I have no high frequency hearing, but let’s keep going for people out there who have normal hearing, four modulated components. [Demo] It’s starting to get more intelligible. We are starting to get there. Let’s do eight of them; this is not the full bandwidth of the speech. This is not the speech itself. This is just the first eight harmonics. If your hearing is good it is going to sound kind of like it is low past filtered, but if your hearing is like mine it will sound just like the original. Let me play it for you and then we are going to have some fun. [Demo] That’s all synthesized from the fundamental that I estimated in the first slide, put together with complex envelopes. We are going to get to the old way of doing this pretty soon. You will hear the distortion, but before I do that I’m going to do something funny. I am going to take out all the modulation. I’m going to just leave the carriers and its harmonics. That first thing I estimated and keep all its harmonics and set all the modulators to 1. The purely real, non-negative, this is not my starting argument. I’m just having some fun here. Okay let’s listen to this. [Demo] Totally unintelligible, okay you need that modulation. People claim the modulation isn’t important, but hey this is not intelligible, okay so you really need to have that. If you knew what it was, you did know what it was and it’s still not intelligible. So you do an arbitrary test on someone they are not going to get this. Now let’s have some other fun. Instead of removing the modulator let’s keep the modulators and take the fundamental frequency and the carrier in their harmonics and make them all flat. Let’s give it a pitch of 200 Hz, exactly 200 Hz. It’s now exactly periodic in terms of its carrier. That temporal fine structure is just flat as a horizontal line. And I’m going to keep the modulators that are complex from the original speech and we are going to keep 8 harmonics and I’m going to play that. So this fundamental frequency is not estimated from the speech. It was to find the modulators, but it’s not being used for this residence now, modulators only. [Demo] Totally intelligible, clear as a bell, but the pitch is flat, effect less pitch, because who put in a flat pitch? Now a little more stuff. Let’s go ahead and try the same experiment I just tried for you, flat pitch, modify the pitch, but let’s assume that our modulator is nonnegative and real, the old assumption and let’s throw it all back together and see what we get. The old way of doing it in like that Letter to Science paper. This is what we get, modulation is 1, same color map as the last slide. [Demo] It’s almost totally distorted. So there are a lot of other things you can play with. We actually have a toolbox you can download and goof around on your own at MATLAB toolbox that is. So being able to pull apart signals like this is not necessarily that new form of vocoding, but it’s what people were doing when they were talking about modulators. They were doing this, it didn’t sound good and my claim is you do something like this and it sounds distortion free because the assumption is correct. So where are we at currently? It has been used by other’s Steve Schimmel who is now working at Apple, Won et al, hearing research, Nie ICASSP 2008, Sell working with [indiscernible], Li a JASA paper, Imennov, Shaft has a JASA paper and there are various other people. This isn’t a complete list by any means, but there is a statistical variant of all this which has been applied to reverberation reduction and this has just been accepted for publication. I don’t know if it’s available on our web page yet, but it should be shortly, because it has been accepted for Transactions on Audio, Speech and Language. So it’s analysis to show removal of reverberation and also began to work with -20DB SNR speech. Okay, let me say that again -20DB SNR speech with 2 sensors, 2microphones, -20 DB SNR where the noise is very different in statistics. We take advantage of that. So the statistical variant of what I am talking about of complex modulator is something I’ll end with for my last couple of slides. We will come back to that. >>: Do you mind if I ask a question? >> Les Atlas: Sure. >>: Do you mind just putting the last slide just briefly. >> Les Atlas: Go back by one? >>: Go back by one, the one where you had the constant pitch, but the correct one. Not this one, the one before. >> Les Atlas: This one. >>: This one yes, it sounded pretty flat, which means that for applications such as recognition it might actually work pretty well. A recognizer might recognize that as well. >> Les Atlas: Not a tonal language. >>: Right, but so if that is true now I am thinking if I am building a system where I have to pick up an audio signal and send to a recognizer somewhere over there and bring it back, could I encode your modulation signals at a lower bit rate than I actually encode the signals that I’m sending to the speech recognizer? >> Les Atlas: If you have a good way to encode these, they are now complex envelopes, if you have a good way to encode complex numbers and my last couple of slides might give you some hints along those lines. >>: Because they must have some smoothness here otherwise they wouldn’t be good envelopes. They can’t be erratic, they have to have some smoothness that a [indiscernible] might be able to capture and not have that high of bit rate to send to the cloud. You might lose the pitch, but it doesn’t seem like it matters. >> Les Atlas: For non-tonal languages yes. >>: For non-tonal languages right or I could send the pitch, because that pitch is also there. >> Les Atlas: Okay, so they are pulled apart and you can throw them back together and get the same information and it turns out you will see in my second couple of slides I give just a little math that the correlation between the real and imaginary part is the innovative stuff in that signal. >>: [inaudible]. But you haven’t done any work on actually trying to build a compression system out of this thing right? >> Les Atlas: No. >>: Okay. >> Les Atlas: Doing it the way I will be getting to in my last couple of slides it might be doable. That’s something to think about, we haven’t thought about that for awhile because we haven’t use the statistical method you are about to see shortly. So there might be some value there. We have been working on cochlear implants, but maybe the next thing is along those lines. That’s a very interesting idea. So let’s get to the cochlear implant and the thing is I want to play for you what was on NPR two weeks ago, because it was kind of exciting. My mailbox has blown open since then. This is the problem I had worked on 31 to 35 years ago, which was to get speech into non-functioning ears that still had an intact cochlea where the cochlea was not transducing sound to neural impulses. The inner here is what turns sound into neural impulses. The middle ear is like an acoustic impedance smash, the outer ear is giving you the kind of acoustic imprint that helps you directionalize and capture sound. And what happens when someone’s inner ear doesn’t work 350,000/400,000 people have this now. It has taken off like crazy as soon as I got out of the area suddenly it worked. Speech through a cochlear implant is a successful technology. I didn’t believe it because I had worked on it for four or five years, had what I thought were really good ideas and we couldn’t get open set speech working. If you have added queues, yes it helped lip reading, but why do you need to do a surgery to help lip reading? The question is you want someone talking on the phone. So someone did the ultimate experiment with me. I got a phone call from a guy who said, “Why didn’t I want to work on cochlear implants anymore?” I said, “Because they don’t work well.” I was so disappointed and I spent years working on them. We thought our technology was so good and people couldn’t get open set speech. Then what the guy said floored me. He said, “I have a cochlear implant.” Here I am talking to him on the phone. That’s all it took and I said, “Oh you’ve got cochlear implant, you are talking to me on the phone,” and he said, “Yes”. He didn’t have to say, “What,” or any of that stuff. He understood me like a normal listener on the phone. I said, “Okay, where do you have problems?” He said, “Music” and he said, “Make it work for music, please make it work for music.” I said, “Okay,” and that was a few years ago. So that’s what we are going to get into now, making it work for music, which is going to make use of these ideas. So what I am going to play for you appeared, the publication on this and I’m not going to go into all the details, but the publication on this appeared in July 2013. The basic idea has been refined since. Let me bring up a web browser separately so I can get the NPR show going. The NPR piece is about 5 minutes long, but it’s very informative because it really gives the human element of the story. I might skip some other stuff, just because this NPR story is superb. The reporter did a great job. They had me in the studio for a couple of hours and they used a few seconds of me talking a few sentences, but that’s how it always works. >>: Today in your health we are going to explore what it’s like to experience music through a cochlear implant to people who depend on these surgically implanted hearing devices. A song like this [demo] may sound more like this [demo]. NPR’s John Hamilton reports on a man whose love of music has persisted through hearing loss and cochlear implantation. >>: By the time Sam Swiller turned one even loud noise had become faint. Hearing aids helped, but Swiller was living in a different world when it came to perceiving sounds. >>: The earliest memory I have [indiscernible] is a family picnic around the time I was four or five, really feeling isolated and separated from everyone, even though it was a picnic filled with family friends and young kids running around. >>: Swiller says even with hearing aids he was understanding maybe one word in three. He relied on lip reading and creativity to get by. And he took refuge in music, loud music that cranked up the sounds he heard best, drums and bass. >>: Isolation was kind of a common theme in my childhood and so Nirvana kind of spoke to me in a way. [Music] >>: But Swiller had no idea what the lyrics were saying until MTV started closed captioning it’s music videos. He says it didn’t matter. >>: I just love music, not just the sound of music but the whole theory of music, the energy that’s created, the connection between a band and the audience and just the whole idea of rocking out. >>: Swiller kept rocking through high school and college. Then in his late twenties the hearing he had left pretty much vanished. So in 2005 he swapped his hearing aids for a cochlear implant. One part of the device is implanted under the skin behind the ear, it receives information from a microphone and processors that are warn like a traditional hearing aid. When a doctor switched it on for the first time Swiller wasn’t prepared for the way people would sound. >>: I remember sitting a room and thinking everyone was a digital version of themselves. >>: The voices seemed artificial, high pitched and harsh. He couldn’t figure out what people were saying. >>: You are basically remapping the audio world and so your brain is understanding, “Okay, I understand this is a language, but I need to figure out how to interpret this language.” >>: Which he did over the next few months and he started listening to music again, but through the implant Nirvana was less appealing. >>: So I was kind of getting pushed away from sounds that I used to love. I was also being more attracted to sounds that I never really appreciated before. [Music] >>: I started to like a lot more folk music and a lot more vocals. So like Bjork is a good example. >>: A cochlear implant isn’t just a fancy hearing aid. Jessica Phillips-Silver, a neuroscience researcher at Georgetown University says the devices work in very different ways. >>: The hearing aid is really just an amplifier. The cochlear implant is actually bypassing the damaged part of the ear and delivering electrical impulses directly to the auditory nerve. >>: The brain has to learn how to interpret these impulses and every brain does that a bit differently. Phillips-Silver says another factor is that implants simply ignore some of the sound information in music. She showed how this effects listening in an experiment that involved people with cochlear implants responding to some dance music. [Music] >>: It was called Suavemente by Elvis Crespo. It’s one that’s heard commonly in the clubs and it get’s people going. >>: At first people with implants responded just like other people. They began moving in time with the music, but which sounds were they responding to? >>: There is a lot going on. There are a lot of different instrument sounds, there is a vocal line, and there is a great range of frequencies. It’s fairly intricate music. >>: So Phillips-Silver had participants listen to several stripped down versions of Suavemente. [Music] When the volunteers heard this drum tone version they had no trouble keeping the beat, but when they heard this piano version they had a lot of trouble. [Music] Phillips-Silver says that’s because what a cochlear implant does really well is transmit the information needed to understand speech. >>: Where it is somewhat lacking is more in relating information about pitch and timbre. So for example being able to tell the difference between notes that are close together on a keyboard or being able to tell the difference between two similar instruments. >>: Other scientists are trying to make those things possible. Les Atlas is a professor of electrical engineering at the University of Washington. >> Les Atlas: There is no easy way to encode pitch as an electrical stimulation pattern, that’s problem number one. Problem number two is to be able to do that in real-time, that means as the music is coming out is a difficult problem. >>: As a result even the newest cochlear implant provides very little information about pitch. Take a simple tune on a piano. [Music] Through an implant it may sound more like this simulation. [Music] So Atlas and other researchers are working on software that allows implants to convey more information. >> Les Atlas: It explicitly looks for the tonality in the music and uses it in how things are encoded. >>: Instead of this [music] the implant sends a more complicated signal that allows the brain to decode information about pitch. [Music] >> Les Atlas: And low and behold the results that we get now on the few people we have tested is that they do get better music queues. They can hear not perfectly, but much better, the difference between musical instruments. The richness of their experience when they listen to music has increased. >>: Atlas says the extra information also should help people with implants who need to understand highly tonal languages like Chinese and Vietnamese. Even with technical improvements the experience of hearing the world through a cochlear implant will still be different, but Sam Swiller whose had his implant for a decade now says he’s okay with that. >>: All of our senses give us the ability to experience very different worlds and even though we are walking side by side we are experiencing a very different street, or very different day, or very different colors. So when we truly engage each other we get to experience a little bit of each other’s world and that’s where I think real creativity happens. >> John Hamilton, NPR news. [Music] >> Les Atlas: So I’m going to stop there. I do have more slides, but it’s mostly just math, a let down from this story. So I think I have gone on long enough, taken enough of your time. So I will just open it up to questions. Yes? >>: So all those processors are not digitized right. [indiscernible] >> Les Atlas: It’s digital. The only thing that’s one analog is the signal conditioning coming through the microphone and getting it ready for the A to D converter, after that everything is done digital. The three main companies, and there are some smaller companies, that make cochlear implants are using imbedded processors. Rather sophisticated systems that are doing digital processing to make things run real time you have to program an assembly code. It’s not fun, but you can do it. It’s not a work horse processor, but all done digitally. >>: [indiscernible]. >> Les Atlas: Well that carrier modulator decomposition we have done no formal study of that question. The only anecdotal thing is for example what you heard when we changed the pitch and flattened it. >>: [indiscernible]. >> Les Atlas: But the remaining unvoiced part, bird population, like the p sound and the other population, tions, at the end there is an unvoiced part of the zzzz sound. All of that came through just fine. So what ever is happening with the complex envelope is made no worse. >>: [indiscernible]. >> Les Atlas: It doesn’t change anything. >>: [indiscernible] the subbend things it actually shows that with the subbend signals that the frequency [indiscernible] they show mostly in high subbands with reasonably well defined envelopes. And once inside it probably doesn’t matter. In fact some speech code there is [indiscernible]. >> Les Atlas: Sure. >>: And once you get the envelope [indiscernible] you can actually distinguish some kind of musical sounds, even though you actually thinking the wave form. But as long as the envelope is reasonably correct a symbol of some kind or another kind you can tell the difference between the two on the high part. To really reproduce a symbol you still need some pitched thing, but [indiscernible] you can easily tell the difference even though the accent is totally random. >> Les Atlas: And there’s also a continuum between what you have asked and the unvoiced sounds. So for example the toughest test we have been successful at is musical timbre test and that test the same a melody played by a cello, a piano, a trumpet, a saxophone, can you differentiate? So there are 8 instruments so the chance is 12.5 percent correct and a normal listener does 80 to 100 percent correct on this task. Now with the cochlear implant, with our new algorithm it’s up to 87 percent correct. So we are in the normal range, but the take away from that related to your questions is that difference, when you look at it is a steady state value of the various harmonics for the instruments. They are all playing the same melody, but it’s also the startup transient for those harmonics. You know whether the string is plopped, whether it’s bowed, whether it is someone blowing into a trumpet, that’s startup transient and sometimes the ending transient. And it’s very important for that difference. So that difference they were able to do quite well on using our new coding. Yes? >>: Are you familiar with how the auto tune algorithms work? >> Les Atlas: Yes I am. >>: It seems to me like maybe what they are doing adds quite a bit of distortion to it and what you are doing might be able to, like correct pitch is better. >> Les Atlas: No they are using an auto correlation based estimator. Yeah, I know what they are doing in auto tune. There are some people who really don’t like auto tune musically. So there is a danger in going into that turf right now, even if you make it sound better, but I agree with what you are saying. I do agree and I think that’s a very good point. I just wouldn’t want to be part of that battle right now trying to make auto tune sound better, because musically people are against it, a lot of people are. But you are correct in what you asked. Yes? >>: What does the person wearing an implant, what does he or she hear when she’s speaking? >> Les Atlas: Oh, that’s a good question. If they are profoundly deaf and that is their inner ear is not working at all they hear nothing, because there is nothing to be heard. But if there is some residual hearing, which a lot of people with implants have, what they will get is a bone conduction version of their voice, which we are hearing with normal listening. Okay, I am hearing a lot of it in my right ear even though it’s deaf, because I have bone growth that’s acting like a plug in my ear. So I hear the bone conductive on my deaf side, but people with residual hearing with a cochlear implant on top, one of the challenges we have and putting this in a discrete time digital device is latency. Because if the sound you hear through the device that’s processing it produces the start of a musical instrument or the start of the p sound in speech 40 milliseconds after they hear the direct bone conducted version it would make them nuts. It makes people crazy; they don’t want to use it. So our latency is up to 6-7 milliseconds and that’s it and that’s hard. So it’s not just making it real time. It’s real time with 6-7 millisecond latency, because if you hear the echoed version any longer than that you start to get irritated, really irritated. So, it’s a tough problem because of that. Yes? >>: I guess I have two questions. I might have missed the first part of your lecture. How did you make it real time and also since it wasn’t periodic did you ever experience anything [indiscernible]? >> Les Atlas: So your first question is how did we make it real time? So to do that I have got a couple of really motivated and good graduate students working with me who have worked very hard at assembly code. They realized that higher level coding wasn’t fast enough, it had too long a latency and I bugged them and they kept at it. So the core part of this is now real time with low enough latency. So it was just a lot of work in assembly code. And one manufacturer gave us a processor and we have the tools to use it and some of the code we use their code for some of the functions which really helped us. Your second question was what? >>: Since it wasn’t periodic did you ever experience any [indiscernible] or leakage? >> Les Atlas: So the fundamental frequency is roughly periodic over a small interval, but it’s not exactly periodic because the pitch is changing I would call it quasi periodic. But the aliasing would come from the bandwidth of the signal. So any use that you have heard of our stuff or any use in the cochlear implant processor starts with a low pass filter to ensure it won’t alias. Now the modulation envelope can alias too by the way. So I talk about this low frequency envelope which can be complex, that can alias down. That’s a whole other discussion. It turns out in the stuff you heard, it probably was, but you didn’t hear the aliasing. >>: Did you use any window or anything? >> Les Atlas: As the signal slides across, depending on what we are doing it could be rectangular, it could be a handmade window and sometimes we have used [indiscernible] or it depends on how much time we have and how careful we want to be. Yes? >>: How is the device connected with the nerve? >> Les Atlas: So connecting with the nerve, there was a picture a showed that spiral shaped device, there is an electrode that’s got actually 22 metal sites on it, 22 electrodes on it, 22 fine wires that come out and they don’t come outside the skin. The go to a device that sits under the skin and that device has a radio frequency coil that picks up information and is decoded internally. That decodes the stimulation pattern and there’s a set of biphasic pulses, positive and negative charge that are put out at a certain rate and a certain proportion to the different electrodes. And how those pulses are modulated is what we changed. So its electrical stimulation pulses that are good enough for speech if they are basically modulated as standard old non-negative way, the conventionl way. Modulated our new way they work much better for music and for speech we are trying to show better performance for speech and noise and those results aren’t ready yet. Yes? >>: [indiscernible]. >> Les Atlas: It’s not unrelated. So the first thing I started using when I was coding speech for cochlear implant was linear prediction to try to find the residences in the vocal tract. We had four electrodes back then and we thought we would stimulate each electrode at the center frequency of a residence. And I used a linear predictive analysis; I used Levinson-Durbin recursion to come up with the different resin pulls that represented different residences in the vocal tract. So I was trying to work like [indiscernible] and how well did it work? It didn’t give us speech then. So it turns out that it’s more like a channeled vocoder. I only had four electrodes and now they have 22, which is better and more precise resolution and frequency. With 22 electrodes and simple channel vocoder, that’s what’s used to get intelligible speech, but it doesn’t get music, which is the problem we worked on. Now Bell started with a channel vocoder with the phone and it broke and then the phone worked. So cannel vocoders have been around a long time. Are there other questions? >>: I want to hear your comments on [indiscernible]. So what’s your comment about how this kind of knowledge that you gain about what’s intelligible, what’s not intelligible, part of the [indiscernible] that you have maybe to help? >> Les Atlas: Well Rico brought up one example of going between the device and the server. >>: [indiscernible]. >>: [indiscernible]. >> Les Atlas: That’s it because we are not going to be learning some recognizer with days, or weeks or months. >>: [indiscernible]. >>: [indiscernible]. >> Les Atlas: And if you want to do better in that kind of approach where you have got data and labels, loads of data and loads of labels, and you want to learn a little faster and do a little better have me give a different talk on active learning or something. It’s not related to this. The brain is doing deep learning already so we are working with a deep learner, but we have got something where we can’t put the wrong input in anymore. >>: [indiscernible]. >> Les Atlas: We haven’t tried that, I can’t claim it yet, but it’s something to look at. That’s a good comment. >>: [indiscernible]. >> Les Atlas: It’s quite close, yes. >>: [indiscernible]. >> Les Atlas: You have to ask that, to get that, at least the way we are talking about it now, to be able to get the complex estimate. We are far up the curve in statistical estimates. So we probably could work in something like +5DB SNR to be able to do this, but still in the presence of noise being able to estimate this complex envelope the actual phase angle is going to be sensitive to that noise. But the correlation between the real and imaginary part, which people haven’t looked at, is important for what you heard and it is important what we are doing here. And it’s a little bit of extra information because you usually assume the real imaginary part are just un-correlated, it’s a Gaussian which is round, its not Ellipsoidal, if it’s a Gaussian assumption. I assume its Ellipsoidal where the real imaginary part, there’s a little correlation between them, then you get something interesting, a new feature. That’s a new feature that could be used. Maybe there’s a [indiscernible] version of that feature that could be useful. I haven’t tried that; that’s an interesting thought. >>: And finally just by reading your first few slides, [indiscernible]. >> Les Atlas: Yes, by the rectified. Why sigmoidal, there’s nothing about sigmoid. >>: [indiscernible]. >> Les Atlas: But you also notice there wasn’t a perfect rectifier. What I made sure is when it’s negative it’s not zero; the resting rate is slightly depressed when it’s negative. So that might have a role or it might just be there for artifacts. Are there other questions? Yes, you had one. >>: So thinking about the math, what you are talking about and so on and wanting to apply it to something totally different, mainly switch mode power supplies, which have depending upon whether they are hard switched or residence they could have a fixed frequency or a variable frequency and you have got envelope modulation and you are trying to understand how things propagate so you can look at feedback gain analysis. I am thinking of a paper that was looking at the real and complex parts of the modulation and how that was something that had been overlooked by people who were just trying to do this envelope. >> Les Atlas: Yea, so it’s the very same thing, yes. >>: And it yielded very good results. >> Les Atlas: It’s the same concept and if you take a look at one of our recent papers not on cochlear implant, but on sonar we extend coherence duration and we are using that. We are taking a look at the real and imaginary part of a complex process to do that. And we are warping it in such a way that we can get longer windows and make higher S and R estimates of things. So this might be related to the feedback as you talk about. So it could be very similar. If you look up Atlas@UW.edu and send me an e-mail I will send you that paper and I would like to see the reference you have. >>: Sure I would be glad to do that. >> Les Atlas: I would be very curious. Yes? >>: I am an IT guy so forgive the lack of [indiscernible] of the depth of the subject. But when you are working with these implants you mention that you are going after queues, like musical queues verses the actual music. So for someone who is hearing impaired now they get this new information, the body interprets it and then they can use that information. >> Les Atlas: We hope so yes. >>: Have you done anything with someone that’s not hearing impaired? Where you can actually implant it, turn off their inner ear, say, “He does that sound the same?” I mean is there any work being done like that? >> Les Atlas: We can’t do that, but you heard in the NPR show the simulation of what someone who is a cochlear implant user hears. Now how accurate is the piano at the end? >>: Yea, that’s what I’m wondering. How accurate is that? >> Les Atlas: The piano thing at the end is our best guess as to what people heard and it did duplicate some things that are probably quite accurate, like the interval between the piano frequencies is not preserved, but you can hear them change at least. And the piano sounds different than a trumpet. We didn’t play that simulation, but if I did play that and if they play that on NPR you would hear a piano, you would hear a cello, you would hear a violin and they would all sound like the right instrument. The melody they are playing, the interval from middle C, up to D, up to F and so on, the intervals would be off. There is no reason they are going to be accurate because the placement of frequency is unknown, but people can re-map and kind of learn that fortunately. And we now know they can get the queues that are useful to knowing which instrument it is and some of the people we have worked with are former musicians and when they hear the test they say, “We want this to take home,” because even learning speech, until someone get’s what’s called a clinical processor that they can wear home they really don’t learn speech with a cochlear implant until it’s been a few weeks, or months, or sometimes just a few days. We expect the same with music. At first music will sound foreign with this new method. They might even like their old processor for awhile better, but then over time they are going to get more richness out of music. At least that’s what our results indicate and we won’t know until next year or so we are going to be determining that with a take home processor. >>: Okay, thank you very much. [Applause]