>>: Good evening gentleman. Thank you all very... What’s Hot in the Seattle Section. I would like...

advertisement
>>: Good evening gentleman. Thank you all very much for coming to our lecture:
What’s Hot in the Seattle Section. I would like to really thank you to be here tonight and
I hope you have a very fruitful evening. I want to especially thank Li Deng who
organized this lecture and Dr. [indiscernible]. He is the sponsor of all the Seattle Section
events. So we are grateful you are all here. I would like to introduce professor Atlas. He
is a IEEE fellow and has been chairs of many conferences, being the professor and doing
fantastic work which is very respected in the whole field of signal processing. Let’s
welcome professor Atlas.
[Applause]
>> Les Atlas: All right. Thank you all for coming. I have got a title here which is a little
bit ambitions, but if you think of what things will be like 100 years from now you know
we are going to have fresh ideas. My claim is that it is up to us to generate them. This
work has been funded by a few places; office of Naval Research, Army Research Office
and several other places, in particular the Coulter Foundation for some of the later work I
am going to be talking about.
And those of you who have seen me speak in the last five years you are going to see
some stuff that I start with, the background, which is vocoding speech and the notion of
an envelope is an old idea. It’s an old idea that was initially put together in a way that
was kind of [indiscernible] and was defined by the signal processing operations instead of
what was really in the signal. So I am going to talk about what’s okay with it, what’s
great about it and what’s wrong with it. Then I’m going to do some demonstrations and
these again are neat demos, but they are not that new.
But then I’m going to come up with potentially a new model and we are going to use it in
a way that’s going to impact for example what people hear when they are deaf and have a
cochlear implant. And over the next five years we are going to be rolling out a new
version of cochlear implant coding, new software update and I will be talking about why
that is with reference to a recent, 2 weeks ago, NPR story.
So the basic idea in an envelope, a signal envelope, this one dimensional time signal, this
redline looks like an envelope. It’s an upper bounding kind of line. I just drew it that
way in there. There are various ways of estimating this upper envelope. This is the
passage in time bird populations over 1.2 seconds. If you zoom in on that you see the red
part, which people call an envelope, just like an AM radio, the blue part which people
call temporal fine structure. They just call it that. It is defined by signal processing
operations. People have not argued about how it should be composed, but they have
showed it’s useful and most people have agreed that it’s useful.
In 1939 I quote something out of the journal of Acoustical Society of America from
Homer Dudley. And in kind of old time language in so many words he says that speech
and other acoustic signals are actually low bandwidth processes that modulate higher
bandwidth carriers. Now why is that? Why does speech do that? Why does, for
example music do that? Well if you took a look at the nature of information rate in
something like speech you would see it around 10 Hz, 50 Hz, and 100 Hz, something like
that. So if we used something like a Nyquist criteria based on the bit rate of speech. You
would then need to produce signals that are 50 Hz, 100 Hz, 200 Hz and if we needed to
produce signals with that kind of low frequency content and launch them across the room
we would need heads that were maybe as tall as the ceiling, bigger than this ceiling, about
20 feet high if you take a look at the size of wavelength of sound. Our heads and vocal
track would be so big they would just be impossible to carry around in our head.
So the idea that evolution gave us by trial and error I guess to form speech was to
modulate higher frequency information up around 8 kHz or 10 kHz, which are the range
of our vocal tract and other associated articulators and our glottis and so on. So we don’t
generate signals that are down at 40 Hz, or 20 Hz or 4 Hz when we modulate speech.
This red envelope is going about 2 Hz up to about 16 Hz or 20 Hz and we don’t generate
those frequencies. If you look with a spectrum analyzer you won’t see that, but we
modulate stuff with those frequencies. Modulate means product and I’m going to come
back to that notion of product pretty quick.
Now the question I raise in my own research, and this is what I have been raising for
maybe now the last 10 or 15 years and it has finally come to fruition or worked out in
some areas, is people have said because the were inspired by AM radio, that’s what was
known when people started thinking about this concept. AM radio has a big fat carrier
and it has smaller side bands that are information vary. So the receiver can be cheap.
You waste a lot of carrier power, but the receiver’s are cheap and you needed it in the 20s
and 30s and still works now, but it’s not necessarily optimal. People’s thinking was
bounded by that view and the envelope in AM radio is non-negative and real. Why,
because that’s the way it was and that’s what worked back then. And when people think
of these modulations they still think non-negative and real. That’s where things start to
fall apart and that’s what I will be honing in on shortly.
What that really affects is what’s left. If you assume that modulation means
multiplication and you are multiplying that blue stuff, which is we will call it temporal
fine structure, we will call it higher frequency information, that stuff up at 2 kHz to 8 kHz
or maybe 80 Hz to 8 kHz depending on how you define it, we will call that temporal fine
structure. That multiplies the red stuff which is a modulating leaf form which is what we
are after. And that red stuff is non-negative real because people assume it that way.
That’s not a mathematical fact it’s just what people have assumed and in fact they have
assumed it in a couple of couple of fairly classic papers. In 1994 they took a look at
smearing the envelope.
What happens if you take speech, you take the envelope, you preserve its temporal fine
structure, preserve the blue stuff so it’s the same, you take the envelope and you smear it?
I am going to do an experiment like that and show you what’s wrong with the usual
assumption. The second paper says to take the envelope from speech, combine it with
fine structure music, and slam them together. What do people hear speech or music?
Now let’s do the opposite, let’s take the envelope from music and combine it with the
fine structure high-frequency information from speech, slam them together, multiply
them by each other and see what you hear. That’s this experiment that was in Letters to
Nature, the top journal, one of the highest impact journals there is.
What I am going to do is debunk these ideas, sorry. I’m going to debunk them right now.
What I am going to do is not necessarily combine speech and music; I’m just going to
start with speech. And I’m going to take speech and I’m going to take the envelope
defined to be non-negative and real as carefully as I can using state of the art technology,
high enough sample rate that everything engineering wise is done correctly, but the
envelope is assumed to be non-negative and real. These pictures are spectrogram, time
verses frequency, standard sliding short-time before you transfer them, then take it’s
magnitude and then DB into a color map. The top picture has no modifications. I am
going to play it for you so you hear what it is, just an un-modified signal and all we are
doing is looking at a picture of it.
[Demo]
>> Les Atlas: So that’s Z, Z, K right there. Now the next thing I’m going to do is take
the envelope of the signal and I’m going to smear it with a lowpass filter, a quite extreme
lowpass filter with a corner at about 1.2 Hz/2 Hz. Let’s hear what happens when I do
that.
[Demo]
>> Les Atlas: There are actually three tokens in that example that I just played. I will
play it for you again, but I will warn you what you are hearing. When you smear an
envelope what you should be doing is stretching these things out so it sound like you are
taking someone’s vocal cords and zzzzzz, like doing that to it. You kind of hear that a
little bit. You also hear a “buzzy” sound and you also hear an “echoy” sound. So let’s
listen again, smearing the zzzzz like that, “buzzy” and “echoy”. Let me play it and then
I’m going to explain the “buzzy” and the “echoy” in this shortly.
[Demo]
>> Les Atlas: Okay, the “buzzyness” is pretty obvious. It turns out when you look at this
mathematically you are turning this signal, when you look at the spectrum closely into a
bunch of comb like functions. We don’t see it at this resolution on the spectrogram and I
won’t go into that in detail. You can see if things are smeared. They are smeared in
time, zzzzzz, like that and this is non-clausal so it’s smeared at both ways and time. I can
do a clausal version with the same kind of idea. And the echoing is right there, a little bit
there, a little bit there and a little there.
What’s going on with the “echoyness”? The result of a lowpass filter of the envelope,
which is a rather severe lowpass, is not non-negative anymore. Its lowpass, but nothing
is forcing it to be non-negative. So the envelope goes negative, what do you then do?
You either switch the phase or you set it to 0. And it’s switching the phase right here, it’s
passing through 0. These actually are negative envelopes right in the middle, negative
going envelopes right in the middle there. I am going to pass through 0 and it get’s very
quite. So that’s the reason for that “echoy” sound.
So we know that this kind of analysis has artifacts. In particular the “echoyness” is due
to the assumption that it’s non-negative real. You can’t then smear and envelope and
force it to stay non-negative. You can force it to say real, but there is problems with that
too, like the “buzzyness”. So, all forms of analysis that I will be talking about, even the
cochlear implant analysis, has some form of frequency a filterbank. Dividing a signal
into it’s short-time Fourier transform is a filterbank and dividing into a bank of filters that
are like the filters in a cochlea is one thing to do. So it could be an audiogram like people
do in an audiologist laboratories, fast Fourier transform as in a spectrogram, or it could be
a cochlear mode. It could be a wavelet transform, except it might not be representable
this way, but it would be very close in terms of higher represented.
And I’m not going to talk about the difference between these kinds of filterbanks. It’s
not what I’m talking about, but instead what happens after the filterbank? What is the
first thing you do after a filterbank? In those pictures I showed you in the last slide we
took a DB magnitude into a color map, period. That’s what we did, so you saw pictures,
which forced the conduit line complex numbers of the short-term Fourier transform to be
real and non-negative. We did it, okay, that wasn’t science, it wasn’t our vocal
stretching, it was us and our signal processing that did it.
So let’s talk some about that decomposition. In particular what people tend to do when
they do decomposition is AM radio example. AM radio would take a filterbank and this
is what the cochlea does to first approximation also. It inputs signal into a bunch of
filters cap N of them. If you are talking about cochlear implant cap N might be 22 or 8.
If you are talking about something like a cochlea itself that’s intact and a normal cochlea
you are talking a number like 1,500 to 30,000 overlapping filterbanks. It’s the same idea
in all cases. What happens after the filterbank? AM radio model in the air, but you have
got 30,000 of them. In a cochlear implant you have 8 or 22 of them, a much smaller
number. A spectrogram you might have 10, 20, 4, 5 or 12. The first thing that happens is
some form of rectifier.
Okay, I actually have a base level that’s a little below zero to represent what the ear does.
There is a suppression of the base level when things go negative. That’s an instantaneous
non-linearity here. In other words it’s linear for inputs that grow positive. If the input is
going negative it depresses the rate a little bit and that’s it. So it’s a non-linear function,
which is instantaneous. You know if you rectify a signal and then lowpass it you are
going to find an envelope and that’s an envelope follower. People say the ear is an
envelope follower and I think that actually is a good approximation of what the ear does,
but remember an ear does it with 30,000 or so overlapping filterbank channels. Those are
all then pulled together by a stream of hundreds of thousands of inner connected neurons
with billions of inner connections before it get’s to the brain itself, the auditory cortex.
So lot’s going on with those 30,000 channels coming in.
So this is a model at the micro level of what happens to the ear. In terms of a macro
model, not every good, but this has been used in MP3 through parts of AACplus, other
things, what comes out using a second transform of that and we have worked on that
ourselves. I won’t be talking about that, but I will talk about some past views. So this is
what I have been talking about the last slide. The simplest way to look at modulation or
envelope is to take rectification and follow it by a lowpass filter AM radio detector.
Another model gives you these same results except you don’t need the lowpass filter.
What it does is it splits the spectrum into positive and negative half by finding an analytic
signal. It uses a Hilbert transform to do that. Set all negative frequencies to 0 and keep
positive frequencies untouched. That means the signal is complex. It has a magnitude
and a phase.
So when people talk about the magnitude being an envelope it is the same thing as that
lowpass envelope. The details have to do with how you implement the Hilbert transform
verses the lowpass filter. Other than that it’s the same concept, but when you use this
kind of processing you also have, remember those blue lines at higher frequency that I
called temporal fine structure, you take this term, this phase angle and take the angle
itself or maybe differentiate it if you are brave or take this co-sign and that’s called
temporal fine structure Hilbert phase. It’s a nasty thing to deal with but that’s science,
Letters to Science paper used it. When they combine the fine structure for music with
envelope from speech and vice versa they were using that part that phase, the co-sign
actually.
These approaches are often called “vocoding” and what’s implicitly assumed? There it
is, what is everyone assuming when they are doing this? They are assuming a product.
This capital end different filterbanks in the ear, capital ends 30,000 and a cochlear
implant it’s 8 or 22 and a standard speech experiment it might be 4, it might be 16, it
might be 1024, some number that the designer chooses. Then you have a modulating
envelope which is a function of whatever subband. This is the processing that happens
outside. It’s not the processing, it’s the model. You assume a model which is an
envelope at times, multiplying a carrier, a higher frequency carrier. That’s what people
are implicitly assuming.
Now it’s the constraints that you put on M and the constraints you put on C that matter a
whole lot. And why are people assuming that M is non-negative and real? Well because
of AM radio, that’s the reason why. And of course we are going to deviate from that. In
fact if you just start with this equation right here from 1963/1964 Alan Oppenheim,
before he did deconvolution did demultiplication, homomorphic demultiplication. I
worked with that, there was a paper with Tom Stockham 1963 with Alan Oppenheim
1964 Stockham and Al Oppenheim. They did homomorphic demultiplication before they
did homomorphic deconvolution and it was so hard to do. It was really hard to do,
because everything is passing through 0, you have got to take a log, and they go negative
you have to take a log. All kinds of reasons it was too hard to do.
So they instead applied deconvolution and that worked pretty well. That’s what his PhD
dissertation was on. He pretty much started his career based on that, but in that case he
gave up with the multiplication and it’s a hard problem because they will post. Okay, it
will post doesn’t mean it’s not solvable, it means it could use some optimization
technique and other people have done that Malcolm Slaney with Greg Sell, colleagues of
mine, another few people like Cambridge have done this with using considering an
optimization technique where they put down constraints. But fundamentally you can’t
find a unique MN and given ST. You know 6 is equal to 2 x 3, but 6 is equal to 6 x 1
also, or 3 x 2. So there isn’t a unique solution for M and C given S as a fundamental
problem.
>>: Do you mind if I ask a question?
>> Les Atlas: Please.
>>: So what if you had some assumptions on the C’s. So for example if C’s are roughly
[indiscernible] or clues to narrow band and if you are willing to do synchronous
modulation then you don’t need to constraint the modulation to be positive, right. The
modulation could be negative [indiscernible].
>> Les Atlas: In fact it will be complex.
>>: Right, it will be complex.
>> Les Atlas: Yes, in general it will be.
>>: I think what you are saying is that because we keep thinking about AM and simple
rectifying that’s what motivates us, but there’s no [indiscernible] to actually restricts that.
>> Les Atlas: People have defined the decomposition based on how it was implemented
in the first place, which is an AM radio model and they have stuck with that. That’s all I
am arguing, get out of that box a little bit. So what you are saying is constrain the C to be
narrow band. It’s very close to the constrain I’m going to make, okay. I like that
constraint as you are going to see.
So how significant is this problem? We have written about this because the first thing
that we did is work on the homomorphic modulation spectra. We use what Stockham and
Oppenheim tried to do. We modernized a little bit. What we found when we did that,
this is a theoretical proof where we showed that in general this constraint problem is not
one where the modulator has to be non-negative and real. That under very general
assumptions the modulator in fact is complex. Then it has to do with spectral
symmetries, bound mid points of narrow band assumptions, make a narrow band
assumption about the carrier. What is the symmetry and what remains around it? There
is nothing that forces it to be Hermitian symmetric in frequency. That’s really what we
showed in general.
And for a product model coherent and band limited carrier is needed. So you have to
estimate narrow band harmonics. Not a single harmonic, but narrow band harmonics will
hopefully at one course bound to every frequency subband, which is not easy to do.
Estimate the harmonics for example of speech, which requires that you know the pitch of
speech, you assume it’s voiced proper speech and not un-voiced. Track all those
harmonics real-time is what we are going to have to do in the cochlear implant and I’m
doing that right now. What’s left is now your modulator and that modulator in general is
complex. Now that you have a complex modulator you don’t have to keep it complex,
but you at least have information that’s not distorted by doing that and you can do things
to it like take it’s real part or other things like that.
I’m going to give you a demonstration of what I mean by distortion free. So what
Drullman did was highly distorting for this filtering system. It did not match the
modified magnitude to a consistent phase. It wasn’t just that you are forcing things to go
negative. The phase you end up with and the magnitude do not match each other. You
may iterate on that, make it sound a little better, but the more modification you make the
worse it gets. The Smith paper, instead of using a rectifier and lowpass filter you used a
Hilbert phase, used the Hilbert transform to find an analytic signal. That seemed more
sophisticated and fancier sounding terminology. A little more complicated code, but the
underlying concept of a non-negative real envelope was still there and still assumed. And
both are similar along with most and they have the same severe problems with their
definition of what’s left that temporal fine structure.
So what’s wrong with the usual vocoding? And actually the first person to mention it
was not Homer Dudley, if you take a look at Helmholtz, the person who discovered the
fact that the ear does a Fourier transform and wrote a book about it. If you look at the
English translation of Helmholtz’s book he talks about this same problem. He says,
“Beware if you have two tones beating with each other the envelope goes negative.” I
read that and said, “Oh my god people missed this because they assumed AM where the
envelope stays only positive.” So Helmholtz predicted there would be a problem and a
few other people over the years did. Lotfi Zadeh did, Tom Kailath did and a few others
like that.
So I can show you an example right now. These are now subbands that are spaced like
an auditory filter bank. This is not a representation in auditory system because it’s not
dense enough in frequency sampling, but it’s close to what the ear does in terms of
spacing and close in bandwidth. What’s significant is that’s a single speech input and
you can see how it’s broken up the unvoiced portion of speech corresponds now to higher
frequencies. I think this is the word she and the sh sound is up at higher frequencies.
Lower frequencies you lose it and the e sound shows up at lower frequencies from 100
Hz up to 1250 Hz and then drops off at higher frequency.
So the voice part is lower frequency and the un-voiced sh part is up at higher frequency.
That’s all well known, but the red part is what I drew in. And I didn’t draw that it, that’s
actually found by a Hilbert envelope, the red part in this one. This is the first hint that
there’s a problem. Let’s zoom in on this part right here. Let’s take just this portion of the
voiced part of speech and let’s zoom in on it. There it is and there’s the envelope found
by the Hilbert envelope non-negative real and you can see what’s happening and that’s
not what Helmholtz would have said should happen. That’s not what happens if you
have two tones beating with each other.
What happens if you have two tones beating with each other is it goes positive, then
negative, positive, then negative and it just worked out the math. If you work out the
math the envelope is going to do this. There it is, right out of Helmholtz’s 1864. So if
you assume the red is the correct solution it’s wrong, it doesn’t match what Helmholtz
said things should do. It doesn’t match a model of two tones that are beating with each
other. It matches two tones beating with each other with a big fat carrier in between, the
AM radio model, but that’s not what’s going on in these subbands and this is natural
speech. This is not anything synthetic. So, real speech should go negative in terms of its
envelope.
So let’s make use of this and come up with a representation and come up with a
representation that’s distortion free in terms of modifying the modulations. What I am
going to do is I am going to use a more careful way of pulling the fine structure carrier
and the envelopes apart. I’m going to do that with speech, I’m going to pull it apart and
then I’m going to put it back together piece by piece, harmonic by harmonic, build it up
and then hear what happens. Then I’m going to go ahead and modify the speech by
modifying the envelope and the fine structure, that carrier stuff independently and we will
hear what happens when you do it right. Then I will go back to the original assumption
which is non-negative real envelope and do the same operation, of course it won’t sound
as good, not even close. Then we will finish with some applications.
So let’s take a look at just some speech I am going to play with, unprocessed speech to
begin with.
[Demo]
Okay, this is a spectrogram where this dimensions time, this dimensions frequency. You
can see clear harmonics in this portion. I’ve only got a range of 3 kHz so you can see the
harmonics clearly. There is frequency content that goes up higher than 3 kHz in this
signal you heard, but I’m really stressing the lower frequencies. I’ve got no pre-emphasis
to bring out the high frequencies and I’m only doing this so you can see how clear these
harmonics are and this high SNR example.
>>: Doesn’t this create a frequency just because [indiscernible], the way we do the
sampling that really our vocal has that kind of discrete performance?
>> Les Atlas: Yes the nature of our speech is one where it is a harmonic series because
it’s almost periodic. So this for example is bird populations, bird right there. That’s go a
periodic signal, almost exactly periodic. I feel vibrations in my glottis, bird when I say
that, the ird sound. And the pitch of that ir sound in that female speaker about 210 Hz,
but this is not exactly flat, it’s got a tilt to it. Then it moves down here and it moves
down there. So it’s not exactly periodic, but it’s almost periodic. So it’s almost a flat
line and I think we’ll force it to be periodic later. You will hear what happens and
because it’s not a pure sinusoid for that ir sound it’s got a first term over tone, second,
third or partials if they were music. You’ve got all those harmonics that are tilted at
increasing slope as a function of which harmonic it is. So voice speech is like this.
Musical instruments that are sono instruments are like this also.
Yes?
>>: [indiscernible].
>> Les Atlas: I can give you one quick answer that won’t be the whole story, but the
fundamental frequency which is this frequency right here, that’s a female speaker, if it
was a male speaker that would drop down by about two-thirds as high. If it was a child
speaker it could be a little bit or a lot higher, quite a bit higher for some children who are
really young. So the fundamental frequency, that lowest line, its height is a function of
whether it’s male, female or a base male would be really low, but there’s so many other
attributes. This is only a small part of it.
In fact let’s go ahead and estimate what that pitch is. If we do that the fundamental
frequency, it’s a step to be able to pull out the envelope in a correct way using a product
model more carefully. And before I start into this I’m going to make a narrow band
assumption like Rico Malvar suggested and that narrow band assumption is going to be
severe, because I’m going to say the signal has no harmonics or only has a fundamental
frequency just to start. I’m going to say this fundamental frequency I’m going to
estimate. I shouldn’t have said this about 6, 7, 8 years ago when we first were doing this
because now we’ve got to do this real-time and accurately and plus 5 DBSNR for the
cochlear implant is hard. We are getting there, but it’s hard. So it turns out this is a
useful thing to do. I never thought estimating pitch is what this is would come back to
haunt me, but when you do and you do in high SNR we had a system that did it well and
this is what it sounds like for the same passage.
[Demo]
What you just heard in the last slide, the bird populations, if you only listen to the pitch.
[Demo]
That’s it, fundamental frequency only. So now what I’m going to do, because I had the
fundamental frequency only I have a constraint on that product form. Given that
constraint I can start stripping off the terms of M sub of N of T. I am going to start with
M sub one of T, the fundamental frequency M sub zero of T. There it is, the first
modulated component, the same pitch track, but now I am modulating it. Because I
found what it is I solve that equation by constraining the carrier to follow the last slide
that had that pitch and see what modulators lift in this low frequency range that
corresponds to the fundamental frequency. So let me play this, this is now modulated
with the complex, not a real non-negative signal, it’s a complex signal.
[Demo]
Okay, and that’s a little better than the last one. It’s getting closer, let’s keep going, and
let’s do it to 2, the first and the second, fundamental frequency and its first harmonic,
both being independently modulated by what the residual that resulted from that
constraint.
[Demo]
Okay, let’s put in the next one, fundamental and two harmonics.
[Demo]
It’s getting closer. I am pretty deaf so that sounds almost perfect to me because I have no
high frequency hearing, but let’s keep going for people out there who have normal
hearing, four modulated components.
[Demo]
It’s starting to get more intelligible. We are starting to get there. Let’s do eight of them;
this is not the full bandwidth of the speech. This is not the speech itself. This is just the
first eight harmonics. If your hearing is good it is going to sound kind of like it is low
past filtered, but if your hearing is like mine it will sound just like the original. Let me
play it for you and then we are going to have some fun.
[Demo]
That’s all synthesized from the fundamental that I estimated in the first slide, put together
with complex envelopes. We are going to get to the old way of doing this pretty soon.
You will hear the distortion, but before I do that I’m going to do something funny. I am
going to take out all the modulation. I’m going to just leave the carriers and its
harmonics. That first thing I estimated and keep all its harmonics and set all the
modulators to 1. The purely real, non-negative, this is not my starting argument. I’m just
having some fun here. Okay let’s listen to this.
[Demo]
Totally unintelligible, okay you need that modulation. People claim the modulation isn’t
important, but hey this is not intelligible, okay so you really need to have that. If you
knew what it was, you did know what it was and it’s still not intelligible. So you do an
arbitrary test on someone they are not going to get this. Now let’s have some other fun.
Instead of removing the modulator let’s keep the modulators and take the fundamental
frequency and the carrier in their harmonics and make them all flat. Let’s give it a pitch
of 200 Hz, exactly 200 Hz. It’s now exactly periodic in terms of its carrier. That
temporal fine structure is just flat as a horizontal line. And I’m going to keep the
modulators that are complex from the original speech and we are going to keep 8
harmonics and I’m going to play that. So this fundamental frequency is not estimated
from the speech. It was to find the modulators, but it’s not being used for this residence
now, modulators only.
[Demo]
Totally intelligible, clear as a bell, but the pitch is flat, effect less pitch, because who put
in a flat pitch? Now a little more stuff. Let’s go ahead and try the same experiment I just
tried for you, flat pitch, modify the pitch, but let’s assume that our modulator is nonnegative and real, the old assumption and let’s throw it all back together and see what we
get. The old way of doing it in like that Letter to Science paper. This is what we get,
modulation is 1, same color map as the last slide.
[Demo]
It’s almost totally distorted. So there are a lot of other things you can play with. We
actually have a toolbox you can download and goof around on your own at MATLAB
toolbox that is. So being able to pull apart signals like this is not necessarily that new
form of vocoding, but it’s what people were doing when they were talking about
modulators. They were doing this, it didn’t sound good and my claim is you do
something like this and it sounds distortion free because the assumption is correct.
So where are we at currently? It has been used by other’s Steve Schimmel who is now
working at Apple, Won et al, hearing research, Nie ICASSP 2008, Sell working with
[indiscernible], Li a JASA paper, Imennov, Shaft has a JASA paper and there are various
other people. This isn’t a complete list by any means, but there is a statistical variant of
all this which has been applied to reverberation reduction and this has just been accepted
for publication. I don’t know if it’s available on our web page yet, but it should be
shortly, because it has been accepted for Transactions on Audio, Speech and Language.
So it’s analysis to show removal of reverberation and also began to work with -20DB
SNR speech. Okay, let me say that again -20DB SNR speech with 2 sensors,
2microphones, -20 DB SNR where the noise is very different in statistics. We take
advantage of that.
So the statistical variant of what I am talking about of complex modulator is something
I’ll end with for my last couple of slides. We will come back to that.
>>: Do you mind if I ask a question?
>> Les Atlas: Sure.
>>: Do you mind just putting the last slide just briefly.
>> Les Atlas: Go back by one?
>>: Go back by one, the one where you had the constant pitch, but the correct one. Not
this one, the one before.
>> Les Atlas: This one.
>>: This one yes, it sounded pretty flat, which means that for applications such as
recognition it might actually work pretty well. A recognizer might recognize that as well.
>> Les Atlas: Not a tonal language.
>>: Right, but so if that is true now I am thinking if I am building a system where I have
to pick up an audio signal and send to a recognizer somewhere over there and bring it
back, could I encode your modulation signals at a lower bit rate than I actually encode the
signals that I’m sending to the speech recognizer?
>> Les Atlas: If you have a good way to encode these, they are now complex envelopes,
if you have a good way to encode complex numbers and my last couple of slides might
give you some hints along those lines.
>>: Because they must have some smoothness here otherwise they wouldn’t be good
envelopes. They can’t be erratic, they have to have some smoothness that a
[indiscernible] might be able to capture and not have that high of bit rate to send to the
cloud. You might lose the pitch, but it doesn’t seem like it matters.
>> Les Atlas: For non-tonal languages yes.
>>: For non-tonal languages right or I could send the pitch, because that pitch is also
there.
>> Les Atlas: Okay, so they are pulled apart and you can throw them back together and
get the same information and it turns out you will see in my second couple of slides I give
just a little math that the correlation between the real and imaginary part is the innovative
stuff in that signal.
>>: [inaudible]. But you haven’t done any work on actually trying to build a
compression system out of this thing right?
>> Les Atlas: No.
>>: Okay.
>> Les Atlas: Doing it the way I will be getting to in my last couple of slides it might be
doable. That’s something to think about, we haven’t thought about that for awhile
because we haven’t use the statistical method you are about to see shortly. So there
might be some value there. We have been working on cochlear implants, but maybe the
next thing is along those lines. That’s a very interesting idea.
So let’s get to the cochlear implant and the thing is I want to play for you what was on
NPR two weeks ago, because it was kind of exciting. My mailbox has blown open since
then. This is the problem I had worked on 31 to 35 years ago, which was to get speech
into non-functioning ears that still had an intact cochlea where the cochlea was not
transducing sound to neural impulses. The inner here is what turns sound into neural
impulses. The middle ear is like an acoustic impedance smash, the outer ear is giving
you the kind of acoustic imprint that helps you directionalize and capture sound.
And what happens when someone’s inner ear doesn’t work 350,000/400,000 people have
this now. It has taken off like crazy as soon as I got out of the area suddenly it worked.
Speech through a cochlear implant is a successful technology. I didn’t believe it because
I had worked on it for four or five years, had what I thought were really good ideas and
we couldn’t get open set speech working. If you have added queues, yes it helped lip
reading, but why do you need to do a surgery to help lip reading? The question is you
want someone talking on the phone. So someone did the ultimate experiment with me.
I got a phone call from a guy who said, “Why didn’t I want to work on cochlear implants
anymore?” I said, “Because they don’t work well.” I was so disappointed and I spent
years working on them. We thought our technology was so good and people couldn’t get
open set speech. Then what the guy said floored me. He said, “I have a cochlear
implant.” Here I am talking to him on the phone. That’s all it took and I said, “Oh
you’ve got cochlear implant, you are talking to me on the phone,” and he said, “Yes”. He
didn’t have to say, “What,” or any of that stuff. He understood me like a normal listener
on the phone. I said, “Okay, where do you have problems?” He said, “Music” and he
said, “Make it work for music, please make it work for music.” I said, “Okay,” and that
was a few years ago. So that’s what we are going to get into now, making it work for
music, which is going to make use of these ideas.
So what I am going to play for you appeared, the publication on this and I’m not going to
go into all the details, but the publication on this appeared in July 2013. The basic idea
has been refined since. Let me bring up a web browser separately so I can get the NPR
show going. The NPR piece is about 5 minutes long, but it’s very informative because it
really gives the human element of the story. I might skip some other stuff, just because
this NPR story is superb. The reporter did a great job. They had me in the studio for a
couple of hours and they used a few seconds of me talking a few sentences, but that’s
how it always works.
>>: Today in your health we are going to explore what it’s like to experience music
through a cochlear implant to people who depend on these surgically implanted hearing
devices. A song like this [demo] may sound more like this [demo]. NPR’s John
Hamilton reports on a man whose love of music has persisted through hearing loss and
cochlear implantation.
>>: By the time Sam Swiller turned one even loud noise had become faint. Hearing aids
helped, but Swiller was living in a different world when it came to perceiving sounds.
>>: The earliest memory I have [indiscernible] is a family picnic around the time I was
four or five, really feeling isolated and separated from everyone, even though it was a
picnic filled with family friends and young kids running around.
>>: Swiller says even with hearing aids he was understanding maybe one word in three.
He relied on lip reading and creativity to get by. And he took refuge in music, loud
music that cranked up the sounds he heard best, drums and bass.
>>: Isolation was kind of a common theme in my childhood and so Nirvana kind of
spoke to me in a way.
[Music]
>>: But Swiller had no idea what the lyrics were saying until MTV started closed
captioning it’s music videos. He says it didn’t matter.
>>: I just love music, not just the sound of music but the whole theory of music, the
energy that’s created, the connection between a band and the audience and just the whole
idea of rocking out.
>>: Swiller kept rocking through high school and college. Then in his late twenties the
hearing he had left pretty much vanished. So in 2005 he swapped his hearing aids for a
cochlear implant. One part of the device is implanted under the skin behind the ear, it
receives information from a microphone and processors that are warn like a traditional
hearing aid. When a doctor switched it on for the first time Swiller wasn’t prepared for
the way people would sound.
>>: I remember sitting a room and thinking everyone was a digital version of themselves.
>>: The voices seemed artificial, high pitched and harsh. He couldn’t figure out what
people were saying.
>>: You are basically remapping the audio world and so your brain is understanding,
“Okay, I understand this is a language, but I need to figure out how to interpret this
language.”
>>: Which he did over the next few months and he started listening to music again, but
through the implant Nirvana was less appealing.
>>: So I was kind of getting pushed away from sounds that I used to love. I was also
being more attracted to sounds that I never really appreciated before.
[Music]
>>: I started to like a lot more folk music and a lot more vocals. So like Bjork is a good
example.
>>: A cochlear implant isn’t just a fancy hearing aid. Jessica Phillips-Silver, a
neuroscience researcher at Georgetown University says the devices work in very different
ways.
>>: The hearing aid is really just an amplifier. The cochlear implant is actually
bypassing the damaged part of the ear and delivering electrical impulses directly to the
auditory nerve.
>>: The brain has to learn how to interpret these impulses and every brain does that a bit
differently. Phillips-Silver says another factor is that implants simply ignore some of the
sound information in music. She showed how this effects listening in an experiment that
involved people with cochlear implants responding to some dance music.
[Music]
>>: It was called Suavemente by Elvis Crespo. It’s one that’s heard commonly in the
clubs and it get’s people going.
>>: At first people with implants responded just like other people. They began moving in
time with the music, but which sounds were they responding to?
>>: There is a lot going on. There are a lot of different instrument sounds, there is a
vocal line, and there is a great range of frequencies. It’s fairly intricate music.
>>: So Phillips-Silver had participants listen to several stripped down versions of
Suavemente.
[Music]
When the volunteers heard this drum tone version they had no trouble keeping the beat,
but when they heard this piano version they had a lot of trouble.
[Music]
Phillips-Silver says that’s because what a cochlear implant does really well is transmit the
information needed to understand speech.
>>: Where it is somewhat lacking is more in relating information about pitch and timbre.
So for example being able to tell the difference between notes that are close together on a
keyboard or being able to tell the difference between two similar instruments.
>>: Other scientists are trying to make those things possible. Les Atlas is a professor of
electrical engineering at the University of Washington.
>> Les Atlas: There is no easy way to encode pitch as an electrical stimulation pattern,
that’s problem number one. Problem number two is to be able to do that in real-time, that
means as the music is coming out is a difficult problem.
>>: As a result even the newest cochlear implant provides very little information about
pitch. Take a simple tune on a piano.
[Music]
Through an implant it may sound more like this simulation.
[Music]
So Atlas and other researchers are working on software that allows implants to convey
more information.
>> Les Atlas: It explicitly looks for the tonality in the music and uses it in how things are
encoded.
>>: Instead of this [music] the implant sends a more complicated signal that allows the
brain to decode information about pitch.
[Music]
>> Les Atlas: And low and behold the results that we get now on the few people we have
tested is that they do get better music queues. They can hear not perfectly, but much
better, the difference between musical instruments. The richness of their experience
when they listen to music has increased.
>>: Atlas says the extra information also should help people with implants who need to
understand highly tonal languages like Chinese and Vietnamese. Even with technical
improvements the experience of hearing the world through a cochlear implant will still be
different, but Sam Swiller whose had his implant for a decade now says he’s okay with
that.
>>: All of our senses give us the ability to experience very different worlds and even
though we are walking side by side we are experiencing a very different street, or very
different day, or very different colors. So when we truly engage each other we get to
experience a little bit of each other’s world and that’s where I think real creativity
happens.
>> John Hamilton, NPR news.
[Music]
>> Les Atlas: So I’m going to stop there. I do have more slides, but it’s mostly just math,
a let down from this story. So I think I have gone on long enough, taken enough of your
time. So I will just open it up to questions.
Yes?
>>: So all those processors are not digitized right. [indiscernible]
>> Les Atlas: It’s digital. The only thing that’s one analog is the signal conditioning
coming through the microphone and getting it ready for the A to D converter, after that
everything is done digital. The three main companies, and there are some smaller
companies, that make cochlear implants are using imbedded processors. Rather
sophisticated systems that are doing digital processing to make things run real time you
have to program an assembly code. It’s not fun, but you can do it. It’s not a work horse
processor, but all done digitally.
>>: [indiscernible].
>> Les Atlas: Well that carrier modulator decomposition we have done no formal study
of that question. The only anecdotal thing is for example what you heard when we
changed the pitch and flattened it.
>>: [indiscernible].
>> Les Atlas: But the remaining unvoiced part, bird population, like the p sound and the
other population, tions, at the end there is an unvoiced part of the zzzz sound. All of that
came through just fine. So what ever is happening with the complex envelope is made no
worse.
>>: [indiscernible].
>> Les Atlas: It doesn’t change anything.
>>: [indiscernible] the subbend things it actually shows that with the subbend signals that
the frequency [indiscernible] they show mostly in high subbands with reasonably well
defined envelopes. And once inside it probably doesn’t matter. In fact some speech code
there is [indiscernible].
>> Les Atlas: Sure.
>>: And once you get the envelope [indiscernible] you can actually distinguish some kind
of musical sounds, even though you actually thinking the wave form. But as long as the
envelope is reasonably correct a symbol of some kind or another kind you can tell the
difference between the two on the high part. To really reproduce a symbol you still need
some pitched thing, but [indiscernible] you can easily tell the difference even though the
accent is totally random.
>> Les Atlas: And there’s also a continuum between what you have asked and the
unvoiced sounds. So for example the toughest test we have been successful at is musical
timbre test and that test the same a melody played by a cello, a piano, a trumpet, a
saxophone, can you differentiate? So there are 8 instruments so the chance is 12.5
percent correct and a normal listener does 80 to 100 percent correct on this task. Now
with the cochlear implant, with our new algorithm it’s up to 87 percent correct. So we
are in the normal range, but the take away from that related to your questions is that
difference, when you look at it is a steady state value of the various harmonics for the
instruments. They are all playing the same melody, but it’s also the startup transient for
those harmonics. You know whether the string is plopped, whether it’s bowed, whether
it is someone blowing into a trumpet, that’s startup transient and sometimes the ending
transient. And it’s very important for that difference. So that difference they were able
to do quite well on using our new coding.
Yes?
>>: Are you familiar with how the auto tune algorithms work?
>> Les Atlas: Yes I am.
>>: It seems to me like maybe what they are doing adds quite a bit of distortion to it and
what you are doing might be able to, like correct pitch is better.
>> Les Atlas: No they are using an auto correlation based estimator. Yeah, I know what
they are doing in auto tune. There are some people who really don’t like auto tune
musically. So there is a danger in going into that turf right now, even if you make it
sound better, but I agree with what you are saying. I do agree and I think that’s a very
good point. I just wouldn’t want to be part of that battle right now trying to make auto
tune sound better, because musically people are against it, a lot of people are. But you
are correct in what you asked.
Yes?
>>: What does the person wearing an implant, what does he or she hear when she’s
speaking?
>> Les Atlas: Oh, that’s a good question. If they are profoundly deaf and that is their
inner ear is not working at all they hear nothing, because there is nothing to be heard.
But if there is some residual hearing, which a lot of people with implants have, what they
will get is a bone conduction version of their voice, which we are hearing with normal
listening. Okay, I am hearing a lot of it in my right ear even though it’s deaf, because I
have bone growth that’s acting like a plug in my ear. So I hear the bone conductive on
my deaf side, but people with residual hearing with a cochlear implant on top, one of the
challenges we have and putting this in a discrete time digital device is latency.
Because if the sound you hear through the device that’s processing it produces the start of
a musical instrument or the start of the p sound in speech 40 milliseconds after they hear
the direct bone conducted version it would make them nuts. It makes people crazy; they
don’t want to use it. So our latency is up to 6-7 milliseconds and that’s it and that’s hard.
So it’s not just making it real time. It’s real time with 6-7 millisecond latency, because if
you hear the echoed version any longer than that you start to get irritated, really irritated.
So, it’s a tough problem because of that.
Yes?
>>: I guess I have two questions. I might have missed the first part of your lecture. How
did you make it real time and also since it wasn’t periodic did you ever experience
anything [indiscernible]?
>> Les Atlas: So your first question is how did we make it real time? So to do that I have
got a couple of really motivated and good graduate students working with me who have
worked very hard at assembly code. They realized that higher level coding wasn’t fast
enough, it had too long a latency and I bugged them and they kept at it. So the core part
of this is now real time with low enough latency. So it was just a lot of work in assembly
code. And one manufacturer gave us a processor and we have the tools to use it and
some of the code we use their code for some of the functions which really helped us.
Your second question was what?
>>: Since it wasn’t periodic did you ever experience any [indiscernible] or leakage?
>> Les Atlas: So the fundamental frequency is roughly periodic over a small interval, but
it’s not exactly periodic because the pitch is changing I would call it quasi periodic. But
the aliasing would come from the bandwidth of the signal. So any use that you have
heard of our stuff or any use in the cochlear implant processor starts with a low pass filter
to ensure it won’t alias. Now the modulation envelope can alias too by the way. So I talk
about this low frequency envelope which can be complex, that can alias down. That’s a
whole other discussion. It turns out in the stuff you heard, it probably was, but you didn’t
hear the aliasing.
>>: Did you use any window or anything?
>> Les Atlas: As the signal slides across, depending on what we are doing it could be
rectangular, it could be a handmade window and sometimes we have used [indiscernible]
or it depends on how much time we have and how careful we want to be.
Yes?
>>: How is the device connected with the nerve?
>> Les Atlas: So connecting with the nerve, there was a picture a showed that spiral
shaped device, there is an electrode that’s got actually 22 metal sites on it, 22 electrodes
on it, 22 fine wires that come out and they don’t come outside the skin. The go to a
device that sits under the skin and that device has a radio frequency coil that picks up
information and is decoded internally. That decodes the stimulation pattern and there’s a
set of biphasic pulses, positive and negative charge that are put out at a certain rate and a
certain proportion to the different electrodes. And how those pulses are modulated is
what we changed. So its electrical stimulation pulses that are good enough for speech if
they are basically modulated as standard old non-negative way, the conventionl way.
Modulated our new way they work much better for music and for speech we are trying to
show better performance for speech and noise and those results aren’t ready yet.
Yes?
>>: [indiscernible].
>> Les Atlas: It’s not unrelated. So the first thing I started using when I was coding
speech for cochlear implant was linear prediction to try to find the residences in the vocal
tract. We had four electrodes back then and we thought we would stimulate each
electrode at the center frequency of a residence. And I used a linear predictive analysis; I
used Levinson-Durbin recursion to come up with the different resin pulls that represented
different residences in the vocal tract. So I was trying to work like [indiscernible] and
how well did it work? It didn’t give us speech then. So it turns out that it’s more like a
channeled vocoder. I only had four electrodes and now they have 22, which is better and
more precise resolution and frequency. With 22 electrodes and simple channel vocoder,
that’s what’s used to get intelligible speech, but it doesn’t get music, which is the
problem we worked on.
Now Bell started with a channel vocoder with the phone and it broke and then the phone
worked. So cannel vocoders have been around a long time. Are there other questions?
>>: I want to hear your comments on [indiscernible]. So what’s your comment about
how this kind of knowledge that you gain about what’s intelligible, what’s not
intelligible, part of the [indiscernible] that you have maybe to help?
>> Les Atlas: Well Rico brought up one example of going between the device and the
server.
>>: [indiscernible].
>>: [indiscernible].
>> Les Atlas: That’s it because we are not going to be learning some recognizer with
days, or weeks or months.
>>: [indiscernible].
>>: [indiscernible].
>> Les Atlas: And if you want to do better in that kind of approach where you have got
data and labels, loads of data and loads of labels, and you want to learn a little faster and
do a little better have me give a different talk on active learning or something. It’s not
related to this. The brain is doing deep learning already so we are working with a deep
learner, but we have got something where we can’t put the wrong input in anymore.
>>: [indiscernible].
>> Les Atlas: We haven’t tried that, I can’t claim it yet, but it’s something to look at.
That’s a good comment.
>>: [indiscernible].
>> Les Atlas: It’s quite close, yes.
>>: [indiscernible].
>> Les Atlas: You have to ask that, to get that, at least the way we are talking about it
now, to be able to get the complex estimate. We are far up the curve in statistical
estimates. So we probably could work in something like +5DB SNR to be able to do this,
but still in the presence of noise being able to estimate this complex envelope the actual
phase angle is going to be sensitive to that noise. But the correlation between the real and
imaginary part, which people haven’t looked at, is important for what you heard and it is
important what we are doing here. And it’s a little bit of extra information because you
usually assume the real imaginary part are just un-correlated, it’s a Gaussian which is
round, its not Ellipsoidal, if it’s a Gaussian assumption. I assume its Ellipsoidal where
the real imaginary part, there’s a little correlation between them, then you get something
interesting, a new feature. That’s a new feature that could be used. Maybe there’s a
[indiscernible] version of that feature that could be useful. I haven’t tried that; that’s an
interesting thought.
>>: And finally just by reading your first few slides, [indiscernible].
>> Les Atlas: Yes, by the rectified. Why sigmoidal, there’s nothing about sigmoid.
>>: [indiscernible].
>> Les Atlas: But you also notice there wasn’t a perfect rectifier. What I made sure is
when it’s negative it’s not zero; the resting rate is slightly depressed when it’s negative.
So that might have a role or it might just be there for artifacts. Are there other questions?
Yes, you had one.
>>: So thinking about the math, what you are talking about and so on and wanting to
apply it to something totally different, mainly switch mode power supplies, which have
depending upon whether they are hard switched or residence they could have a fixed
frequency or a variable frequency and you have got envelope modulation and you are
trying to understand how things propagate so you can look at feedback gain analysis. I
am thinking of a paper that was looking at the real and complex parts of the modulation
and how that was something that had been overlooked by people who were just trying to
do this envelope.
>> Les Atlas: Yea, so it’s the very same thing, yes.
>>: And it yielded very good results.
>> Les Atlas: It’s the same concept and if you take a look at one of our recent papers not
on cochlear implant, but on sonar we extend coherence duration and we are using that.
We are taking a look at the real and imaginary part of a complex process to do that. And
we are warping it in such a way that we can get longer windows and make higher S and R
estimates of things. So this might be related to the feedback as you talk about. So it
could be very similar. If you look up Atlas@UW.edu and send me an e-mail I will send
you that paper and I would like to see the reference you have.
>>: Sure I would be glad to do that.
>> Les Atlas: I would be very curious. Yes?
>>: I am an IT guy so forgive the lack of [indiscernible] of the depth of the subject. But
when you are working with these implants you mention that you are going after queues,
like musical queues verses the actual music. So for someone who is hearing impaired
now they get this new information, the body interprets it and then they can use that
information.
>> Les Atlas: We hope so yes.
>>: Have you done anything with someone that’s not hearing impaired? Where you can
actually implant it, turn off their inner ear, say, “He does that sound the same?” I mean is
there any work being done like that?
>> Les Atlas: We can’t do that, but you heard in the NPR show the simulation of what
someone who is a cochlear implant user hears. Now how accurate is the piano at the
end?
>>: Yea, that’s what I’m wondering. How accurate is that?
>> Les Atlas: The piano thing at the end is our best guess as to what people heard and it
did duplicate some things that are probably quite accurate, like the interval between the
piano frequencies is not preserved, but you can hear them change at least. And the piano
sounds different than a trumpet. We didn’t play that simulation, but if I did play that and
if they play that on NPR you would hear a piano, you would hear a cello, you would hear
a violin and they would all sound like the right instrument. The melody they are playing,
the interval from middle C, up to D, up to F and so on, the intervals would be off. There
is no reason they are going to be accurate because the placement of frequency is
unknown, but people can re-map and kind of learn that fortunately.
And we now know they can get the queues that are useful to knowing which instrument it
is and some of the people we have worked with are former musicians and when they hear
the test they say, “We want this to take home,” because even learning speech, until
someone get’s what’s called a clinical processor that they can wear home they really
don’t learn speech with a cochlear implant until it’s been a few weeks, or months, or
sometimes just a few days. We expect the same with music. At first music will sound
foreign with this new method. They might even like their old processor for awhile better,
but then over time they are going to get more richness out of music. At least that’s what
our results indicate and we won’t know until next year or so we are going to be
determining that with a take home processor.
>>: Okay, thank you very much.
[Applause]
Download