25179 >> Arjmand Samuel: So, it's my pleasure and honor to...

advertisement
25179
>> Arjmand Samuel: So, it's my pleasure and honor to welcome Professor Les Atlas to MSR. Les
is a distinguished professor at U-Dub, he's been involved in different things. Mainly he is
working in single processing of acoustics. The sound is his specialty I believe and sensing that
and working on it. So Professor Atlas please.
>> Les Atlas: Thank you. So I want to mention that this talk that I'm giving is a slightly updated
version of a talk I gave a few weeks ago in Cambridge, and it was given to the group of hearing
researchers, Brian Moore and his group and other people, Roy Patterson's group at Cambridge
UK, and the reason for that is kind of right here, and Bloedel Research Hearing Scholar is
something that I was awarded about a year ago, which is, gave me some release time and some
motivation to work on how the ear works because people have run into a limit in what they feel
the science and mathematics can tell them about the ear and they want to help, wanted some
help with that limit. So, I also want to thank someone who's been very inspirational in this
work, Bishnu Atal, who many of you have heard of and are sponsors.
So, let me go ahead and get into my first point. And it's really real, just what I just said that
conventional method tools and science for audio signal representations that we work with now,
that we’re taught, are insufficient, and they limit our creativity. We are taught a certain set of
things that are standard part of electrical engineering, computer science, physics, mathematics,
and those are limiting. And the better tools and science would give us the ability to, for
example, come up with enhanced listening, be it for normal hearing, impaired listeners or
machines, multiple simultaneous sources, noise, and reverb. These are standard problems that
people face, importing systems from the laboratory to the real world. And let me talk about
existence proof that there are better solutions.
The first one is, we know our ears work well. And we can look at top-down approaches, people
like Shihab Shamma or Li Deng or others who work in dynamics. That's essential, having
models of how sounds evolve over time is absolutely essential part. And I'm not saying that's
unimportant. But I'm going to stress something else that's important which is more the
bottom-up feature-based approach. And just from the standpoint of what really can be done if
you expand your imagination, you let go of some of your previous notions and previous
mathematics, allow for the fact that there might be a different world out there than what we've
learned. And I've got a picture here from a recent publication that actually made use of some
of our present previous concepts which is a notion of a modulation spectrum and modulation
filtering. This is a model of what the ear might be doing, according to McDemott et al. Neuron,
and they cited a paper transaction signal processing, something I've talked about before called
modulation filtering, and their argument, which is an argument that we've gone a bit beyond,
and I will be doing that the rest of my talk, is that, well first of all, the way that the ear works,
it's got a bunch of subbands and each subband is producing something like a bandpass filter,
there's an envelope, and compressed nonlinearity. Envelopes come out of every subband and
whenever you talk about an envelope or modulation, there's another word for modulation
which is called product. Modulation, multiplying product, all mean the same thing, which
means if you really want to find a modulation envelope, you have to DE multiply the original
signal, find its carrier, or do something else, get rid of that original carrier. And I'm going to talk
about that old communications model, why it's old, and why it's a bit dated, and how we can
move on from that point. It’s the conventional model that people are using. This is a 2011
paper, they've got some pretty new results, nice results, using something called modulation
filtering, which I've argued for previously, but what’s really missing>>: Is that experimental work or is that modern work?
>> Les Atlas: Is what?
>>: Is this experimental work or is it a modern work?
>> Les Atlas: Both.
>>: It's both.
>> Les Atlas: So, in this neuron paper is both, and you'll find if you look in the literature, you'll
certainly see sets of both. But what's missing is part of the scientific community and also the
engineering community is people talk about a modulation envelope. And that’s a well accepted
concept because the way that the ear works, and the way that vocoders work, and the way that
you can get intelligible speech is about these envelopes, these slow-moving envelopes, one for
every subband, you might have eight or sixteen or even four subbands, can excite them with
noise, you can get intelligible speech. But when you have two talkers, or you have noise, or you
have reverberation going on, that falls apart. So the problems that we started with, which is
multiple sources noise, reverb, are things that don't hold up well to having the envelopes only.
So the notion of was called a temporal fine structure is what the community is after. Problem is
they don't have an agreed-upon definition. There is kind of an operational definition, I'm going
to tell you why that definition is wrong. And I'm going to give you a substitute way of looking at
it.
Now, let me just give you another existence proof. Something that's actually related to this
problem. When you go into a new room and you listen to a group of people talking, you can
single out a particular talker in a highly reverberant environment like that. Understand?
There's speech. Even at a negative signal to noise ratio. And there's another case of something
like that. That's radiofrequency, walking around, be it Wi-Fi that's running at 450 or 600
megabytes per second, or 4G Internet, or 3G Internet, if you take a look at something which is
city kind of layout, that's actually a Wi-Fi record, that's not 4G Internet. Little devices that look
like this thing that's small on the left, you can get a layout like this between buildings that tells
you about how well you'll do. And the key thing about that is when you use Wi-Fi or use 4G
Internet these days, there's no training sequence. It's a blind equalization that's done. And
there's tricks being done in Orthogonal Frequency Division Multiplexing that are remarkably
similar, as I will argue, to what are our ear might be doing. That's the analogy I'm going to
make toward the end of my talk. So let's start heading toward that.
And first of all, to get to that point, that final point of OFDM, our frequency division
multiplexing, let's start with some history. The idea of breaking things into frequency or
subbands, the first person to do it really was Alexander Graham Bell, who tried to send multiple
Morse code signals down a single line simultaneous, the multi-user problem for Morse code.
He failed in doing that, but by mistake, he was able to send voice over that same line, that's
called the telephone now. That was in about 1875 that he did that. 1877, Helmholtz ended up
writing a book, and a key part of that book is quoted right here: the notion of tone, they used
cronk[phonetic] out zero crossings if they could, they used to try to make a siligram[phonetic]
or time domain pictures, but they didn't have a notion of frequency, didn't link what Fourier did
with what happens with sound. It was Helmholtz who made that link. Once we did that,
suddenly the area took off. 1906, the notion of product models, that are still being used, that's
AM radios. Initiated back then. 1933, vocoding models, AM and FM together. FM radio came
there, and the problem is, we're kind of stuck at that 1933 point in terms of modeling we're
using for understanding how the ear works, and understanding things like even mel-frequency
cepstrum and what they're doing. Now frequency cepstrum go well beyond what AM, FM
models do, but inherent in them, inherent in their formulation, is right there.
Let’s take a look at something that's more modern. OFDM. And I'm really stressing the FTM
part. Doesn't have to be orthogonal, notion of frequency division multiplexing, that’s 1966 is
the first papers, I list 1942 because the first person to talk about frequency division multiplexing
are spread spectrum in general, Hedy Lamarr, the actress, 1942. And that's used in all modern
Wi-Fi, 4G and high-speed data communications. Now, it's just like this problem we are talking
about, multiple talkers. That's called multiple access. Uses reverberation, it doesn't just blindly
equalize, it uses reverb to its advantage in ways that I'm going to show you without the reverb
there's multiple paths between buildings that would work worse. At worst best if thousands of
frequency channels are available for one data stream. The more frequency channels, the
better, except for battery life. It's kind of like the ear, we have thousands of frequencies
channels highly overlapping in the ear, they're not orthogonal, but offer some novel and
surprisingly simple concepts for auditory encoding, whether we’re trying to understand what
the ear does from a scientific standpoint, or come up with new algorithms that are insensitive
to noise, can single out sources. Now, what's different about it? It's talking about some
communication scheme, but what the single kernel of difference between what auditory
models use and assume, and the stuff I've even talked about before, and what's done in OFDM?
Well, it's a fact. It’s multiple axis, primarily additive. I'm going to show you what I mean by this
soon, and not a product model. When people talk about modulation coming out of every
subband, and then the fine structure that might be coming at the pitch rate, or to a higher
frequency rate, they're presuming a product model. A product model causes problems which I
will show. If you use an additive model instead, things get better. Let's take a look at the
auditory system, just a simple model of it. And I'm going to keep a simple model from known
material. Not making any strange speculation here, saying that we have an input signal that's
broken into a bunch of subbands, could be thousands if it's our auditory system, if it's some
device, you know modern audio compression, it might not be thousands, you are limited by
computes, but in any case, you have a bunch of subbands that have certain properties. Now
what’s special about our auditory system is the hair cell rectification right here. So the hair cell
rectification I'm showing an instantaneous nonlinearity that compresses the signal in a way
where if the signal is negative going, it either provides zero or it lowers the resting rate, so it
has a slight amount of negative value which I show by having this white line below the origin.
On the right side, we pretty much duplicate the output. Half wave rectifier, not a perfect
model, but pretty much what the transduction between acoustic pressure and neural stimuli
does, followed by, now, I'm going to foreshadow what's coming, a crummy lowpass filter.
If you take a look at every hair cell on the ear, there’s several neural fibers coming off it.
Anywhere between one and eight for the hair cells that go up to the brain. And what comes off
them has a lowpass filter, but it's a leaky integrator lowpass filter. Some of those fibers carry
information at higher rates than others. Never really been looked at. So let me give you some
alternative views. Here's the conventional view of what happens in every subband in our
auditory system; here's the conventional view from the standpoint of auditory research and
cycle acoustics. Take a Hilbert transform to form a complex signal called an analytic signal, put
things in polar form. Okay? That's where the problem is. You put things in polar form, as you'll
see, there is that product. That product’s not going to be our friend, as I'm going to show you.
Because you start with something which will be a beautiful looking envelope, if you looked at
the original signal, two-sided real signal, formed its one-sided analytic complex signal, and then
looked at this magnitude, which is again real and nonnegative, it's a beautiful looking temporal
envelope with a signal. Tracks it really nice. Don’t need a lowpass when you do this. But then
what people do, is so they get an envelope that's look reasonable, nonnegative real, and then
they take what's left over, because that's what you need to put the whole signal back together,
yeah-
>>: What’s the motive of the processing people have done? They don't even take the phase
one. They just take [inaudible] energy as the function of pi.
>> Les Atlas: Well, the energy as a function of time would be very similar to this if you put a
square there, for example, and lowpass filter it if you wish. If you don't use the analytic signal.
But the problem then is all you have are envelopes. Okay? So the standard thing is if you only
have envelopes, you don't have what we call to fine structure which is this branch down here,
you're not able to track multiple talkers and not able to track a single talk or a noise. These are
the things that I'm arguing for.
>>: [inaudible] into the phase there.
>> Les Atlas: This is the standard way people go after what they call temporal fine structure.
That's a way I'm going to argue against.
>>: Multiple [inaudible] simply take into spike time?
>> Les Atlas: Take what?
>>: Into spike time, of the low spiking one.
>> Les Atlas: Oh. Well the spiking comes from somewhere, yes. Okay. There's>>: Actually, I think I've seen, you know, all the correlation models, that they recall the spiking
time, whereas they’re not, they cannot typically [inaudible] represent by this [inaudible] white
phase, the phase the other [inaudible].
>> Les Atlas: Well, I'm not going to argue with that. Okay? There's issues. But in terms of
psychoacoustic experiments these days, and some physiology that's being done with mammals,
which Shiab does with Ian and other people at Cambridge do, they do use Hilbert phase and
that is the most common thing that's done. The neurophysiologist and psychoacousticians tend
to use the Hilbert phase, which is that lower signal. You might be a little uncomfortable with it,
I am sorry, I am too. Okay? So that's one example okay? But here, let me make a more general
point, that if you decompose a signal into a slowly varying energetic like component, multiplied
by a finer time structure component, that's still going to be problematic. Any of those.
Especially if this is unimodular or magnitude equal to one. Okay?
So let's take a look at the alternative. That's the problem. It's a multiplicative view. So what
we’re going to propose instead is something that's closer to what's done in frequency vision
multiplexing. We’re going to start with rectification. The ear does that. There's a certain
properties about rectification that are going to be coming soon that you're going to see. Really
interesting properties that no one really takes advantage of. That's half wave rectification.
Then what's known as this notion of integration after that, which is a lowpass filter, now the
new part, that top branch is a modulation envelope. Now the new part is, instead of it being a
product, it's additive. What we’re going to put is what we call a fast envelope. If you're talking
about phase locking, your phase locking to this fast envelope or carrier, it might be at the pitch,
right, if is lower subband in speech, it might be at the input signal rate. So this addition on the
envelope, will be riding on top of it, a much higher frequency. That's a very different model.
It’s different algebra than we have here.
Let's talk a little bit more about this new additive model. So we're talking about the top being
something which is identical to the standard half wave rectifier, however you'd like to do it
whether you say Hilbert envelope or use a rectification followed by a lowpass filter on a real
signal. If you're look at for example, Stuart Rosen's definitions, you tend to have frequencies up
to about 50 hertz and below for that envelope. Maybe below 16 or 18 hertz in some people's
definition. But this new part, which is additive, that bottom fast envelope, starts with a low
corner about 50 hertz, below the pitch rate. Below the lowest pitch, and goes on up to
whatever phase locking rate you can have. If you look in the latest edition of Brian's book,
you're up to four-five kilohertz for your phase locking rate. So I’m calling it a bandpass filter
just to show it doesn't go on forever.
And let's look at some of these models. The conventional model, I'm using a Hilbert phase
because that's the most common of people I was talking to in Cambridge, but you can
substitute your other favorite product model for the high-frequency stuff that your phase
locking to. First problem with it: complex exponential phase, this is a theoretical issue, does not
have an inverse function in the standard sense. If you pair it with its original modulation, you're
okay. But if you make any change, you're stuck. Mathematically or algebraically. The
underlying complex polar forms magnitude must always be nonnegative. We're going to see
the problems that develop. Because whenever the real part attempts to dip below zero, you’ve
got a subband of speech coming through bandpass filter. In the ear or whatever device you're
building, it wants to cross zero. It's not nonnegative. Whenever it crosses zero, this particular
Hilbert phase or whatever definition you use for a polar from product model, it's going to have
a problem. It's going to abruptly change its phase by plus or minus pi. It’s not connected to any
known physiology. And let's contrast this to what I'm proposing instead, which is an additive
model. Fast envelope.
Now there's a deep theory behind it. I didn't pull this out of a hat. The reason it’s connected to
frequency division multiplexing and high-speed data communication is something called
complementary statistics. If you take a look at, you know, more recently, the work of Peter
Schreier, and Louis Scharf, and before that, Picinbono in France, transaction signal processing
1994, you can see new information and signals that hasn't been used in most audio processing.
Very rarely used in audio processing. The only papers I've seen are icast 2012, maybe one in
2011 that I'm co-author on. Well I haven't seen anyone else use it in audio. It's very rarely used
for natural signals, in fact. No show complementary envelope is usually used for complex
communication signals where you have a cluster in a complex plane of 8 or 64 different points
in your signal constellation. And because you have this cluster in your complex plane, it's not
circular. If you look in a complex plane of the real versus the imaginary part, you wish to use
quadrature that's fine, there's a pattern which is not exactly circular. That pattern’s not
circular, it's got complementary statistics that are meaningful. It's got new information. When
speech has periodic correlation, any signal with periodic correlation, could be speech, it could
be sonar from a propeller, it has significant complementary statistics.
>>: So is there any argument that this kind of[inaudible] is connected to auditory physiology,
no?
>> Les Atlas: I'm getting there.
>>: Okay. Sorry.
>> Les Atlas: I'm getting there. Okay? I'm about to get there. Couple slides. They still make
use of a simplified lowpass and what I'm arguing for, the additive fast envelope. I'm trying to
get at that. Okay? When you look at these kind of papers, they're talking about expected
values, things that are in the limit. Expectations where you have either infinite limits of use SMI
ergodic or other ways of finding an expected value if it's not ergodic you also have to assume
that signals are harmonizable, other things that you need because they're not necessarily
stationary, but yet you need to be able to find some statistics on them. Complicated
mathematics, hard to get through with this complementary statistics. What we do instead
these days, it’s not going to be my talk, I don't have time for it today, is a generative model.
And what a generative model can do, it's much like having a DFT instead of a power spectral
density. If you have a power spectral density, that's a pencil and paper thing. You need to
know the underlying true autocorrelation function, you take its Fourier transform and pencil
and paper. But if you have data you can't do that. You do a DFT, you do an FFT, you might
window it. You might use linear prediction with another characteristic. Those are all spectral
estimators. One estimator of complementary part is fast envelope. It is an estimator of it. It's
not expected value, is not a pencil and paper thing, but it works with the data.
Let's show you why. This is a slide where I can show you why that's true. This is only math that
I'll be giving you today. It’s actually very interesting. And the reason is, if you take a look at this
generative model of what the complementary part is, this new information that has to do with
the non-circularity, if you took the analytic signal, the stuff that's really important
communication signal the reason you get 600 megabits per second, if you look at that, you get
a double frequency term where that complementary part is. And zero frequency term is where
the envelope is. The standard envelope is beat by itself down to baseband, double the
frequency that you started with; if your carriers have 2 kilohertz, at four kilohertz is a
complementary part. That's what you’d normally get in statistics. In complementary statistics.
Now when you work with a half wave rectifier, it gets a little more believable and plausible for
the auditory system. This is why. Let's assume a sinusoid’s coming in. And what happens to
the Fourier series for a sinusoid, gets really interesting. We have a sinusoid coming in, what
comes out? This is a sinusoidal input. It's a sinusoid at the fundamental frequency. And with a
minus pi phase shift, and minus quadrature, double frequency term.
So here's the story. If you're talking about complementary statistics as new information, it’s
usually a double frequency term. We know 1.5 kilohertz, 2 kilohertz is important for speech.
The double frequency term would be 3 or 4 kilohertz. Hard to argue for phase locking up at
those frequencies. Really hard to argue Brian Moore and his group for phase locking at those
frequencies very rare. But if you're right at that 1.5 or 2 kilohertz, it's not so hard anymore. So
stuff that leaks through at the fundamental is right at the frequencies you can get phase locking
which is that complementary information, and by the way, if your first harmonic for a sinusoid
is in minus quadrature. Not just the nature of the Fourier series. But just for a sinusoidal input.
I can’t generalize this for broadband case because it's nonlinear. But for a narrowband case, it
would be like this. The higher order terms for the sign term are very small. You know, go away
basically. The higher order terms for the cosine terms are also quite small but they don't totally
go away.
So there's key principles going on here that's linking a fundamental to its harmonic. If we’re
talking about a subband that's entered at 3 kilohertz for speech, at 6 kilohertz, that first
harmonic, there's not to be much phase locking there, so we’re really not getting much. But if
you are talking about something and 800 hertz, 1600 hertz, you will get phase locking and
maybe this guy in this phase shift is important. So let's keep that in mind.
>>: Just for this matter I couldn’t tell that there's a 270° phase shift.
>> Les Atlas: Oh, all I did is, I just did the algebra. I don't have the derivation on this slide, okay,
I can show that derivation separately. Okay? But the derivation is if you start with this as a
signal model and then determine a sub k a zero and b sub k using standard definition of
Fourier’s series, you get these functions that I plotted. So all that is, it’s closed form solution for
a sub k and b sub k, even though it's a series, infinite series.
>>: [inaudible] for a series.
>> Les Atlas: Yes. If use a standard definition of Fourier series there's a definition of a sub k is
equal to, another definition of what b sub k is, you solve for them and then you plot them,
you get these pictures. For this particular s of t. That's all that is. Okay. So this is the effect of
half wave rectification, which is interesting, a lot more interesting than I would’ve ever thought.
So let's now do some demos that relate to these concepts.
What I'm going to do now is I'm going to play two sounds that differ. This is a really crude
model of one versus two talkers. And the way I'm going to do this crude model is not really one
versus two sources; it's something which is consonant versus dissonant. And the one which is
consonant is when they’re harmonic; it sounds like one source, no question about it. When
they’re dissonant or inharmonic, which is the 412 and 740, those two, they don't necessarily
sound like one source, they sound chord, like a crummy chord. And what we’re going to do is
we’re going to play them where we play this first one, then the second one and we’ll keep
looping through. And just to hear the difference between them. You’re going to listen to them
now, and then I'm going to show you the fast and the slow envelope for these. That's what I'm
getting to. I’m making the argument the fast envelope tells you the difference between
consonance and dissonance. Between one source and multiple sources. But let's just start with
the test signal so you understand how it hears. How it sounds.
[sound]
>> Les Atlas: That's this one.
[sound]
>> Les Atlas: That one.
[sound]
>> Les Atlas: That one again.
[sound]
>> Les Atlas: Okay. One sounds like a single source, the second one sounds maybe like a train
whistle but has two reeds in it, whatever it is. Not a perfect cord but certainly more dissonant
than the first one. Now let's take a look at these two. If I go with the, just the first case, which
was this one, not playing, there it is.
[sound]
Just that thing where they’re harmonic and sounds like one source, what I'm going to do for
the sake of my discussion, is to talk about half wave rectification, start with the usual, follow a
lowpass filter. Simplest model auditorily acceptable to most people. What our auditory system
does to find an envelope. I'll call it the slow envelope to distinguish it from the new concept,
the additive thing, not a product, the fast envelope. Okay here's a way to look at it. You can
think of the single signal that's coming through the half wave rectifier. And the lowpass filtered
version is just following the peaks of it. Okay? Because the lowpass filter. What's left over?
Bandpass filtered part. High enough corner of the bandpass filtered part, these two sum
together give you what came out of the half wave rectifier. Give you everything except the
negative going part of the original signal. So what you're doing>>: So for the bandpass filter, what is the band frequency?
>> Les Atlas: So for this particular one, I think that what we used here was it the corner 450 for
this one?
>>: Four hundred fifty over two.
>> Les Atlas: 450 over 2, okay, so it’s half way.
>>: Is that the band?
>> Les Atlas: That's the low corner.
>>: And the upper corner was 3 [f c] over two, so three times 450 over two. So you would get
that single frequency term.
>> Les Atlas: Right. And likewise for this one which shows a corner that was, made sure that it
was flat. And we'll see the results are shortly. But I'm going to be comparing the results to this
case, which is the inharmonic case. Takes a while to load.
[sound]
>> Les Atlas: Sounds like two things, maybe, train whistle at least. We're going to do the same
thing for say, comparison. This is the more dissonant case, like two sources maybe. The slow
and the fast envelope. And we’re going to look at them and compare them. Let's compare the
harmonic fast envelope first The 300, the 600. Clearly harmonic . Here's a fast envelope
coming out of the 600 hertz input. Look at its peaks. Line them up. There's a 300 hertz input
fast envelope. Perfectly lined up. And the valleys also line up with the peaks of the other one.
There's the fast envelope structure aligns in time . So if I had to subbands, one corresponding to
about 300 hertz, another corresponding to about 600 hertz, look at that fast envelope that
comes out, call the lowpass filter after the half wave rectifier a leaky integrator, look at what
leaks out, line them up with their phase locking, the phase locking lines up nicely. If you
correlated the two.
>>: That's funny. Just [inaudible]>> Les Atlas: Sure.
>>: Just to make sure I'm following. Aren’t those [inaudible] flipped? Isn't 600 the one in the
top and 300 the one in the bottom?
>> Les Atlas: You're right. Sorry about that. Okay. I could call this scale and we’d get away with
it. Sorry about that. Yes they are. That's a mistake. Thank you for pointing that out. Good
eye.
>>: Those envelopes, they line up in the same way as the original harmonics that you used to
create the sound lineup. So what would happen if the original harmonics would be shifted
relatively to each other?
>> Les Atlas: So you mean that the original harmonics are all shifted together?
>>: Yeah, so could you go back to the last one?
>> Les Atlas: Okay, because I will be showing you the inharmonic case soon okay.
>>: No, I’m talking about the harmonic case.
>> Les Atlas: Okay.
>>: So yeah, it's a, hang on please, on the top right, you see that, so the top right is the sum of
the left signals, so you see that they have then aligned such that peaks of the 300 hertz line up
with the peaks of the 600 hertz. And of course as a result, the envelopes line up in the same
way. What would happen if you were to shift 600 hertz wave by say>> Les Atlas: 50 hertz.
>>: More than one?
>> Les Atlas:. Both by 50 hertz, you’re saying.
>>: No, if you shape it in time by a quarter period or so.
>> Les Atlas: Oh, if there's a relative phase delay between them.
>>: So pretty much I believe this one makes them to belong to the same, we to perceive them
belonging to the same signal, they are in phase. My guess is that if you put, let’s say, 57° for
the initial phase shift between 300 and 600, we may perceive>>: [inaudible] explanation is different. That particular phase relationship has to do the fact
that they [inaudible] output of the half wave rectifier, right? Because with the half wave
rectifier, the second harmonic, the second beat has to be aligned with the other one to produce
the continuation of the low peak because of the rectification.
>> Les Atlas: That's a factor. But if we went ahead and did a relative phase, and if you think of
what the ear does with the traveling wave, they’re not in phase by the time they get to the hair
cell. Okay>>: I would expect in the ways they move around it won't change much our>> Les Atlas: Same perception, so>>: And every model would predict that. [inaudible] model would predict the same thing.
>> Les Atlas: Yeah. And so when I brought this example up to the psychoacoustics and
physiology people in Cambridge, it’s not a discussion that went on for a week. And this is to me
as success. Okay? That's what I wanted to do. I want people to things a little differently.
>>: [inaudible] because all that randomness, when you go to the phase spikes, you'd don't
swap out all these differences.
>> Les Atlas: You're looking for correlation somewhere else down the line. So if you do a
relative phase shift one, and that's never exact, but you get random firing at the peaks okay,
and if there's some correlation somewhere down the line in this huge structure in the auditory
midbrain, it caused these likely from the same source . And all I want to do is contrast, so we
could put phase shift in, we could get something where they don't line up perfectly in the plot
that I showed you. Let's go back to the plot. We get something where we don't get exactly this
alignment. Okay? There’d be relative phase shift in the bottom one versus the top one, but
over time there'd be correlation between the two. And how that correlation is measured
somewhere else up in the auditory midbrain.
>>: [inaudible]
>> Les Atlas: Yeah the cochlear, and over time, be it 10 milliseconds or 40 milliseconds, so
statistics would accumulate, say, hey, we've got one source. Because they’re firing in
synchrony in some way. Even though there's a relative delay. Yes.
>>: And the magnitude of these envelopes would also be effective by the relative phase of the
original signals?
>> Les Atlas: Nope. Here's the magnitude of them right there. They're both, the slow>>: As the signals, as you change the timeline into the signals, the magnitude of this spectrum
stays the same?
>> Les Atlas: The magnitude of each of sum band output stays the same. Okay? The single
signal itself, of course if you look at the magnitude of the signal itself before it goes into the
individual subbands, of course that’s changing. Okay? The signal itself is changing it peaks and
its shape. But if you look at the subbands in the auditory system, what's coming out of each
one, call it a slow envelope that's defined by half wave rectifier and a crummy lowpass filter,
but only keeps low frequencies, that’s flat as a board. With your phase shift or without. Okay?
And the reason I'm doing this particular case is because when we go to the inharmonic case,
the dissonant case next, they're also both going to be flat as a board. So when you look at the
standard auditory model of a bunch of subbands followed by a rectifier followed by a lowpass
filter, consonant and dissonant shouldn't sound different. They may sound different pitch, but
the consonance versus dissonance should not be noticeable. It’s something else. What is that
something else? Okay? People would argue for the Hilbert phase. I argue against that. In a
moment I'll tell you why. This is the alternative, which is the additive alternative.
Now this is only the harmonic case. Let’s go to the inharmonic case. See what happens. Now
they're not lined up. No matter what you do with the numbers that we chose, these are never
going to line up. You could shift the phase of the top, relative to the bottom, but they're never
going to line and accumulate their statistics in terms of their, for example, phase locking in such
a way that over a period of 20 or 500 milliseconds you can collect things where they’re
synchronized. They’re always a synchronized and that's why they sound dissonant. You can
look at the peaks too, and the valleys, and we also have, [inaudible] have reversed again, so
sorry about that. So they're not aligned ever.
But look at their slow envelope. No information of slow envelope for subband. They're both
flat. So consonance versus dissonance did not make it through the low pass half wave rectifier.
It's the high pass part. Or the bandpass part. Or the fast envelope, as I call it. Okay, so that's
my new concept. These are additive concepts, that's another important thing. Let's compare
this to the conventional view that’s used by psychoacousticians and neurophysiologists. The
Hilbert phase is very popular now. And I'm going to give you a different example. Let two
tones beat. Sum them together. Let's just play this case. Takes a little while to load.
[sound]
>> Les Atlas: Now these, this would be considered the case of two tones falling into the same
auditory subband. Because they're following under the same auditory sub brand, they’re
beating with each other, and you can certainly hear that beating whatever that envelope rate is
that you're staring that. And that's a pretty well understood phenomena, going all the way
back to Helmholtz, that you have an envelope, could be tuning a guitar or something like that,
you’re listening to the beating, you are waiting for it to go away. When you've got frequencies
that are close to each other, now the first kind of hint that there's an issue here is which of the
two envelopes is correct. That top one or the bottom one? I don't know. They're both correct.
Now let’s take a look at that in the frequency domain . They're very close. We presume they're
in the same subband, which what they would be for our auditory system. Let's look at the
conventional Hilbert phase temporal fine structure. The way it’s defined is that bottom
quantity is the cosine of that phase that comes after the analytic signal. You put in polar form,
complex number for polar form, its phase function, which is a cosine, is between -1 and one.
That's what a cosine’s going to give you. Whatever phase you put in there. Let's take a look at
that signal. That’s that same beating signal we just looked that. Let's look at some of its
characteristics. Let's zoom in on this portion. After we zoom in on Tom, I didn't want to use, I
told Scott not to use Matlab's interpolation between points because there is no interpolation to
be done between points. It isn't a vertical line there, that’s the discontinuity. So there's
something wrong mathematically with this. This definition gives you discontinuous signal. This
is what people are using. And when I get to speech, you're going to see that this is not an
artificial case, this happens in speech also. So there's a severe discontinuity, plus minus pi,
whenever that thing wanted to cross a zero.
So let's take a look at this same issue. I argue there's a problem with Hilbert phase and other
ways of representing this fine structure. And the problem is fundamentally that whenever your
envelope wants to cross zero you're going to have a discontinuity in any of the standard
definitions. What happens with our new fast envelope with this case, do we still have a
discontinuity, do we have a mess? Let's take a look. Half wave rectifier, slow envelope, is just
that envelope. There's a fast envelope in blue. Okay. The vast envelope is not magnitude one.
It doesn't go between -1 and one. It has modulation left on it. That's troubling for some
people. That was troubling for me at first because I'm used to this product model. And the
product model is where the fine structure has a magnitude that's going between plus minus
one because you've taken all the modulation out of it. How could you say that the envelope’s a
modulator if there's still modulation left in what you call the fine structure? I was stuck
thinking that way for the last 10 years. Our entire community’s been stuck thinking that way
since the 1920s. Okay? There's no way to force this to go between plus minus one without
forcing a discontinuity in it. And that discontinuity, as were going to see, would exist in speech
if you analyze it this way and it also causes troubles.
>>: So why do physchoacoustics use this model? I never liked this model to begin with. I
talked about [inaudible], so this stuff [inaudible]. Spiking information?
>> Les Atlas: Well, this is a predecessor to the spiking. Okay? This is what comes before the
spike. This is what the spiking’s based on. I'm not arguing that this, I'm not arguing this in place
of spiking. At all. I'm saying the spiking's got to come from something. What does it come
from?
>>: The spiking come from the [inaudible]. [inaudible] going there, they could, you know,
[inaudible] down this there [inaudible], you don’t have to explain, you know, [inaudible] you
can still detect all kinds of spiking.
>> Les Atlas: I will argue that, but the physcoacoustic experiments that are being done these
days that are looking at, are using this, yes. You know, 2012 papers are using this, okay? So,
but let's zoom in on this portion. What I'm trying to do is give you a good alternative. So what
you’ll have is people who look at it other ways, but they still look at it in ways where that fine
structure before they find the spiking has to go between -1 and one, so it still has to have
problems. And I'm arguing you have to get away from that.
So here's what happens with the fast envelope when I zoom in on that portion where the
envelope crossed zero. The fast envelope just go to zero smoothly, there's no discontinuity,
there's no trouble. But what I want to get to is this case of speech because we’re going to do
the same kind of thing for speech in a moment. So I'm preparing a Hilbert phase version, or any
version which is a product form and its fine structure before you determine the spiking, with
something that we’ll call a fast envelope, which is just the stuff that leaks through the leaking
integrator that's going on after the half wave rectifier. And just compare them spectrally. If we
look at them spectrally between the two, the red is the Hilbert case, spectrally, the blue is the
case of the fast envelope. The fast envelope is giving you your two tones exactly. Hilbert phase
is giving you all these harmonics which are distortion products due to that discontinuity.
Now, we could compare these two by listening to them, but I think I want to get to the speech
looking at the time. Let's take a look at a 1.5 kilohertz subband of speech. We’re talking about
artificial signals. Does this really matter for speech? And my argument is that yes it does. Let
me just play the speech sample.
[speech sample] A bicycle has two wheels.
>> Les Atlas: Let me move ahead. Let's pass that speech sample through a single subband of
speech based on, well, the models that they used in Cambridge. And we could just play it after
it goes through that subband.
[sound]
>> Les Atlas: 1.5 kilohertz, it sounds squeaky, but you can hear some of the speech coming
through just that subband. Now let's take a look at our various representations. Hilbert phase:
there's the original signal in purple, coming through there, not the original signal, but after the
subband, is in purple right there. Its Hilbert envelope is the black dashed line. Beautiful looking
temporal envelope. These are time signals. There's a Hilbert phase for speech. What did it do?
Pretty ugly. This is any definition that's assuming a product form where that fine structure has
to go between -1 and one. If you are spiking follows, fine. But any model that does that is
going to have these problems. Let's zoom in on some portions. Discontinuity right there. Let's
zoom in when it gets small. When the magnitude gets smaller, the signal, got some problems.
What this does right in this portion, not a function of the signal, it’s a function of the word size
of your computer. Really, what it’s doing here. Because we’re down at a very small signal level
and we’re looking for a transcendental function inverse tangent.
So, a Hilbert temporal fine structure, which pretty much everyone's using an psychoacoustics
and a lot of physiologists are using as a predecessor for the spiking model has artifacts. It's
because it's of constant amplitude even at times when the artifacts are pronounced, and that's
fundamentally because of the product model. It's also not terrible physiological, using this. So
the fast envelope fine structures are alternative that doesn't have these mathematical
problems and also is more physiological. Let's do the same experiment. Here's a Hilbert phase
temporal fine structure that would come out. If we use a half wave rectifier instead, let's look
at the fast envelope. Fast envelope is, I'll zoom in on the difference. The Hilbert is the red.
Conventional is red. Product model’s red, our fast envelope is the blue one, right there. It has
no discontinuity.
>>: [inaudible] this fast envelope can capture the carrier somehow?
>> Les Atlas: That's exactly what it, it's capturing both the pitch and the carrier, depends
where you put that low past corner.
>>: But typically a communication signal passing [inaudible] carrier is because of the frequency
and the cosine sign. But here, carrier changes the function pattern into [inaudible].
>> Les Atlas: We have to have it that way otherwise we’re going to get artifacts. Now, those
artifacts you get in man-made communication signals are the message that's being sent. Okay?
Here, we are not, we don't have such a clean the message that's being, that was being
designed, that was man-made. So, that's why it's a problem to use those communication ideas
for natural signals. There's nothing that's doing, for example, phase shift gain, coming from our
auditory system that we have control over. Okay? So if we would put, or make a receiver look
for that, it’s going to see artifacts which were zero>>: So it’s [inaudible] to argue that the slow envelope is another [inaudible] message that gets
sent?
>> Les Atlas: The slope?
>>: I mean in man-made communication system, you get constant error of fast [inaudible].
And then you get the message that you [inaudible] but now for this metro signal, so what is the
analogy here? Do you try to make analogy that the fast envelope, it's like, you know, distorted
version of the carrier and the slow envelope [inaudible]?
>> Les Atlas: Here's the closest I can answer that. The experiments are in progress right now.
Okay? They're working with guinea pigs; they're looking at this model and comparing it to
Hilbert phase in Cambridge right now. The model that we are using is one, we are comparing
the red for where the spikes are. The probability of a spike is highest at the peaks of the red.
That's old model. Okay? The new model says the probability is where the peaks in the blue
are. But if the peak is very small, it's a lower probability. That's all.
>>: I see.
>> Les Atlas: So what we’re saying is we are still phase locking, but when the signal level is
small or crossing zero the probability of phase locking at that phase drops down. That's all.
>>: So you can simulate using this one as a [inaudible] and then compare the spiking results
experimentally to see how much constance you might get?
>> Les Atlas: Yes.
>>: Nothing has been [inaudible]?
>> Les Atlas: No. By December, well the resulst will be ready by the end of, probably paper
ready by December. A draft.
>>: And there’ll be experiments to verify that.
>> Les Atlas: Excuse me?
>>: Who [inaudible] experiment measurement to see whether using this model as probability,
you can tell the spikes they are consistently [inaudible] statistic with the [inaudible]?
>> Les Atlas: Like Stone and his colleagues at, in Cambridge. Okay, and physiology people from
the neurophysiology department, Ian, I can’t think of his last name, his first name is Ian, who I
was there with. They're working with guinea pigs and looking at single in the inferior colliculus
and lower-level cochlear nucleus.
>>: But in the cochlear, in the ear, you’ve probably [inaudible], you probably have to go to a
high level.
>> Les Atlas: You can't really get to the fibers coming off the hair cells. My argument would be
in the fibers, the eight or so fibers coming off each hair cell, each outer hair cell, you might see
something that spikes at the peaks of the blue. So here's a way to look at it: spike probability’s
high here, it's pretty high here, but it’s higher here than it is here. That's an example.
>>: That's very cool.
>> Les Atlas: Okay? So it's just, it's a model that doesn't mathematically break. If you followed
the red, and spike at every peak of the red, what’s going to happen here? Big spike here, what
happens there? It's meaningless. Okay? So when you look at the results that people try to
predict what the physiology is doing when they use the Hilbert phase or other measures that
are going between plus one or minus one with a product model, they don't work so well. Okay?
That's really not a big change. Now there's a little more on this. We could listen to these, and if
you listened to the Hilbert phase it sounds horrible, all clicky and stuff, you listen to the fast
envelope it's much closer to intelligible, doesn't have the distortion, that should be obvious.
But one last, a few last points I want to make. We have all the subbands in the ear. And I want
to get back to what, you know, an application what this could be. And on the right-hand side is
a really interesting one. We don’t necessarily with OFDM or with orthogonal frequency division
multiplex, we don't, we do blind equalization, in fact, if you walk around holding a Wi-Fi or cell
phone or something like that, the reverb helps you. Why does it help you? Well, we look at it
couple different frequencies. And if we’re at frequency of one, some subband and the direction
a for a strongest reflective path and out of phase, ignore that subband, it's weak or weighed a
lot less. Whereas if you're up some other frequency and just the distances of these are just
right, the timing of them just right, you get constructive interference, not destructive like you
do for frequency one. So if you look at a whole bunch of frequencies divided up into a bunch of
orthogonal frequencies, which is what they do in OFDM, you’re going to get something that
does much better. As you walk around it's automatically scanning over the frequencies. Now
there's other things, like it's orthogonal, which the area isn't, there’s a carrier channel coming
through which the ear doesn't necessarily directly have. I'm just arguing for this kind of analogy
and also care carrier recovery issues. A little more complicated than that.
So, we can compare what we have, this is kind of from Malcolm's purpose. What's the
difference between what we are arguing for now, which is this fast envelope, versus say
Malcolm Slaney’s corellogram? In a corellogram, you're looking for this kind of information by
doing correlations, what’s coming out of every subband, where this axis is the frequency of
subband, this axis over here is autocorrelation lag. And might just contrast those two. In a
corellogram you've got perfect symmetry about the center points of autocorrelation, perfect
symmetry right here and here, slight amount of time variation, in this case over here, we don't
have the exact perfect symmetry but there's a much bigger difference. This picture’s oh, nonnegative real. Okay? This corellogram is all non-negative real. This picture on the left can go
negative. Okay? And what's happening here with the picture on the left is everything that's
blue is a zero crossing. And that zero crossing and that asymmetry is a query different
representation. The thing on the right is a square. Is a quadratic version of the signal. The
thing on the left, there is a rectifier in there, but there is a potential for it to be closer to holding
superposition.
>>: [inaudible] this is for one segment of vowel, it’s [inaudible], it’s not something you have to
change over time?
>> Les Atlas: Single speaker, one segment of a vowel about six, seven pitch periods.
>>: But the time is different in the corellogram, the axis is the time delay.
>> Les Atlas: This is samples here, and this is time in milliseconds, ees.
>>: Oh>> Les Atlas: So they're not the same. They're both time but different units of time.
>>: Can you pluck that fast envelope [inaudible] something also in the same way that you do
correlation?
>> Les Atlas: Oh, it kind of is here. The only difference is that this is a sample index and this is
actually, Scott converted this to actual time in milliseconds. So they really are equivalent. The
duration of the left one and the right one, they're pretty close, not exact. But they really have
the same time axis, they're just, the units are different in this plot.
>>: I see.
>>: There's a straightforward mapping [inaudible] for the general frequencies so>> Les Atlas: Yeah. There is a straightforward mapping. There's a straightforward mapping of
this axis, of that axis, these axes are identical, but the axis coming out of the screen, okay? The
dependent variable is wildly different. That's the big difference.
>>: How do you experimentally verify which one’s more? You know, realistically>> Les Atlas: Well, I really don't know what to do with these, but the physiological experiments
that are going on in Cambridge right now, are trying to confer whether that fast envelope does
a better job of predicting phase locking.
>>: I see.
>> Les Atlas: For signals that are, for example, beating, and he's also using chirp sounds from
guinea pigs, their natural vocalizations, a set of signals that have characteristics of the things
I've demonstrated for you. I can make a few more summaries here.
The main point I'm making is that the envelope in its fine structure, which I call TFS, for
temporal fine structure additive and not multiplicative. You don't have the problems if they’re
additive. You're not forcing the temporal fine structure to have these discontinuities or strange
behavior. Additivity is an easier decomposition to work with. If you align fast envelopes, it’s a
good way to do segregation. We can allow, for example, an observation that happens in the
auditory system is that neighboring subbands in speech that overlap heavily have sharper,allow
for sharper tuning curves for loud sounds. How does that happen? Well, allowing the fast
envelopes to be additive can reinforce the tuning curve, or the narrowness of a subband, which
is actually quite narrow in the auditory system. And all of this is a something called frequency
diversity. It's potentially a new view of pitch modeling; I was talking to Bob Carillon about this
when I was in Cambridge.
There's need for something more general than autocorrelation features, and this is, when Bob
and I talked and when I was talking about corellogram, there's a couple things the corellogram
doesn't predict that happen. One is that the corellogram is not sensitive to forward versus
backward order . You could take a signal and reverse its order. Corellogram doesn't change,
okay, assuming the signal’s, for example, is a steady vowel. If you reverse it, the corellogram is
identical, whereas what we're showing with the fast envelope, it changes. And even a transient
signal, if you reverse its order, corellogram, over a certain interval, will not change.
Roy Patterson's auditory image, this is another person who is at Cambridge that I worked with,
he talks about a stabilized image, and what we're talking about with the fast envelope is a
computationally or mathematically defined way of doing what he calls an auditory image.
Better potential for analysis than Hilbert or other product models, the fact that it's an additive
model means that he can be decompose it. So, that's it. Done with 2 minutes left. Thank you.
>>: Can you say something about how this bandwidth has filters [inaudible]?
>> Les Atlas: Okay. , I don't I don't want to fix on any two frequency endpoints, but let's just
say the lowest frequency of the bandpass filter is just lower than the lowest pitch you'd ever
get. Because you want to capture the pitch period, in speech that is, you want to capture your
periodicity of your pitch. Let's make it 50 hertz or 70 hertz or something like that. Let's let its
highest frequency capture the fastest phase locking, whatever mammalian species you're
working with. Now cats phase lock at a higher frequency than humans do. So whatever your
phase locking of humans gets argued between 1.5 kilohertz up to 8 kilohertz. Somewhere in
that range for the high-end. So it's basically everything outside of that low-frequency envelope-
>>: So it doesn't matter how big it is as long as>> Les Atlas: It can be large, yes. Now there's going to be certain subband width you start with,
which is if course going to limit it.
>>: I see. Does it come from this complementary>> Les Atlas: All the complementary information is in the fast envelope. Now let me be a little
more clear about what the theory of complementary statistics. Okay. Everything that's down
in the slow envelope is the circular or non-complementary part. There's nothing which is
complementary in the slow envelope. It's all left out. But when you go to the fast envelope,
you have both.
>>: I see.
>> Les Atlas: You have a mixture of both. That's with the theory says and that's what you get in
practice. That's a fancy way of saying that when you look at that fast envelope it still has
modulation on top of it. But it's got other information buried in there that you wouldn’t have
had in that slow envelope.
>>: Okay. So going back to my original question, which information do you [inaudible] as the
message, and which information to you [inaudible] as carrier? And do you mix them together
or do you consider them to be separate as [inaudible]?
>> Les Atlas: You know, I would have to say that, you know, in terms of how the ear works, for
example, or how a speech recognizer should work, you should use both. Okay? Both the best
you can. Why throw out any of it. Okay? But you can, but here's, but if you have pristine
conditions, that is, one talker, high SNR, you can use only the envelope and use noise as a
carrier. That's Bob Shannon's experiment. That was in science. Eight subbands, you get
intelligible speech. You throw some noise in there, you have a lower SNR, it falls apart. You
throw two talkers in, it falls apart. What does that suggest? It suggests that this fast envelope
is needed for noise, for multiple talkers, or for reverb, the things that would make the noise
excited vocoder break down.
>>: Thank you very much, Les, for the wonderful talk. Let’s thank these people.
Download