>> Mike: Good afternoon, everyone. It's my great... Stern from Carnegie Mellon University here. Rick is a...

advertisement
>> Mike: Good afternoon, everyone. It's my great pleasure to welcome Professor Rich
Stern from Carnegie Mellon University here. Rick is a professor in the department of
electrical and computer engineering and has been so since 1977. Is that right?
>> Richard Stern: January 10th.
>> Mike: January 10th, 1977. And he has a joint appointment in computer science and
the Language Technologies Institute and the biomedical engineering department and,
very much a renaissance man, a lecturer in the music department.
>> Richard Stern: Yes.
>> Mike: And is most well known for his work in robust speech recognition and in
auditory perception and I think he's going to give us today an interesting talk on the latest
work on matching those two areas in supplying physiological motivated models to speech
recognition. Here's Rick.
>> Richard Stern: Thank you. Thank you, Mike. And thank you all of you. I mean, I'm
absolutely touched that all of you came here with this terrific talk right next door on
Avatar. [Laughter] I know it's good because I saw it yesterday.
The only thing that I can say is my understanding is that all of these talks are recorded
and, you know, be your reward for staying here now is that you can see this firsthand and
then watch the Avatar talk on the video.
As usual, I don't do any work anymore. Everything that I'm talking about today is work
that various students of mine have done. I'll be, in addition to everybody who is listed,
many of whom you already know, Yu-Hsiang Bosco Chiu and Chanwoo Kim contributed
most of the work that I'll be talking about today, and I'll try to identify their contributions as
we encounter them.
As -- well, as you just heard, but most people have forgotten by now because it's such
ancient history, I was originally trained in auditory perception. My thesis was in binaural
hearing. And, let's see, over the last 20-some-odd years -- it's hard to keep track -- I've
been spending most of my time trying to improve the accuracy of speech recognition
systems in difficult acoustical environments. And today I'd like to, as you heard, talk
about some of the ways in which my group, and for that matter, those across the land
and other lands have been attempting to apply knowledge of auditory perception to
improve speech recognition accuracy.
And I'll -- in the beginning I'll talk about some general stuff that is transiented for
everywhere, and then, you know, as time goes by, of necessity, I'll have to be more
specific about things that we've done.
Approaches can be more or less faithful to physiology and psychophysics. Today I'm
wearing my engineering hat rather than my science hat, so the only thing that matters is
lowering the error rate, but hopefully paying attention to how the auditory system works
might be able to help us do that.
So the big questions that we have are how can knowledge of auditory physiology and
perception improve speech recognition accuracy? Can speech recognition results the
other way around tell us anything we don't already know about auditory processing? So,
you know, we'll see.
So what I'd like to do is start by reviewing some of the major physiological psychophysical
results that motivate the models. I'll do this pretty quickly since I'm talking to an expert
audience here and all of you know most of this stuff or have already heard it before.
I'll briefly discuss -- review and discuss some of the classical auditory models of the
1980s, which is the first renaissance for physiologically motivated processing. Stephanie
Seneff, Dick Lyon, Oda (phonetic) Ghitza are the ones that come to mind. And then talk
about some major new trends in today's models and talk about some representative
issues that have driven our own work in recent years.
So here's robust speech recognition circa 1990. [Laughter] You should recognize that
guy on the left here. [Laughter].
Some of you may not know, this is Tom Sullivan.
>>: He looks the same.
>> Richard Stern: Hockey player. [Laughter]. And on the right is Ochi Oshima. Tom
and Ochi had actually had a band with a couple of other students, Dean Rupi (phonetic),
and in fact, at one time, I had a fantasy of forming a rock group with my grad students.
That never got off the ground.
Another year I had the fantasy of running the Pittsburgh marathon as a relay team with all
my grad students. That never happened either. But nevertheless, they did do some
interesting work in speech recognition.
Alex, of course, was focused on statistical approaches that enabled us to use some
nonlinear processing to improve recognition accuracy in situations where we had both
noise and filtering.
This is -- Tom did early work on array processing that motivated some of the work we
have now. And Ochi did early work on physiologically motivated processing.
So and that pretty much gives you kind after good idea of some of the things we were
looking at circa 1990.
Paul, when were you there? I forget your dates. Paul's another former student of mine.
But he was wise enough to ->> Paul: (Indiscernible).
>> Richard Stern: Okay. And you did your Ph.D. thesis or correlation.
>> Paul: Yeah.
>> Richard Stern: Using silicon with ray-car-lee (phonetic). A topic close to my mind?
>>: That was Windows 95.
>> Richard Stern: Yeah. That was Windows 95. So fast forward ten years, and what do
we get here, we get -- [laughter].
>>: Nice. I think I did sit at that desk.
>> Richard Stern: This was Mike in his big hair days. [Laughter].
>>: That's really cute. Aye-yi-yi. Is this where you (indiscernible) making your slides?
[Laughter].
>>: Is that a wig or is that a PhotoShop?
>>: That's what I call nasty. [Laughter].
>> Richard Stern: I was inspired by the guys next door to use digital imagery to tell a
story.
The truth is I didn't have a photo. Mike actually did occupy the same desk. And he did
sit there. And that's not too unreasonable because Tom apparently took the better part of
ten years to finish. And Ochi was running regularly. It was because he was more
invested in hockey I think than speech recognition. And he is still actually here. So it's a
pseudo realistic shot.
Anyway, and this is Mike. All right. This is Mike?
>>: I don't have those jeans, for the record.
>> Richard Stern: Well, I did a little bit of image modification. It was subtle. So if you
look hard you can detect it.
Anyway, and what was going on during that time is Mike, of course, did very interesting
work again using array processing from a different standpoint. Other students at the
time, probably best known one was Bic Sarajh (phonetic) who did seminal work in
missing feature recognition before moving on to Merle and later back to CMU. Bic Sarajh
(phonetic) and Paul were in the same place at the same time. I don't know how much
you worked together, but he talked about you very -- very lovingly. Yeah. Yeah.
Also of course, Juan Juerta (phonetic) was working was working on telephone speech
recognition with money that we had from Alex's old group, from Telephonica. Let's see.
Who else was there then? Pedro Moreno extended Alex's work developing the VTS
algorithm with Bic Sarajh. Still widely used. And let's see. I'm forgetting all of them.
But anyway, we did a bunch of other stuff then. And now ten years later, this is our third
Microsoft employee, accept that the IMS hasn't figured out that he's a Microsoft employee
yet. But they will. This is Yu-Hsiang Bosco Chiu. And his last day at his desk, the same
desk, I might add -- [laughter]?
>>: (Indiscernible).
>> Richard Stern: He cleaned it up because he spent the previous week shipping
everything in like two boxes. This is very much the end game. This of course must be
his offer from Microsoft. I'm not sure.
But anyway, during that time, Bosco and his cohort, Chanwoo Kim, we've been focusing
more on methods that were related to perception and production of sound, kind of going
back to our roots in that sense. And I have another student, she had a [indiscernible]
who is now at Yahoo, who is working on speech production approaches and another
student Blingwin Goo (phonetic), work on another form of signal separation.
So we have many years, same desk, of -- and you know, whoever occupies that desk will
in another ten years, will be here some day. [Laughter] I'm not sure they'll be, but we'll
see [laughter].
So I don't need to tell you too much about this. This is standard you know auditory
anatomy culled from the web. We don't need to draw pictures anymore because we
have this.
Most important thing: Air comes in. Tympanic then removes here. These are a bunch of
levers, and then the tympanic membrane -- I'm sorry, tympanic membrane back here
moves the hammer, the anvil, and the stirrup to unload the cochlea. And I used to have
these, you know, be diagrams by Bill Rhody (phonetic) in fact, about cochlear mechanics.
We don't do that anymore because now we have glitzy animations. At least I thought we
did. Let's see. There we go. This I also didn't do. I got it from the web. Just tells you
what goes on inside. Pretty cool, huh?
>> Auditory played file as follows: So here we are with a view of the cochlear. The
cochlear now uncoils. We look at the basilar membrane. And now see what happens
when we play individual tones. (Sounds played.) Now a chord. (Sounds played.) And
finally something really complex. (Music played.)
>> Richard Stern: It goes on for a while. I should really credit where that came from.
And to be truthful, I'll have to go back and look. [Laughter]. But I stole it without
attribution obviously from a website. If you look up, you know, be cochlear animation or
something like that, you will find it too. And I apologize for that.
We'll spend a few more minutes in talking about a couple of representative physiological
results. This was a curve that was showing the relative -- so after the cochlea, and again
the important thing there that I hope that you saw was that there's a so-called
pitch-to-place transformation that takes place. The mechanics that give rise to this are --
have been studied extensively. Basically, be the basilar membrane, which is a
membrane inside the cochlea, has stiffness that varies as you move along and also a
density that varies as you move along so that different locations have different resonant
frequencies.
There are many other nonlinear mechanisms that come into play that I won't even
attempt to characterize but the short end of the story is that again, the input is here, the
output is here, the high frequency sounds excite this end of the thing, the low frequency
sound pretty much excite the whole thing but more over here than back here. And that's
sort of what you saw.
Now, stuck to the cochlea -- I'm sorry. Stuck to the basilar membrane are tens of
thousands of fibers of the auditory nerve, and one -- each enervating a local region in the
cochlea. And when they move, they excite a neural pulse that gets piped on upstairs
through the brain stem, ultimately to the brain. This is sufficiently reduced description
that any physiologist would cringe, but it's all that we need to talk about here because we
don't go very far.
This is the response that shows the relative number of firings. Statistically, this is from
the group that Nelson Kang led, and this is from the 60 treatise from [indiscernible] at
MIT.
But the signals, the tone burst that goes on for here, and what I wanted you to be
observant of is that there's a burst of activity that settles down more or less to a steady
state. And then when it's off, there's a depressed amount of activity that sort of comes
back up to a spontaneous baseline level.
So you get kind of enhancement of on set and offset as well as response.
As you saw before, the individual -- the cochlea is frequency selective. This mapping is
preserved in the auditory nerves. These are so-called tuning curves. And each one of
these curves represents a different fiber of the auditory nerve and shows the intensity
needed to elicit a criterion response as a function of frequency.
And what's important here, note that we have a long scale, is that the units are free see
selective. The fact that the triangles are roughly the same shape at high frequencies
indicates that the filters as they were are approximately constant queue at high
frequencies. It looks like they're stretching at low frequencies, but that's a consequence
of a log scale. They're approximately constant bandwidth at low frequencies.
>>: Is that the right scale? Seems like [indiscernible].
>> Richard Stern: Well, it does. These are cats. [Laughter]. And cats have smaller
ears. They have -- everything kind of scales proportionally.
I -- there's a lot more I could have said about the experiment, but all of the physiological
data are, you know, are cats or, yeah. These are all cats. But -- and they do indeed
have a higher frequency scale.
This is a so-called rate intensity curve. And what we're looking at is intensity. This is
actually with the [indiscernible]. This is spontaneous right on top of this which has been
subtracted off. And all that I want to say about this is that it's roughly S shaped. There's
an area in the middle where it's approximately linear with respect to the log of intensity.
This linearity curve actually is one of the motivating things that gives right to the decibel
scale which also codes things that are linear with respect to intensity. Not the only thing
though. And there's a cutoff region here and there's a saturation here.
Some units saturate more than others. That's a discussion I'm not going tow get into
right now. Yvonne?
>>: The numbers [indiscernible].
>> Richard Stern: Those are different frequencies. Those are best frequency of
response. And those are in kilohertz. There's nothing -- this thing actually -- the fibers
vary in the response with respect to what frequency they're most sensitive to. They also
have different spontaneous rates of firing. And if anything, that has more to do with the
ultimate shape than anything else. And you know, kind of since the time that I've paid a
lot of attention to this, this big discussion about inner versus outer hair [indiscernible] two
populations that have somewhat different properties. But that goes beyond a level of
complexity that the average auditory model for speech recognition takes into account and
in the interest of finishing before the sun sets and it's a long day in spring, I would think,
you know, I'll -- we'll omit some of those details.
Let's see. What else can we say? Oh, yeah. That's going backwards because I'm inept.
Over here we're looking at response to pure tones. And at low frequencies, well,
1100 hertz is a low frequency for a cat. And what's important here is that the relative
number of firings is, you know, if you have a sufficiently low frequency sound, it's -they're not only -- they don't just occur randomly, but they're actually synchronized to the
face of the incoming signal. And this is very important because that's the major cue that
enables us to keep track of cycle by cycle variability which obviously is needed to
determine differences in a arrival time for binaural hearing, again something dear to my
heart.
These studies that ability to respond in a synchronous fashion disappears as you get
above a certain frequency. And we have every reason to believe that that's cued to,
again, the size -- basically animals with big heads lose the ability to follow phase
information at lower frequencies. Cats -- we as humans, we infer through psychophysical
experiments lose that ability at about one kilohertz. And cats maintain that with smaller
heads up to a higher frequency.
We have every reason to believe that that's keyed to you lose that information when that
would become confusing due to spatial alias and considerations. If you have delay that's
longer than half a wavelength, then you'll get, you know, unhelpful encoding of time
delay.
Okay. This just repeats what I already said, doesn't it? So we'll skip that.
This is a phenomenon called lateral suppression or two-tone suppression. First done by
Morrey Sachs who is a student of Nelson Kang. His student, Eric Young who's very
active still in Johns Hopkins. Morrey is near retirement.
Anyway, the idea here, this is tuning curve as we saw before for a particular unit with a
characteristic frequency of about 7 or 8 kilohertz. It's being -- there's a signal at that
frequency, a probe tone that's ten dB above the threshold for that unit. And the shaded
areas show combinations of frequencies and intensities for which the presentation of the
second tone will inhibit the response to the first tone.
This is very interesting because many of these frequencies -- I'm sorry many of that he is
combinations are at frequencies and intensities are such that if the second tone were
presented by itself, there'd be no response to it. So a second tone, even subthreshold at
an adjacent frequency, will inhibit the response to a primary tone at the frequency.
I -- I'm not going to say much more about this, but I believe that this enables you to get a
sharper frequency response without losing temporal resolution. And so it's another way.
This is the results of a psychoacoustic experiment, I won't go into great detail about it.
There's a long demo that if I had two hours to lecture, I'd play for you, but I won't now.
And basically, it's the results of several studies that have the goal of estimating the
effective band with the frequency resolution as a function of frequency. The fact that this
is linear at high frequencies implies, as before, that the system perceptionally measured
is constant cue, just as the physiological results indicated.
And it's a little bit harder to infer what's going on down here. Some estimates have them
becoming constants. Some continue to have them decrease. But roughly speaking, we
get the same kind of functional dependence of resolution bandwidth as a function of
channel center frequency that we observed from the physiological data. And with
increases with center frequency and the solid curve, this one here is the so-called
equivalent rectangular bandwidth. That's one of three frequency scales that have been
used.
One of the things that you'll encounter fairly frequently are attempts to characterize the
dependence of bandwidth and center frequency which basically, you know, again,
suggests that resolution is finer with respect to frequency at low frequencies than at high
frequencies.
We believe that the reason for this is that you want to have good frequency resolution at
low frequencies because this enables us to attend to format frequencies which change,
which we need to be able to do for vowel perception.
The existence of broad frequency channels at high frequencies enables us to develop
very sharp temporal resolution, which is important for certain consonant discriminations
of voice on voice detections.
So by building a system that the narrow in frequency at the low end, we get good
frequency resolution at the low end, good for vowels, good timer solution at the high end,
good for consonants, and you have a system that's optimized for both. So seems like a
sensible thing to do.
These three representations, the Bark scale, the Mel scale, and the ERB scale, were
developed -- Bark by Eberhardt Vicker (phonetic) and Dorschland (phonetic) and the Mel
scale by Smitty Stevens. Mel, actually, I found out many years later, I thought it was a
guy named Mel. But it isn't. It's the shorthand for melody. I finally had to look up the
original paper and it's a footnote there. How many of you knew that? Probably not a lot.
Anyway, that's what the M from MFCC coefficients comes from. Now you know.
ERB, you just saw, was equivalent rectangular bandwidth. It was from Brian Moore
whose -- seems to publish a paper a month in jazzer for the last 30 years. But now that
was one of them.
These are the plots of the Mel scale, the ERB scale, and the Bark scale. And they're
normalized for amplitude. They look kind of the same.
If you manipulated the variance of the green curve, it would do a pretty good job of laying
top of the -- around the blue curve. So the bottom line, as far as I'm concerned, is that all
of these more or less do the same thing. Doesn't matter which one you use, but
everybody seems to have their favorite.
Frankly, I don't think it affects recognition accuracy much at all. But ->>: So where did the Bark come from?
>> Richard Stern: Barkhasem (phonetic). It's an individual. It's the name of somebody.
Yeah. It's a contraction. Sorry about that. I don't know who he is or what he did, but
that's what it's from.
Okay. And the last one equal on this curves -- these are the Fletcher-Munson curves
early psychophysical measurements from the 1920s, showed absolute sensitivity as a
function of frequency at higher intensities. The intensity curve saturates.
Okay. So basically, frequency analysis and parallel channels, preservation of temporal
fine structure, limited dynamic range in individual channels. I should have made more of
a to-do about this, but when we saw the rate intense at this curve, all of those curves
went from threshold to saturation within about 20 or 25 dB. And that's actually our
markup of paradox because of course, you and I have a dynamic range of about 100-dB,
depending on where we look and how we count. And the fact that you can do that with
individual fibers says a lot about the fact that we look at this picture very, very
wholistically which is something that computers aren't so good at doing. But it's of
interest.
Enhancement of temporal contrast, enhancement of spectral contrast, on set and off sets
and adjacent frequencies. And most of these physiological attributes have
psychophysical correlates, in fact I would say all of them. It took, you know, some were
discovered in the 1920s and some were not discovered until the 1970s or not confirmed,
but basically everything that I've talked about seems to be relevant for perception. The
question is sit helpful for speech recognition. And I don't have the complete answer for
these, but we'll talk about some partial answers.
And most physiological and psychophysical effects are not preserved in conventional
representations of speech recognition. So that's the point of departure.
I'm not going to insult any of you by going through my usual walk-through, Mel frequency
coefficients, cepstral coefficients. Just suffice it to say for those of you who aren't familiar
with the speech processing, that we take the input speech, multiply it by a hemming
window, typically about 20, 25 milliseconds. Typically hemming, doesn't have to be. Do
it for you. Transform takes a magnitude of that. Weight that triangularly with spectra
frequency and that the supposed to be a crude representation of the frequency
specificity. Take a log of that and take the inverse for you, transform of that, and you get
these things called Mel frequency capsular coefficients.
The Mel comes from the fact that he is triangular filters are spaced nonlinearly.
Originally, according to the Mel scale. And this was first proposed by a pair of
researchers at Bell Northern, Davis and Murmelstein (phonetic). What was it? Around
1972?
>>: I think it was ->> Richard Stern: '82? '82?
>>: 80s.
>> Richard Stern: It's Davis and Murmelstein (phonetic), yeah.
>>: '82.
>> Richard Stern: '82. I knew it had a two in it, not to mention a one and a nine.
Anyway, so let's take a look at what comes up. So this is an original speech
spectrogram. I know you can all read this. I took a course from Victor Zoo (phonetic)
once in 1985 teaching me to read this. And what it tells me is that it's me speaking.
Actually that you can't tell from that. But when you should be able to tell is the utterance
is welcome to DSP one. This is an example. So usually SP one is over there. And this
is the spectrum recovered from Mel frequency cepstral coefficients. And you know, if I
take off my glasses -- I'm pretty damn nearsighted -- and walk to the back of the room, I
will get the same thing. But the general idea is that it's fairly blurry compared to the
original Mel wideband spectrogram.
Now, in all fairness, part of that blurriness was deliberate because this was designed to
get rid of these striations which correspond to pitch. That was considered to be not part
of what was interesting. But nevertheless, it's blurry.
Some of the aspects of fundamental auditory processing that are preserved are the
frequencies selectivity and the spectral bandwidth so that the analysis is narrower at low
frequencies than at high frequencies, so that's consistent with physiology.
However, because of the fact that we use a window of constant duration, we don't really
take advantage of the opportunity to exploit better temporal resolution at the high
frequencies. We basically throw that away. It's an opportunity lost.
Wavelet schemes exploit time-frequency resolution better. Les Atlas in our own native
land of Washington, Seattle, has looked at this a bit. But I think it's fair to say that
wavelet analysis has not had a big impact on speech recognition so far are. Otherwise
we'd all be using it, and we're not.
So it's gotten no better results and it's less simple so people are continuing to do what
they have been.
Also, there is -- the nonlinear amplitude response is encoded in the logarithmic
transformation that was part of the Mel cap representation.
The bunch of aspects of auditory processing that are not represented, one of them is the
detailed timing structure. Lateral suppression, enhancement of temporal contrast and
other auditory nonlinearities. And the list can go on and on. I just am running out of
space here, this PowerPoint. And so we'll take a look.
Now, interest in the auditory system began -- well, I mean, people have always been
interested in the auditory system. But potential interest in applying this to speech
processing began in the 1980s. There were a few seminal models, one of which, which
we actually looked at in the '90s or later '80s, was that of Stephanie Seneff. This was her
Ph.D. thesis before they went on to work in natural language processing.
And basically, it assumed stage one was a filter bank, a critical band filter bank. Stage
two was a hair-cell model, which included the nonlinearities and a couple of temporal
things like short-term AGC.
And then there was a combination of envelope detector and synchrony detector. The
envelope detector is kind of like an energy detector and the synchrony was like a -- well,
that actually looked for the synchronization that I talked about before. And if you blow up
the second stage, you get saturating half-wave rectifier, short-term AGC, a lowpass filter,
and another AGC.
It was basically, all of these things are computational approximations to what the
physiologists observed. And it was used among all of the early ones, this is the with one
that people started the most. And frankly, the reason was that Stephanie gave her code
away so everybody could use it, and it's a lesson approximate about open source. It was
helpful.
Oda Ghitza had a -- I wouldn't say a competing model, but a complimentary model that
the interesting thing about that was, again, there's a filter bank and what was different
was they had a lot of different thresholds and level crossing associated with that, but by
the time you looked at everything and you were done with it, you had something that
looked fairly similar to what you would have gotten with the Seneff model. And similar -there we go.
Similarly, Dick Lyon, who at that point originally was at Fairchild but later with Apple, had
a -- again a set of band pass filters that are kind of off the page. A more detailed model
of what was going on after that including a stage that explicitly modeled lateral
suppression. So he also included autocorrelation display. This led to this popular
correlograms. And also introduced the idea of crosscorrelation as for -- as a mechanist
for auditory lateralization. And he was really ahead of the curve in the autocorrelation
and crosscorrelation.
I should mention that crosscorrelation is supported by physiology. The autocorrelation is
really not. To my knowledge?
>>: I was talking to Jordan Cohen, also used to work for IBM at that time.
>> Richard Stern: Jordan Cohen, his Ph.D. thesis was pitch, a model of pitch. And it
used -- it also used an autocorrelation. He finished around 1982. And indeed that was
contemporaneous with all of this. But I didn't include that because I didn't think that the
model had anything that the previous models didn't have. And his work particularly was
focused on pitch perception.
I met Jordy, gosh, for the first time at the '82 iCast (phonetic), which was in Paris, and he
was presenting his work there. That's also when I met Dick Lyon. And Seneff I had
known from before, of course. We were a year apart at MIT. And Victor Zoo is my TA in
DSP. That was when he was a grad student and I was -- well, we both were grad
students.
Anyway, one of the reasons why the Seneff model did not catch the world afire was this
one here. This is an analysis, the number of multiplications per millisecond. And on the
left, you have the various stage of the Seneff model, and over here we have LPC
processing. And NFCC processing was comparable to LPC -- LPC processing at the
time so you could assume that it was about the same. So that was a -- that was a
deterrent. And this was, keep in mind, in the 1980s, computers were not very powerful.
Not very big.
I remember I was describing somebody, a spent about, oh, God, about 6 or $7,000 to get
this big disk. It came, it was about this size. A Winchester drive. And it was eight
megabytes. And it was this wide and took a whole thing and had to -- it was hermetically
sealed. It was -- yeah. You may remember that disk. It had its own [indiscernible].
Anyway, so to summarize what was going on before, the models developed in the 1980s
included you know, kind of realistic auditory filtering, realistic auditory nonlinearities and
sort of in quotes, synchrony extraction, lateral suppression again, higher processing
through autocorrelation, crosscorrelation.
This is if you look across the [indiscernible] sample of models, every system developer
had his or her own private idea of what was important. And this varied quite a bit from
person to person. And that was -- you know.
So clearly there's no consensus, and not much quantitative evaluation actually
performed. Typically the paper would say we have this thing and then they'd show you a
display of like one sentence and say see how much better this looks than a -- anything
else. And I can understand this. I mean it was really hard to do a good job with this
because everything was so slow. And so I -- you know, I had an appreciate of that.
When we actually did get around to evaluating this -- and this was in part Ochi Oshima's
Ph.D. thesis, what we found was the following: Physiological processing didn't help at all.
Or certainly not much if you had clean speech. It gave us some improvement if we had
degraded speech. If we had a noise or recorded things with a distant microphone, it was
better.
However, the benefit that we got with physiological processing did not exceed what we
could get with more prosaic approaches such as CDCN, you know, which was Alex's
Ph.D. thesis. I don't know how much you hear about CDCN these days. But this shows
up in our stuff.
But in any case, we would do better with much lower computational costs just being good
engineers and forgetting about the physiology. And so, you know, it was did I appointing
but true, and we couldn't ignore the reality.
There are also other reasons why they didn't work so -- why they weren't so successful.
One of them was that in those days, the conventional state of the audio recognition
systems, either DTW using conventional -- I don't know, just using a conventional, you
know, distance metric, or HMMs using, in those days, Kai Felis' (phonetic) thesis was
single density, you know, be discrete HMMs. And these all implicitly assumed univariate
Gaussians.
The distributions of the features that came out were very non-Gaussian processes. And
so there wasn't a good statistical match.
Ben Shigea (phonetic) in inner speech -- wow, I guess in those days it was ICSLP, 1992,
and Biff, remember we shared a room? Watched the last game of the National League
playoffs in 1992? Yes, we did. We were there. The Braves defeated the Pirates.
Francisco Cabrera knocked in Sid Breen, former Pirate, in 92. Immediately thereafter,
Barry Bonds went to San Francisco and Bobby Branea (phonetic) went to the Mets and
Pittsburgh never finished above 500 after that. Including now. So we'll well beyond 500.
Anyway, more interestingly, Ben Shigea had a paper in which he compared physiological
approaches to conventional Mel caps, both with conventional HMM and with a neural
network classifier. And the physiological model really shown with the neural net classifier
because the neural net could learn the density, whereas in those days, at least, the
HMMs assumed Gaussian densities.
Nowadays, of course, we all use Gaussian mixtures, which in principle, you know, can
model any shape. So that's of less relevance.
Also, frankly, the more pressing need was to solve other basic speech recognition
problems. How to do large vocabularies, how do you integrate language models. This
was really kind of a boutique kind of thing.
So there wasn't a lot of attention paid to it. It was kind of a niche market and you know, it
consumed a lot of cycles, didn't provide any benefit, but so it was a small cautery of
affecianados.
Okay. So nevertheless, in the late 1990s, renaissance, number of reasons for that. One
of them was that computation no longer was a big deal. Wasn't -- not as much as before.
There was other attributes. Serious attention paid to temporal evolution. A lot of work in
modulation filtering, became very popular. Attention also paid to reverberation, which
was not as obvious a problem in the old days, be but once people started deploying
systems in real rooms, if you worked on a close talking microphone, that to my mind was
for, you know, in some ways at least as challenging if not more challenging than noise.
And also binaural processing became part of the mix, which is, to my mind, a good thing.
And more effective and mature approaches to information fusion as well. I'm not going to
talk a lot about that, but it's also I believe one of the factors that's motivating the
increased popularity?
>>: By neural process, you [indiscernible]?
>>: [Indiscernible]?
>> Richard Stern: Well, actually, I don't even mean that. I just mean two microphones.
Just two microphones. But you're right, strictly speaking, binaural recording's with the
head, but most of us don't want to have a device with a head on it. And so again, I'm
trying to go pragmatic. What can we appropriate from what we learn about the system in
order to improve performance? And without necessarily overly lavish regard to
physiological or anatomical details like heads and ears and so forth. You know.
>>: So in this area, (indiscernible)?
>> Richard Stern: We can exploit things by exploiting interval time delay.
So, by the way, Mike, I'm looking at the clock. I note that the time was budgeted for
90 minutes, which is 50 percent more than I thought it would be. Is everybody going to
leave at five -- or four, rather? At four?
>>: Typically it's 4:30.
>> Richard Stern: No, no, no, no, I'll finish by 4:30. My question is do I need to worry
about four?
>>: No.
>> Richard Stern: No. Okay. Okay. Great. Okay.
So let's talk a little bit about what we've been doing lately. More or less for the last 6 or
7 years. I'll talk a bit about end -- unfortunately I'm not going to be able to talk about
these in equal levels of detail but I'll be happy to stay as long as you have patience after
we're done to answer questions.
Representation of synchrony, shape of the rate-intensity function, revisiting analysis
duration, frequency resolution, on set enhancement, modulation filtering I'm not going to
say very much about but I can comment on, and binaural and polyaural techniques as
well as techniques derived from auditory scene analysis based on auditory common
frequency modulation.
So this is another physiological result by Eric Young and Morrey Sachs. I mentioned
before Morrey Sachs was the guy who was behind -- who first recorded consistently the
lateral suppression effect that we saw before. In fact that was his career before he went
to Hopkins biomedical engineering department. Eric Young was a former student of his.
He's now a big honcho there as well as from the ARO and the ASA.
And what we're looking at are physiological recordings of cats that are being hit by an
artificial signal that's actually a pseudo vowel generated by a computer. It's basically
sound waves with those intensities. This is a -- done in the late 70s, when, you know,
again, equipment was pretty primitive. And we're looking at the relative number of spikes
as a function of -- in response to this averaged over time. And these things are plotted
according to the characteristic frequency. And if you look at the response as a function of
frequency, you get an estimate of what the response profile is.
And the panels -- you probably can read them, but we're looking at the overall loudness
changes from 28-dB to 78-dB. So it's, you know, 60 or 50, you know, a range of about
50-dB in intensity. And these three arrows which you see occasionally, and I -- they're
kind of in odd positions because I got this by taking a picture and moving it around, but
these three arrows are in these positions. They indicate the original, you know, pseudo
format frequencies of the original signal.
And basically the story here is that if you look at this and if you look at what happened
over frequencies, it's a big mess, and there's no one variance over intensity. And it
doesn't look like mean right of firing is a very useful way of coding the spectral shape of
the vowel that the coming in. At least if you believe the results and based on that figure.
So this was recorded using mean rate doesn't work. At least physiologically.
Now, this is the same thing using something called and average localized synchrony rate.
And what that was, you saw earlier, I talked about the fact that the response was
synchronized to the phase and the signal coming in. The synchrony rate is the measure
of the extent to which it's synchronized. So if the things occur randomly anywhere within
the phase of the signal coming in, that synchrony measure is zero. And if it's completely
lock-stepped to the phase coming in, that's going to be one.
And these are vertically displaced from each other, but the cute thing about this, again,
we're looking over range of intensities, is that not only are the contours, including the
format frequencies very nicely preserved, but also that the same -- but also that the curve
(indiscernible) for an invariant.
So this suggests synchrony is important. This by the way, was taken up by Campbell
[indiscernible], and you may recall. Were they there when you were?
>>: Yeah.
>> Richard Stern: Yeah. So ->>: (Indiscernible).
>> Richard Stern: Yeah, probably. I forget exactly. I should look at that. The measure
actually -- they weren't necessarily synchronized also, I should say, to the actual
fundamental frequency. They would be synchronized to whatever the nearest harmonic
was of the fundamental frequency. So it's a little somewhat bogus statistic. But in any
case, the important information -- the implication, at least to me, is that cues about the
spectral content, you know, are certainly there in the synchronization. And better
preserved there than they are in the mean rate.
Now, when we take a look at Mel caps, really, what that is is that's a measure of
short-term energy as a function of the short-term time measured as function or frequency.
So that's more like the mean rate which doesn't look very good at all.
So the question is, you know, be can we harness that? And the answer, by the way, I'm
not going to keep you in suspense forever, is: Well, sort of.
But here is a fairly complicated model of signal processing. I forget Zhang's first name.
Do you remember? All right.
Carney is Laura Carney who is a physiologist. Used to be at BU, now runs a lab at
Syracuse. Zhang, who I believe is a woman, was a student of Carney.
Anyway, they had this fairly complicated model which actually looked better on my screen
than it does here, that has -- which by the way we're also using because it's available in
open source. The C code is right there under Carney's lab which makes it easy to
exploit.
One of the things that we did is we just asked the question that if we just took the output
down here, and this is one of the first things that Bosco did when he got to graduate
school, if you just take the output down here and use that as the basis, do we do any
better than if we -- than we would with Mel caps from that very complicated model?
And this by the way was very slow. Really, really, really slow. Even in 2006, it was
really, really slow. Much more complicated than (indiscernible) model. The boxes look
simple, but they're not. It's slow. It's a careful physiological model.
The general shape, this was Bosco's side, and this was Chanwoo Kim's side. Chanwoo
is a fan of complexity. But we did mean rate estimates, synchrony detection, looked at
synchrony low frequencies, mean rate at high frequencies. And then combined the two
and then used that.
In those days, it was very slow, mainly because, again, we're using the Carney auditory
nerve model. And these are results that we present back in 2006.
If we look at some of these in the top curve, this is an original spectrogram, this is the
reconstruction using no frequency capsular coefficients in the fashion before we just
turned the coefficients back into -- back into something that looked like a spectrogram.
And then down here, this is the auditory model.
You see, you know, with queen's speech, you have this -- you still have the former
trajectories, kind of nicely and cleanly represented. As we go to 20-dB SNR, this and this
are showing the effects of noise. This particular at high frequencies, not because they're
bad at high frequencies, but the filters have a wider bandwidth at high frequencies. So
more noise gets into each channel. So this is just a -- if we had pink noise, this would be
the same across the frequencies.
But again, we still got, you know, pretty good preservation. The contours that we see
before.
>>: (Indiscernible).
>> Richard Stern: Say that again?
>>: There is synchrony.
>> Richard Stern: Yeah, this is -- but with mean rate and synchrony. That's right?
>>: So why is it that the (indiscernible) so much more strongly than (indiscernible)?
(Indiscernible) you can see energy elsewhere.
>> Richard Stern: Yeah, it's not -- you mean here? This is ->>: (Indiscernible) red colors everywhere else up there.
>> Richard Stern: I think this is an excerpt of the greasy wash water utterance that's, you
know, the dialogue normalization sentence for ->>: (Indiscernible).
>> Richard Stern: Say that again.
>>: (Indiscernible) down below a thousand? Usually it's the opposite.
>> Richard Stern: You know, I'd have to listen to the utterance. It might have been, you
know, kind of very (indiscernible) greasy. You know, it really would depend on how it's
pronounced.
>>: (Indiscernible).
>> Richard Stern: Yeah. In fact, that's what I said a moment ago, by the way. But you
were so much more eloquent. [Laughter].
And you can see it reflect in the fact the noise shows up here as well.
Here you see it even more, ten-DB. And again, this is starting to get kind of really fuzzed
out whereas these are still pretty well preserved. And you know, down at 0-DB, nothing
works. Sooner or later that was going to happen, and that's where it did.
So now let's look at the Wall Street Journal task white noise and background music, and I
fear that these might be mislabeled. Yes, they are.
Please interchange in your mind the green curve and the blue curve. I keep intending to
do this and I never do. But the basic story is -- and these were results that actually
Bosco did and I think Chanwoo (indiscernible) them -- is this is mean rate. This is the
auditory representation -- this is -- I'm sorry. This curve, this curve and also this curve
here is Mel frequency cap cepstral coefficients. The green curve and again, I believe, the
blue curve on the right side, despite the fact it's not labeled that way, is the auditory
model with mean rate only. And the red curve is mean rate plus synchrony.
In this situation, the synchrony certainly did not give us, you know, very -- you know, very
impressive incremental advantage, especially considering the amount of effort that it
would take to calculate it, which I'm skipping over some of the details.
I don't regard this as a big success for synchrony, but one of the things that we do see is
that we see, you know, really a substantial difference between performance with the
auditory models and everything else.
Now, note by the way, that it is diminished quite a bit when you have background music.
In general, white noise we now understand is really easy relatively speaking, and if you
really want to impress somebody, you got on work in music and also you have to work in
big tasks too?
>>: So in your work, did you find any difference between (indiscernible)?
>> Richard Stern: Well, you know, agree. And the truth is that we did this without
officially atom noise because we had to.
>>: (Indiscernible) situation.
>> Richard Stern: Say that again.
>>: (Indiscernible).
>> Richard Stern: It has -- no. We've done other -- well, there are two issues here. One
is what's the noise source and the other is how's the noise combined. So as I understand
it, I have done only limited work with Aurora, as, you know, we have very different noises.
White noise, background music from the old hub four task, various speech. And more
recently, we did some work for Samsung on [indiscernible], and they actually went
around with a microphone in a supermarket and, you know, concert, in an airport, and
there were train stations, the street and so forth.
And also we've had natural noise samples from Telephonica from the work that they did
also recording things.
In terms of the noise type, things that are kind of quiescent are easy. So if you have
white noise and colored noise, it's fairly trivial.
If you have things that jump around a lot, and I don't talk a lot about this in this talk, but
you know, when I came here a few years ago, we talked more about that. For example,
background music is much more difficult to compensate for than white noise. And that's
partly because, you know, you got problems with the, you know, particularly vocal music
with the music -- with the background being confused with the foreground.
But also, the classical compensation algorithms like CDCNVTS all began by sniffing a
piece of the environment for about a second or so and then using that for the
environmental parameters. And if the environment changes during at that time, which it
typically will for that, you know, it won't be helpful.
In addition, impulsive things like tympany crashes or, for that matter, gunshots or
impulsive noises in factory floors are particularly susceptible to not working well.
Missing feature techniques such as the things that Martin Cook and Sheffield and
somewhat later, but we think better, Bic Sarajh (phonetic) did at Carnegie Mellon, are
more effective for impulsive noise. So there's a big issue with noise type.
Then beyond that, there's a question of how is the noise added. Probably the biggest
issue with -- is that when you additionally add noise, you don't include any reverberant
effects. And in the real world, be unless you're making -- unless your measuring things in
an anechoic chamber, which especially in these days is unusual, or perhaps outdoors,
which even there is not really perfectly anechoic at all, that you're going to get echoes
and the echoes are really going to muck things up quite a bit. And that when you digitally
combine things as we do and as Aurora does, you lose cognizance of that.
So we've started to become more attentive to that, although again recalibrated in artificial
situations using, you know, image model, but we found that the always worse in the real
world but that the thing that we do for a particular -- and evaluate for will still provide
benefit for comparable tasks.
So, you know, I understand the concern. I share the concern. The hard to do much
about it.
>>: (Indiscernible).
>> Richard Stern: Yeah, yeah, yeah, yeah.
>>: (Indiscernible). Natural noise (indiscernible). Is that our speakers (indiscernible) and
they receive noise and changing their formats, but the noise is real noise. It's not
(indiscernible) generated.
>> Richard Stern: Well, these were ->>: (Indiscernible).
>> Richard Stern: Yeah, we don't have any trivial logarithms. I mean, we don't exploit
anything -- none of these are ->>: (Indiscernible) all the extremes that I found show that it was very, very good
approximation. Not for the rest. Not true.
>> Richard Stern: Well, I think for microphone arrays -- well, some of these things that
you can get singular solutions for, you know, you don't get the solutions when somebody
is in the room and then disturbing the model. I'm generally very skeptical of any
algorithm that tries to invert anything. It's because all of these inversion techniques are
very sensitive to numerical issues, they're very sensitive to having an exact model, which
you're never going to have in a real environment. We never do that.
>>: (Indiscernible) natural environment sometimes the channel effect (indiscernible). So
somehow, you know, there's a (indiscernible).
>> Richard Stern: Yeah. Let's continue for now. Just because I have 250 slides and
we're only -- no. I'm only kidding. [Laughter]. I don't have that many. But I'm not going
to tell you what the number is. [Laughter]. I know what it is. And we have to go on to the
next slide.
Anyway, a reasonable question, to the auditory models have to be so damn complex?
And here's the Carney and Zhang model, or Zhang and Carney model. And here's the
Chiu model. Or this is just an easy model when you have Gammatone filters followed by
a nonlinear rectifier and followed by a lowpass filter. This is what we tried at the end,
which is kind of a very crude abstraction, in other words, taking this piece, maybe this
piece and this piece, and leaving everything else out of it, especially this stuff over here.
And if we just did that, how would we do? And the answer is, well, pretty good. This is
Mel frequency cepstral coefficients. This is the simple auditory model. And this is the
more complicated auditory model. And so the -- this is one of these half empty/half full
situations. I mean, on the one hand you're still doing really well on the basis of almost 0
computation compared to just using Mel caps. On the other hand, you know, if you really
want to spend a lot more cycles, you can do better.
And you know, the question that drives, you know, our work these days is can we do
better without spanning that many cycles. And in order to do that, we have to have a
deeper understanding of what's going on than simply plugging in somebody else's code
at great expense computationally and then running it into our analyzers.
>>: Do you still see improvements with the adaptations?
>> Richard Stern: Do we still see improvements when there is speaker adaptation ->>: Yeah.
>> Richard Stern: Excellent question. I don't know. We should try it.
>>: It's equivalent to asking the question that (indiscernible). That's a simpler version of
the question. The answer is it's hard to do that.
>> Richard Stern: Yeah.
>>: Right? (Indiscernible).
>> Richard Stern: We would. These were all trained clean which was our religion base
from your days. We're, you know, moving away from that. And I will show you some
results from Aurora using the Aurora paradigm in a moment.
But that's a good we question. The truth is I don't know. You know. We'll find out.
The one thing I can tell you is, once again, that we did one specific piece of work for
Samsung in which we, you know, basically -- it wasn't exactly the paradigm we're talking
about here. It was actually a much bigger task and more realistic noises, but the things
that were -- that I'm describing worked in that practical environment. They may not be
the best thing to do, but we're not completely doing what you would consider to be the
right experiments.
One obvious question to ask is, you know, in those models, you know, what really is
important. And this is a -- this is actually a part of Bosco's thesis. And what we're looking
at is we're looking into various different stages of the Seneff auditory model, looking at
performance. And this is now recognition accuracy rather than error rate as a function of
SNR going into the other direction from which you would expect it to. That's why the
curve is going down.
But just the quick interpretation of this is that if you include an appropriate saturating half
wave rectifier, then you get good results. And if you don't, you get bad results or ->>: So what is that ->> Richard Stern: These are the good ->>: -- (indiscernible).
>> Richard Stern: Well, we'll talk about that. So the previous slide doesn't belong there.
What this rectifier does is follows. This is plotting log intensity and -- I put in a linear
scale. A logarithmic transformation. This is what's implied by, you know, the
conventional log transformation in Mel frequency cepstral coefficients is the straight line.
The -- this curve is kind of an abstraction of actually the curve that emerged from the
Zhang and Carney model. Is that correct?
>>: No. It's from the Seneff model.
>> Richard Stern: From the Seneff model.
>>: Yeah. Yeah, just (indiscernible).
>> Richard Stern: Okay. Sorry about that. Thank you for the correction.
Anyway, the idea is that for example, 20-dB SNR, which of course is relatively benign,
the speech kind of sits here in the graded portion of the curve, and the noise sits down
here. And so when you have noise by itself, you're there and the contribution of the
noise is relatively small. So you reduce the variability produced by the noise.
And this is again from Bosco's thesis, comparison of frequency response or as a
function -- this is channel index but read by that -- that's a marker for frequency. This is
clean speech and noisy speech. The effect of the noise, of course is to fill in the valleys
and the representation. And by using the nonlinearity, you still got some effect, of
course, be but the correspondence is much closer. So that seems -- that seems to be
helpful.
This is a comparison of recognition accuracy obtained. And this is actually on the Aurora
test set. This is, thank you, Mike. And this is test set A. And it was trained and tested,
my understanding is, according to Aurora protocols.
And what we're looking at, the dark red triangles are results using Mel frequency cepstral
coefficients. The red triangles are using a baseline nonlinearity kind of taken I think in
this case from the other models, sort of out of the box, just fitting a curve to the results of
the physiological model. Carney and Zhang this time? Also Seneff?
>>: Yeah. I think it's (indiscernible).
>> Richard Stern: Okay. Doesn't matter. And the blue curve was the results that Bosco
was able to obtain using a routine that automatically learned the characteristics of the
nonlinearity or found the characteristic of the nonlinearity that produced best
performance.
>>: Was that something like the (indiscernible)?
>>: It's using the (indiscernible).
>> Richard Stern: It wasn't based on error rate. In other words, it's operating in open
loop fashion ->>: (Indiscernible).
>> Richard Stern: No. It was done ahead of time. It was done in advance.
>>: (Indiscernible).
>>: So it's maximizing the (indiscernible).
>>: It's maximizing the (indiscernible).
>>: (Indiscernible).
>> Richard Stern: You're talking about clean speech?
>>: Clean speech.
>>: (Indiscernible).
>>: (Indiscernible).
>> Richard Stern: We have the numbers somewhere. I have your thesis, you know,
here, if you want refer to it. But let's hold off on that for now.
Anyway, so nonlinearity helps.
All right. I want to talk for a few moments about the analysis window duration. Typical
analysis window, as you know, as we mentioned before, as you all know, is about 25 to
35 milliseconds for most speech recognition systems. In fact sometimes I've seen it go
down a little bit lower.
If you're trying to sniff the environment, you're better off looking over a longer duration.
Typically, 75 to 125 milliseconds, depending on the particular application. And this seem
trivial in retrospect, but, you know, up until now, we've sort of been going frame by frame,
everything we've been doing.
There's a pretty substantial win to be had just by looking over a longer window for
estimating compensation parameters. And then drilling down to a shorter window. So
basically go frame by frame in a short duration frame, look over a longer window, and
then kind of move things forward.
We're not the first people who have done this, of course. But it is -- it's not as commonly
done as it should be. So I thought I'd make mention of that?
>>: When you say compensation parameters, you mean like (indiscernible) CMN or
(indiscernible)?
>> Richard Stern: Well, we're doing -- nowadays we're doing everything online. So we
don't use CMN because CMN requires that you look at the whole utterance. So we only
have a look ahead of about a frame or two. So that, you know, and this is a real problem
because a lot of things like voice activity detection are dependent on having a model for
silence and we're constantly worrying about how to update those models dynamically
based on, you know, what things come in. And it's a tough problem.
Anyway, in all of those, you know, kind of adaptive parameter updates, typically the
update will be based on what we observe over about 75 milliseconds. But it will be
updated every frame, which is every ten milliseconds, of course. And the analysis will
still be applied to, you know, the actual speech recognition will still be done in a
25-millisecond frame, 26.3 or whatever. The something that's an odd, you know, be
something that derives from the sampling rate and the power of two. You know the way it
is. But normally somewhere between 20, 25 milliseconds.
So again, that the something that's worth noting. Chanwoo Kim calls this medium
duration or medium time. Windowing, I don't think it's that profound, but it's worth doing.
And I see that we're also missing a closed paren, maybe over here is where it goes.
Sorry about that.
Frequency resolution. We looked at several different types of frequency resolution.
There's the MFCC triangular filters. Gammatone filters are wider, and originally, we -- in
many of the original work, I indicated -- well, if you just take the Mel frequency cepstral
coefficients and increase the duration of the window and increase the bandwidth of the
window, rather, in other words replace the triangular window you used in Mel frequency
cepstral coefficients by Gammatone windows, you actually do better. In fact, fairly
substantially better.
However, if you're also willing to go ahead and use a different nonlinearity, then if you are
also willing to use a different nonlinearity, then the effects of the nonlinearity actually
swamps the effect of the window -- the bandwidth. So it's something by not doing the
right experiment, we can go wrong conclusion.
And we've looked at Mel frequency triangular filters, Gammatone filters, truncated
Gammatone filter shapes which is just Gammatones but when the waiting function gets
down to a certain point, we just simply set it equal to 0. And this is useful because
Gammatone windows actually go on for a long time. And if you are using them as waiting
functions, by having them go on for a long -- for long range the frequencies, you're going
to end up multiplying lots and lots of numbers that you don't really get -- don't really
contribute much to anything so that you can, by using kind of a truncated Gammatone
frequency waiting function, you can get the effects of waiting functions without that.
Now, there's one exception to this. In certain situations when you're using frequency
select, so if you're doing a missing feature analysis for example in which you are
selecting own a subset of time frequency bends for representation, then the role of the
frequency smoothing -- then there's a question of how do you fill in -- how do you fill in
the missing features.
Now, really the best way to do this is using something like cluster-based analysis like Bic
Sarajh did. But if you don't want to spend all that time and energy and computation, a
much cheaper thing to do is to simply use frequency waiting or frequency. And the effect
of these frequency windows is basically to smear what you have. You're in effect
convolving the frequency response that you have which includes the missing components
with the frequency response in the windows which actually vary with frequency. But if
you can imagine this kind of frequency-varying convolution, you gain a lot. And there,
having the wider resolution helps.
I think in general -- I'm sorry. The broader frequency (indiscernible) Gammatone helps.
So I think in the broad range, I think in terms of everything, we do everything with
Gammatone filters now. It never hurts. It helps in some situations. In a lot of situations,
doesn't make much difference. But that's what we have gleaned from that.
Effects of on set enhancement processing. This is a paper again that Chanwoo Kim is
doing, and I apologize for not having more details here, but it's -- if anybody is curious
about it, I'll send you the paper. It was just accepted for inner speech 2010.
What's going on is that we have a auditory-based model. And the usual frequency
analysis nonlinearity. And then after the nonlinearity, maybe before the nonlinearity, I
need to check, there's a mechanism that after the band pass filtering the nonlinearity, I
think, there's something that -- there's a couple of things. One is that it takes a look at
basically the envelope -- something that looks like a power envelope -- and subtracts off
the -- basically crosses the falling edge to fall away very quickly.
So what that means is that you pay a lot of attention to the rising edge of things, pay a lot
less attention to the falling edge. And just interpreting the curves here, the baseline
MFCC curve is blue. RASTA PLP with -- and all of these have cepstral mean
normalization. RASTA PLP is the red curve which actually is worse than this. This by
the way, is music noise -- oh, jeez. I wanted to show -- I'm sorry.
I wanted to show not resource management, but Wall Street Journal, we had numbers for
that as well but this is what we have.
Anyway, baseline MFCC with cepstral mean normalizations here. RASTA PLP with CMN
is here. And just by doing this stuff, we get a bit of an improvement. Again, I look at the
horizontal displacement in these curves, and it's only a couple dB, but more interestingly,
in reverberation, we get a very big improvement in recognition accuracy. This is
simulated reverb time going from 0 to 1.2 seconds. Again, doing nothing is down here.
But adding this SSF processing, gives us -- it still isn't great. I mean, we go down from
95 percent to 60 percent correct, but that's in 1.2 milliseconds of reverb time. There is
much better preservation of word accuracy as a function of reverb time by paying
attention, by getting rid of the stuff immediately after the first driving things.
>>: (Indiscernible).
>> Richard Stern: Yes. Yeah.
>>: Nice.
>> Richard Stern: It is cute.
I -- because we're short on time, I'm not spending a lot of time talking about the
precedence effect, but that's, you know, it's very real. (Indiscernible) this processing is
monaural. So it's a nice result.
The other thing that is kind of interesting is that doing the processing improves the
recognition accuracy a little bit clean speech as well. It's not easy to see because of the,
you know, data point overload here. But that was surprising to me. It's consistent with
the fact that you're kind of differentiating the spectral envelope -- or the power envelope
coming in.
I don't know how well that's going to hold up in noise, though, because typically
differentiating things in noise is bad.
And these reverberation results were done in the absence of noise. However, when you
have both noise and reverb, you know, the noise will make things worse, but you will still
get the same hierarchy of results. It really does help.
>>: (Indiscernible) parameters talking about adaptation or temporal adaptation.
>> Richard Stern: It's a form of temporal adaptation, that's right?
>>: (Indiscernible).
>> Richard Stern: I will questions, again, I don't do anything. I just edit the papers and
try to get the students to stay on message, but they do all the work.
This was -- there were a few parameters and they were, you know, what you see are the
results with the best parameters.
We do typically one these over many different kinds of noises and, you know, it -- indeed
it is the case that some parameter guys are better for some. We try to find a set of
parameters that looks the best for the noises that we consider. Typically, the suite of
things that we tend to look at are white noise, speech babble, I had have interfering
speech, the street noise, and background music and simulated reverberation.
Typically, of those, the most difficult kind of interference are the individual speaker and
background music in that the most benignest white noise and reverb is kind of
(indiscernible) above everything else but also quite difficult.
Okay. I want to talk for a moment about an integrated front end. I really apologize for
this. You know, just turn your heads on your sides if you don't mind for a moment. I tried
to inaudible] running I need to redraw this clearly, but there are too many boxes. What
we're looking at is Chanwoo Kim's PNCC algorithm. It stands for pace normalized
cepstral coefficients. And this is a block diagram comparison of MFCC processing, PLP
processing and PNCC processing. The most important things are there's different
frequency integration that's not a big deal. PLP uses their own -- its own function. MFCC
uses the triangular filters.
There is medium duration power calculation. I talked about that before, that's used for
normalization. This ANS stands for asymmetrical nonlinear spectral filtering. And that's
the kind of -- that's a version of the kind of on set enhancement that I talked about before.
There's a temporal masking which actually, again, gives you the effect of the on set
enhancement. And where do you cross channels, and then a certain amount of
normalization. So you get something that looks like cepstral coefficients.
There's code for this, by the way, available online. And a paper that we're writing and
papers that have already been published that cover most of the details in fairly cryptical
forms.
I just want to talk performance, do a few performance calculations. And I can talk later
about what blocks give you what. But this again is MFCCs are down here. And this is by
the way, Wall Street Journal 5k in white noise. RASTA PLP is a little bit better. Not a lot
better. This is Mel caps with VTS applied. Substantially better. And PNCC is up here.
Background music, as I indicated before, the magnitude of the improvement is less. And
again, but so -- this is still several dB. Again, in this case, baseline is here. RASTA PLP
is worse. VTS doesn't help much. We've known that since 97, that VTS isn't very
effective in music.
And we get some improvement, although, again, it's not as much as we'd like. There are
other things that we could do that I'm not going to show that show better results in
background music.
>>: So for 5k, what's the general (indiscernible)?
>> Richard Stern: Well, the most important thing here, this is about 89 percent, 87,
88 percent. We always de-tune the languish models because there's -- we do this more
for resource management than for Wall Street Journal. This is one pass. We're not
interested in all the things that are done to clean that up. We're only concerned with
relative improvement. Especially with resource management. If you have any sort of
languish model at all, these just kind of snap into position and it's, you know, just a very
bad indicator of the quality of the acoustic models just because the task is so trivial.
Now, Wall Street Journal, of course, is less susceptible to that but suffice it to say this is a
very simple one pass system that not at all the kind of thing you do in an evaluation. And
we really didn't work to optimize it at all.
The last thing is reverberation. Again, Mel caps are here, RASTA PLP is worse. I've had
it, as I mentioned to some of you, I've had discussions with Huner Krumanski (phonetic)
to, you know, confirm that we're not, you know, you know, abusing RASTA PLP here.
This is the implementation out of the box and the Dan Ellis website. And as far as he and
I can tell, the numbers are legitimate.
Again, VTS does not provide any great improvement in reverberation once the reverb
time exceeds the frame duration. And again, because some of the nonlinear processing
we're talking about before, there is an improvement here.
>>: And that is no noise.
>> Richard Stern: In this case, I'd have to look. There may be ten-dB (indiscernible)
ratio.
>>: (Indiscernible).
>> Richard Stern: Say that again.
>>: In the variables (indiscernible) reverberation (indiscernible) no noise conditions.
>> Richard Stern: That's right. And I'm -- I would have to check to see if there's noise
here or not. There may not and I'm thinking that there isn't because if you look at the -- if
the 0 reverb time is about the same as it was before, so I suspect no noise.
Basically, the SSF algorithm, again, we have a bunch of things. The SSF algorithm
showed it's better in reverb than PNCC. PNCC is intended to be something that is best
all around. Provides improvement in noise and reverb.
There actually, in the special case of music, there's a different kind of noise
compensation that gives a better result than we saw before. But the praise that you pay
is the performance in clean depose down. In this situation, in all of these, there's no
PNCC is just as good as MFCCs and clean speech. And that was more important for
Chanwoo than it was for me. But it was there. That's all.
One thing, we looked at computation of complexity, MFCC, and this is MFCC without
VTS. If you add VTS, it's a lot more. VTS is relatively slow. PLP, PNCC and truncated
PNCC. Truncated PNCC is as I described before, cutting down the frequencies and
nothing else.
>>: So I thought for reverberation, a standard baseline is MFCC with low window and
then (indiscernible) and then (indiscernible).
>> Richard Stern: Yeah. You can do that. We haven't found a huge benefit from that.
No. You know. It helps a little bit. And you need a long window.
I hate to say this, but we are out of time. If anybody would like to stay, they can. But I
think that what I should do is skip everything in binaural hearing, some of which is
interesting, and skip to the end.
I will, if anybody wants to stay around, I'll flip through this very quickly, but in all fairness
to everyone else, I -- let start here.
>>: (Indiscernible). [Laughter].
>> Richard Stern: Yeah, but ->>: (Indiscernible).
>> Richard Stern: A lot of them are skipped. A lot of them are skipped?
>>: It wasn't too (indiscernible) [laughter].
>> Richard Stern: So now knowledge of the auditory system can certainly improve
speech recognition accuracy. Some of the things we've talked about include use of
synchrony, although not much, consideration of rate-intensity function helps a lot, on set
enhancement helps a lot for reverb. Selective reconstruction, I didn't get to talk about,
but it's useful in, again, in reverberant situations. I'll elaborate on that very quickly if
anybody wants to hang around. Give you the five-minute version.
And correlation-based emphasis, I also didn't talk about, is a binaural-based algorithm.
Consideration of processes mediating scene analysis again, we didn't talk about here.
We have some results based on comparison of frequencies. And the other question of
course was do our experiences in speech recognition inform students of the auditory
system? And my answer for that is kind of a queasy maybe. And I'll tell you the reason
for that is that I had coming in and in all my years as a hearing student, assumed that all
of these nonlinearities and DTLs in the auditory system were just -- were functions of
fundamental limitations of physiological tissue. And were just an annoyance that we
should be dispensing with. And we should just model the whole thing as linear as
possible because we can as engineers.
And I'm certainly coming to appreciate the fact that -- or appreciate the proposition that
these details and these nonlinearities actually have some functional advantage for
processing signals such as speech in difficult acoustical environments. I don't feel that I
understand everything about them yet. But certainly my picture about auditory
processing and what it means has evolved over the years and hopefully will continue to
evolve.
Thank you for sharing these moments. And again, I'll quickly skim through some of the
things we skipped over for anyone who wants to see them, but this is the formal end.
>>: Thank you. [Applause.]
Download