>> Mike: Good afternoon, everyone. It's my great... Stern from Carnegie Mellon University here. Rick is a...

>> Mike: Good afternoon, everyone. It's my great pleasure to welcome Professor Rich Stern from Carnegie Mellon University here. Rick is a professor in the department of electrical and computer engineering and has been so since 1977. Is that right? >> Richard Stern: January 10th. >> Mike: January 10th, 1977. And he has a joint appointment in computer science and the Language Technologies Institute and the biomedical engineering department and, very much a renaissance man, a lecturer in the music department. >> Richard Stern: Yes. >> Mike: And is most well known for his work in robust speech recognition and in auditory perception and I think he's going to give us today an interesting talk on the latest work on matching those two areas in supplying physiological motivated models to speech recognition. Here's Rick. >> Richard Stern: Thank you. Thank you, Mike. And thank you all of you. I mean, I'm absolutely touched that all of you came here with this terrific talk right next door on Avatar. [Laughter] I know it's good because I saw it yesterday. The only thing that I can say is my understanding is that all of these talks are recorded and, you know, be your reward for staying here now is that you can see this firsthand and then watch the Avatar talk on the video. As usual, I don't do any work anymore. Everything that I'm talking about today is work that various students of mine have done. I'll be, in addition to everybody who is listed, many of whom you already know, Yu-Hsiang Bosco Chiu and Chanwoo Kim contributed most of the work that I'll be talking about today, and I'll try to identify their contributions as we encounter them. As -- well, as you just heard, but most people have forgotten by now because it's such ancient history, I was originally trained in auditory perception. My thesis was in binaural hearing. And, let's see, over the last 20-some-odd years -- it's hard to keep track -- I've been spending most of my time trying to improve the accuracy of speech recognition systems in difficult acoustical environments. And today I'd like to, as you heard, talk about some of the ways in which my group, and for that matter, those across the land and other lands have been attempting to apply knowledge of auditory perception to improve speech recognition accuracy. And I'll -- in the beginning I'll talk about some general stuff that is transiented for everywhere, and then, you know, as time goes by, of necessity, I'll have to be more specific about things that we've done. Approaches can be more or less faithful to physiology and psychophysics. Today I'm wearing my engineering hat rather than my science hat, so the only thing that matters is lowering the error rate, but hopefully paying attention to how the auditory system works might be able to help us do that. So the big questions that we have are how can knowledge of auditory physiology and perception improve speech recognition accuracy? Can speech recognition results the other way around tell us anything we don't already know about auditory processing? So, you know, we'll see. So what I'd like to do is start by reviewing some of the major physiological psychophysical results that motivate the models. I'll do this pretty quickly since I'm talking to an expert audience here and all of you know most of this stuff or have already heard it before. I'll briefly discuss -- review and discuss some of the classical auditory models of the 1980s, which is the first renaissance for physiologically motivated processing. Stephanie Seneff, Dick Lyon, Oda (phonetic) Ghitza are the ones that come to mind. And then talk about some major new trends in today's models and talk about some representative issues that have driven our own work in recent years. So here's robust speech recognition circa 1990. [Laughter] You should recognize that guy on the left here. [Laughter]. Some of you may not know, this is Tom Sullivan. >>: He looks the same. >> Richard Stern: Hockey player. [Laughter]. And on the right is Ochi Oshima. Tom and Ochi had actually had a band with a couple of other students, Dean Rupi (phonetic), and in fact, at one time, I had a fantasy of forming a rock group with my grad students. That never got off the ground. Another year I had the fantasy of running the Pittsburgh marathon as a relay team with all my grad students. That never happened either. But nevertheless, they did do some interesting work in speech recognition. Alex, of course, was focused on statistical approaches that enabled us to use some nonlinear processing to improve recognition accuracy in situations where we had both noise and filtering. This is -- Tom did early work on array processing that motivated some of the work we have now. And Ochi did early work on physiologically motivated processing. So and that pretty much gives you kind after good idea of some of the things we were looking at circa 1990. Paul, when were you there? I forget your dates. Paul's another former student of mine. But he was wise enough to ->> Paul: (Indiscernible). >> Richard Stern: Okay. And you did your Ph.D. thesis or correlation. >> Paul: Yeah. >> Richard Stern: Using silicon with ray-car-lee (phonetic). A topic close to my mind? >>: That was Windows 95. >> Richard Stern: Yeah. That was Windows 95. So fast forward ten years, and what do we get here, we get -- [laughter]. >>: Nice. I think I did sit at that desk. >> Richard Stern: This was Mike in his big hair days. [Laughter]. >>: That's really cute. Aye-yi-yi. Is this where you (indiscernible) making your slides? [Laughter]. >>: Is that a wig or is that a PhotoShop? >>: That's what I call nasty. [Laughter]. >> Richard Stern: I was inspired by the guys next door to use digital imagery to tell a story. The truth is I didn't have a photo. Mike actually did occupy the same desk. And he did sit there. And that's not too unreasonable because Tom apparently took the better part of ten years to finish. And Ochi was running regularly. It was because he was more invested in hockey I think than speech recognition. And he is still actually here. So it's a pseudo realistic shot. Anyway, and this is Mike. All right. This is Mike? >>: I don't have those jeans, for the record. >> Richard Stern: Well, I did a little bit of image modification. It was subtle. So if you look hard you can detect it. Anyway, and what was going on during that time is Mike, of course, did very interesting work again using array processing from a different standpoint. Other students at the time, probably best known one was Bic Sarajh (phonetic) who did seminal work in missing feature recognition before moving on to Merle and later back to CMU. Bic Sarajh (phonetic) and Paul were in the same place at the same time. I don't know how much you worked together, but he talked about you very -- very lovingly. Yeah. Yeah. Also of course, Juan Juerta (phonetic) was working was working on telephone speech recognition with money that we had from Alex's old group, from Telephonica. Let's see. Who else was there then? Pedro Moreno extended Alex's work developing the VTS algorithm with Bic Sarajh. Still widely used. And let's see. I'm forgetting all of them. But anyway, we did a bunch of other stuff then. And now ten years later, this is our third Microsoft employee, accept that the IMS hasn't figured out that he's a Microsoft employee yet. But they will. This is Yu-Hsiang Bosco Chiu. And his last day at his desk, the same desk, I might add -- [laughter]? >>: (Indiscernible). >> Richard Stern: He cleaned it up because he spent the previous week shipping everything in like two boxes. This is very much the end game. This of course must be his offer from Microsoft. I'm not sure. But anyway, during that time, Bosco and his cohort, Chanwoo Kim, we've been focusing more on methods that were related to perception and production of sound, kind of going back to our roots in that sense. And I have another student, she had a [indiscernible] who is now at Yahoo, who is working on speech production approaches and another student Blingwin Goo (phonetic), work on another form of signal separation. So we have many years, same desk, of -- and you know, whoever occupies that desk will in another ten years, will be here some day. [Laughter] I'm not sure they'll be, but we'll see [laughter]. So I don't need to tell you too much about this. This is standard you know auditory anatomy culled from the web. We don't need to draw pictures anymore because we have this. Most important thing: Air comes in. Tympanic then removes here. These are a bunch of levers, and then the tympanic membrane -- I'm sorry, tympanic membrane back here moves the hammer, the anvil, and the stirrup to unload the cochlea. And I used to have these, you know, be diagrams by Bill Rhody (phonetic) in fact, about cochlear mechanics. We don't do that anymore because now we have glitzy animations. At least I thought we did. Let's see. There we go. This I also didn't do. I got it from the web. Just tells you what goes on inside. Pretty cool, huh? >> Auditory played file as follows: So here we are with a view of the cochlear. The cochlear now uncoils. We look at the basilar membrane. And now see what happens when we play individual tones. (Sounds played.) Now a chord. (Sounds played.) And finally something really complex. (Music played.) >> Richard Stern: It goes on for a while. I should really credit where that came from. And to be truthful, I'll have to go back and look. [Laughter]. But I stole it without attribution obviously from a website. If you look up, you know, be cochlear animation or something like that, you will find it too. And I apologize for that. We'll spend a few more minutes in talking about a couple of representative physiological results. This was a curve that was showing the relative -- so after the cochlea, and again the important thing there that I hope that you saw was that there's a so-called pitch-to-place transformation that takes place. The mechanics that give rise to this are -- have been studied extensively. Basically, be the basilar membrane, which is a membrane inside the cochlea, has stiffness that varies as you move along and also a density that varies as you move along so that different locations have different resonant frequencies. There are many other nonlinear mechanisms that come into play that I won't even attempt to characterize but the short end of the story is that again, the input is here, the output is here, the high frequency sounds excite this end of the thing, the low frequency sound pretty much excite the whole thing but more over here than back here. And that's sort of what you saw. Now, stuck to the cochlea -- I'm sorry. Stuck to the basilar membrane are tens of thousands of fibers of the auditory nerve, and one -- each enervating a local region in the cochlea. And when they move, they excite a neural pulse that gets piped on upstairs through the brain stem, ultimately to the brain. This is sufficiently reduced description that any physiologist would cringe, but it's all that we need to talk about here because we don't go very far. This is the response that shows the relative number of firings. Statistically, this is from the group that Nelson Kang led, and this is from the 60 treatise from [indiscernible] at MIT. But the signals, the tone burst that goes on for here, and what I wanted you to be observant of is that there's a burst of activity that settles down more or less to a steady state. And then when it's off, there's a depressed amount of activity that sort of comes back up to a spontaneous baseline level. So you get kind of enhancement of on set and offset as well as response. As you saw before, the individual -- the cochlea is frequency selective. This mapping is preserved in the auditory nerves. These are so-called tuning curves. And each one of these curves represents a different fiber of the auditory nerve and shows the intensity needed to elicit a criterion response as a function of frequency. And what's important here, note that we have a long scale, is that the units are free see selective. The fact that the triangles are roughly the same shape at high frequencies indicates that the filters as they were are approximately constant queue at high frequencies. It looks like they're stretching at low frequencies, but that's a consequence of a log scale. They're approximately constant bandwidth at low frequencies. >>: Is that the right scale? Seems like [indiscernible]. >> Richard Stern: Well, it does. These are cats. [Laughter]. And cats have smaller ears. They have -- everything kind of scales proportionally. I -- there's a lot more I could have said about the experiment, but all of the physiological data are, you know, are cats or, yeah. These are all cats. But -- and they do indeed have a higher frequency scale. This is a so-called rate intensity curve. And what we're looking at is intensity. This is actually with the [indiscernible]. This is spontaneous right on top of this which has been subtracted off. And all that I want to say about this is that it's roughly S shaped. There's an area in the middle where it's approximately linear with respect to the log of intensity. This linearity curve actually is one of the motivating things that gives right to the decibel scale which also codes things that are linear with respect to intensity. Not the only thing though. And there's a cutoff region here and there's a saturation here. Some units saturate more than others. That's a discussion I'm not going tow get into right now. Yvonne? >>: The numbers [indiscernible]. >> Richard Stern: Those are different frequencies. Those are best frequency of response. And those are in kilohertz. There's nothing -- this thing actually -- the fibers vary in the response with respect to what frequency they're most sensitive to. They also have different spontaneous rates of firing. And if anything, that has more to do with the ultimate shape than anything else. And you know, kind of since the time that I've paid a lot of attention to this, this big discussion about inner versus outer hair [indiscernible] two populations that have somewhat different properties. But that goes beyond a level of complexity that the average auditory model for speech recognition takes into account and in the interest of finishing before the sun sets and it's a long day in spring, I would think, you know, I'll -- we'll omit some of those details. Let's see. What else can we say? Oh, yeah. That's going backwards because I'm inept. Over here we're looking at response to pure tones. And at low frequencies, well, 1100 hertz is a low frequency for a cat. And what's important here is that the relative number of firings is, you know, if you have a sufficiently low frequency sound, it's -they're not only -- they don't just occur randomly, but they're actually synchronized to the face of the incoming signal. And this is very important because that's the major cue that enables us to keep track of cycle by cycle variability which obviously is needed to determine differences in a arrival time for binaural hearing, again something dear to my heart. These studies that ability to respond in a synchronous fashion disappears as you get above a certain frequency. And we have every reason to believe that that's cued to, again, the size -- basically animals with big heads lose the ability to follow phase information at lower frequencies. Cats -- we as humans, we infer through psychophysical experiments lose that ability at about one kilohertz. And cats maintain that with smaller heads up to a higher frequency. We have every reason to believe that that's keyed to you lose that information when that would become confusing due to spatial alias and considerations. If you have delay that's longer than half a wavelength, then you'll get, you know, unhelpful encoding of time delay. Okay. This just repeats what I already said, doesn't it? So we'll skip that. This is a phenomenon called lateral suppression or two-tone suppression. First done by Morrey Sachs who is a student of Nelson Kang. His student, Eric Young who's very active still in Johns Hopkins. Morrey is near retirement. Anyway, the idea here, this is tuning curve as we saw before for a particular unit with a characteristic frequency of about 7 or 8 kilohertz. It's being -- there's a signal at that frequency, a probe tone that's ten dB above the threshold for that unit. And the shaded areas show combinations of frequencies and intensities for which the presentation of the second tone will inhibit the response to the first tone. This is very interesting because many of these frequencies -- I'm sorry many of that he is combinations are at frequencies and intensities are such that if the second tone were presented by itself, there'd be no response to it. So a second tone, even subthreshold at an adjacent frequency, will inhibit the response to a primary tone at the frequency. I -- I'm not going to say much more about this, but I believe that this enables you to get a sharper frequency response without losing temporal resolution. And so it's another way. This is the results of a psychoacoustic experiment, I won't go into great detail about it. There's a long demo that if I had two hours to lecture, I'd play for you, but I won't now. And basically, it's the results of several studies that have the goal of estimating the effective band with the frequency resolution as a function of frequency. The fact that this is linear at high frequencies implies, as before, that the system perceptionally measured is constant cue, just as the physiological results indicated. And it's a little bit harder to infer what's going on down here. Some estimates have them becoming constants. Some continue to have them decrease. But roughly speaking, we get the same kind of functional dependence of resolution bandwidth as a function of channel center frequency that we observed from the physiological data. And with increases with center frequency and the solid curve, this one here is the so-called equivalent rectangular bandwidth. That's one of three frequency scales that have been used. One of the things that you'll encounter fairly frequently are attempts to characterize the dependence of bandwidth and center frequency which basically, you know, again, suggests that resolution is finer with respect to frequency at low frequencies than at high frequencies. We believe that the reason for this is that you want to have good frequency resolution at low frequencies because this enables us to attend to format frequencies which change, which we need to be able to do for vowel perception. The existence of broad frequency channels at high frequencies enables us to develop very sharp temporal resolution, which is important for certain consonant discriminations of voice on voice detections. So by building a system that the narrow in frequency at the low end, we get good frequency resolution at the low end, good for vowels, good timer solution at the high end, good for consonants, and you have a system that's optimized for both. So seems like a sensible thing to do. These three representations, the Bark scale, the Mel scale, and the ERB scale, were developed -- Bark by Eberhardt Vicker (phonetic) and Dorschland (phonetic) and the Mel scale by Smitty Stevens. Mel, actually, I found out many years later, I thought it was a guy named Mel. But it isn't. It's the shorthand for melody. I finally had to look up the original paper and it's a footnote there. How many of you knew that? Probably not a lot. Anyway, that's what the M from MFCC coefficients comes from. Now you know. ERB, you just saw, was equivalent rectangular bandwidth. It was from Brian Moore whose -- seems to publish a paper a month in jazzer for the last 30 years. But now that was one of them. These are the plots of the Mel scale, the ERB scale, and the Bark scale. And they're normalized for amplitude. They look kind of the same. If you manipulated the variance of the green curve, it would do a pretty good job of laying top of the -- around the blue curve. So the bottom line, as far as I'm concerned, is that all of these more or less do the same thing. Doesn't matter which one you use, but everybody seems to have their favorite. Frankly, I don't think it affects recognition accuracy much at all. But ->>: So where did the Bark come from? >> Richard Stern: Barkhasem (phonetic). It's an individual. It's the name of somebody. Yeah. It's a contraction. Sorry about that. I don't know who he is or what he did, but that's what it's from. Okay. And the last one equal on this curves -- these are the Fletcher-Munson curves early psychophysical measurements from the 1920s, showed absolute sensitivity as a function of frequency at higher intensities. The intensity curve saturates. Okay. So basically, frequency analysis and parallel channels, preservation of temporal fine structure, limited dynamic range in individual channels. I should have made more of a to-do about this, but when we saw the rate intense at this curve, all of those curves went from threshold to saturation within about 20 or 25 dB. And that's actually our markup of paradox because of course, you and I have a dynamic range of about 100-dB, depending on where we look and how we count. And the fact that you can do that with individual fibers says a lot about the fact that we look at this picture very, very wholistically which is something that computers aren't so good at doing. But it's of interest. Enhancement of temporal contrast, enhancement of spectral contrast, on set and off sets and adjacent frequencies. And most of these physiological attributes have psychophysical correlates, in fact I would say all of them. It took, you know, some were discovered in the 1920s and some were not discovered until the 1970s or not confirmed, but basically everything that I've talked about seems to be relevant for perception. The question is sit helpful for speech recognition. And I don't have the complete answer for these, but we'll talk about some partial answers. And most physiological and psychophysical effects are not preserved in conventional representations of speech recognition. So that's the point of departure. I'm not going to insult any of you by going through my usual walk-through, Mel frequency coefficients, cepstral coefficients. Just suffice it to say for those of you who aren't familiar with the speech processing, that we take the input speech, multiply it by a hemming window, typically about 20, 25 milliseconds. Typically hemming, doesn't have to be. Do it for you. Transform takes a magnitude of that. Weight that triangularly with spectra frequency and that the supposed to be a crude representation of the frequency specificity. Take a log of that and take the inverse for you, transform of that, and you get these things called Mel frequency capsular coefficients. The Mel comes from the fact that he is triangular filters are spaced nonlinearly. Originally, according to the Mel scale. And this was first proposed by a pair of researchers at Bell Northern, Davis and Murmelstein (phonetic). What was it? Around 1972? >>: I think it was ->> Richard Stern: '82? '82? >>: 80s. >> Richard Stern: It's Davis and Murmelstein (phonetic), yeah. >>: '82. >> Richard Stern: '82. I knew it had a two in it, not to mention a one and a nine. Anyway, so let's take a look at what comes up. So this is an original speech spectrogram. I know you can all read this. I took a course from Victor Zoo (phonetic) once in 1985 teaching me to read this. And what it tells me is that it's me speaking. Actually that you can't tell from that. But when you should be able to tell is the utterance is welcome to DSP one. This is an example. So usually SP one is over there. And this is the spectrum recovered from Mel frequency cepstral coefficients. And you know, if I take off my glasses -- I'm pretty damn nearsighted -- and walk to the back of the room, I will get the same thing. But the general idea is that it's fairly blurry compared to the original Mel wideband spectrogram. Now, in all fairness, part of that blurriness was deliberate because this was designed to get rid of these striations which correspond to pitch. That was considered to be not part of what was interesting. But nevertheless, it's blurry. Some of the aspects of fundamental auditory processing that are preserved are the frequencies selectivity and the spectral bandwidth so that the analysis is narrower at low frequencies than at high frequencies, so that's consistent with physiology. However, because of the fact that we use a window of constant duration, we don't really take advantage of the opportunity to exploit better temporal resolution at the high frequencies. We basically throw that away. It's an opportunity lost. Wavelet schemes exploit time-frequency resolution better. Les Atlas in our own native land of Washington, Seattle, has looked at this a bit. But I think it's fair to say that wavelet analysis has not had a big impact on speech recognition so far are. Otherwise we'd all be using it, and we're not. So it's gotten no better results and it's less simple so people are continuing to do what they have been. Also, there is -- the nonlinear amplitude response is encoded in the logarithmic transformation that was part of the Mel cap representation. The bunch of aspects of auditory processing that are not represented, one of them is the detailed timing structure. Lateral suppression, enhancement of temporal contrast and other auditory nonlinearities. And the list can go on and on. I just am running out of space here, this PowerPoint. And so we'll take a look. Now, interest in the auditory system began -- well, I mean, people have always been interested in the auditory system. But potential interest in applying this to speech processing began in the 1980s. There were a few seminal models, one of which, which we actually looked at in the '90s or later '80s, was that of Stephanie Seneff. This was her Ph.D. thesis before they went on to work in natural language processing. And basically, it assumed stage one was a filter bank, a critical band filter bank. Stage two was a hair-cell model, which included the nonlinearities and a couple of temporal things like short-term AGC. And then there was a combination of envelope detector and synchrony detector. The envelope detector is kind of like an energy detector and the synchrony was like a -- well, that actually looked for the synchronization that I talked about before. And if you blow up the second stage, you get saturating half-wave rectifier, short-term AGC, a lowpass filter, and another AGC. It was basically, all of these things are computational approximations to what the physiologists observed. And it was used among all of the early ones, this is the with one that people started the most. And frankly, the reason was that Stephanie gave her code away so everybody could use it, and it's a lesson approximate about open source. It was helpful. Oda Ghitza had a -- I wouldn't say a competing model, but a complimentary model that the interesting thing about that was, again, there's a filter bank and what was different was they had a lot of different thresholds and level crossing associated with that, but by the time you looked at everything and you were done with it, you had something that looked fairly similar to what you would have gotten with the Seneff model. And similar -there we go. Similarly, Dick Lyon, who at that point originally was at Fairchild but later with Apple, had a -- again a set of band pass filters that are kind of off the page. A more detailed model of what was going on after that including a stage that explicitly modeled lateral suppression. So he also included autocorrelation display. This led to this popular correlograms. And also introduced the idea of crosscorrelation as for -- as a mechanist for auditory lateralization. And he was really ahead of the curve in the autocorrelation and crosscorrelation. I should mention that crosscorrelation is supported by physiology. The autocorrelation is really not. To my knowledge? >>: I was talking to Jordan Cohen, also used to work for IBM at that time. >> Richard Stern: Jordan Cohen, his Ph.D. thesis was pitch, a model of pitch. And it used -- it also used an autocorrelation. He finished around 1982. And indeed that was contemporaneous with all of this. But I didn't include that because I didn't think that the model had anything that the previous models didn't have. And his work particularly was focused on pitch perception. I met Jordy, gosh, for the first time at the '82 iCast (phonetic), which was in Paris, and he was presenting his work there. That's also when I met Dick Lyon. And Seneff I had known from before, of course. We were a year apart at MIT. And Victor Zoo is my TA in DSP. That was when he was a grad student and I was -- well, we both were grad students. Anyway, one of the reasons why the Seneff model did not catch the world afire was this one here. This is an analysis, the number of multiplications per millisecond. And on the left, you have the various stage of the Seneff model, and over here we have LPC processing. And NFCC processing was comparable to LPC -- LPC processing at the time so you could assume that it was about the same. So that was a -- that was a deterrent. And this was, keep in mind, in the 1980s, computers were not very powerful. Not very big. I remember I was describing somebody, a spent about, oh, God, about 6 or $7,000 to get this big disk. It came, it was about this size. A Winchester drive. And it was eight megabytes. And it was this wide and took a whole thing and had to -- it was hermetically sealed. It was -- yeah. You may remember that disk. It had its own [indiscernible]. Anyway, so to summarize what was going on before, the models developed in the 1980s included you know, kind of realistic auditory filtering, realistic auditory nonlinearities and sort of in quotes, synchrony extraction, lateral suppression again, higher processing through autocorrelation, crosscorrelation. This is if you look across the [indiscernible] sample of models, every system developer had his or her own private idea of what was important. And this varied quite a bit from person to person. And that was -- you know. So clearly there's no consensus, and not much quantitative evaluation actually performed. Typically the paper would say we have this thing and then they'd show you a display of like one sentence and say see how much better this looks than a -- anything else. And I can understand this. I mean it was really hard to do a good job with this because everything was so slow. And so I -- you know, I had an appreciate of that. When we actually did get around to evaluating this -- and this was in part Ochi Oshima's Ph.D. thesis, what we found was the following: Physiological processing didn't help at all. Or certainly not much if you had clean speech. It gave us some improvement if we had degraded speech. If we had a noise or recorded things with a distant microphone, it was better. However, the benefit that we got with physiological processing did not exceed what we could get with more prosaic approaches such as CDCN, you know, which was Alex's Ph.D. thesis. I don't know how much you hear about CDCN these days. But this shows up in our stuff. But in any case, we would do better with much lower computational costs just being good engineers and forgetting about the physiology. And so, you know, it was did I appointing but true, and we couldn't ignore the reality. There are also other reasons why they didn't work so -- why they weren't so successful. One of them was that in those days, the conventional state of the audio recognition systems, either DTW using conventional -- I don't know, just using a conventional, you know, distance metric, or HMMs using, in those days, Kai Felis' (phonetic) thesis was single density, you know, be discrete HMMs. And these all implicitly assumed univariate Gaussians. The distributions of the features that came out were very non-Gaussian processes. And so there wasn't a good statistical match. Ben Shigea (phonetic) in inner speech -- wow, I guess in those days it was ICSLP, 1992, and Biff, remember we shared a room? Watched the last game of the National League playoffs in 1992? Yes, we did. We were there. The Braves defeated the Pirates. Francisco Cabrera knocked in Sid Breen, former Pirate, in 92. Immediately thereafter, Barry Bonds went to San Francisco and Bobby Branea (phonetic) went to the Mets and Pittsburgh never finished above 500 after that. Including now. So we'll well beyond 500. Anyway, more interestingly, Ben Shigea had a paper in which he compared physiological approaches to conventional Mel caps, both with conventional HMM and with a neural network classifier. And the physiological model really shown with the neural net classifier because the neural net could learn the density, whereas in those days, at least, the HMMs assumed Gaussian densities. Nowadays, of course, we all use Gaussian mixtures, which in principle, you know, can model any shape. So that's of less relevance. Also, frankly, the more pressing need was to solve other basic speech recognition problems. How to do large vocabularies, how do you integrate language models. This was really kind of a boutique kind of thing. So there wasn't a lot of attention paid to it. It was kind of a niche market and you know, it consumed a lot of cycles, didn't provide any benefit, but so it was a small cautery of affecianados. Okay. So nevertheless, in the late 1990s, renaissance, number of reasons for that. One of them was that computation no longer was a big deal. Wasn't -- not as much as before. There was other attributes. Serious attention paid to temporal evolution. A lot of work in modulation filtering, became very popular. Attention also paid to reverberation, which was not as obvious a problem in the old days, be but once people started deploying systems in real rooms, if you worked on a close talking microphone, that to my mind was for, you know, in some ways at least as challenging if not more challenging than noise. And also binaural processing became part of the mix, which is, to my mind, a good thing. And more effective and mature approaches to information fusion as well. I'm not going to talk a lot about that, but it's also I believe one of the factors that's motivating the increased popularity? >>: By neural process, you [indiscernible]? >>: [Indiscernible]? >> Richard Stern: Well, actually, I don't even mean that. I just mean two microphones. Just two microphones. But you're right, strictly speaking, binaural recording's with the head, but most of us don't want to have a device with a head on it. And so again, I'm trying to go pragmatic. What can we appropriate from what we learn about the system in order to improve performance? And without necessarily overly lavish regard to physiological or anatomical details like heads and ears and so forth. You know. >>: So in this area, (indiscernible)? >> Richard Stern: We can exploit things by exploiting interval time delay. So, by the way, Mike, I'm looking at the clock. I note that the time was budgeted for 90 minutes, which is 50 percent more than I thought it would be. Is everybody going to leave at five -- or four, rather? At four? >>: Typically it's 4:30. >> Richard Stern: No, no, no, no, I'll finish by 4:30. My question is do I need to worry about four? >>: No. >> Richard Stern: No. Okay. Okay. Great. Okay. So let's talk a little bit about what we've been doing lately. More or less for the last 6 or 7 years. I'll talk a bit about end -- unfortunately I'm not going to be able to talk about these in equal levels of detail but I'll be happy to stay as long as you have patience after we're done to answer questions. Representation of synchrony, shape of the rate-intensity function, revisiting analysis duration, frequency resolution, on set enhancement, modulation filtering I'm not going to say very much about but I can comment on, and binaural and polyaural techniques as well as techniques derived from auditory scene analysis based on auditory common frequency modulation. So this is another physiological result by Eric Young and Morrey Sachs. I mentioned before Morrey Sachs was the guy who was behind -- who first recorded consistently the lateral suppression effect that we saw before. In fact that was his career before he went to Hopkins biomedical engineering department. Eric Young was a former student of his. He's now a big honcho there as well as from the ARO and the ASA. And what we're looking at are physiological recordings of cats that are being hit by an artificial signal that's actually a pseudo vowel generated by a computer. It's basically sound waves with those intensities. This is a -- done in the late 70s, when, you know, again, equipment was pretty primitive. And we're looking at the relative number of spikes as a function of -- in response to this averaged over time. And these things are plotted according to the characteristic frequency. And if you look at the response as a function of frequency, you get an estimate of what the response profile is. And the panels -- you probably can read them, but we're looking at the overall loudness changes from 28-dB to 78-dB. So it's, you know, 60 or 50, you know, a range of about 50-dB in intensity. And these three arrows which you see occasionally, and I -- they're kind of in odd positions because I got this by taking a picture and moving it around, but these three arrows are in these positions. They indicate the original, you know, pseudo format frequencies of the original signal. And basically the story here is that if you look at this and if you look at what happened over frequencies, it's a big mess, and there's no one variance over intensity. And it doesn't look like mean right of firing is a very useful way of coding the spectral shape of the vowel that the coming in. At least if you believe the results and based on that figure. So this was recorded using mean rate doesn't work. At least physiologically. Now, this is the same thing using something called and average localized synchrony rate. And what that was, you saw earlier, I talked about the fact that the response was synchronized to the phase and the signal coming in. The synchrony rate is the measure of the extent to which it's synchronized. So if the things occur randomly anywhere within the phase of the signal coming in, that synchrony measure is zero. And if it's completely lock-stepped to the phase coming in, that's going to be one. And these are vertically displaced from each other, but the cute thing about this, again, we're looking over range of intensities, is that not only are the contours, including the format frequencies very nicely preserved, but also that the same -- but also that the curve (indiscernible) for an invariant. So this suggests synchrony is important. This by the way, was taken up by Campbell [indiscernible], and you may recall. Were they there when you were? >>: Yeah. >> Richard Stern: Yeah. So ->>: (Indiscernible). >> Richard Stern: Yeah, probably. I forget exactly. I should look at that. The measure actually -- they weren't necessarily synchronized also, I should say, to the actual fundamental frequency. They would be synchronized to whatever the nearest harmonic was of the fundamental frequency. So it's a little somewhat bogus statistic. But in any case, the important information -- the implication, at least to me, is that cues about the spectral content, you know, are certainly there in the synchronization. And better preserved there than they are in the mean rate. Now, when we take a look at Mel caps, really, what that is is that's a measure of short-term energy as a function of the short-term time measured as function or frequency. So that's more like the mean rate which doesn't look very good at all. So the question is, you know, be can we harness that? And the answer, by the way, I'm not going to keep you in suspense forever, is: Well, sort of. But here is a fairly complicated model of signal processing. I forget Zhang's first name. Do you remember? All right. Carney is Laura Carney who is a physiologist. Used to be at BU, now runs a lab at Syracuse. Zhang, who I believe is a woman, was a student of Carney. Anyway, they had this fairly complicated model which actually looked better on my screen than it does here, that has -- which by the way we're also using because it's available in open source. The C code is right there under Carney's lab which makes it easy to exploit. One of the things that we did is we just asked the question that if we just took the output down here, and this is one of the first things that Bosco did when he got to graduate school, if you just take the output down here and use that as the basis, do we do any better than if we -- than we would with Mel caps from that very complicated model? And this by the way was very slow. Really, really, really slow. Even in 2006, it was really, really slow. Much more complicated than (indiscernible) model. The boxes look simple, but they're not. It's slow. It's a careful physiological model. The general shape, this was Bosco's side, and this was Chanwoo Kim's side. Chanwoo is a fan of complexity. But we did mean rate estimates, synchrony detection, looked at synchrony low frequencies, mean rate at high frequencies. And then combined the two and then used that. In those days, it was very slow, mainly because, again, we're using the Carney auditory nerve model. And these are results that we present back in 2006. If we look at some of these in the top curve, this is an original spectrogram, this is the reconstruction using no frequency capsular coefficients in the fashion before we just turned the coefficients back into -- back into something that looked like a spectrogram. And then down here, this is the auditory model. You see, you know, with queen's speech, you have this -- you still have the former trajectories, kind of nicely and cleanly represented. As we go to 20-dB SNR, this and this are showing the effects of noise. This particular at high frequencies, not because they're bad at high frequencies, but the filters have a wider bandwidth at high frequencies. So more noise gets into each channel. So this is just a -- if we had pink noise, this would be the same across the frequencies. But again, we still got, you know, pretty good preservation. The contours that we see before. >>: (Indiscernible). >> Richard Stern: Say that again? >>: There is synchrony. >> Richard Stern: Yeah, this is -- but with mean rate and synchrony. That's right? >>: So why is it that the (indiscernible) so much more strongly than (indiscernible)? (Indiscernible) you can see energy elsewhere. >> Richard Stern: Yeah, it's not -- you mean here? This is ->>: (Indiscernible) red colors everywhere else up there. >> Richard Stern: I think this is an excerpt of the greasy wash water utterance that's, you know, the dialogue normalization sentence for ->>: (Indiscernible). >> Richard Stern: Say that again. >>: (Indiscernible) down below a thousand? Usually it's the opposite. >> Richard Stern: You know, I'd have to listen to the utterance. It might have been, you know, kind of very (indiscernible) greasy. You know, it really would depend on how it's pronounced. >>: (Indiscernible). >> Richard Stern: Yeah. In fact, that's what I said a moment ago, by the way. But you were so much more eloquent. [Laughter]. And you can see it reflect in the fact the noise shows up here as well. Here you see it even more, ten-DB. And again, this is starting to get kind of really fuzzed out whereas these are still pretty well preserved. And you know, down at 0-DB, nothing works. Sooner or later that was going to happen, and that's where it did. So now let's look at the Wall Street Journal task white noise and background music, and I fear that these might be mislabeled. Yes, they are. Please interchange in your mind the green curve and the blue curve. I keep intending to do this and I never do. But the basic story is -- and these were results that actually Bosco did and I think Chanwoo (indiscernible) them -- is this is mean rate. This is the auditory representation -- this is -- I'm sorry. This curve, this curve and also this curve here is Mel frequency cap cepstral coefficients. The green curve and again, I believe, the blue curve on the right side, despite the fact it's not labeled that way, is the auditory model with mean rate only. And the red curve is mean rate plus synchrony. In this situation, the synchrony certainly did not give us, you know, very -- you know, very impressive incremental advantage, especially considering the amount of effort that it would take to calculate it, which I'm skipping over some of the details. I don't regard this as a big success for synchrony, but one of the things that we do see is that we see, you know, really a substantial difference between performance with the auditory models and everything else. Now, note by the way, that it is diminished quite a bit when you have background music. In general, white noise we now understand is really easy relatively speaking, and if you really want to impress somebody, you got on work in music and also you have to work in big tasks too? >>: So in your work, did you find any difference between (indiscernible)? >> Richard Stern: Well, you know, agree. And the truth is that we did this without officially atom noise because we had to. >>: (Indiscernible) situation. >> Richard Stern: Say that again. >>: (Indiscernible). >> Richard Stern: It has -- no. We've done other -- well, there are two issues here. One is what's the noise source and the other is how's the noise combined. So as I understand it, I have done only limited work with Aurora, as, you know, we have very different noises. White noise, background music from the old hub four task, various speech. And more recently, we did some work for Samsung on [indiscernible], and they actually went around with a microphone in a supermarket and, you know, concert, in an airport, and there were train stations, the street and so forth. And also we've had natural noise samples from Telephonica from the work that they did also recording things. In terms of the noise type, things that are kind of quiescent are easy. So if you have white noise and colored noise, it's fairly trivial. If you have things that jump around a lot, and I don't talk a lot about this in this talk, but you know, when I came here a few years ago, we talked more about that. For example, background music is much more difficult to compensate for than white noise. And that's partly because, you know, you got problems with the, you know, particularly vocal music with the music -- with the background being confused with the foreground. But also, the classical compensation algorithms like CDCNVTS all began by sniffing a piece of the environment for about a second or so and then using that for the environmental parameters. And if the environment changes during at that time, which it typically will for that, you know, it won't be helpful. In addition, impulsive things like tympany crashes or, for that matter, gunshots or impulsive noises in factory floors are particularly susceptible to not working well. Missing feature techniques such as the things that Martin Cook and Sheffield and somewhat later, but we think better, Bic Sarajh (phonetic) did at Carnegie Mellon, are more effective for impulsive noise. So there's a big issue with noise type. Then beyond that, there's a question of how is the noise added. Probably the biggest issue with -- is that when you additionally add noise, you don't include any reverberant effects. And in the real world, be unless you're making -- unless your measuring things in an anechoic chamber, which especially in these days is unusual, or perhaps outdoors, which even there is not really perfectly anechoic at all, that you're going to get echoes and the echoes are really going to muck things up quite a bit. And that when you digitally combine things as we do and as Aurora does, you lose cognizance of that. So we've started to become more attentive to that, although again recalibrated in artificial situations using, you know, image model, but we found that the always worse in the real world but that the thing that we do for a particular -- and evaluate for will still provide benefit for comparable tasks. So, you know, I understand the concern. I share the concern. The hard to do much about it. >>: (Indiscernible). >> Richard Stern: Yeah, yeah, yeah, yeah. >>: (Indiscernible). Natural noise (indiscernible). Is that our speakers (indiscernible) and they receive noise and changing their formats, but the noise is real noise. It's not (indiscernible) generated. >> Richard Stern: Well, these were ->>: (Indiscernible). >> Richard Stern: Yeah, we don't have any trivial logarithms. I mean, we don't exploit anything -- none of these are ->>: (Indiscernible) all the extremes that I found show that it was very, very good approximation. Not for the rest. Not true. >> Richard Stern: Well, I think for microphone arrays -- well, some of these things that you can get singular solutions for, you know, you don't get the solutions when somebody is in the room and then disturbing the model. I'm generally very skeptical of any algorithm that tries to invert anything. It's because all of these inversion techniques are very sensitive to numerical issues, they're very sensitive to having an exact model, which you're never going to have in a real environment. We never do that. >>: (Indiscernible) natural environment sometimes the channel effect (indiscernible). So somehow, you know, there's a (indiscernible). >> Richard Stern: Yeah. Let's continue for now. Just because I have 250 slides and we're only -- no. I'm only kidding. [Laughter]. I don't have that many. But I'm not going to tell you what the number is. [Laughter]. I know what it is. And we have to go on to the next slide. Anyway, a reasonable question, to the auditory models have to be so damn complex? And here's the Carney and Zhang model, or Zhang and Carney model. And here's the Chiu model. Or this is just an easy model when you have Gammatone filters followed by a nonlinear rectifier and followed by a lowpass filter. This is what we tried at the end, which is kind of a very crude abstraction, in other words, taking this piece, maybe this piece and this piece, and leaving everything else out of it, especially this stuff over here. And if we just did that, how would we do? And the answer is, well, pretty good. This is Mel frequency cepstral coefficients. This is the simple auditory model. And this is the more complicated auditory model. And so the -- this is one of these half empty/half full situations. I mean, on the one hand you're still doing really well on the basis of almost 0 computation compared to just using Mel caps. On the other hand, you know, if you really want to spend a lot more cycles, you can do better. And you know, the question that drives, you know, our work these days is can we do better without spanning that many cycles. And in order to do that, we have to have a deeper understanding of what's going on than simply plugging in somebody else's code at great expense computationally and then running it into our analyzers. >>: Do you still see improvements with the adaptations? >> Richard Stern: Do we still see improvements when there is speaker adaptation ->>: Yeah. >> Richard Stern: Excellent question. I don't know. We should try it. >>: It's equivalent to asking the question that (indiscernible). That's a simpler version of the question. The answer is it's hard to do that. >> Richard Stern: Yeah. >>: Right? (Indiscernible). >> Richard Stern: We would. These were all trained clean which was our religion base from your days. We're, you know, moving away from that. And I will show you some results from Aurora using the Aurora paradigm in a moment. But that's a good we question. The truth is I don't know. You know. We'll find out. The one thing I can tell you is, once again, that we did one specific piece of work for Samsung in which we, you know, basically -- it wasn't exactly the paradigm we're talking about here. It was actually a much bigger task and more realistic noises, but the things that were -- that I'm describing worked in that practical environment. They may not be the best thing to do, but we're not completely doing what you would consider to be the right experiments. One obvious question to ask is, you know, in those models, you know, what really is important. And this is a -- this is actually a part of Bosco's thesis. And what we're looking at is we're looking into various different stages of the Seneff auditory model, looking at performance. And this is now recognition accuracy rather than error rate as a function of SNR going into the other direction from which you would expect it to. That's why the curve is going down. But just the quick interpretation of this is that if you include an appropriate saturating half wave rectifier, then you get good results. And if you don't, you get bad results or ->>: So what is that ->> Richard Stern: These are the good ->>: -- (indiscernible). >> Richard Stern: Well, we'll talk about that. So the previous slide doesn't belong there. What this rectifier does is follows. This is plotting log intensity and -- I put in a linear scale. A logarithmic transformation. This is what's implied by, you know, the conventional log transformation in Mel frequency cepstral coefficients is the straight line. The -- this curve is kind of an abstraction of actually the curve that emerged from the Zhang and Carney model. Is that correct? >>: No. It's from the Seneff model. >> Richard Stern: From the Seneff model. >>: Yeah. Yeah, just (indiscernible). >> Richard Stern: Okay. Sorry about that. Thank you for the correction. Anyway, the idea is that for example, 20-dB SNR, which of course is relatively benign, the speech kind of sits here in the graded portion of the curve, and the noise sits down here. And so when you have noise by itself, you're there and the contribution of the noise is relatively small. So you reduce the variability produced by the noise. And this is again from Bosco's thesis, comparison of frequency response or as a function -- this is channel index but read by that -- that's a marker for frequency. This is clean speech and noisy speech. The effect of the noise, of course is to fill in the valleys and the representation. And by using the nonlinearity, you still got some effect, of course, be but the correspondence is much closer. So that seems -- that seems to be helpful. This is a comparison of recognition accuracy obtained. And this is actually on the Aurora test set. This is, thank you, Mike. And this is test set A. And it was trained and tested, my understanding is, according to Aurora protocols. And what we're looking at, the dark red triangles are results using Mel frequency cepstral coefficients. The red triangles are using a baseline nonlinearity kind of taken I think in this case from the other models, sort of out of the box, just fitting a curve to the results of the physiological model. Carney and Zhang this time? Also Seneff? >>: Yeah. I think it's (indiscernible). >> Richard Stern: Okay. Doesn't matter. And the blue curve was the results that Bosco was able to obtain using a routine that automatically learned the characteristics of the nonlinearity or found the characteristic of the nonlinearity that produced best performance. >>: Was that something like the (indiscernible)? >>: It's using the (indiscernible). >> Richard Stern: It wasn't based on error rate. In other words, it's operating in open loop fashion ->>: (Indiscernible). >> Richard Stern: No. It was done ahead of time. It was done in advance. >>: (Indiscernible). >>: So it's maximizing the (indiscernible). >>: It's maximizing the (indiscernible). >>: (Indiscernible). >> Richard Stern: You're talking about clean speech? >>: Clean speech. >>: (Indiscernible). >>: (Indiscernible). >> Richard Stern: We have the numbers somewhere. I have your thesis, you know, here, if you want refer to it. But let's hold off on that for now. Anyway, so nonlinearity helps. All right. I want to talk for a few moments about the analysis window duration. Typical analysis window, as you know, as we mentioned before, as you all know, is about 25 to 35 milliseconds for most speech recognition systems. In fact sometimes I've seen it go down a little bit lower. If you're trying to sniff the environment, you're better off looking over a longer duration. Typically, 75 to 125 milliseconds, depending on the particular application. And this seem trivial in retrospect, but, you know, up until now, we've sort of been going frame by frame, everything we've been doing. There's a pretty substantial win to be had just by looking over a longer window for estimating compensation parameters. And then drilling down to a shorter window. So basically go frame by frame in a short duration frame, look over a longer window, and then kind of move things forward. We're not the first people who have done this, of course. But it is -- it's not as commonly done as it should be. So I thought I'd make mention of that? >>: When you say compensation parameters, you mean like (indiscernible) CMN or (indiscernible)? >> Richard Stern: Well, we're doing -- nowadays we're doing everything online. So we don't use CMN because CMN requires that you look at the whole utterance. So we only have a look ahead of about a frame or two. So that, you know, and this is a real problem because a lot of things like voice activity detection are dependent on having a model for silence and we're constantly worrying about how to update those models dynamically based on, you know, what things come in. And it's a tough problem. Anyway, in all of those, you know, kind of adaptive parameter updates, typically the update will be based on what we observe over about 75 milliseconds. But it will be updated every frame, which is every ten milliseconds, of course. And the analysis will still be applied to, you know, the actual speech recognition will still be done in a 25-millisecond frame, 26.3 or whatever. The something that's an odd, you know, be something that derives from the sampling rate and the power of two. You know the way it is. But normally somewhere between 20, 25 milliseconds. So again, that the something that's worth noting. Chanwoo Kim calls this medium duration or medium time. Windowing, I don't think it's that profound, but it's worth doing. And I see that we're also missing a closed paren, maybe over here is where it goes. Sorry about that. Frequency resolution. We looked at several different types of frequency resolution. There's the MFCC triangular filters. Gammatone filters are wider, and originally, we -- in many of the original work, I indicated -- well, if you just take the Mel frequency cepstral coefficients and increase the duration of the window and increase the bandwidth of the window, rather, in other words replace the triangular window you used in Mel frequency cepstral coefficients by Gammatone windows, you actually do better. In fact, fairly substantially better. However, if you're also willing to go ahead and use a different nonlinearity, then if you are also willing to use a different nonlinearity, then the effects of the nonlinearity actually swamps the effect of the window -- the bandwidth. So it's something by not doing the right experiment, we can go wrong conclusion. And we've looked at Mel frequency triangular filters, Gammatone filters, truncated Gammatone filter shapes which is just Gammatones but when the waiting function gets down to a certain point, we just simply set it equal to 0. And this is useful because Gammatone windows actually go on for a long time. And if you are using them as waiting functions, by having them go on for a long -- for long range the frequencies, you're going to end up multiplying lots and lots of numbers that you don't really get -- don't really contribute much to anything so that you can, by using kind of a truncated Gammatone frequency waiting function, you can get the effects of waiting functions without that. Now, there's one exception to this. In certain situations when you're using frequency select, so if you're doing a missing feature analysis for example in which you are selecting own a subset of time frequency bends for representation, then the role of the frequency smoothing -- then there's a question of how do you fill in -- how do you fill in the missing features. Now, really the best way to do this is using something like cluster-based analysis like Bic Sarajh did. But if you don't want to spend all that time and energy and computation, a much cheaper thing to do is to simply use frequency waiting or frequency. And the effect of these frequency windows is basically to smear what you have. You're in effect convolving the frequency response that you have which includes the missing components with the frequency response in the windows which actually vary with frequency. But if you can imagine this kind of frequency-varying convolution, you gain a lot. And there, having the wider resolution helps. I think in general -- I'm sorry. The broader frequency (indiscernible) Gammatone helps. So I think in the broad range, I think in terms of everything, we do everything with Gammatone filters now. It never hurts. It helps in some situations. In a lot of situations, doesn't make much difference. But that's what we have gleaned from that. Effects of on set enhancement processing. This is a paper again that Chanwoo Kim is doing, and I apologize for not having more details here, but it's -- if anybody is curious about it, I'll send you the paper. It was just accepted for inner speech 2010. What's going on is that we have a auditory-based model. And the usual frequency analysis nonlinearity. And then after the nonlinearity, maybe before the nonlinearity, I need to check, there's a mechanism that after the band pass filtering the nonlinearity, I think, there's something that -- there's a couple of things. One is that it takes a look at basically the envelope -- something that looks like a power envelope -- and subtracts off the -- basically crosses the falling edge to fall away very quickly. So what that means is that you pay a lot of attention to the rising edge of things, pay a lot less attention to the falling edge. And just interpreting the curves here, the baseline MFCC curve is blue. RASTA PLP with -- and all of these have cepstral mean normalization. RASTA PLP is the red curve which actually is worse than this. This by the way, is music noise -- oh, jeez. I wanted to show -- I'm sorry. I wanted to show not resource management, but Wall Street Journal, we had numbers for that as well but this is what we have. Anyway, baseline MFCC with cepstral mean normalizations here. RASTA PLP with CMN is here. And just by doing this stuff, we get a bit of an improvement. Again, I look at the horizontal displacement in these curves, and it's only a couple dB, but more interestingly, in reverberation, we get a very big improvement in recognition accuracy. This is simulated reverb time going from 0 to 1.2 seconds. Again, doing nothing is down here. But adding this SSF processing, gives us -- it still isn't great. I mean, we go down from 95 percent to 60 percent correct, but that's in 1.2 milliseconds of reverb time. There is much better preservation of word accuracy as a function of reverb time by paying attention, by getting rid of the stuff immediately after the first driving things. >>: (Indiscernible). >> Richard Stern: Yes. Yeah. >>: Nice. >> Richard Stern: It is cute. I -- because we're short on time, I'm not spending a lot of time talking about the precedence effect, but that's, you know, it's very real. (Indiscernible) this processing is monaural. So it's a nice result. The other thing that is kind of interesting is that doing the processing improves the recognition accuracy a little bit clean speech as well. It's not easy to see because of the, you know, data point overload here. But that was surprising to me. It's consistent with the fact that you're kind of differentiating the spectral envelope -- or the power envelope coming in. I don't know how well that's going to hold up in noise, though, because typically differentiating things in noise is bad. And these reverberation results were done in the absence of noise. However, when you have both noise and reverb, you know, the noise will make things worse, but you will still get the same hierarchy of results. It really does help. >>: (Indiscernible) parameters talking about adaptation or temporal adaptation. >> Richard Stern: It's a form of temporal adaptation, that's right? >>: (Indiscernible). >> Richard Stern: I will questions, again, I don't do anything. I just edit the papers and try to get the students to stay on message, but they do all the work. This was -- there were a few parameters and they were, you know, what you see are the results with the best parameters. We do typically one these over many different kinds of noises and, you know, it -- indeed it is the case that some parameter guys are better for some. We try to find a set of parameters that looks the best for the noises that we consider. Typically, the suite of things that we tend to look at are white noise, speech babble, I had have interfering speech, the street noise, and background music and simulated reverberation. Typically, of those, the most difficult kind of interference are the individual speaker and background music in that the most benignest white noise and reverb is kind of (indiscernible) above everything else but also quite difficult. Okay. I want to talk for a moment about an integrated front end. I really apologize for this. You know, just turn your heads on your sides if you don't mind for a moment. I tried to inaudible] running I need to redraw this clearly, but there are too many boxes. What we're looking at is Chanwoo Kim's PNCC algorithm. It stands for pace normalized cepstral coefficients. And this is a block diagram comparison of MFCC processing, PLP processing and PNCC processing. The most important things are there's different frequency integration that's not a big deal. PLP uses their own -- its own function. MFCC uses the triangular filters. There is medium duration power calculation. I talked about that before, that's used for normalization. This ANS stands for asymmetrical nonlinear spectral filtering. And that's the kind of -- that's a version of the kind of on set enhancement that I talked about before. There's a temporal masking which actually, again, gives you the effect of the on set enhancement. And where do you cross channels, and then a certain amount of normalization. So you get something that looks like cepstral coefficients. There's code for this, by the way, available online. And a paper that we're writing and papers that have already been published that cover most of the details in fairly cryptical forms. I just want to talk performance, do a few performance calculations. And I can talk later about what blocks give you what. But this again is MFCCs are down here. And this is by the way, Wall Street Journal 5k in white noise. RASTA PLP is a little bit better. Not a lot better. This is Mel caps with VTS applied. Substantially better. And PNCC is up here. Background music, as I indicated before, the magnitude of the improvement is less. And again, but so -- this is still several dB. Again, in this case, baseline is here. RASTA PLP is worse. VTS doesn't help much. We've known that since 97, that VTS isn't very effective in music. And we get some improvement, although, again, it's not as much as we'd like. There are other things that we could do that I'm not going to show that show better results in background music. >>: So for 5k, what's the general (indiscernible)? >> Richard Stern: Well, the most important thing here, this is about 89 percent, 87, 88 percent. We always de-tune the languish models because there's -- we do this more for resource management than for Wall Street Journal. This is one pass. We're not interested in all the things that are done to clean that up. We're only concerned with relative improvement. Especially with resource management. If you have any sort of languish model at all, these just kind of snap into position and it's, you know, just a very bad indicator of the quality of the acoustic models just because the task is so trivial. Now, Wall Street Journal, of course, is less susceptible to that but suffice it to say this is a very simple one pass system that not at all the kind of thing you do in an evaluation. And we really didn't work to optimize it at all. The last thing is reverberation. Again, Mel caps are here, RASTA PLP is worse. I've had it, as I mentioned to some of you, I've had discussions with Huner Krumanski (phonetic) to, you know, confirm that we're not, you know, you know, abusing RASTA PLP here. This is the implementation out of the box and the Dan Ellis website. And as far as he and I can tell, the numbers are legitimate. Again, VTS does not provide any great improvement in reverberation once the reverb time exceeds the frame duration. And again, because some of the nonlinear processing we're talking about before, there is an improvement here. >>: And that is no noise. >> Richard Stern: In this case, I'd have to look. There may be ten-dB (indiscernible) ratio. >>: (Indiscernible). >> Richard Stern: Say that again. >>: In the variables (indiscernible) reverberation (indiscernible) no noise conditions. >> Richard Stern: That's right. And I'm -- I would have to check to see if there's noise here or not. There may not and I'm thinking that there isn't because if you look at the -- if the 0 reverb time is about the same as it was before, so I suspect no noise. Basically, the SSF algorithm, again, we have a bunch of things. The SSF algorithm showed it's better in reverb than PNCC. PNCC is intended to be something that is best all around. Provides improvement in noise and reverb. There actually, in the special case of music, there's a different kind of noise compensation that gives a better result than we saw before. But the praise that you pay is the performance in clean depose down. In this situation, in all of these, there's no PNCC is just as good as MFCCs and clean speech. And that was more important for Chanwoo than it was for me. But it was there. That's all. One thing, we looked at computation of complexity, MFCC, and this is MFCC without VTS. If you add VTS, it's a lot more. VTS is relatively slow. PLP, PNCC and truncated PNCC. Truncated PNCC is as I described before, cutting down the frequencies and nothing else. >>: So I thought for reverberation, a standard baseline is MFCC with low window and then (indiscernible) and then (indiscernible). >> Richard Stern: Yeah. You can do that. We haven't found a huge benefit from that. No. You know. It helps a little bit. And you need a long window. I hate to say this, but we are out of time. If anybody would like to stay, they can. But I think that what I should do is skip everything in binaural hearing, some of which is interesting, and skip to the end. I will, if anybody wants to stay around, I'll flip through this very quickly, but in all fairness to everyone else, I -- let start here. >>: (Indiscernible). [Laughter]. >> Richard Stern: Yeah, but ->>: (Indiscernible). >> Richard Stern: A lot of them are skipped. A lot of them are skipped? >>: It wasn't too (indiscernible) [laughter]. >> Richard Stern: So now knowledge of the auditory system can certainly improve speech recognition accuracy. Some of the things we've talked about include use of synchrony, although not much, consideration of rate-intensity function helps a lot, on set enhancement helps a lot for reverb. Selective reconstruction, I didn't get to talk about, but it's useful in, again, in reverberant situations. I'll elaborate on that very quickly if anybody wants to hang around. Give you the five-minute version. And correlation-based emphasis, I also didn't talk about, is a binaural-based algorithm. Consideration of processes mediating scene analysis again, we didn't talk about here. We have some results based on comparison of frequencies. And the other question of course was do our experiences in speech recognition inform students of the auditory system? And my answer for that is kind of a queasy maybe. And I'll tell you the reason for that is that I had coming in and in all my years as a hearing student, assumed that all of these nonlinearities and DTLs in the auditory system were just -- were functions of fundamental limitations of physiological tissue. And were just an annoyance that we should be dispensing with. And we should just model the whole thing as linear as possible because we can as engineers. And I'm certainly coming to appreciate the fact that -- or appreciate the proposition that these details and these nonlinearities actually have some functional advantage for processing signals such as speech in difficult acoustical environments. I don't feel that I understand everything about them yet. But certainly my picture about auditory processing and what it means has evolved over the years and hopefully will continue to evolve. Thank you for sharing these moments. And again, I'll quickly skim through some of the things we skipped over for anyone who wants to see them, but this is the formal end. >>: Thank you. [Applause.]

>> Mike: Good afternoon, everyone. It's my great... Stern from Carnegie Mellon University here. Rick is a...

Related documents

Products

Support

&gt;&gt; Mike: Good afternoon, everyone. It's my great... Stern from Carnegie Mellon University here. Rick is a...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Mike: Good afternoon, everyone. It's my great... Stern from Carnegie Mellon University here. Rick is a...