>> Ivan Tashev: Today it is my pleasure to present Xing Li. She has been an intern in our research group for the last few months and her project was dynamic loudness control for in car audio which is what she is going to present now. A little bit about her, Xing is a PhD student at the University of Washington across the pond. Her advisory professor is Les Atlas and without further ado you have the floor. >> Xing Li: Thank you, Ivan. Hello everyone. Thanks for coming. This is my end of the internship talk. Let me take this opportunity to thank the speech research group for a great summer and to Windows Embedded for Automotive team for funding this project, and I would like to thank my fellow interns for their help and friendship. Here is my outline. First I will describe the project. One key point about this project is that loudness control is not just about volume adjustment. It involves multiple aspects of the hearing perception and we will talk about this in a moment. One key question about this project is what is the preferred listening level for music, for speech and in different scenarios? I will present the user study results and describe the loudness control systems we have developed for each system. In total we have developed three systems. For each system I have prepared sound demos so you can judge by your own ear. After all the demos I will conclude and talk about future directions. First, why do we need a dynamic loudness control system for in car audio? From everyday experience, we know that during driving the background noise will be constantly changing as the road and traffic conditions change from time to time. The car, indeed, itself may generate different kinds of noise. On the other hand, during driving we usually like to listen to some music or video or get a phone call, so if we have to frequently change the volume it can be quite annoying, so we asked the question can we enjoy effortless listening by developing a dynamic loudness control system. Here is the big picture. To design such a dynamic control system we need a reliable noise estimation. In this project we assume that the noise estimation comes from the audio processing pipeline which consists of multiple components including, for example, the acoustic [inaudible]. Provided that we can get accurate noise estimation, the question is how are we going to adjust to the amplification gain as noise goes up and down from time to time. Before we go through the technical details, let's first talk about the evaluation methodology. Whatever system we develop, how are we going to evaluate it? It would be ideal if there was some objective measurement to guide the design. Unfortunately, the hearing perception is quite complicated and involves many factors, so for this project there is no objective evaluation tool. The main evaluation will be based on our own perception and user study, so get ready for lots of listening. There is no objective result in the end. Let's start. Let's talk about volume and loudness. Basically by adjusting the volume we are controlling the sound pressure level which then affects our loudness perception of the sound. First, how do we measure sound level? Normally, sound pressure is constantly changing over time. The effective measure is usually taken as the root mean square of the instantaneous sound pressure over a certain time interval, for example, several seconds. Because sound pressure range is very big and human perception is roughly logarithmic, so the dB scale is used. The commonly used zero reference, 20 micropascal represents the hearing threshold at 1 kHz. At this point I am enormously nervous [laughter]. >>: Do you mind if I ask you a question? >> Xing Li: Yes? >>: That level is kind of the bottom line of those Fletcher Munson curves, right? >> Xing Li: Yes. >>: Cool. >>: [inaudible] otherwise in the underwater acoustics they have a different zero for reference. In some cases for vibrations and other types of noises there are other 0 decibels, but typically what we are measuring the field depth is 20 µPa. >> Xing Li: Okay, thanks. Now let's talk about the loudness perception. This figure shows the equal loudness curves at different levels. The x-axis is the frequency and a y-axis is the sound pressure level. Each curve represents the required sound pressure level at a different frequency to evoke a constant perceived loudness. For example, the bottom curve is the hearing threshold at a 50 dB. In order to hear the sound level must be at least a 40 dB. Whereas, at a 1 kHz the threshold is zero, so our ear is less sensitive to the very low and to the very high frequencies. The ear’s sensitivity is not fixed, but rather it is level dependent. Note that the 100 phon curve is relatively flat which means at a high sound pressure level our ear becomes almost equally sensitive to all frequencies, so what does this imply in terms of sound timbre perception, which is an important factor for music enjoyment? For example, the mix level for music is normally about 80 dBspl. For home listening, if we turn down the volume by 20 dB the perceived timbre will be very different from the original recording. How do we keep the timbre balanced when changing the playback volume? The technique is called loudness compensation. Basically we need to find out what was the original mix level and what is the playback level. The compensation gain should be depending on the difference between the original, will be dependent on the curve between the original level and the playback level. In this case we take the 80 dB as the reference level, so if the music is played back at this level, no compensation is required, but if we decrease the volume, we would need to boost low frequency and high frequency to keep them at the same relative loudness level as in the original recording. To demonstrate this I have a music example for you. Let's listen to a song first at 80 dB, which is roughly the music reference level. [audio begins]. [audio ends]. >> Xing Li: Wow! >>: Sorry. This is louder than when we tried it yesterday. [laughter]. >> Xing Li: It's 80 dB? >>: Yes. [audio begins]. [audio ends]. >>: That was much better, sorry. >> Xing Li: Okay. That was roughly 80 dB. Next let's listen to the same song but at 60 dB. The distance from here to here, well definitely you will hear that the sound becomes quieter, but there is a difference in the low-frequency region, so when you are listening please concentrate on the low frequencies. Let me play this. [audio begins]. [audio begins]. >> Xing Li: The low frequencies compared to other frequencies sounds much quieter. Now let's listen to the same song with compensation at 60 dB. [audio begins]. [audio ends]. >> Xing Li: With compensation it allows relatively loudness level between low-frequency and other frequencies and they stay relatively constant from here to here, so to confirm, to notice this let me play the 80 dB one again. The perception is complicated so I don't know, it's very hard to describe the correct terms for what I want you to hear, but let me play this. [audio begins]. [audio ends]. >> Xing Li: Fun today, anyway. [laughter]. >>: [inaudible] did you? [laughter]. >> Xing Li: Somehow there is a level jump in the very beginning. I don't know why. Let me talk about a user study. The purpose of the user study is to find out what is the preferred level increase as a function of the noise level. Is it different for different kinds of music and speech? So first off, how do we measure noise level? There are different kinds of measurement. For example, dB A, dB C which are basically the weighting sound pressure level. To reflect the ear’s sensitivity at different frequencies, the weighting curve is roughly the invert of the equal loudness contour. For example the 40 dB, the A weighting, this blue curve is roughly the invert of the 40 phon curve. The C weighting is the invert of the 100 phon curve. In our study we found that the measurement in dB A provides a good indication of the noise level, so in the results I'm going to present later we will use the dB A measurement. Before I describe the user study set up let's give you the key factors involved. First the stimulus type, for example, rock music versus pop ballads. For noise condition we have car noise recordings from different scenarios and a third is an interesting one, personal preference, because we all understand that individual judgment can be really personal rendering. Statistically we don't know if there is a pattern, so let's find out. This is a screenshot of the user study interface. A user controls when to start the stimulus by clicking on the play button. The stimulus is then played continuously. The stimulus can be one of any of the four types of music or it can be a telephone speech or natural speech. While the stimulus is been played continuously, at the same time car noise recordings will be played in the background. As the noise conditions change from time to time, the user can use the slider bar to adjust the volume until he or she is most comfortable. This picture shows our experiment set up. The study was conducted in the driving simulator. We used two sets of speakers, one for music and the other for car noise to reflect the real situation as far as possible because in practice, speech and music come from the sound noise instrument, but the car noise comes from any direction. They don't coincide in space. Totally we have tested 19 users. Each one did 80 judgments including 10 stimuli by eight noise conditions. Now let's take a look at some results. The histogram on the left shows a distribution of the preferred music level inquired. The most common choice is 75 dB. On the left side which is speech we see a big peak around 65 dB A. On both sides that wide distribution implies that there is a big preference variability in terms of the absolute listening level, so we normalized each user's data by their preference inquired. Now let's take a look at the normalized data. This is about the music, how much level increase users prefer as noise goes up from 30 to 78 dB A. Each curve represents one type of music. We did not find any significant difference among those four types of music. This figure is about speech. We do find a narrow band and a wide band of speech. The preference pattern for telephone speech is significantly different than the pattern for natural speech. Basically you can receive a much, a larger level increase is required for telephone speech. >>: What is the difference between wide band speech and natural speech? >> Xing Li: Wide band speech is sampled at a 16 kHz band pass filter between 50 Hz and 7.2 kHz. Natural speech is basically the CD-quality speech, the full band. We did some further analysis and we found that music and natural speech they belong to one group and wide band and narrow band speech they belong to another. We proposed to analyze user's preferences by two parts. The first part is the static of personal preference and the second is the dynamic level increase trend. What matters to our dynamic control system is the level increase trend which can be modeled by generalized sigmoid function. In this figure the red fitting curve represents the telephone speech and the blue curve represents music and the natural speech. The fitting correlation for telephone speech is slightly higher than for music possibly because for speech of the choice is pretty much dependent on the speech intelligibility, whereas in the case of music it's getting more personal, so the judgment for speech is relatively less objective so we get a high correlation, but the fitting correlation for music is .76, so it's a pretty distant fitting. Now let's talk about dynamic loudness control. Yes? >>: Going back to that curve, just to, a little bit of anecdotal validation, when we were back at the hotel trying to decide if reference volume level was [inaudible] we did find one thing similar to yours which is that people like to hear narrowband or wideband speeches at a slightly higher level than human voice. A potential explanation from our psychologist in the group who did some analysis was that it is harder to understand something coming from a machine and the only thing the user can do about it is increase the volume. [laughter]. >> Xing Li: Thank you. >> Ivan Tashev: I was wondering another thing actually about the results of the histograms towards the preferred levels, people subconsciously tried to put the speech signal as loud as that of a human voice. They don't want the machine to dominate with a louder speech but with lower speech they cannot understand it as well, where the music all tended to be 10 to 15 dB off. >>: Very interesting. >> Xing Li: Now let's talk about dynamic loudness control. Here are the function blocks of our system. Let's start from the bottom. We get the noise level spectrum data from audio processing pipeline. We are going to compute the overall noise level. Given a certain noise, level we will find out the gain adjustments based on user preference data which is represented as a LUT here. At the same time we will compute the compensation for the playback level. The gain and the compensation will be applied to signal before we send it to the loudspeaker. All of the processing is done in frequency domain. One thing I want to point out is the due time constant control. Consider that we get a new noise measurement every 20 milliseconds and we know the noise level will be fluctuating constantly, are we going to change the gain following the exact fluctuation of the noise level? The answer is no because if we updated the gain too frequently, this would introduce artificial temporal structure to the signal which should be avoided for both music and for speech. So the attack time will basically determine how long it takes for the system to enter full gain boost or the release time determines how long it takes to enter gain reduction, the choice is for those two parameters have been a little bit different for music and for speech. We will see later. Next let's talk about loudness compensation. We take 80 dB as the reference music level. The compensation gain is then dependent on the difference between the 80 phon curve and the curve of the playback level because the level at 1 kHz is the common reference for loudness measure so we normalize the gain at a level of 1 kHz. The gain and the compensation are then applied to the signal and we reconstruct the signal by overlapping [inaudible] synthesis. Here are some demos. The first example is music. First let's listen to it without any loudness control. That way it won't be too loud. Let me start from the new. The preset level for music is 75 DB. Let's see how it sounds as noise--well the background noise is going up and down constantly. Let's see how… [audio begins]. [audio ends]. >> Xing Li: Increase the volume just slightly. >>: No, don't touch. It's good. >> Xing Li: Okay. Now let's… >>: It sounds okay. Let's play some loud sounds. [laughter]. >> Xing Li: Now, let's listen to the same song with the same background noise but with dynamic control. [audio begins]. [audio ends]. >> Xing Li: I hope you like the way the music level is increased easily as the noise go up and decreases smoothly. >>: [inaudible]. [laughter]. [multiple speakers]. [laughter]. >>: I want it for my Zune when I am cycling to work. [laughter]. >>: So the noise sample that you used is pretty steady. How does the system respond to impulses, like you go over a bump or, all the various non-steady state noises that you get? >> Xing Li: That's referred to the customer’s choice. If the noise, if there is a noise burst that lasts like 200 milliseconds it won't cause the gain change because our constant is on a much longer scale than the noise of that duration. >>: But doesn't it contribute to an overall RMS value? >> Ivan Tashev: Yes, but if you took the average, you don't want to change the gain every time… >>: I understand that you don't want to, what I'm curious about is how you differentiate. >> Ivan Tashev: It's just averaging. Look, the time constant is 1.6 seconds to change the gain. This means that if something in the noise drastically changes, it will take three times that, 4.8 seconds before you reach 99% of the compensation, so for the small bumps it is not going to affect much, which for the music it's completely okay, but this is not the case for a speaker on a telephone call and you don't want to miss a word. >> Xing Li: Now, yeah, thank you for the introduction, [laughter], now let's talk about speech in the same time of hearing noise example. Let me start from the middle. [audio begins]. >>: [inaudible] by the somewhat obvious nickname of Cuff. On the September morning consecrated to the enormous event, he arose nervously at six o'clock, dressed himself, adjusted in impeccable stock and hurried forth through the streets of Baltimore to the hospital to determine whether the darkness of night had born in new life upon its bosom. When he was approximately 100 yards from the Maryland private hospital for ladies and gentlemen, he saw Doctor Keyes the family physician descending the front steps rubbing his hands together with a… [audio ends]. >> Xing Li: Now in this case, the speech level is at a 65 dBA. When the noise level is low it's fairly intelligible, but when the noise goes up, it becomes quite hard to get it. Now let's hear the same speech with dynamic control. [audio begins]. >>: [inaudible] they were related to this family and that family which was as every Southerner knew entitled them to membership in that enormous [inaudible] which largely populated the Confederacy. This was their first experience with the charming old custom of having babies. Mr. Button was naturally nervous. He hoped it would be a boy so that he could be sent to Yale College in Connecticut at which institution Mr. Button himself had been known for four years by the somewhat obvious nickname of Cuff. On the September morning consecrated to the enormous event, he arose nervously at six o'clock, dressed himself, adjusted in impeccable stock and hurried forth through the streets of Baltimore to the hospital to determine whether the darkness of night had born in new life upon its bosom. When he was approximately 100 yards from the Maryland private hospital for ladies and gentlemen, he saw Doctor Keyes the family physician descending the front steps rubbing his hands together with a… [audio ends.] >> Xing Li: So our initial system demonstrated its convenience for listening in time of varying noise. Regarding music, we would prefer a slow attack because missing several notes is acceptable, but a drastic level change sounds unpleasant and the longer compensation helps reduce timbre coloration. Regarding speech, a faster attack is preferred because trends in the noise might mask a segment of speech. Now, let's talk about the drawback of this system. One concern is that in order to keep, to maintain speak intelligibility in noise, the system will drive speech level high as noise goes up. Sometimes the peak levels can sound really unpleasant, so we ask the question, can we do something more to maintain speech intelligibility without driving the speech level to high, especially to avoid unpleasant peaks which are not just unpleasant; they are not good for your health. Another concern is that the user study detail might only hold for car noise like interference and the kind of noise, the overall noise levelbased design might not be optimum. So these considerations lead us to the development of our second system, the noise spectrum-based loudness control. The processing is still done in frequency domain but this time it is over the critical bands. Now let's talk about the human auditory system. Our ear functions like a frequency analyzer. Sound comes in through the ear canal and eventually it will reach the cochlear inside the inner ear. Here, this picture shows unrolled cochlear. Here different frequencies components in the sound will excite a different part of the cochlear. This placed frequency is called the tonotopic organization. It is as if there is a series of band classed filters arranged along the cochlear. To describe those auditory channels, the concept of a critical band was introduced in the ‘40s. Third octave band filters are usually considered as a good approximation to critical bands. Given the noise spectrum data from audio processing pipeline in this case we are going to compute a noise level in each band. We are here now. Once we know the noise level in a particular channel we can figure out how much boost of gain we need for the signal occupying the signal channel, so comparing this system with our initial design, here we computed the boosted gain channel by channel, whereas, before we computed one gain and applied it to all channels. Now let's talk about that. How do we compute the boosted gain? It's basically illustrated in this figure. The red curve represents the noise level over a critical band. The black curve is the reference level which is the 65 phon reference curve. Why do I take this 65 phon curve as the reference level? Here is my thinking, because the speech preset level is 65 dBA, so from a statistical point of view the average speech level as a function of a frequency might be approximated by this curve, so the boosted gain within each band will be the difference between the noise level in this reference level. Let's hear some demos. On this we are comparing our initial system with the noise spectrum-based approach. On both figures the red curve represents how the noise level change, how the noise level goes up and down over time. It is a telephone speech in traffic noise example. On the left side, the green curve represents these outputs by our initial system. On this side, the blue curve represents the speech level generated by our new approach. Now let me play the output of our initial system first. It is the telephone speech in traffic noise. [audio begins]. >>: You want to tap the shoulder of the nearest stranger and share it. The stranger might laugh and seemed to enjoy the [inaudible] but [inaudible] that they didn't quite understand the [inaudible] quality as you did such as your friends, thank heaven, don't often fall in love with the person you were going on and on about. You are on the verge of entering the wide provoking, benevolent, hilarious and addictive… [audio ends]. >> Xing Li: He is fairly intelligible, but at the same time I don't think we will like the peaks because they are very unpleasant. Now let's listen to our new approach, the same telephone speech example with the traffic noise. [audio begins]. >>: The eye and penetrates the brain. You want to tap the shoulder of the nearest stranger and share it. The stranger might laugh and seem to enjoy the writing, but you hug to yourself the thought that they didn't quite understand the [inaudible] quality the way you do, just as your friends, thank heavens, don't also fall in love with the person you are going on and on about to them. You are on the verge of entering the wide, provoking, benevolent, hilarious and addictive… [audio ends]. >> Xing Li: So from left to right the speaker’s intelligibility is the same, but the overall level dropped by five dBA. The level drop might have been even bigger for those peaks. So listen to the speech only which are the signals we sent to the loudspeaker. [audio begins]. The eye and penetrates the brain. You want to tap the shoulder of the nearest stranger and share it. The stranger might laugh and seem to enjoy the writings but you hug to yourself the thought that they didn't quite understand its force and quality the way you do, just as your friends, thank heavens, don't also fall in love with the person you are going on and on about to them. You are on the verge of entering the wise, provoking, benevolent hilarious and addictive… [audio ends]. >> Xing Li: Okay. And now let's listen to the second system. [audio begins]. The eye and penetrates the brain. You want to tap the shoulder of the nearest stranger and share it. The stranger might laugh and seem to enjoy the writings but you hug to yourself the thought that they didn't quite understand its force and quality the way you do, just as your friends, thank heavens, don't also fall in love with the person you are going on and on about to them. You are on the verge of entering the wise, provoking, benevolent hilarious and addictive… [audio ends]. >> Xing Li: Now let's do some analysis about those peaks and see what are the differences between the usual system and this current new approach. Here on both sides, so we are looking at the long-term spectrum now, on both sides the red curve represents the noise spectrum. It's a [inaudible] noise. The energy is concentrated around 2 kHz. The black dashed line on both sides represents the original speech at a 65 dB. The green curve here represents the output by our initial system, while the blue curve represents the output by the noise spectrum based approach. The green bar indicates the boosted gain in each critical band which is basically the difference between the noise level and the reference level. One main difference between those two is in the low frequency because the noise doesn't have much energy in the low frequency, so with the new approach we won't boost the low-frequency much. It's unnecessary. So as we see a big peak here there is about a 10 dB drop here, so that's why the peaks with the new approach the peak levels are noticeably low. Let's listen to this segment again taken from the previous example. [audio begins]. Don't also fall in love with the person you are going on and on about. [audio ends]. >> Xing Li: Oh, so the audio system will generate big trouble for me. [audio begins]. Don't also fall in love with the person you are going on and on about. [audio ends]. >> Xing Li: So the main message here is we can maintain speech intelligibility in noise without having to drive speaker level too high, so a lower speech level means a lower audio power which probably a minor effect for, yes, for car audio. >>: When you are doing it per band, when you are essentially filtering the signal, it seems that there would be some risk between perceptually changing what the person is actually saying versus… >> Xing Li: The sound and timbre will change, but the intelligibility won't because what I did not go through is we have a boosted gain per band and this boosted gain will change over time as noise changes but, again, we apply a dual time integrator to smooth the gain to avoid introducing any artificial temporal structure to the signal, and I guess your concern is talking about if we change the spectrum of the speech will that modify the intelligibility or affect the intelligibility information. Is that your concern? >>: Well, if you're not doing it very short term then you won't change the perceptual quality. You won't change intelligibility, but just the perceptual quality of it. I mean you can add some spectral tilt to it. >> Ivan Tashev: When you are on a phone call do you want intelligibility or high-quality? [laughter]. >>: Yes. [laughter]. >>: For what it's worth, when you listen to it with the noise the perceptual quality is pretty good, but when you listen to just the speech by itself, then you can really, that's when the perceptual quality changes so much. >>: [inaudible] critical band isn't it? >> Xing Li: Yes. Since we are trying to, you know, solve the problem for listening in noise, so… >>: So we are all right. [laughter]. >> Xing Li: Noise spectrum based approach has the advantage of maintaining speech intelligibility without driving the sensitivity level too high, this approach because of the dynamically changing filters might, as Michael pointed out, will modify the sound of timbre so it won't be suitable for music. Now, so far the loudness control is only based on noise and we haven't made any use of the signal yet, so we asked, can we do something even better. For example, because of auditory masking there are some frequency components in the playback signal that we won't hear any way so why bother to make them louder. On the other hand, there are some frequencies in the noise that won't mask our signal so why try to overpower them? Let's first talk about the auditory masking in human perception. There are basically two types of auditory masking, spectral masking and temporal masking. Spectral masking is normally known as simultaneous masking. This figure shows the intra-band masking where the masker and a signal occupy the same auditory channel. This figure shows the inter-band masking where the sound in the low-frequency generates masking towards the high-frequency. Regarding temporal masking, forward masking refers to the phenomenon that what happens now affects what happened in the past. This mechanism is not really understood why this happens. What happens more frequently is the post masking, basically a louder sound at this moment will affect what's coming next. Let's go through this. We have, as before the processing flow from a noise spectrum. Here we are computing the masking pattern generated by the noise. On the other hand, we have a processing flow from the signal. Here we are going to compute the masking pattern generated by the signal and identify which components in the signal are audible if there is no noise. So in the center block the boosted gain will be dependent on how much do we need to keep the audible peaks still audible if noise is present. Let's talk about how do we compute a masking pattern. This figure shows--I see Rico was… [laughter]. >>: [inaudible] embarrassing. [laughter]. >> Xing Li: Yes. This is based on my learning from… >> Ivan Tashev: [inaudible] just copied Rico's book [laughter]. >> Xing Li: No, no. I found this. I generated this by myself. [laughter]. Let's walk through this. The blue curve represents the speech spectrum. We can see the harmonic peaks. The red curve represents the masking pattern in each band. We can see only the strong peaks that will be audible. Well, there is a weak component that will be masked. Consider temporal masking. This is, we can compute that the masking pattern in frequency domain. Now taking this as a threshold, we are going to apply a dual time constant to the intake reader to reflect post masking of the, well the choice of the release time constant show the reflect post masking. As for the tech time, it basically affects how much, how quick we can turn up the gain when noise goes up. Now let's talk about the boosted gain. Let's assume we know the noise masking pattern and we know which signal components are audible, how much boosted gain do we need to make the signal stay audible? I think that is the opening question. In the experiments we boosted the signal peaks to be 3 dB higher than the noise masking threshold. Now let's listen to some sound demos. We have, in the previous slides we have already listen to those two columns, so I will just play those as the same narrow band of speech in traffic noise example. [audio begins]. The eye and penetrates the brain. You want to tap the shoulder of the nearest stranger and share it. The stranger might laugh and seem to enjoy the writings but you hug to yourself the thought that they didn't quite understand its force and quality the way you do, just as your friends, thank heavens, don't also fall in love with the person you are going on and on about to them. You are on the verge of entering the wise, provoking, benevolent hilarious and addictive… [audio ends]. >> Xing Li: Now listen to the speech only signal. It's the one we send to loudspeaker. [audio begins]. The eye and penetrates the brain. You want to tap the shoulder of the nearest stranger and share it. The stranger might laugh and seem to enjoy the writings but you hug to yourself the thought that they didn't quite understand its force and quality the way you do, just as your friends, thank heavens, don't also fall in love with the person you are going on and on about to them. You are on the verge of entering the wise, provoking, benevolent hilarious and addictive… [audio ends]. >> Xing Li: So the message here is we can maintain speech intelligibility in noise but at a moderate speech level. A most important point is that we can avoid the very unpleasant high peak levels because consider when the background noise is already overwhelming, we are the ones that listen to speech at a high level to overload our ear, so avoid those unpleasant peaks and keep speech at a moderated level is very important from a perception perspective. Let me conclude what we have done so far. From the user study data we found that there is a strong relationship between the preferred listening level and the background noise level, so that justifies that it is possible to design a dynamic loudness control system that sounds pleasant. The relationship is different for music and for speech. Based on the user data we have developed our initial system. This system has demonstrated its convenience for listening in time of hearing noise and this system has an advantage. At the same time it has drawbacks. The main concern is to maintain the high speech level and the output when noise goes up, so to overcome the limitations of this initial system, we have developed the noise spectrum based loudness control. The processing is based on critical bands and we have demonstrated that speech intelligibility can be, we can maintain the same speech intelligibility as the previous approach but keep the speech level at a moderate level. From here to here the processing gets a little more complicated. The auditory masking-based approach depends on both the noise and the signal. So far it sounds to me, or to some people, that this approach sounds fairly similar and they also you call intelligible, although this approach that we consider the auditory masking, it can drive the speech level even lower. I want to say the message again, a lower speech level is not just about a lower audio power, but also it's from a perception perspective it first, a lower level is good as far as your health and it sounds pleasant. Okay. Let's now talk about future directions. There are some very important parameters. For example, the time constants, the choice in those examples are based on informal listening by members in our group, so in the next step we will need to run formal user studies to verify the time constant choice of the thresholds and all the parameters and what the optimal choice for speech and for music in different scenarios. We need to verify those parameters. The second point is so far in the examples that we see, we can maintain speech and intelligibility at a lower level. We need to verify this by running more formal tests. A third point and this one is a little bit tricky. During the auditory masking-based approach, we computed a masking pattern and based on the auditory masking pattern we computed the boosted gain, but as we changed the boosted gain, the masking pattern changed at the same time, so it's unknown if there is a better approach to calculate the boosted gain. The fourth point, to adapt it for practical use if we want to view this in the product, think of the set power limits to meet the health requirements is very critical and in some scenarios when the noise levels go enormously high, we probably would have to do something more human. For example, shut down the whole system instead of generating really loud speech or music which probably is less optimal. That's all my talk. Thank you. [applause]. >> Ivan Tashev: Thank you Xing. Questions please? >>: Xing? >> Xing Li: Yes? >>: To estimate the noise level you must use a microphone, right? >> Xing Li: Yes. >>: And that type of microphone is what people use for communicating? >> Xing Li: Yes. >>: So when I am having that conversation with you and I'm in the car… >> Xing Li: Uh-huh. >>: Is that measured, that should be measuring my signal. >> Xing Li: Uh-huh. >>: But how does it know it's my signal and not the noise? >> Xing Li: That's an advanced problem… [laughter]… In that audio progression pipeline. The system should be smarter to figure out if it is near the talker, if there is… >> Ivan Tashev: [inaudible] noise models only when no one in the car is talking and there is no sound through the loudspeakers. >>: But let's say we are talking on the phone and you are quiet. I am talking. I am in the car. >> Ivan Tashev: You're talking. >>: I am in the car and you are quiet. You're just listening to me >> Ivan Tashev: Yep. >>: The signal that is picked up is presumably just my signal. How does the system know that that is not the noise? >> Ivan Tashev: There is a voice acuity detector which is combined with the [inaudible] detector and we… >>: The [inaudible] detector should be doing nothing, right because there is no other sound. >> Ivan Tashev: There is a voice acuity detector. This is part of the noise suppression idea. I know when the local person is talking. I know when I have a loud signal in the loudspeakers. >>: So the noise you are measuring is the noise in the cabin? >> Ivan Tashev: In the cabin without the… >>: How can you tell the noise like the child in the background making noise from the speaker or from the truck? >>: Well, that is noise, right? >>: How is that different from my voice which is not noise? >> Ivan Tashev: There is a voice acuity detector. If there is a signal which statistically goes above the threshold, it does not consider it when you make the noise model because we need the noise model, not only is this kind of a side effect, we need the noise model for. Acoustic separation, well, there is a quite sophisticated system for building the noise model in [inaudible]. >>: If there is a person in the passenger seat talking is that considered… >> Ivan Tashev: It will not be considered as a noise to some degree. If it goes statistically too low or too close to the noise at some point it will drop it. Otherwise it will be feedback. The system starts the music [inaudible] the speaker [inaudible] [laughter]. >>: Another point that you had on future directions, so I wondered how much you guys thought about that on power limits. What approach could be borrowed from some of the techniques that the radio stations use, because they do have the [inaudible], so I think was a little bit of [inaudible] also has the benefit of driving [inaudible] amplifier and the speakers because as you try to compensate, you may get significantly close to the limits where the distortion would go very high, but if you apply, in addition to all that you put some damping on that you might drive the actual peak levels to lower levels and then you could improve that [inaudible] overall as well. >> Ivan Tashev: So we try to please [inaudible] any kind that works. The telephone speech it's already here [inaudible]. >>: Oh, because it came companion already… >> Ivan Tashev: It was already here as a companion, so technically it may have a good effect on textual speech when you are here to maintain the [inaudible] and understandable, but for telephone it's heavily [inaudible]. You can just see the peaks are equal [inaudible] response from the [inaudible]. >>: Yeah, because it may have been through [inaudible]. >> Ivan Tashev: It's all in the calibrates. >>: Yeah [inaudible]. >>: So how computationally complex is this? >> Xing Li: The initial system is fairly, I think it's computationally fairly light for the initial system because all we want is to compute, this is the second approach. Well, this one, we would recommend this one for music because all we want is to take the noise in the spectrum calculate the overall level dBA and take a look at the user preference data and adjust to the gain. The user can always reset the preset levels because some people prefer louder music and some people prefer soft music and so they can always change the preset levels. All we are doing is changing the dynamic level increase [inaudible]. >> Ivan Tashev: Xing, because I know from my code days [laughter] this is the better frame [inaudible], that's it. The simplest, the stupidest voice acuity detector is pretty much equivalent of computationally-wise to the first system. >>: So how come [inaudible] complicated is the curve and gaining control based on… >> Xing Li: This one, this one… >>: Go the way of the last one. How complicated is that one? How many [inaudible] do we have? >> Xing Li: I think for telephone speech between 250 and there are, this is [inaudible] band, something around below 20 I believe, less than 20. >> Ivan Tashev: That is more complex than [inaudible]? >>: Bless her heart but it's not. >> Ivan Tashev: Nothing is of computational concern here besides converting the speech of the loudspeaker [inaudible] frequency in the back which you will do anyway for acoustic diagnostics. >>: Cool. I want it tomorrow. [laughter]. The music thing is really, that's really good. >>: How would that work with, you know, I have a soft top Mustang. How would this work with soft top? Ciao. >> Ivan Tashev: Don't want to start [inaudible]. >>: Yeah, yeah. >> Ivan Tashev: And maybe they should on some of them. >>: He needs to borrow your car. [laughter]. >> Ivan Tashev: Yeah, I was going to say. [laughter]. [inaudible] automobile some car [inaudible] on top. >>: Next year. >> Ivan Tashev: Perfect, around springtime we will agree to borrow it for the summer. [laughter]. >>: Travel with me. [laughter]. >>: With regard to your different attack times for speech and music, so assuming I am not having a phone conversation. I am just listening and there is music and speech, how does your system identify… >> Xing Li: Music or speech? >>: Yeah. >> Xing Li: Everything comes from the audio [inaudible]. >> Ivan Tashev: For example, if you have a radio talk show, it is considered the same as music and you may need some certain points… >>: So better to err on the music, the slow attack side… >> Ivan Tashev: But that may not be that critical when you have a phone call. >>: Yeah. >> Ivan Tashev: So pretty much a radio, CD, media player, this goes through the same type of settings for music and [inaudible] speech is down. The settings for speech, for the phone call channel and for the [inaudible] speech. >>: Did you try different music genres to see if some music genres were like typically longer phrasing and things like that would respond differently to different attack times? >> Ivan Tashev: [inaudible]. >>: I would think that maybe rock or pop as opposed to say chamber music might be… >>: [inaudible]. >>: So is it? >> Xing Li: You are, is there a difference in the time constant in the dynamic control for different types of music or are you talking about the relatively level increase between… >>: Yeah, just assuming that like different types of music is not like a constant thing. Some music is very busy and there are lots of attack transients in like the percussive impulses and other music is very much the opposite. >> Xing Li: Uh-huh. I did not present that in the four types of music the difference between pop, ballads and rock is much bigger than, you know, this is [inaudible] very different music. We almost want to say there is a significant difference between them, but since we have only 19 users so we set this significant test level of .01, but the difference between them if we start a threshold at .05 we would conclude that the pop ballad which has a wide dynamic range of music is a significant different from the rock music, so I guess to answer your question, I wouldn't--this set of data concludes that there is no significant difference, but, you know, it's a human preference. In order to give a confident answer, I think you need to run more tests to find out if there is a statistical difference. >> Ivan Tashev: Just to conclude the user studies to see if there is a difference, the settings, I'm not saying that they were just [inaudible], but kind of seemed [laughter]. >>: [inaudible]. >>: I'd also like to add if the noise is so heavy that the automatic loudness changes significant, then it's not a suggestion that you forget about the attack [inaudible]. >>: Yeah, I know. It's just… >>: Yeah, so there's no way to treat classical music with that because you would completely destroy it. >>: And that was the other question. Does the system cap out at some point? We don't continue to compensate, right? >>: For example, when Mike takes a top-down on his car you're going to quit trying to like destroy my speakers, right? [laughter]. Personally, I think this system would be very helpful, but I concluded that it's just not possible to listen to classical music in my car. It's just too noisy, so… >> Ivan Tashev: [inaudible] the dynamic range. >>: So [inaudible] we apply heavy processing, there is no, it's not hi-fi. You just maintain audibility of the music, but it's not studio quality. >>: Have you determined how far to go? >>: You mean, yeah. How much of what is you call it, frequency parameter thing, the optimal gain, basically. >> Ivan Tashev: The design of the car already decided this because you have some maximum power unless you want some additional loudspeakers and… >>: Because you are just trying to achieve my optimal preferred listening volume, right? >> Ivan Tashev: And at some point you hit the maximum of the system of the car and that's it. If it was necessary to have more power, presumably the carmakers would put it there. Of course, there are people that we have seen that take the car system out and put in a subwoofer that is under the driver system so that when there is a [inaudible] the driver kind of jumps, but… >>: Maybe that's what I want. >> Ivan Tashev: More questions? >>: Guess what? I think this is related to what you said, is that if the car happened to be one of those that it's active noise cancellation using the hi-fi speakers, then you've got to be careful how to make noise, more noise [inaudible] you would need to adjust the… >> Ivan Tashev: We may have problems, definitely. >>: You need to kind of jointly design the two so that one doesn't mess up… >>: But then, a real car… >> Ivan Tashev: Yes, but I think those cars which have [inaudible] isolation, they can afford to pay for additional [inaudible] [laughter]. >>: High-end stuff, but [laughter]. [inaudible] some Mercedes already have them. >>: That's beyond my [laughter] my horizon. [laughter]. >>: Just [inaudible] it's not a typical Microsoft market, but… >> Ivan Tashev: It may cause problems. We should look into that. >>: Yeah. All right, cool. >> Ivan Tashev: Anymore questions? If not let's thank Xing for coming. >> Xing Li: Thank you. [applause].