>> Ivan Tashev: Today it is my pleasure to... research group for the last few months and her project...

advertisement
>> Ivan Tashev: Today it is my pleasure to present Xing Li. She has been an intern in our
research group for the last few months and her project was dynamic loudness control for in car
audio which is what she is going to present now. A little bit about her, Xing is a PhD student at
the University of Washington across the pond. Her advisory professor is Les Atlas and without
further ado you have the floor.
>> Xing Li: Thank you, Ivan. Hello everyone. Thanks for coming. This is my end of the
internship talk. Let me take this opportunity to thank the speech research group for a great
summer and to Windows Embedded for Automotive team for funding this project, and I would
like to thank my fellow interns for their help and friendship. Here is my outline. First I will
describe the project. One key point about this project is that loudness control is not just about
volume adjustment. It involves multiple aspects of the hearing perception and we will talk
about this in a moment. One key question about this project is what is the preferred listening
level for music, for speech and in different scenarios? I will present the user study results and
describe the loudness control systems we have developed for each system. In total we have
developed three systems. For each system I have prepared sound demos so you can judge by
your own ear. After all the demos I will conclude and talk about future directions. First, why do
we need a dynamic loudness control system for in car audio? From everyday experience, we
know that during driving the background noise will be constantly changing as the road and
traffic conditions change from time to time. The car, indeed, itself may generate different kinds
of noise. On the other hand, during driving we usually like to listen to some music or video or
get a phone call, so if we have to frequently change the volume it can be quite annoying, so we
asked the question can we enjoy effortless listening by developing a dynamic loudness control
system. Here is the big picture. To design such a dynamic control system we need a reliable
noise estimation. In this project we assume that the noise estimation comes from the audio
processing pipeline which consists of multiple components including, for example, the acoustic
[inaudible]. Provided that we can get accurate noise estimation, the question is how are we
going to adjust to the amplification gain as noise goes up and down from time to time. Before
we go through the technical details, let's first talk about the evaluation methodology.
Whatever system we develop, how are we going to evaluate it? It would be ideal if there was
some objective measurement to guide the design. Unfortunately, the hearing perception is
quite complicated and involves many factors, so for this project there is no objective evaluation
tool. The main evaluation will be based on our own perception and user study, so get ready for
lots of listening. There is no objective result in the end. Let's start. Let's talk about volume and
loudness. Basically by adjusting the volume we are controlling the sound pressure level which
then affects our loudness perception of the sound. First, how do we measure sound level?
Normally, sound pressure is constantly changing over time. The effective measure is usually
taken as the root mean square of the instantaneous sound pressure over a certain time
interval, for example, several seconds. Because sound pressure range is very big and human
perception is roughly logarithmic, so the dB scale is used. The commonly used zero reference,
20 micropascal represents the hearing threshold at 1 kHz. At this point I am enormously
nervous [laughter].
>>: Do you mind if I ask you a question?
>> Xing Li: Yes?
>>: That level is kind of the bottom line of those Fletcher Munson curves, right?
>> Xing Li: Yes.
>>: Cool.
>>: [inaudible] otherwise in the underwater acoustics they have a different zero for reference.
In some cases for vibrations and other types of noises there are other 0 decibels, but typically
what we are measuring the field depth is 20 µPa.
>> Xing Li: Okay, thanks. Now let's talk about the loudness perception. This figure shows the
equal loudness curves at different levels. The x-axis is the frequency and a y-axis is the sound
pressure level. Each curve represents the required sound pressure level at a different
frequency to evoke a constant perceived loudness. For example, the bottom curve is the
hearing threshold at a 50 dB. In order to hear the sound level must be at least a 40 dB.
Whereas, at a 1 kHz the threshold is zero, so our ear is less sensitive to the very low and to the
very high frequencies. The ear’s sensitivity is not fixed, but rather it is level dependent. Note
that the 100 phon curve is relatively flat which means at a high sound pressure level our ear
becomes almost equally sensitive to all frequencies, so what does this imply in terms of sound
timbre perception, which is an important factor for music enjoyment? For example, the mix
level for music is normally about 80 dBspl. For home listening, if we turn down the volume by
20 dB the perceived timbre will be very different from the original recording. How do we keep
the timbre balanced when changing the playback volume? The technique is called loudness
compensation. Basically we need to find out what was the original mix level and what is the
playback level. The compensation gain should be depending on the difference between the
original, will be dependent on the curve between the original level and the playback level. In
this case we take the 80 dB as the reference level, so if the music is played back at this level, no
compensation is required, but if we decrease the volume, we would need to boost low
frequency and high frequency to keep them at the same relative loudness level as in the
original recording. To demonstrate this I have a music example for you. Let's listen to a song
first at 80 dB, which is roughly the music reference level. [audio begins].
[audio ends].
>> Xing Li: Wow!
>>: Sorry. This is louder than when we tried it yesterday. [laughter].
>> Xing Li: It's 80 dB?
>>: Yes. [audio begins].
[audio ends].
>>: That was much better, sorry.
>> Xing Li: Okay. That was roughly 80 dB. Next let's listen to the same song but at 60 dB. The
distance from here to here, well definitely you will hear that the sound becomes quieter, but
there is a difference in the low-frequency region, so when you are listening please concentrate
on the low frequencies. Let me play this. [audio begins].
[audio begins].
>> Xing Li: The low frequencies compared to other frequencies sounds much quieter. Now let's
listen to the same song with compensation at 60 dB. [audio begins].
[audio ends].
>> Xing Li: With compensation it allows relatively loudness level between low-frequency and
other frequencies and they stay relatively constant from here to here, so to confirm, to notice
this let me play the 80 dB one again. The perception is complicated so I don't know, it's very
hard to describe the correct terms for what I want you to hear, but let me play this. [audio
begins].
[audio ends].
>> Xing Li: Fun today, anyway. [laughter].
>>: [inaudible] did you? [laughter].
>> Xing Li: Somehow there is a level jump in the very beginning. I don't know why.
Let me talk about a user study. The purpose of the user study is to find out what is the
preferred level increase as a function of the noise level. Is it different for different kinds of
music and speech? So first off, how do we measure noise level? There are different kinds of
measurement. For example, dB A, dB C which are basically the weighting sound pressure level.
To reflect the ear’s sensitivity at different frequencies, the weighting curve is roughly the invert
of the equal loudness contour. For example the 40 dB, the A weighting, this blue curve is
roughly the invert of the 40 phon curve. The C weighting is the invert of the 100 phon curve. In
our study we found that the measurement in dB A provides a good indication of the noise level,
so in the results I'm going to present later we will use the dB A measurement. Before I describe
the user study set up let's give you the key factors involved. First the stimulus type, for
example, rock music versus pop ballads. For noise condition we have car noise recordings from
different scenarios and a third is an interesting one, personal preference, because we all
understand that individual judgment can be really personal rendering. Statistically we don't
know if there is a pattern, so let's find out. This is a screenshot of the user study interface. A
user controls when to start the stimulus by clicking on the play button. The stimulus is then
played continuously. The stimulus can be one of any of the four types of music or it can be a
telephone speech or natural speech. While the stimulus is been played continuously, at the
same time car noise recordings will be played in the background. As the noise conditions
change from time to time, the user can use the slider bar to adjust the volume until he or she is
most comfortable. This picture shows our experiment set up. The study was conducted in the
driving simulator. We used two sets of speakers, one for music and the other for car noise to
reflect the real situation as far as possible because in practice, speech and music come from the
sound noise instrument, but the car noise comes from any direction. They don't coincide in
space. Totally we have tested 19 users. Each one did 80 judgments including 10 stimuli by
eight noise conditions. Now let's take a look at some results. The histogram on the left shows a
distribution of the preferred music level inquired. The most common choice is 75 dB. On the
left side which is speech we see a big peak around 65 dB A. On both sides that wide
distribution implies that there is a big preference variability in terms of the absolute listening
level, so we normalized each user's data by their preference inquired. Now let's take a look at
the normalized data. This is about the music, how much level increase users prefer as noise
goes up from 30 to 78 dB A. Each curve represents one type of music. We did not find any
significant difference among those four types of music. This figure is about speech. We do find
a narrow band and a wide band of speech. The preference pattern for telephone speech is
significantly different than the pattern for natural speech. Basically you can receive a much, a
larger level increase is required for telephone speech.
>>: What is the difference between wide band speech and natural speech?
>> Xing Li: Wide band speech is sampled at a 16 kHz band pass filter between 50 Hz and 7.2
kHz. Natural speech is basically the CD-quality speech, the full band. We did some further
analysis and we found that music and natural speech they belong to one group and wide band
and narrow band speech they belong to another. We proposed to analyze user's preferences
by two parts. The first part is the static of personal preference and the second is the dynamic
level increase trend. What matters to our dynamic control system is the level increase trend
which can be modeled by generalized sigmoid function. In this figure the red fitting curve
represents the telephone speech and the blue curve represents music and the natural speech.
The fitting correlation for telephone speech is slightly higher than for music possibly because
for speech of the choice is pretty much dependent on the speech intelligibility, whereas in the
case of music it's getting more personal, so the judgment for speech is relatively less objective
so we get a high correlation, but the fitting correlation for music is .76, so it's a pretty distant
fitting. Now let's talk about dynamic loudness control. Yes?
>>: Going back to that curve, just to, a little bit of anecdotal validation, when we were back at
the hotel trying to decide if reference volume level was [inaudible] we did find one thing similar
to yours which is that people like to hear narrowband or wideband speeches at a slightly higher
level than human voice. A potential explanation from our psychologist in the group who did
some analysis was that it is harder to understand something coming from a machine and the
only thing the user can do about it is increase the volume. [laughter].
>> Xing Li: Thank you.
>> Ivan Tashev: I was wondering another thing actually about the results of the histograms
towards the preferred levels, people subconsciously tried to put the speech signal as loud as
that of a human voice. They don't want the machine to dominate with a louder speech but
with lower speech they cannot understand it as well, where the music all tended to be 10 to 15
dB off.
>>: Very interesting.
>> Xing Li: Now let's talk about dynamic loudness control. Here are the function blocks of our
system. Let's start from the bottom. We get the noise level spectrum data from audio
processing pipeline. We are going to compute the overall noise level. Given a certain noise,
level we will find out the gain adjustments based on user preference data which is represented
as a LUT here. At the same time we will compute the compensation for the playback level. The
gain and the compensation will be applied to signal before we send it to the loudspeaker. All of
the processing is done in frequency domain. One thing I want to point out is the due time
constant control. Consider that we get a new noise measurement every 20 milliseconds and we
know the noise level will be fluctuating constantly, are we going to change the gain following
the exact fluctuation of the noise level? The answer is no because if we updated the gain too
frequently, this would introduce artificial temporal structure to the signal which should be
avoided for both music and for speech. So the attack time will basically determine how long it
takes for the system to enter full gain boost or the release time determines how long it takes to
enter gain reduction, the choice is for those two parameters have been a little bit different for
music and for speech. We will see later. Next let's talk about loudness compensation. We take
80 dB as the reference music level. The compensation gain is then dependent on the difference
between the 80 phon curve and the curve of the playback level because the level at 1 kHz is the
common reference for loudness measure so we normalize the gain at a level of 1 kHz. The gain
and the compensation are then applied to the signal and we reconstruct the signal by
overlapping [inaudible] synthesis. Here are some demos. The first example is music. First let's
listen to it without any loudness control. That way it won't be too loud. Let me start from the
new. The preset level for music is 75 DB. Let's see how it sounds as noise--well the background
noise is going up and down constantly. Let's see how… [audio begins].
[audio ends].
>> Xing Li: Increase the volume just slightly.
>>: No, don't touch. It's good.
>> Xing Li: Okay. Now let's…
>>: It sounds okay. Let's play some loud sounds. [laughter].
>> Xing Li: Now, let's listen to the same song with the same background noise but with dynamic
control. [audio begins].
[audio ends].
>> Xing Li: I hope you like the way the music level is increased easily as the noise go up and
decreases smoothly.
>>: [inaudible]. [laughter]. [multiple speakers]. [laughter].
>>: I want it for my Zune when I am cycling to work. [laughter].
>>: So the noise sample that you used is pretty steady. How does the system respond to
impulses, like you go over a bump or, all the various non-steady state noises that you get?
>> Xing Li: That's referred to the customer’s choice. If the noise, if there is a noise burst that
lasts like 200 milliseconds it won't cause the gain change because our constant is on a much
longer scale than the noise of that duration.
>>: But doesn't it contribute to an overall RMS value?
>> Ivan Tashev: Yes, but if you took the average, you don't want to change the gain every
time…
>>: I understand that you don't want to, what I'm curious about is how you differentiate.
>> Ivan Tashev: It's just averaging. Look, the time constant is 1.6 seconds to change the gain.
This means that if something in the noise drastically changes, it will take three times that, 4.8
seconds before you reach 99% of the compensation, so for the small bumps it is not going to
affect much, which for the music it's completely okay, but this is not the case for a speaker on a
telephone call and you don't want to miss a word.
>> Xing Li: Now, yeah, thank you for the introduction, [laughter], now let's talk about speech in
the same time of hearing noise example. Let me start from the middle. [audio begins].
>>: [inaudible] by the somewhat obvious nickname of Cuff. On the September morning
consecrated to the enormous event, he arose nervously at six o'clock, dressed himself, adjusted
in impeccable stock and hurried forth through the streets of Baltimore to the hospital to
determine whether the darkness of night had born in new life upon its bosom. When he was
approximately 100 yards from the Maryland private hospital for ladies and gentlemen, he saw
Doctor Keyes the family physician descending the front steps rubbing his hands together with
a…
[audio ends].
>> Xing Li: Now in this case, the speech level is at a 65 dBA. When the noise level is low it's
fairly intelligible, but when the noise goes up, it becomes quite hard to get it. Now let's hear
the same speech with dynamic control. [audio begins].
>>: [inaudible] they were related to this family and that family which was as every Southerner
knew entitled them to membership in that enormous [inaudible] which largely populated the
Confederacy. This was their first experience with the charming old custom of having babies.
Mr. Button was naturally nervous. He hoped it would be a boy so that he could be sent to Yale
College in Connecticut at which institution Mr. Button himself had been known for four years
by the somewhat obvious nickname of Cuff. On the September morning consecrated to the
enormous event, he arose nervously at six o'clock, dressed himself, adjusted in impeccable
stock and hurried forth through the streets of Baltimore to the hospital to determine whether
the darkness of night had born in new life upon its bosom. When he was approximately 100
yards from the Maryland private hospital for ladies and gentlemen, he saw Doctor Keyes the
family physician descending the front steps rubbing his hands together with a… [audio ends.]
>> Xing Li: So our initial system demonstrated its convenience for listening in time of varying
noise. Regarding music, we would prefer a slow attack because missing several notes is
acceptable, but a drastic level change sounds unpleasant and the longer compensation helps
reduce timbre coloration. Regarding speech, a faster attack is preferred because trends in the
noise might mask a segment of speech. Now, let's talk about the drawback of this system. One
concern is that in order to keep, to maintain speak intelligibility in noise, the system will drive
speech level high as noise goes up. Sometimes the peak levels can sound really unpleasant, so
we ask the question, can we do something more to maintain speech intelligibility without
driving the speech level to high, especially to avoid unpleasant peaks which are not just
unpleasant; they are not good for your health. Another concern is that the user study detail
might only hold for car noise like interference and the kind of noise, the overall noise levelbased design might not be optimum. So these considerations lead us to the development of
our second system, the noise spectrum-based loudness control. The processing is still done in
frequency domain but this time it is over the critical bands. Now let's talk about the human
auditory system. Our ear functions like a frequency analyzer. Sound comes in through the ear
canal and eventually it will reach the cochlear inside the inner ear. Here, this picture shows
unrolled cochlear. Here different frequencies components in the sound will excite a different
part of the cochlear. This placed frequency is called the tonotopic organization. It is as if there
is a series of band classed filters arranged along the cochlear. To describe those auditory
channels, the concept of a critical band was introduced in the ‘40s. Third octave band filters are
usually considered as a good approximation to critical bands. Given the noise spectrum data
from audio processing pipeline in this case we are going to compute a noise level in each band.
We are here now. Once we know the noise level in a particular channel we can figure out how
much boost of gain we need for the signal occupying the signal channel, so comparing this
system with our initial design, here we computed the boosted gain channel by channel,
whereas, before we computed one gain and applied it to all channels. Now let's talk about
that. How do we compute the boosted gain? It's basically illustrated in this figure. The red
curve represents the noise level over a critical band. The black curve is the reference level
which is the 65 phon reference curve. Why do I take this 65 phon curve as the reference level?
Here is my thinking, because the speech preset level is 65 dBA, so from a statistical point of
view the average speech level as a function of a frequency might be approximated by this
curve, so the boosted gain within each band will be the difference between the noise level in
this reference level. Let's hear some demos. On this we are comparing our initial system with
the noise spectrum-based approach. On both figures the red curve represents how the noise
level change, how the noise level goes up and down over time. It is a telephone speech in
traffic noise example. On the left side, the green curve represents these outputs by our initial
system. On this side, the blue curve represents the speech level generated by our new
approach. Now let me play the output of our initial system first. It is the telephone speech in
traffic noise.
[audio begins].
>>: You want to tap the shoulder of the nearest stranger and share it. The stranger might
laugh and seemed to enjoy the [inaudible] but [inaudible] that they didn't quite understand the
[inaudible] quality as you did such as your friends, thank heaven, don't often fall in love with
the person you were going on and on about. You are on the verge of entering the wide
provoking, benevolent, hilarious and addictive… [audio ends].
>> Xing Li: He is fairly intelligible, but at the same time I don't think we will like the peaks
because they are very unpleasant. Now let's listen to our new approach, the same telephone
speech example with the traffic noise.
[audio begins].
>>: The eye and penetrates the brain. You want to tap the shoulder of the nearest stranger
and share it. The stranger might laugh and seem to enjoy the writing, but you hug to yourself
the thought that they didn't quite understand the [inaudible] quality the way you do, just as
your friends, thank heavens, don't also fall in love with the person you are going on and on
about to them. You are on the verge of entering the wide, provoking, benevolent, hilarious and
addictive… [audio ends].
>> Xing Li: So from left to right the speaker’s intelligibility is the same, but the overall level
dropped by five dBA. The level drop might have been even bigger for those peaks. So listen to
the speech only which are the signals we sent to the loudspeaker.
[audio begins]. The eye and penetrates the brain. You want to tap the shoulder of the nearest
stranger and share it. The stranger might laugh and seem to enjoy the writings but you hug to
yourself the thought that they didn't quite understand its force and quality the way you do, just
as your friends, thank heavens, don't also fall in love with the person you are going on and on
about to them. You are on the verge of entering the wise, provoking, benevolent hilarious and
addictive… [audio ends].
>> Xing Li: Okay. And now let's listen to the second system.
[audio begins]. The eye and penetrates the brain. You want to tap the shoulder of the nearest
stranger and share it. The stranger might laugh and seem to enjoy the writings but you hug to
yourself the thought that they didn't quite understand its force and quality the way you do, just
as your friends, thank heavens, don't also fall in love with the person you are going on and on
about to them. You are on the verge of entering the wise, provoking, benevolent hilarious and
addictive… [audio ends].
>> Xing Li: Now let's do some analysis about those peaks and see what are the differences
between the usual system and this current new approach. Here on both sides, so we are
looking at the long-term spectrum now, on both sides the red curve represents the noise
spectrum. It's a [inaudible] noise. The energy is concentrated around 2 kHz. The black dashed
line on both sides represents the original speech at a 65 dB. The green curve here represents
the output by our initial system, while the blue curve represents the output by the noise
spectrum based approach. The green bar indicates the boosted gain in each critical band which
is basically the difference between the noise level and the reference level. One main difference
between those two is in the low frequency because the noise doesn't have much energy in the
low frequency, so with the new approach we won't boost the low-frequency much. It's
unnecessary. So as we see a big peak here there is about a 10 dB drop here, so that's why the
peaks with the new approach the peak levels are noticeably low. Let's listen to this segment
again taken from the previous example.
[audio begins]. Don't also fall in love with the person you are going on and on about. [audio
ends].
>> Xing Li: Oh, so the audio system will generate big trouble for me.
[audio begins]. Don't also fall in love with the person you are going on and on about. [audio
ends].
>> Xing Li: So the main message here is we can maintain speech intelligibility in noise without
having to drive speaker level too high, so a lower speech level means a lower audio power
which probably a minor effect for, yes, for car audio.
>>: When you are doing it per band, when you are essentially filtering the signal, it seems that
there would be some risk between perceptually changing what the person is actually saying
versus…
>> Xing Li: The sound and timbre will change, but the intelligibility won't because what I did not
go through is we have a boosted gain per band and this boosted gain will change over time as
noise changes but, again, we apply a dual time integrator to smooth the gain to avoid
introducing any artificial temporal structure to the signal, and I guess your concern is talking
about if we change the spectrum of the speech will that modify the intelligibility or affect the
intelligibility information. Is that your concern?
>>: Well, if you're not doing it very short term then you won't change the perceptual quality.
You won't change intelligibility, but just the perceptual quality of it. I mean you can add some
spectral tilt to it.
>> Ivan Tashev: When you are on a phone call do you want intelligibility or high-quality?
[laughter].
>>: Yes. [laughter].
>>: For what it's worth, when you listen to it with the noise the perceptual quality is pretty
good, but when you listen to just the speech by itself, then you can really, that's when the
perceptual quality changes so much.
>>: [inaudible] critical band isn't it?
>> Xing Li: Yes. Since we are trying to, you know, solve the problem for listening in noise, so…
>>: So we are all right. [laughter].
>> Xing Li: Noise spectrum based approach has the advantage of maintaining speech
intelligibility without driving the sensitivity level too high, this approach because of the
dynamically changing filters might, as Michael pointed out, will modify the sound of timbre so it
won't be suitable for music. Now, so far the loudness control is only based on noise and we
haven't made any use of the signal yet, so we asked, can we do something even better. For
example, because of auditory masking there are some frequency components in the playback
signal that we won't hear any way so why bother to make them louder. On the other hand,
there are some frequencies in the noise that won't mask our signal so why try to overpower
them? Let's first talk about the auditory masking in human perception. There are basically two
types of auditory masking, spectral masking and temporal masking. Spectral masking is
normally known as simultaneous masking. This figure shows the intra-band masking where the
masker and a signal occupy the same auditory channel. This figure shows the inter-band
masking where the sound in the low-frequency generates masking towards the high-frequency.
Regarding temporal masking, forward masking refers to the phenomenon that what happens
now affects what happened in the past. This mechanism is not really understood why this
happens. What happens more frequently is the post masking, basically a louder sound at this
moment will affect what's coming next. Let's go through this. We have, as before the
processing flow from a noise spectrum. Here we are computing the masking pattern generated
by the noise. On the other hand, we have a processing flow from the signal. Here we are going
to compute the masking pattern generated by the signal and identify which components in the
signal are audible if there is no noise. So in the center block the boosted gain will be dependent
on how much do we need to keep the audible peaks still audible if noise is present. Let's talk
about how do we compute a masking pattern. This figure shows--I see Rico was… [laughter].
>>: [inaudible] embarrassing.
[laughter].
>> Xing Li: Yes. This is based on my learning from…
>> Ivan Tashev: [inaudible] just copied Rico's book [laughter].
>> Xing Li: No, no. I found this. I generated this by myself. [laughter]. Let's walk through this.
The blue curve represents the speech spectrum. We can see the harmonic peaks. The red
curve represents the masking pattern in each band. We can see only the strong peaks that will
be audible. Well, there is a weak component that will be masked. Consider temporal masking.
This is, we can compute that the masking pattern in frequency domain. Now taking this as a
threshold, we are going to apply a dual time constant to the intake reader to reflect post
masking of the, well the choice of the release time constant show the reflect post masking. As
for the tech time, it basically affects how much, how quick we can turn up the gain when noise
goes up. Now let's talk about the boosted gain. Let's assume we know the noise masking
pattern and we know which signal components are audible, how much boosted gain do we
need to make the signal stay audible? I think that is the opening question. In the experiments
we boosted the signal peaks to be 3 dB higher than the noise masking threshold. Now let's
listen to some sound demos. We have, in the previous slides we have already listen to those
two columns, so I will just play those as the same narrow band of speech in traffic noise
example.
[audio begins]. The eye and penetrates the brain. You want to tap the shoulder of the nearest
stranger and share it. The stranger might laugh and seem to enjoy the writings but you hug to
yourself the thought that they didn't quite understand its force and quality the way you do, just
as your friends, thank heavens, don't also fall in love with the person you are going on and on
about to them. You are on the verge of entering the wise, provoking, benevolent hilarious and
addictive… [audio ends].
>> Xing Li: Now listen to the speech only signal. It's the one we send to loudspeaker.
[audio begins]. The eye and penetrates the brain. You want to tap the shoulder of the nearest
stranger and share it. The stranger might laugh and seem to enjoy the writings but you hug to
yourself the thought that they didn't quite understand its force and quality the way you do, just
as your friends, thank heavens, don't also fall in love with the person you are going on and on
about to them. You are on the verge of entering the wise, provoking, benevolent hilarious and
addictive… [audio ends].
>> Xing Li: So the message here is we can maintain speech intelligibility in noise but at a
moderate speech level. A most important point is that we can avoid the very unpleasant high
peak levels because consider when the background noise is already overwhelming, we are the
ones that listen to speech at a high level to overload our ear, so avoid those unpleasant peaks
and keep speech at a moderated level is very important from a perception perspective. Let me
conclude what we have done so far. From the user study data we found that there is a strong
relationship between the preferred listening level and the background noise level, so that
justifies that it is possible to design a dynamic loudness control system that sounds pleasant.
The relationship is different for music and for speech. Based on the user data we have
developed our initial system. This system has demonstrated its convenience for listening in
time of hearing noise and this system has an advantage. At the same time it has drawbacks.
The main concern is to maintain the high speech level and the output when noise goes up, so to
overcome the limitations of this initial system, we have developed the noise spectrum based
loudness control. The processing is based on critical bands and we have demonstrated that
speech intelligibility can be, we can maintain the same speech intelligibility as the previous
approach but keep the speech level at a moderate level. From here to here the processing gets
a little more complicated. The auditory masking-based approach depends on both the noise
and the signal. So far it sounds to me, or to some people, that this approach sounds fairly
similar and they also you call intelligible, although this approach that we consider the auditory
masking, it can drive the speech level even lower. I want to say the message again, a lower
speech level is not just about a lower audio power, but also it's from a perception perspective it
first, a lower level is good as far as your health and it sounds pleasant. Okay. Let's now talk
about future directions. There are some very important parameters. For example, the time
constants, the choice in those examples are based on informal listening by members in our
group, so in the next step we will need to run formal user studies to verify the time constant
choice of the thresholds and all the parameters and what the optimal choice for speech and for
music in different scenarios. We need to verify those parameters. The second point is so far in
the examples that we see, we can maintain speech and intelligibility at a lower level. We need
to verify this by running more formal tests. A third point and this one is a little bit tricky.
During the auditory masking-based approach, we computed a masking pattern and based on
the auditory masking pattern we computed the boosted gain, but as we changed the boosted
gain, the masking pattern changed at the same time, so it's unknown if there is a better
approach to calculate the boosted gain. The fourth point, to adapt it for practical use if we
want to view this in the product, think of the set power limits to meet the health requirements
is very critical and in some scenarios when the noise levels go enormously high, we probably
would have to do something more human. For example, shut down the whole system instead
of generating really loud speech or music which probably is less optimal. That's all my talk.
Thank you. [applause].
>> Ivan Tashev: Thank you Xing. Questions please?
>>: Xing?
>> Xing Li: Yes?
>>: To estimate the noise level you must use a microphone, right?
>> Xing Li: Yes.
>>: And that type of microphone is what people use for communicating?
>> Xing Li: Yes.
>>: So when I am having that conversation with you and I'm in the car…
>> Xing Li: Uh-huh.
>>: Is that measured, that should be measuring my signal.
>> Xing Li: Uh-huh.
>>: But how does it know it's my signal and not the noise?
>> Xing Li: That's an advanced problem… [laughter]… In that audio progression pipeline. The
system should be smarter to figure out if it is near the talker, if there is…
>> Ivan Tashev: [inaudible] noise models only when no one in the car is talking and there is no
sound through the loudspeakers.
>>: But let's say we are talking on the phone and you are quiet. I am talking. I am in the car.
>> Ivan Tashev: You're talking.
>>: I am in the car and you are quiet. You're just listening to me
>> Ivan Tashev: Yep.
>>: The signal that is picked up is presumably just my signal. How does the system know that
that is not the noise?
>> Ivan Tashev: There is a voice acuity detector which is combined with the [inaudible]
detector and we…
>>: The [inaudible] detector should be doing nothing, right because there is no other sound.
>> Ivan Tashev: There is a voice acuity detector. This is part of the noise suppression idea. I
know when the local person is talking. I know when I have a loud signal in the loudspeakers.
>>: So the noise you are measuring is the noise in the cabin?
>> Ivan Tashev: In the cabin without the…
>>: How can you tell the noise like the child in the background making noise from the speaker
or from the truck?
>>: Well, that is noise, right?
>>: How is that different from my voice which is not noise?
>> Ivan Tashev: There is a voice acuity detector. If there is a signal which statistically goes
above the threshold, it does not consider it when you make the noise model because we need
the noise model, not only is this kind of a side effect, we need the noise model for. Acoustic
separation, well, there is a quite sophisticated system for building the noise model in
[inaudible].
>>: If there is a person in the passenger seat talking is that considered…
>> Ivan Tashev: It will not be considered as a noise to some degree. If it goes statistically too
low or too close to the noise at some point it will drop it. Otherwise it will be feedback. The
system starts the music [inaudible] the speaker [inaudible] [laughter].
>>: Another point that you had on future directions, so I wondered how much you guys
thought about that on power limits. What approach could be borrowed from some of the
techniques that the radio stations use, because they do have the [inaudible], so I think was a
little bit of [inaudible] also has the benefit of driving [inaudible] amplifier and the speakers
because as you try to compensate, you may get significantly close to the limits where the
distortion would go very high, but if you apply, in addition to all that you put some damping on
that you might drive the actual peak levels to lower levels and then you could improve that
[inaudible] overall as well.
>> Ivan Tashev: So we try to please [inaudible] any kind that works. The telephone speech it's
already here [inaudible].
>>: Oh, because it came companion already…
>> Ivan Tashev: It was already here as a companion, so technically it may have a good effect on
textual speech when you are here to maintain the [inaudible] and understandable, but for
telephone it's heavily [inaudible]. You can just see the peaks are equal [inaudible] response
from the [inaudible].
>>: Yeah, because it may have been through [inaudible].
>> Ivan Tashev: It's all in the calibrates.
>>: Yeah [inaudible].
>>: So how computationally complex is this?
>> Xing Li: The initial system is fairly, I think it's computationally fairly light for the initial system
because all we want is to compute, this is the second approach. Well, this one, we would
recommend this one for music because all we want is to take the noise in the spectrum
calculate the overall level dBA and take a look at the user preference data and adjust to the
gain. The user can always reset the preset levels because some people prefer louder music and
some people prefer soft music and so they can always change the preset levels. All we are
doing is changing the dynamic level increase [inaudible].
>> Ivan Tashev: Xing, because I know from my code days [laughter] this is the better frame
[inaudible], that's it. The simplest, the stupidest voice acuity detector is pretty much equivalent
of computationally-wise to the first system.
>>: So how come [inaudible] complicated is the curve and gaining control based on…
>> Xing Li: This one, this one…
>>: Go the way of the last one. How complicated is that one? How many [inaudible] do we
have?
>> Xing Li: I think for telephone speech between 250 and there are, this is [inaudible] band,
something around below 20 I believe, less than 20.
>> Ivan Tashev: That is more complex than [inaudible]?
>>: Bless her heart but it's not.
>> Ivan Tashev: Nothing is of computational concern here besides converting the speech of the
loudspeaker [inaudible] frequency in the back which you will do anyway for acoustic
diagnostics.
>>: Cool. I want it tomorrow. [laughter]. The music thing is really, that's really good.
>>: How would that work with, you know, I have a soft top Mustang. How would this work
with soft top? Ciao.
>> Ivan Tashev: Don't want to start [inaudible].
>>: Yeah, yeah.
>> Ivan Tashev: And maybe they should on some of them.
>>: He needs to borrow your car. [laughter].
>> Ivan Tashev: Yeah, I was going to say. [laughter]. [inaudible] automobile some car
[inaudible] on top.
>>: Next year.
>> Ivan Tashev: Perfect, around springtime we will agree to borrow it for the summer.
[laughter].
>>: Travel with me. [laughter].
>>: With regard to your different attack times for speech and music, so assuming I am not
having a phone conversation. I am just listening and there is music and speech, how does your
system identify…
>> Xing Li: Music or speech?
>>: Yeah.
>> Xing Li: Everything comes from the audio [inaudible].
>> Ivan Tashev: For example, if you have a radio talk show, it is considered the same as music
and you may need some certain points…
>>: So better to err on the music, the slow attack side…
>> Ivan Tashev: But that may not be that critical when you have a phone call.
>>: Yeah.
>> Ivan Tashev: So pretty much a radio, CD, media player, this goes through the same type of
settings for music and [inaudible] speech is down. The settings for speech, for the phone call
channel and for the [inaudible] speech.
>>: Did you try different music genres to see if some music genres were like typically longer
phrasing and things like that would respond differently to different attack times?
>> Ivan Tashev: [inaudible].
>>: I would think that maybe rock or pop as opposed to say chamber music might be…
>>: [inaudible].
>>: So is it?
>> Xing Li: You are, is there a difference in the time constant in the dynamic control for
different types of music or are you talking about the relatively level increase between…
>>: Yeah, just assuming that like different types of music is not like a constant thing. Some
music is very busy and there are lots of attack transients in like the percussive impulses and
other music is very much the opposite.
>> Xing Li: Uh-huh. I did not present that in the four types of music the difference between
pop, ballads and rock is much bigger than, you know, this is [inaudible] very different music.
We almost want to say there is a significant difference between them, but since we have only
19 users so we set this significant test level of .01, but the difference between them if we start
a threshold at .05 we would conclude that the pop ballad which has a wide dynamic range of
music is a significant different from the rock music, so I guess to answer your question, I
wouldn't--this set of data concludes that there is no significant difference, but, you know, it's a
human preference. In order to give a confident answer, I think you need to run more tests to
find out if there is a statistical difference.
>> Ivan Tashev: Just to conclude the user studies to see if there is a difference, the settings, I'm
not saying that they were just [inaudible], but kind of seemed [laughter].
>>: [inaudible].
>>: I'd also like to add if the noise is so heavy that the automatic loudness changes significant,
then it's not a suggestion that you forget about the attack [inaudible].
>>: Yeah, I know. It's just…
>>: Yeah, so there's no way to treat classical music with that because you would completely
destroy it.
>>: And that was the other question. Does the system cap out at some point? We don't
continue to compensate, right?
>>: For example, when Mike takes a top-down on his car you're going to quit trying to like
destroy my speakers, right? [laughter]. Personally, I think this system would be very helpful,
but I concluded that it's just not possible to listen to classical music in my car. It's just too noisy,
so…
>> Ivan Tashev: [inaudible] the dynamic range.
>>: So [inaudible] we apply heavy processing, there is no, it's not hi-fi. You just maintain
audibility of the music, but it's not studio quality.
>>: Have you determined how far to go?
>>: You mean, yeah. How much of what is you call it, frequency parameter thing, the optimal
gain, basically.
>> Ivan Tashev: The design of the car already decided this because you have some maximum
power unless you want some additional loudspeakers and…
>>: Because you are just trying to achieve my optimal preferred listening volume, right?
>> Ivan Tashev: And at some point you hit the maximum of the system of the car and that's it.
If it was necessary to have more power, presumably the carmakers would put it there. Of
course, there are people that we have seen that take the car system out and put in a subwoofer
that is under the driver system so that when there is a [inaudible] the driver kind of jumps,
but…
>>: Maybe that's what I want.
>> Ivan Tashev: More questions?
>>: Guess what? I think this is related to what you said, is that if the car happened to be one of
those that it's active noise cancellation using the hi-fi speakers, then you've got to be careful
how to make noise, more noise [inaudible] you would need to adjust the…
>> Ivan Tashev: We may have problems, definitely.
>>: You need to kind of jointly design the two so that one doesn't mess up…
>>: But then, a real car…
>> Ivan Tashev: Yes, but I think those cars which have [inaudible] isolation, they can afford to
pay for additional [inaudible] [laughter].
>>: High-end stuff, but [laughter]. [inaudible] some Mercedes already have them.
>>: That's beyond my [laughter] my horizon. [laughter].
>>: Just [inaudible] it's not a typical Microsoft market, but…
>> Ivan Tashev: It may cause problems. We should look into that.
>>: Yeah. All right, cool.
>> Ivan Tashev: Anymore questions? If not let's thank Xing for coming.
>> Xing Li: Thank you. [applause].
Download