>> Ivan Tashev: Good morning, everyone. We are kind... I think we should start. It's a great pleasure for...

advertisement
>> Ivan Tashev: Good morning, everyone. We are kind of past the legal five minutes, so
I think we should start. It's a great pleasure for me to introduce our visiting speaker
today, Professor Peter Vary. Looking at the audience and knowing who is going
eventually to watch us online, it is needless to tell who Professor Vary is. I will just finish
the introduction with his current job title. He is the director of the Institute -- Okay let me
find it -- of Communication Systems and Data Processing in RWTH Aachen University.
Without further adieu, Peter, you have the floor.
>> Peter Vary: Yeah thank you very much, Ivan, for inviting me and hello to all of you.
It's now, in 2013, 20 years since we've had the first true digital mobiles in commercial
GSM networks. One of the first mobiles was the Motorola 3200. I don't know how the
name was chosen but that was approximately the price in Deutsche Mark, and the
weight was 580 grams. And during the first years of GSM I would say the main focus
was on reducing the size and increasing the talk time of the mobiles. And in the second
ten years we saw that the computation capacity of the mobile was increasing so much
that we could do other things like having a phone call, we had the cameras, we had the
calendar and we had the apps. Nowadays, it's not a mobile phone any more; let me say
mainly it's a mobile computer, mobile apps computer and the price is not really what you
pay in Germany. That's a matter of advertisement of the companies. But the weight is
about 110 grams, so reduction in size and in weight and a dramatic increase in terms of
complexity and capabilities.
And now a question arises: what can do to improve in the future? And there are different
approaches. One is, I would say, the distinguished British approach by the company
Vertu which is connected to Nokia. Very precious devices are an alternative solution
from an Austrian designer-jeweler, the Kings Button iPhone. But that's not what I would
like to talk about. As I said, we have now 20 years of mobile phones. We have seen that
it's still a phone. You can still use it as a phone, but during the last ten years we have not
seen too much progress in terms of speech quality. And that's the topic of my talk: the
path towards high-definition telephony. So let's see what is high-definition telephony.
I will go through some issues like speech coding as such and HD-voice and modelbased bandwidth extension; that might be an important keyword. I will address error
correction and have some ideas about a future step at the end.
Now we have the Plain Old Telephone System which is tuned to history. In the old
analog days, the capacity of long-haul ranged transmission links required narrow
frequency band per channel, and this is why we have this old limitation to the telephone
bandwidth which is 300 hertz to 3.4 kilohertz. And we know how to use the telephone but
actually the intelligibility of syllables is only 91 percent; that is not too much. And quite
often, we need a spelling alphabet on the telephone because we don't get the right
syllables.
HD-voice is at least 7 kilohertz but we include also the lower frequencies down to 50
hertz. The technical term is wideband but the marketing people invented the term HD so
everyone understands what it is. And here we see the bandwidth which is increased
significantly but there is even more than wideband, there is also super-wideband up to
14 kilohertz.
So the low frequencies increase naturalness and comfort because the fundamental
frequency of speech is mostly not transmitted on the telephone. And if you include that, it
sounds more natural and you can distinguish between the daughter and the mother on
the telephone; that might sometimes be important. The higher frequencies contribute to
clarity and intelligibility, and if we increase even further we can improve brilliance and
quality of music.
Quite often only a minor part of the speech spectrum is transmitted. This is the
telephone band; this is unvoiced sound. And we see even more energy beyond the
telephone band and if we include that in terms of wideband or HD coding, we will
significantly improve the quality and the intelligibility of syllables. So we have these
categories: narrowband, wideband, super-wideband -- Sometimes it's called HD Plus -and fullband if that is needed as a reference.
Now I would like to address speech coding. More and more we are talking about
speech-audio coding, speech-audio signal processing. But for the time being in the
mobile phones we have pure speech coding. And the simplified explanation of what the
mobile phone is about is given here. There is a model of the speech production
mechanism of the vocal tract. We have the vocal folds and we have the vocal tract, and
that can be described by a model where we have a generator and filter, time [inaudible]
filter. And this is what we see at the receiving end in the mobile: there is model of
speech production that is implemented. And over the air we transmit the parameters
which are needed to drive the model, and at the transmit side we extract from the
speech the model parameters. So actually it's some kind of naturally-sounding vocoder
what we are using there, but it is more and more a mixture of parametry coding and
waveform coding.
The most important codecs we have in the GSM mobiles is the very old Full Rate codec
from 1988. It was very simple and then, we had improvement in terms of technology
when microelectronics became more powerful. And ten years later we have what is
called the Adaptive Multi-Rate codec. And later on we had to add the term NB,
narrowband. So its telephone bandwidth fs rate is 8 kilohertz; it's much more complex
and it has different bit rates. It goes down to 4.75 and the maximum bit rate is 12.2; that
is the codec which is preferably used in the mobile phones. That's also called the
Enhanced Full Rate codec, 12.2 kilobits per second. And it's based on the model of
speech production in terms of code-excited linear prediction. We have here two stages
of the filter, the vocal tract, the vocal folds, the gain and some excitation generator which
is a vector codebook. And what is transmitted over the channel is the vector address.
Typically it is 40 samples; that is, 5 milliseconds a gain factor that filtered to produce 40
samples. So that seems to be very simple. The secret is how to extract at the transmit
site the best vectors, the gains, the coefficients such that we can code it with a very low
bit rate and still get excellent quality at the receiving end.
The motivation for designing the Adaptive Multi-Rate codec was not to improve the
quality. The motivation behind that was, usually we have a full rate channel in GSM that
is 22.8 kilobits per second including source coding and error protection. And if we can
keep that rate and reduce the bit rate for speech coding, we can increase the number of
bits for error protection. And that's what has happened with the AMR codec. If you have
the Full Rate codec the bit rate is always 22.8. If the mobile is outside, we use the
enhanced Full Rate codec, highest bit rate, highest quality. If a channel gets worse,
maybe a mobile is inside a building, then the channel gets bad and we increase
dramatically the error protection and still have reasonable quality but the quality will
never be better than if we use the highest bit rate.
So the motivation and the objective of the AMR Narrowband codec is to save money
from the point of view of the operators because if you would like to improve the in-house
coverage, you could install more base stations outside the buildings, inside the buildings.
So it's a reasonable and a good solution at that time to improve the service quality in the
mobile networks by introducing the Adaptive Multi-Rate codec. And it can switch very
quickly between the different bit rates, every 40 milliseconds.
There is a half-rate channel where the bit rate is just half of that. I will not go more into
the details. Now we have seen high-definition is at least 7 kilohertz. And there are some
examples like Skype, and I guess this is the first time users became aware that a
telephone could be more than just telephone quality. There is Voice-Over IP, the OPUS
codec. I should add here Windows Live messaging system. And we see there is already
standardization; very, very long ago in 1985 we designed the first wideband coding. The
ITU standard for ISDN -- It was designed for ISDN networks to increase quality. And now
the patents are not valid anymore, and we see the first products. And one of the
products is called DECT Plus; that's the digital cordless telephone and it is used in
combination with a SIP protocol over the Internet. But it's difficult to use; the normal
customers, I'm sure, will not use that because you have to combine the telephone with a
router and you have to install software and you have to know someone who has it. So
it's far from being the standard in the fixed networks so far.
Then we have in the 3GPP domain what is called the AMR-Wideband. That's the next
standard. Adaptive Multi-Rate Wideband standard. Interestingly in 2001, more than 12
years ago, it was designed. At that time we had no third generation and no fourth
generation; it was designed for GSM. It could be used in GSM. It was never used in
GSM. And there are some activities to introduce in 3G and 4G mobile networks, but for
quite a while there have been some mobiles on the market which have the AMRWideband. And it is used but only for ringtones not for communication but that might
change.
Now let's try to get some idea about the quality. This is narrowband quality, the standard
quality in GSM.
[Audio playback begins]
>>: Slow-cooked hippopotamus in red wine: a recipe for special occasions, provided
that the hippopotamus feels comfortable in red wine. Slow-cooked hippopotamus in red
wine: a recipe for special occasions provided that the hippopotamus feels comfortable in
red wine.
[Audio playback ends]
>> Peter Vary: Now the question is, while this is a significant improvement can we do
more? Does it make any sense to improve in terms of bandwidth for speech? So let's
listen to 7 kilohertz wideband.
[Audio playback begins]
>>: Slow-cooked hippopotamus in red wine: a recipe for special occasions. Slowcooked hippopotamus in red wine: a recipe for special occasions, provided that the
hippopotamus feels comfortable in red wine.
[Audio playback ends]
>> Peter Vary: And clearly the naturalness and the transparency increases even more.
But we have not only speech, we also have music. Let's listen to GSM music.
[Music playback]
>> Peter Vary: So we could, even in the mobile networks, transmit audio quality. There
are some codecs to do that. And the bit rates are such that we can transport them in the
3G networks. There is only packet transmission in the 4G networks. LTE: we have
packet transmission so it would be no problem to introduce that in principle. The object
of the AMR-Wideband codec is to increase quality and intelligibility for the first time. And
interestingly the bit rates go down according to the AMR principle to 6.6 kilobits per
second then it can be filled up by error protection. The quality is not as good at 6.6 as at
24 kilobits per second. If you try to find out where in the world is HD-voice you'll find that
at the Global mobile Suppliers Association, and they keep track of the announcements
of the operators. And the latest count is that there is HD-voice launched or announced in
61 mobile networks, mostly 3G, in 35 countries in the world. If we look to Germany; it's
not that clear. We have pretty good coverage for 3G and 4G. For main operators, it looks
very similar. And all of them are saying, "We have HD-voice," but if you go to the shop
no one can tell you how to use that because it's partly installed in the network. And it's
not everywhere because we have different suppliers for the infrastructure, and it's
obviously a strong effort to implement HD-voice in the network.
In some countries we have it live, like in France. In the UK there are several 3G
networks which offer HD-voice. And meanwhile, we have plenty of mobile devices which
can do that. And this of course is the next interesting problem for the speech processing
community: if we install HD-voice in the network, we have to modify quite a lot. To give
you an example, a German-wide coverage of GSM requires about 19,000 base stations.
And if you install such a new codec, you need a new speech codec and you need a new
channel codec. In the mobiles, that's not an issue because you buy a new mobile. The
base station is there and you just need a software update at the base station. But it's like
with the PCs, if you've ever tried to update your PC which is ten years old, install the
latest software, there might be some problems.
So it's more than trivial to introduce to HD-voice in the network because you need a lot
of modifications and protocols. Coding part is installed in the mobile switching center,
you don't have so many. That's not [inaudible] but the base stations is the point. And if
you have done that, you have two separate worlds.
The customer buys a new mobile and he knows, hopefully, a few friends who also have
the HD service, but there is no cross-connection in terms of quality between the
narrowband and the wideband terminal. And it will take a very, very long time until or if
ever the whole telephone network has been converted to HD-voice. And then, we can
study all kinds of interconnections between HD and narrowband mobiles over HD and
narrowband networks. For example, for such a guy who has the new HD mobile phone if
he is happy, we have the complete chain in HD. But if the user at the other end is
narrowband, he gets only narrowband quality. Even if the user here is connected by a
3G network -- That could be 3G LTE. That could be GSM -- then he can't take profit of
the HD capabilities because it transmits only narrowband noise and narrowband speech.
It might be that the HD terminal connect is connect to a narrowband network then they
also have only narrowband quality. And so we have different solutions.
Now I would like to study these cases. Case A is a happy case where we use the AMRWideband codec or the AMR-Wideband Plus codec. In this situation we try to improve.
We could try to improve by -- the keyword is now bandwidth extension. We transmit just
the narrowband signal, and in the mobile we try to exploit the model of speech
production and some characteristics of the human ear to artificially increase the quality
and eventually the intelligibility.
From an information theoretical point of view, that's very questionable. We have here the
specific source and we have a model, and we can do something. I will demonstrate that.
If the narrowband mobile is connected via the 3G network, we could also implement this
Artificial BandWidth Extension in the network; therefore, I'm saying it's two cases, B1
and B2. It doesn't make a big difference. The question is, where is the processing
power? If we have only the narrowband network but two HD terminals, it seems to be
that we can do nothing but we can do something if we succeed to transmit, in a
compatible way, side information for Artificial BandWidth Extension.
While we could use, as in the first case, B1 Artificial BandWidth Extension at the
receiving end but the quality improvement is limited as soon we can transmit side
information which drives the BandWidth Extension, the quality becomes much more
better. Some of the codecs have the element of Artificial Wideband Extension as part of
the standard already. So the idea is the narrowband network, let's say, transports only
PCM or an AMR narrowband codec and we embed some information. We will see how
to do that.
And we try to do that in a compatible way such that a narrowband terminal gets
narrowband quality and will not be disturbed by the embedded information which can be
used by the HD terminal to expand the bandwidth. The first case, just receiver-based -That was Scenario B1 or B2 where the BandWidth Extension is in the network -- tries to
estimate some characteristics of the speech and to expand the speech artificially. It is a
mixture of speech recognition of estimation and speech synthesis. And in listening tests I
can tell you the listeners clearly prefer Artificial BandWidth Extension, and in certain
occasions if we have background noise, Artificial BandWidth Extension will even
increase intelligibility.
One approach my former PhD student Peter Jax had was purely based on linear
predictive coding. What we are doing here, we receive narrowband speech, let's say,
that's already decoded -- It's PCM -- and the first step is to interpolate from 8 to 16
kilohertz. Then, we apply a pattern recognition approach where we extract some
features. We have a statistical model. The goal is to estimate the spectral envelope of
the wideband speech. So at the end the wideband speech will be reconstructed by
applying a wideband synthesis filter. And the goal is to estimate from parameters which
are extracted from the narrowband signal, the envelope of the wideband signal.
And it uses a codebook. The codebook could be, for example, just the LPC or the LSF
quantization codebook. For the LPC information it uses bias estimation, and what we get
is the LPC or reflection coefficients or capsule coefficients of the wideband filter. And we
apply the analysis filter to the over-sampled narrowband speech, what we get is the
over-sampled residual signal. And if the signal was limited to 3.4 kilohertz, the residual
will also be limited to 3.4 kilohertz. If the prediction goes quite well and let's if the voice
was unvoiced, we have a flat noise spectrum up to 3.4 kilohertz but we needed up to 7
kilohertz. So it's not too difficult to imagine how to expand the narrowband flat noise to a
wider flat noise just by spectral replication or spectral translation.
If the speech is voice then we have here a harmonic narrowband noise with more or less
constant spectral components. It has a harmonic structure and it's not too difficult to
imagine how to expand that to a wideband excitation by spectral repetition folding
modulation. I can tell you, the extension of the excitation signal is very easy and it's not
critical at all. The art of BandWidth Extension is hard to estimate the envelope. And then
we can apply this wideband excitation to wideband synthesis filter and get hopefully
decent quality. So this is a little bit about the theory. You get some idea that is pattern
recognition and conditional estimation. I will not go through the details. And I would like
to play one example.
What we see here is a spectrogram, and it switches between narrowband and the
artificially extended signal.
[Audio playback begins]
>>: Well, three or four months run along and it was well into the winter now. I had been
school most all the time and could spell and read and write just a little. I could say the
multiplication table up to six times seven is thirty-five, and I don't reckon I could ever get
any further than that if I was to live forever. I don't take no stock in mathematics anyway.
[Audio playback ends]
>> Peter Vary: So the quality is significantly better. And if there is some background
noise, tests indicated that even the intelligibility is increased. The second approach for
Artificial Wideband Extension is to embed information, to hide information in a
compatible way in the bitstream. That works with Scenario C, and we will see here two
different versions how to implement that. We add some side information, so we transmit
a narrowband bitstream or narrowband speech samples over a narrowband telephone
network. The normal narrowband terminal should not be confused by this hidden
information. And the bit format, the stream, the frames and all that should be the same
as before or, let me say, at least according to the specified format. And the HD terminal
could extract the hidden information to do much better BandWidth Extension with the
transmitted side information.
No increase of the bit rate. No modification of the bit stream format. And I would like to
discuss this with an alternative version, how to expand the bandwidth. The first one was
to use purely LPC-based techniques, LPC analysis filter expansion and synthesis. The
second one is to split the signal, the wideband signal, into a lower and an upper band.
So, with sub-band processing we need not to do anything with a low pass band but we
try to re-synthesize at the upper band, the high band from 4 to 7 or 8 kilohertz.
But the bitstream is according to a narrowband signal. What we have here is the
bassband decoder, let's say, the AMR narrowband codec at 12.2 kilobits per second. We
get the narrowband signal. It's sampled at 8 kilohertz. We have a two-channel filter bank
which does the interpolation drop to obtain the enhanced signal sampled at 16 kilohertz.
And here's the extension band synthesis, and first approach is to hide data by
watermarking techniques or steganography in the bitstream. For example, 2 kilobitz per
second, that's quite a lot, but we don't need so much for wideband extension to extract
some parameters to do here the BandWidth Extension. And I would like to explain once
more using the AMR codec, the simplified plot diagram here, the vector codebook for the
gain vectors, and this filter which has two stages which we have seen before. But the
point is to exploit the properties of the vector codebook or the vector search. Without
going too far into the details, I can say that at the transmit site, finding the best vector is
done by a procedure which is Code Excited Linear Prediction or analysis by synthesis or
you could also say trial and error because you try out which is the best excitation. And
there are many, many different approaches how to design a codebook and there are few
approaches how to design a search procedure in a codebook which is, let me say,
specified by the [inaudible] kind search. It is not a codebook which is stored but it's
constructed iteratively by some search procedure. And this is the optimization.
Criterion H is a matrix where we have the impulse response of the synthesis filter. The
vectors have length for 40 that is 5 milliseconds, and this has to be optimized. And in the
standards there is for complexity reasons a non-exhaustive search because in any case
the codebooks are very, very sparse. The dimension is 40, and typically the number of
codebook entries is 2 to the power of 10: 1,000 roughly. Here the dimension is only 2
and in the codebook we have here obviously only 4 entries to explain the principle. So
sparse codebook.
And the first idea to use watermarking techniques could be to split the codebook into two
different codebooks. No, that's too fast. The codebook -- To use -- Yeah. To split the
codebook such that we, let's say, use the green codebook or the red codebook. We just
separate it into two codebooks, and the hidden information is 1 bit. If you use the green
one, the bit is zero. And if we use the red one, the bit is 1. And the receiver can find out
which entries have been used and he knows the information which has been
transmitted. The disadvantage is that the quantization error increases because we have
reduced the size of the codebook. But as the codebook is that sparse, we can invent a
second codebook of the same size. And then, we have no loss in terms of quantization
errors. And the receiver knows, if the red or the green book is used, which bit is
transmitted, which polarity or sign of the bit is transmitted.
And this can be implemented in a modified, non-exhaustive search because if we study
the AMR codec, we see that there at least one thousand possibilities to specify a
codebook of size 1,000 or let me say 2 to the power of 10 which means we can hide 10
bits in the codebook address of a vector of length 40. And as 40 samples is 5
milliseconds, that is 2 kilobits per second.
So this information is used to drive the BandWidth Extension at the receiver, and here is
the sub-band approach. The high-band signal is here [inaudible] by controlling the time
envelope, by controlling the frequency envelope, by controlling the excitation generator
which produces noise or some periodic signal. And the bitstream is extracted in the
decoder in the part where the codebook address is decoded to find out is it red or green.
We have 1,024 different colors and the color says which are the 10 bits which are just
transmitted.
I explained that already. And let's study some examples. This is narrowband speech and
this is extended speech with the side information of 1.65 kilobits per second. We have
taken this because this, as indicated somewhere, became part of the standard of the
extension of the G.729.1 codec that Geiser designed. This solution included channel
coding 2 kilobits per second; the bandwidth is extended from narrowband to wideband
and he used that module to expand the AMR codec but hiding the bitstream in a
compatible way. Now let's listen to one example. The first test is at what extent when
gets a conventional mobile, narrowband mobile is confused or disturbed, do the listeners
recognize that there is something which has been modified in the background?
[Audio playback begins]
>>: He suffered terribly from a speech defect. He suffered terribly from a speech defect.
>>: Faulty installation can be blamed for this. Faulty installation can be blamed for this.
[Audio playback ends]
>> Peter Vary: And actually you don't recognize the modifications. Only the very
experienced listener sometimes can distinguish that there might be something different.
So it would be allowed to do that. And then, we can imbed the 2 kilobits in the bitstream
of the AMR codec and we can compare the narrowband receiving terminal and the
wideband receiving terminal in terms of quality.
[Audio playback begins]
>>: To administer medicine to animals is frequently a very difficult matter and yet,
sometimes it's necessary to do so. To administer medicine to animals is frequently a
very difficult matter and yet, sometimes it's necessary to do so.
[Audio playback ends]
>> Peter Vary: And that sounds much better or significantly better than the artificially
expanded signal where we don't have side information at all. But the drawback of that
solution is we have to transmit the bitstream as it is. As soon as we have transcoding in
between, let's say a call from a mobile to a different network and if there is PCM, for
example, in between, the hidden information gets lost.
But if a transmission is time-domain free which is one of the options in GSM then it
would work. And the nice idea would be, the operator just has to sell to two people, two
mobile phones without modifying the network at all. If they are in the same network, they
would from the very first usage of a telephone have wideband quality. But there is a
second solution, and the second solution is to avoid this time-domain problem and to
hide the signal in the samples not in the bitstream. And the idea is just the following:
here we have the wideband spectrum. And we take the spectrum from 4 kilohertz to 6.4
for a certain reason, and we compress this and place this in the gap between 3.4 and 4
kilohertz. If we have a narrowband signal and if we have clean telephone bandpasses,
stop band frequency of the bandpass is 3.4 kilohertz and between 3.4 and 4 we can
transmit something else. So in-band signaling but in-band compressed transmissions of
frequencies here. If it is periodic, it will be compressed and you could also say it's pitch
scaling or spectral compression, and we do that in the frequency domain.
The detailed solution will be explained next week by Bernd Geiser at the ICASSP. The
processing is based on DFT domain. We have first analysis with a large window and a
second analysis of the high-band signal -- Oh, sorry. First of all we split the signal into a
high-band and into a low-band. We decimate the sample rate from 16 to 8 kilohertz and
then, we have here a short analysis and here a longer analysis. And we inject this
spectrum. In the high-band we do re-synthesis and there is more windowing and overlap
to obtain wideband speech.
This is the wideband signal. We clearly see that there is a lot of energy here especially in
unvoiced sounds. And there is less energy in the voice regions. So if we do compression
here and it is more or less noise-like, we can expect it's not so critical because we don't
have too many harmonics here. This is for high-band. We take this and we compress it
and place it in the range between 3.4 and 4. As I said we take the high-band from 4 to
6.4 and we compress it by a factor of 4 so we can place it here in this gap which has a
width of 600 hertz.
This is the narrowband signal where we did not apply the telephone bandpass. As you
see in the true speech it goes up to here, but here would be the 3.4 kilohertz limit here.
And we will replace this as you see. And it looks different. I'll go back and forth. It looks
different, but even if you listen to that, it doesn't sound too disturbing.
[Audio playback begins]
>>: Oak is strong and also gives shade. Cats and dogs each hate the other.
[Audio playback ends]
>> Peter Vary: But, the normal terminal would have some telephone bandpass and
would not allow it to offer these components to the user. But what we see is the insertion
of the fricative sounds. This is a comparison of the reconstructed wideband
approximation and the original wideband signal. The original wideband signal does not
have here a gap between 3.4 and 4 kilohertz, so that's the true wideband. The
narrowband signal has a gap because there's a cutoff frequency at 3.4 kilohertz. And we
replace the high-band here, and the gap is not annoying at all. We know that from
different codecs that you will not hear the gap and, therefore, we can expect a good
quality. So let's listen to the extended.
[Audio playback begins]
>>: Oak is strong and also gives shade. Cats and dogs each hate the other.
[Audio playback ends]
>> Peter Vary: And the uncoded.
[Audio playback begins]
>>: And dogs each hate the other.
[Audio playback ends]
>> Peter Vary: Oh, sorry. I'll repeat once more because...
[Audio playback begins]
>>: Oak is strong and also gives shade. Cats and dogs each hate the other. Oak is
strong and also gives shade. Cats and dogs each hate the other.
[Audio playback ends]
>> Peter Vary: It might be that you can recognize slight differences, but usually you
don't have the AB comparison and the quality is pretty good. Now we have different
options to improve the quality in the networks by coding, better source coding,
compression, hiding information, whatever. But then, we have a transmission channel
with errors. The question is, can we keep the quality even if we have bit errors. So we
have to study the bit error sensitivities and maybe we have to use all the tricks which are
known from channel coding. And one interesting solution is turbo coding because you
can just apply it at the receive site. And a very special interesting solution is what we call
Turbo Error Concealment. That is the idea of turbo coding applied to a channel decoder
and a source decoder. Usually a turbo decoder has two component decoders; one helps
the other in an iterative procedure. And this concept can be applied to a channel
decoder and a source decoder, or let me say a parameter decoder.
I will explain that. This is the transmission model we have. At the transmit site we extract
some parameters, let's say any codec. We have channel coding. We have modulation
and transmission. And at the receive end, we have soft in and soft output channel
decoder. No real turbo decoder so far but just soft in-soft out; it's a component of a turbo
decoder because it can accept extrinsic information. Extrinsic information means, "What
do the other bits know about me as a bit?"
The processing is as follows: the soft-out channel decoder produces bits and it produces
reliabilities. Reliabilities are the probability that the bit here is right or wrong; it's usually
described by the log-likelihood values. And then, we calculate here our so-called aposteriori probabilities not on bit level but on parameter level because what we extracted
here is parameters, LPC coefficients, vector addresses, gain vectors. And these still
have some redundancy. If you study the first LCP coefficient in a sequence, you will see
there is clearly correlation. If you study the prediction signal of a predictive coder and
even if you have perfect prediction, you would say, "While the output is white noise, even
in white noise we still have redundancy because the white noise, even if it has no
correlation, has a certain distribution." Literally, Gaussian distribution.
And that is still redundancy, redundancy on the level of the samples or on the level of the
parameters. And this is extracted here in this block. This simple equation describes at
the end what happens. We calculate a-posteriori probabilities. We give something back
on bit level. I will explain that on the next slide. So there is some iteration between the
soft-channel decoder and the first block of the soft-decision source decoder. And after
stopping the iterations, we know the best a-posteriori probabilities. So that means we
have, let's say, a parameter V. And the parameter V might be transported by 3 bits. So if
it is 3 bits, we have 8 possibilities. At the transmitter there is a quantization table. If it K
log quantization with 8 entries. So these are the entries in the codebook and these are
the 8 a-posteriori probabilities which you have found after iterating here.
So we have received 3 bits and we interested in the probability if X, let's say, a group of
3 bits is fixed [inaudible] the probability for the first with second preferred entry number
8. And then we do conditional estimation in terms of minimizing the means squared
estimation error, and it turns out we just have the multiply the codebook entries with their
a-posteriori probabilities. If the transmission is perfect, if there is no bit error any more
because the channel decoder has repaired all the bit errors then only one of these
probabilities will be one and all the others will be zero. So the summation does just the
table lookup and is exactly comparable with the standard. If the transmission is
completely disturbed, let's say only random bits, then it will turn out that a-posteriori
probabilities have all the same value, 1 over Q or 1 over 8. And what we get is the
average of the codebook entries. And if that was a parameter with a sign and a
symmetrical distribution, the average would be zero. So we have graceful degradation.
So that improves significantly the quality by exploiting residual redundancy on the
parameter level that is the distribution and the correlation of parameters. And from that
we can feedback extrinsic information on bit level to the source decoder. Let's assume
we have a situation with 3 bits and the channel decoder finds out that bit number two
and three are zero and one, or should be zero and one in one iteration and the channel
decoder asks the source decoder, "What do you think about bit number one?" The
source decoder says, "Well, these three bits are belonging to a parameter which has a
certain distribution which is known. And we know the bit assignment and the
quantization table, so if bits two and three are zero and one, we have two possibilities:
obviously, that your first bit is one or zero. But this probability than this here." And then,
we can take the quotient and the algorithm. And this is a language the channel decoder
understands, that's a log-likelihood value which is fed back to the channel decoder. And
then, this way we do iterations across the channel and the source decoder.
And the improvements are quite impressive. This table look-up decoding. This is source
decoding if we rate over the channel quality as a quality measure the parameter SNR,
not the speech or audio SNR. We take the parameter SNR of a gain vector of a
parameter SNR of an LPC coefficient. And if we do the iterations, we come very close to
the theoretical limits. Using the turbo-like process applied to channel and source
decoding.
One example, let's take speech. One iteration means actually no iteration; we just onehalf one path through the decoder.
[Audio playback begins]
>>: To further his prestige he occasionally reads the Wall Street Journal.
[Audio playback ends]
>> Peter Vary: Two iterations.
[Audio playback begins]
>>: To further his prestige he occasionally reads the Wall Street Journal. To further his
prestige he occasionally reads the Wall Street Journal. To further his prestige he
occasionally reads the Wall Street Journal.
[Audio playback ends]
>> Peter Vary: Let's listen to music
[Music playback]
>> Peter Vary: Yeah so we, let me say, have solved the problem. We have good
codecs. We have good ideas to hide information, and we have very powerful channel
decoders. The question is what might be the next step?
Well, this is just one statement. It could be binaural telephony and then, we would need
stereo headsets, stereo coding and transmission. And you might ask what is it good for?
One situation you could imagine is you would not like or you cannot travel to a meeting.
And you would like to participate in the meeting such that instead of you there is a
dummy-head or at least two microphones. And we have two-channel transmission, and
you are in Environment B with a mobile and you can listen and get spatial information.
And for example, Environment A would be the meeting room at the airport and
Environment B is at the beach, for example. And we have produced a nice demo. And
you see it makes it really fun to participate in such a binaural audio conference. If you
listen to that here via headphones you get some idea which information you transmit. I
would say it's just the opposite philosophy of what we are doing today.
Today we tried to suppress the background and the background noise. And the acoustic
atmosphere, here, we do the opposite and we transmit. But the audio bandwidth is more
than just telephone bandwidth. And it makes it really fun if you apply that to groups. In
that case we have a Group A and a Group B and in each group only one has a master
headset with two microphones and all the others have the headphones. The connection
here might be wireless. And then, everyone can talk to everyone. Maybe you attended
our demo at the last [inaudible]. It's very amazing. You look to the left and see, "Oh, he
is not here in my room. He's somewhere else." And this might get great acceptance not
only for the professional application of audio conference but even for groups of young
people.
Now my conclusions are that mobile phones will, I would say, hopefully surpass the
speech quality of fixed-line telephones because all the ingredients are there to improve
the quality. We have the standards and the networks provide sufficient capacity to
transmit that. And it will become easier to introduce new codecs because in the 3G and
4G networks there is no dedicated speech transmission mode because it's just packet
transmission. Then, it will be much easier. But nevertheless, even in the new world of
the 4G networks we still have the old fashioned telephone network and we have
interaction of new and old mobile phones. And then, we have the compatibility problem.
Even if it seems to be that in quite a lot of networks there is HD-voice on the move, I
would say, you have to distinguish between announcement and real public use. But
then, we will see the discrepancy between narrowband and wideband, and in that
situation BandWidth Extension might help to improve the user acceptance. Someone
who is buying a new mobile phone can get some improvement from the very beginning
independent of the situation which is in the connection presently. And it could bridge the
gap between narrowband and HD.
And that could be improved significantly if we hide information if we do the transmission
in a competitive way, let me say, over GSM or [inaudible] network eventually. The latest
proposal is to do frequency compression, and my impression is that could be a shortcut
to HD mobile telephony. But the real step could be the future binaural HD telephony as
close as being there. Yeah, that's the end of my talk but I would like to cite some ideas,
true statements about what is HD. There is one analyst who says the expectation for
quality in the public on cell phones is so low; people say it's good enough. But there is
on the other side the mobile manufacturer, Peter Isberg, he said, it will be "the most
significant quality upgrade in telephone history." And I expect it will be something in
between but hopefully much closer to the most important quality upgrade. Thank you
very much.
[applause]
>> Ivan Tashev: Questions [inaudible]?
>>: I'm curious if you have tried some of those demonstrations with the Artificial
BandWidth Extension in different signal [inaudible] and different noise settings? And I
was curious to hear if you have the demo how it actually sounds [inaudible]?
>> Peter Vary: Yeah, I can tell you it's not as good. We are working on that. And the first
idea is what should we do? Should we first suppress the noise, pre-processing? Or
should we imply the noise suppression in the parameter estimation? So it's like in
speech recognition. And it looks like we should first of all try to get rid of the background
noise and then it works quite good. Then your next question might be, what about
music? Yeah, it's not that good. If you have music there and music in the background
then the pure Artificial BandWidth Extension does not perform that well. But if we can
transmit side information it's like a codec which spends not too many bits for the higher
frequency band, and we know that the ear is not that sensitive there.
>>: Another question is, okay, from the moment you switched from the traditional
transportation to the pure digital, besides having errors in the bits of your transmitted
data which is less common in the digital networks but then we face packet drops or
packets arriving too late and how eventually this is going to be resolved. To keep the
quality high, even in certain packet drop rate.
>> Peter Vary: Yeah. The steganographic transmission is not limited to BandWidth
Extension. You could use the capacity of 2 kilobits per second, for example, to transmit
any information. And one interesting solution is to support packet substitution, to have
redundant transmission. One of my students made some investigations in that direction
and that really helps to improve the packet error concealment. But then, we have spent
this information and we will not use it for wideband. But it goes further: from wideband
you can extend to super wideband and we have some capacity which could be used for
packet error concealment.
>> Ivan Tashev: More questions?
>>: I have a related question. With the AMR codec there are significant artifacts
introduced by the codec itself. Presumably the presence of these artifacts has the effect
of concealing other problems like channel errors or a poorly placed microphone or a
poorly designed microphone. And then, as we go to HD-voice those artifacts are going to
be reduced. And the user may then become more aware of other problems like a poorly
placed microphone or channel errors [inaudible]. Do you have some insight as to how
the user's awareness is going to change and how the irritation of other artifacts that used
to be insignificant becomes more detectible?
>> Peter Vary: Yeah, that's a very interesting question. Let me say, if you are in a videoaudio conference and one subscriber comes via the old telephone, you clearly see the
significant difference between and hear the difference between the audio quality of
Subscriber A and the telephone quality of Subscriber B. But you need not to have such
complicated consolations. In a normal telephone conference which we have almost
every day or quite often, we experience what is happening in the telephone networking
over the last ten years; the telephone quality is not getting better. It's getting worse. And
you know the people which are on the phone personally and you clearly find out, "Oh,
that's probably a mobile and that's a terrible speaking mobile phone and that's a terrible
desk telephone." In the office you hear the reverberation. I expect that the users are
really aware of the degradations, and we should do something. And that should be
applied to the whole chain; it starts at the acoustic front end. You address the
microphones and the frequency responses. So we should improve that. We should not
make them cheaper. But I'm realistic; I know we cannot replace all of the devices we
have there. So we should try to improve by intelligent signal processing, let me say,
intelligent equalization at the receiving end.
Or better, pre-processing of a digital signal. I expect there is a lot of room for
improvement even if we have an AMR codec in between. Yeah? But I think this is a very
interesting question for the near-future.
>> Ivan Tashev: More questions?
>>: I have one questions. For a super wideband you mentioned that that would be an
additional improvement. But I think what we found that is if you own a cell phone like that
then the microphone is so far away from the mouth that actually [inaudible] frequencies
are almost [inaudible].
>> Peter Vary: Yeah.
>>: Did you do any studies on that?
>> Peter Vary: Yeah, we have to distinguish between at least three different modes of
usage: that is, the normal handset mode, the hands-free mode and then we have the
headset mode. And presumably you will not get all the improvements in each mode. If
you have the headset you clearly will get this improvement. Or if you have an audio
conference situation with big loudspeakers. So that will be at least third or fourth one.
>>: A few years back there were studies about Artificial Wideband Extension. But user
fatigue. So sometimes when you hear it for the first time it's impressive, something is
there that wasn't. But if you keep using it over the time, you may change your opinion.
You may actually prefer a clean narrowband versus, you know, artificially [inaudible]. I'm
not referring to the case where you embed information or the extension is actually
intelligent to design. But do you have any comments on that?
>> Peter Vary: Yeah, there is a big test emerged from the [inaudible] project and
[inaudible] workshop last year. And there was a common effort to make a very deep
comparison of different approaches of Artificial BandWidth Extension. And the results
are very encouraging. If you would like to know more details about that, I could send you
the paper. I don't have in mind where it was published. It is out meanwhile.
>> Ivan Tashev: Let's thank again our speaker.
[applause]
Download