>> Ivan Tashev: Good morning, everyone. We are kind of past the legal five minutes, so I think we should start. It's a great pleasure for me to introduce our visiting speaker today, Professor Peter Vary. Looking at the audience and knowing who is going eventually to watch us online, it is needless to tell who Professor Vary is. I will just finish the introduction with his current job title. He is the director of the Institute -- Okay let me find it -- of Communication Systems and Data Processing in RWTH Aachen University. Without further adieu, Peter, you have the floor. >> Peter Vary: Yeah thank you very much, Ivan, for inviting me and hello to all of you. It's now, in 2013, 20 years since we've had the first true digital mobiles in commercial GSM networks. One of the first mobiles was the Motorola 3200. I don't know how the name was chosen but that was approximately the price in Deutsche Mark, and the weight was 580 grams. And during the first years of GSM I would say the main focus was on reducing the size and increasing the talk time of the mobiles. And in the second ten years we saw that the computation capacity of the mobile was increasing so much that we could do other things like having a phone call, we had the cameras, we had the calendar and we had the apps. Nowadays, it's not a mobile phone any more; let me say mainly it's a mobile computer, mobile apps computer and the price is not really what you pay in Germany. That's a matter of advertisement of the companies. But the weight is about 110 grams, so reduction in size and in weight and a dramatic increase in terms of complexity and capabilities. And now a question arises: what can do to improve in the future? And there are different approaches. One is, I would say, the distinguished British approach by the company Vertu which is connected to Nokia. Very precious devices are an alternative solution from an Austrian designer-jeweler, the Kings Button iPhone. But that's not what I would like to talk about. As I said, we have now 20 years of mobile phones. We have seen that it's still a phone. You can still use it as a phone, but during the last ten years we have not seen too much progress in terms of speech quality. And that's the topic of my talk: the path towards high-definition telephony. So let's see what is high-definition telephony. I will go through some issues like speech coding as such and HD-voice and modelbased bandwidth extension; that might be an important keyword. I will address error correction and have some ideas about a future step at the end. Now we have the Plain Old Telephone System which is tuned to history. In the old analog days, the capacity of long-haul ranged transmission links required narrow frequency band per channel, and this is why we have this old limitation to the telephone bandwidth which is 300 hertz to 3.4 kilohertz. And we know how to use the telephone but actually the intelligibility of syllables is only 91 percent; that is not too much. And quite often, we need a spelling alphabet on the telephone because we don't get the right syllables. HD-voice is at least 7 kilohertz but we include also the lower frequencies down to 50 hertz. The technical term is wideband but the marketing people invented the term HD so everyone understands what it is. And here we see the bandwidth which is increased significantly but there is even more than wideband, there is also super-wideband up to 14 kilohertz. So the low frequencies increase naturalness and comfort because the fundamental frequency of speech is mostly not transmitted on the telephone. And if you include that, it sounds more natural and you can distinguish between the daughter and the mother on the telephone; that might sometimes be important. The higher frequencies contribute to clarity and intelligibility, and if we increase even further we can improve brilliance and quality of music. Quite often only a minor part of the speech spectrum is transmitted. This is the telephone band; this is unvoiced sound. And we see even more energy beyond the telephone band and if we include that in terms of wideband or HD coding, we will significantly improve the quality and the intelligibility of syllables. So we have these categories: narrowband, wideband, super-wideband -- Sometimes it's called HD Plus -and fullband if that is needed as a reference. Now I would like to address speech coding. More and more we are talking about speech-audio coding, speech-audio signal processing. But for the time being in the mobile phones we have pure speech coding. And the simplified explanation of what the mobile phone is about is given here. There is a model of the speech production mechanism of the vocal tract. We have the vocal folds and we have the vocal tract, and that can be described by a model where we have a generator and filter, time [inaudible] filter. And this is what we see at the receiving end in the mobile: there is model of speech production that is implemented. And over the air we transmit the parameters which are needed to drive the model, and at the transmit side we extract from the speech the model parameters. So actually it's some kind of naturally-sounding vocoder what we are using there, but it is more and more a mixture of parametry coding and waveform coding. The most important codecs we have in the GSM mobiles is the very old Full Rate codec from 1988. It was very simple and then, we had improvement in terms of technology when microelectronics became more powerful. And ten years later we have what is called the Adaptive Multi-Rate codec. And later on we had to add the term NB, narrowband. So its telephone bandwidth fs rate is 8 kilohertz; it's much more complex and it has different bit rates. It goes down to 4.75 and the maximum bit rate is 12.2; that is the codec which is preferably used in the mobile phones. That's also called the Enhanced Full Rate codec, 12.2 kilobits per second. And it's based on the model of speech production in terms of code-excited linear prediction. We have here two stages of the filter, the vocal tract, the vocal folds, the gain and some excitation generator which is a vector codebook. And what is transmitted over the channel is the vector address. Typically it is 40 samples; that is, 5 milliseconds a gain factor that filtered to produce 40 samples. So that seems to be very simple. The secret is how to extract at the transmit site the best vectors, the gains, the coefficients such that we can code it with a very low bit rate and still get excellent quality at the receiving end. The motivation for designing the Adaptive Multi-Rate codec was not to improve the quality. The motivation behind that was, usually we have a full rate channel in GSM that is 22.8 kilobits per second including source coding and error protection. And if we can keep that rate and reduce the bit rate for speech coding, we can increase the number of bits for error protection. And that's what has happened with the AMR codec. If you have the Full Rate codec the bit rate is always 22.8. If the mobile is outside, we use the enhanced Full Rate codec, highest bit rate, highest quality. If a channel gets worse, maybe a mobile is inside a building, then the channel gets bad and we increase dramatically the error protection and still have reasonable quality but the quality will never be better than if we use the highest bit rate. So the motivation and the objective of the AMR Narrowband codec is to save money from the point of view of the operators because if you would like to improve the in-house coverage, you could install more base stations outside the buildings, inside the buildings. So it's a reasonable and a good solution at that time to improve the service quality in the mobile networks by introducing the Adaptive Multi-Rate codec. And it can switch very quickly between the different bit rates, every 40 milliseconds. There is a half-rate channel where the bit rate is just half of that. I will not go more into the details. Now we have seen high-definition is at least 7 kilohertz. And there are some examples like Skype, and I guess this is the first time users became aware that a telephone could be more than just telephone quality. There is Voice-Over IP, the OPUS codec. I should add here Windows Live messaging system. And we see there is already standardization; very, very long ago in 1985 we designed the first wideband coding. The ITU standard for ISDN -- It was designed for ISDN networks to increase quality. And now the patents are not valid anymore, and we see the first products. And one of the products is called DECT Plus; that's the digital cordless telephone and it is used in combination with a SIP protocol over the Internet. But it's difficult to use; the normal customers, I'm sure, will not use that because you have to combine the telephone with a router and you have to install software and you have to know someone who has it. So it's far from being the standard in the fixed networks so far. Then we have in the 3GPP domain what is called the AMR-Wideband. That's the next standard. Adaptive Multi-Rate Wideband standard. Interestingly in 2001, more than 12 years ago, it was designed. At that time we had no third generation and no fourth generation; it was designed for GSM. It could be used in GSM. It was never used in GSM. And there are some activities to introduce in 3G and 4G mobile networks, but for quite a while there have been some mobiles on the market which have the AMRWideband. And it is used but only for ringtones not for communication but that might change. Now let's try to get some idea about the quality. This is narrowband quality, the standard quality in GSM. [Audio playback begins] >>: Slow-cooked hippopotamus in red wine: a recipe for special occasions, provided that the hippopotamus feels comfortable in red wine. Slow-cooked hippopotamus in red wine: a recipe for special occasions provided that the hippopotamus feels comfortable in red wine. [Audio playback ends] >> Peter Vary: Now the question is, while this is a significant improvement can we do more? Does it make any sense to improve in terms of bandwidth for speech? So let's listen to 7 kilohertz wideband. [Audio playback begins] >>: Slow-cooked hippopotamus in red wine: a recipe for special occasions. Slowcooked hippopotamus in red wine: a recipe for special occasions, provided that the hippopotamus feels comfortable in red wine. [Audio playback ends] >> Peter Vary: And clearly the naturalness and the transparency increases even more. But we have not only speech, we also have music. Let's listen to GSM music. [Music playback] >> Peter Vary: So we could, even in the mobile networks, transmit audio quality. There are some codecs to do that. And the bit rates are such that we can transport them in the 3G networks. There is only packet transmission in the 4G networks. LTE: we have packet transmission so it would be no problem to introduce that in principle. The object of the AMR-Wideband codec is to increase quality and intelligibility for the first time. And interestingly the bit rates go down according to the AMR principle to 6.6 kilobits per second then it can be filled up by error protection. The quality is not as good at 6.6 as at 24 kilobits per second. If you try to find out where in the world is HD-voice you'll find that at the Global mobile Suppliers Association, and they keep track of the announcements of the operators. And the latest count is that there is HD-voice launched or announced in 61 mobile networks, mostly 3G, in 35 countries in the world. If we look to Germany; it's not that clear. We have pretty good coverage for 3G and 4G. For main operators, it looks very similar. And all of them are saying, "We have HD-voice," but if you go to the shop no one can tell you how to use that because it's partly installed in the network. And it's not everywhere because we have different suppliers for the infrastructure, and it's obviously a strong effort to implement HD-voice in the network. In some countries we have it live, like in France. In the UK there are several 3G networks which offer HD-voice. And meanwhile, we have plenty of mobile devices which can do that. And this of course is the next interesting problem for the speech processing community: if we install HD-voice in the network, we have to modify quite a lot. To give you an example, a German-wide coverage of GSM requires about 19,000 base stations. And if you install such a new codec, you need a new speech codec and you need a new channel codec. In the mobiles, that's not an issue because you buy a new mobile. The base station is there and you just need a software update at the base station. But it's like with the PCs, if you've ever tried to update your PC which is ten years old, install the latest software, there might be some problems. So it's more than trivial to introduce to HD-voice in the network because you need a lot of modifications and protocols. Coding part is installed in the mobile switching center, you don't have so many. That's not [inaudible] but the base stations is the point. And if you have done that, you have two separate worlds. The customer buys a new mobile and he knows, hopefully, a few friends who also have the HD service, but there is no cross-connection in terms of quality between the narrowband and the wideband terminal. And it will take a very, very long time until or if ever the whole telephone network has been converted to HD-voice. And then, we can study all kinds of interconnections between HD and narrowband mobiles over HD and narrowband networks. For example, for such a guy who has the new HD mobile phone if he is happy, we have the complete chain in HD. But if the user at the other end is narrowband, he gets only narrowband quality. Even if the user here is connected by a 3G network -- That could be 3G LTE. That could be GSM -- then he can't take profit of the HD capabilities because it transmits only narrowband noise and narrowband speech. It might be that the HD terminal connect is connect to a narrowband network then they also have only narrowband quality. And so we have different solutions. Now I would like to study these cases. Case A is a happy case where we use the AMRWideband codec or the AMR-Wideband Plus codec. In this situation we try to improve. We could try to improve by -- the keyword is now bandwidth extension. We transmit just the narrowband signal, and in the mobile we try to exploit the model of speech production and some characteristics of the human ear to artificially increase the quality and eventually the intelligibility. From an information theoretical point of view, that's very questionable. We have here the specific source and we have a model, and we can do something. I will demonstrate that. If the narrowband mobile is connected via the 3G network, we could also implement this Artificial BandWidth Extension in the network; therefore, I'm saying it's two cases, B1 and B2. It doesn't make a big difference. The question is, where is the processing power? If we have only the narrowband network but two HD terminals, it seems to be that we can do nothing but we can do something if we succeed to transmit, in a compatible way, side information for Artificial BandWidth Extension. While we could use, as in the first case, B1 Artificial BandWidth Extension at the receiving end but the quality improvement is limited as soon we can transmit side information which drives the BandWidth Extension, the quality becomes much more better. Some of the codecs have the element of Artificial Wideband Extension as part of the standard already. So the idea is the narrowband network, let's say, transports only PCM or an AMR narrowband codec and we embed some information. We will see how to do that. And we try to do that in a compatible way such that a narrowband terminal gets narrowband quality and will not be disturbed by the embedded information which can be used by the HD terminal to expand the bandwidth. The first case, just receiver-based -That was Scenario B1 or B2 where the BandWidth Extension is in the network -- tries to estimate some characteristics of the speech and to expand the speech artificially. It is a mixture of speech recognition of estimation and speech synthesis. And in listening tests I can tell you the listeners clearly prefer Artificial BandWidth Extension, and in certain occasions if we have background noise, Artificial BandWidth Extension will even increase intelligibility. One approach my former PhD student Peter Jax had was purely based on linear predictive coding. What we are doing here, we receive narrowband speech, let's say, that's already decoded -- It's PCM -- and the first step is to interpolate from 8 to 16 kilohertz. Then, we apply a pattern recognition approach where we extract some features. We have a statistical model. The goal is to estimate the spectral envelope of the wideband speech. So at the end the wideband speech will be reconstructed by applying a wideband synthesis filter. And the goal is to estimate from parameters which are extracted from the narrowband signal, the envelope of the wideband signal. And it uses a codebook. The codebook could be, for example, just the LPC or the LSF quantization codebook. For the LPC information it uses bias estimation, and what we get is the LPC or reflection coefficients or capsule coefficients of the wideband filter. And we apply the analysis filter to the over-sampled narrowband speech, what we get is the over-sampled residual signal. And if the signal was limited to 3.4 kilohertz, the residual will also be limited to 3.4 kilohertz. If the prediction goes quite well and let's if the voice was unvoiced, we have a flat noise spectrum up to 3.4 kilohertz but we needed up to 7 kilohertz. So it's not too difficult to imagine how to expand the narrowband flat noise to a wider flat noise just by spectral replication or spectral translation. If the speech is voice then we have here a harmonic narrowband noise with more or less constant spectral components. It has a harmonic structure and it's not too difficult to imagine how to expand that to a wideband excitation by spectral repetition folding modulation. I can tell you, the extension of the excitation signal is very easy and it's not critical at all. The art of BandWidth Extension is hard to estimate the envelope. And then we can apply this wideband excitation to wideband synthesis filter and get hopefully decent quality. So this is a little bit about the theory. You get some idea that is pattern recognition and conditional estimation. I will not go through the details. And I would like to play one example. What we see here is a spectrogram, and it switches between narrowband and the artificially extended signal. [Audio playback begins] >>: Well, three or four months run along and it was well into the winter now. I had been school most all the time and could spell and read and write just a little. I could say the multiplication table up to six times seven is thirty-five, and I don't reckon I could ever get any further than that if I was to live forever. I don't take no stock in mathematics anyway. [Audio playback ends] >> Peter Vary: So the quality is significantly better. And if there is some background noise, tests indicated that even the intelligibility is increased. The second approach for Artificial Wideband Extension is to embed information, to hide information in a compatible way in the bitstream. That works with Scenario C, and we will see here two different versions how to implement that. We add some side information, so we transmit a narrowband bitstream or narrowband speech samples over a narrowband telephone network. The normal narrowband terminal should not be confused by this hidden information. And the bit format, the stream, the frames and all that should be the same as before or, let me say, at least according to the specified format. And the HD terminal could extract the hidden information to do much better BandWidth Extension with the transmitted side information. No increase of the bit rate. No modification of the bit stream format. And I would like to discuss this with an alternative version, how to expand the bandwidth. The first one was to use purely LPC-based techniques, LPC analysis filter expansion and synthesis. The second one is to split the signal, the wideband signal, into a lower and an upper band. So, with sub-band processing we need not to do anything with a low pass band but we try to re-synthesize at the upper band, the high band from 4 to 7 or 8 kilohertz. But the bitstream is according to a narrowband signal. What we have here is the bassband decoder, let's say, the AMR narrowband codec at 12.2 kilobits per second. We get the narrowband signal. It's sampled at 8 kilohertz. We have a two-channel filter bank which does the interpolation drop to obtain the enhanced signal sampled at 16 kilohertz. And here's the extension band synthesis, and first approach is to hide data by watermarking techniques or steganography in the bitstream. For example, 2 kilobitz per second, that's quite a lot, but we don't need so much for wideband extension to extract some parameters to do here the BandWidth Extension. And I would like to explain once more using the AMR codec, the simplified plot diagram here, the vector codebook for the gain vectors, and this filter which has two stages which we have seen before. But the point is to exploit the properties of the vector codebook or the vector search. Without going too far into the details, I can say that at the transmit site, finding the best vector is done by a procedure which is Code Excited Linear Prediction or analysis by synthesis or you could also say trial and error because you try out which is the best excitation. And there are many, many different approaches how to design a codebook and there are few approaches how to design a search procedure in a codebook which is, let me say, specified by the [inaudible] kind search. It is not a codebook which is stored but it's constructed iteratively by some search procedure. And this is the optimization. Criterion H is a matrix where we have the impulse response of the synthesis filter. The vectors have length for 40 that is 5 milliseconds, and this has to be optimized. And in the standards there is for complexity reasons a non-exhaustive search because in any case the codebooks are very, very sparse. The dimension is 40, and typically the number of codebook entries is 2 to the power of 10: 1,000 roughly. Here the dimension is only 2 and in the codebook we have here obviously only 4 entries to explain the principle. So sparse codebook. And the first idea to use watermarking techniques could be to split the codebook into two different codebooks. No, that's too fast. The codebook -- To use -- Yeah. To split the codebook such that we, let's say, use the green codebook or the red codebook. We just separate it into two codebooks, and the hidden information is 1 bit. If you use the green one, the bit is zero. And if we use the red one, the bit is 1. And the receiver can find out which entries have been used and he knows the information which has been transmitted. The disadvantage is that the quantization error increases because we have reduced the size of the codebook. But as the codebook is that sparse, we can invent a second codebook of the same size. And then, we have no loss in terms of quantization errors. And the receiver knows, if the red or the green book is used, which bit is transmitted, which polarity or sign of the bit is transmitted. And this can be implemented in a modified, non-exhaustive search because if we study the AMR codec, we see that there at least one thousand possibilities to specify a codebook of size 1,000 or let me say 2 to the power of 10 which means we can hide 10 bits in the codebook address of a vector of length 40. And as 40 samples is 5 milliseconds, that is 2 kilobits per second. So this information is used to drive the BandWidth Extension at the receiver, and here is the sub-band approach. The high-band signal is here [inaudible] by controlling the time envelope, by controlling the frequency envelope, by controlling the excitation generator which produces noise or some periodic signal. And the bitstream is extracted in the decoder in the part where the codebook address is decoded to find out is it red or green. We have 1,024 different colors and the color says which are the 10 bits which are just transmitted. I explained that already. And let's study some examples. This is narrowband speech and this is extended speech with the side information of 1.65 kilobits per second. We have taken this because this, as indicated somewhere, became part of the standard of the extension of the G.729.1 codec that Geiser designed. This solution included channel coding 2 kilobits per second; the bandwidth is extended from narrowband to wideband and he used that module to expand the AMR codec but hiding the bitstream in a compatible way. Now let's listen to one example. The first test is at what extent when gets a conventional mobile, narrowband mobile is confused or disturbed, do the listeners recognize that there is something which has been modified in the background? [Audio playback begins] >>: He suffered terribly from a speech defect. He suffered terribly from a speech defect. >>: Faulty installation can be blamed for this. Faulty installation can be blamed for this. [Audio playback ends] >> Peter Vary: And actually you don't recognize the modifications. Only the very experienced listener sometimes can distinguish that there might be something different. So it would be allowed to do that. And then, we can imbed the 2 kilobits in the bitstream of the AMR codec and we can compare the narrowband receiving terminal and the wideband receiving terminal in terms of quality. [Audio playback begins] >>: To administer medicine to animals is frequently a very difficult matter and yet, sometimes it's necessary to do so. To administer medicine to animals is frequently a very difficult matter and yet, sometimes it's necessary to do so. [Audio playback ends] >> Peter Vary: And that sounds much better or significantly better than the artificially expanded signal where we don't have side information at all. But the drawback of that solution is we have to transmit the bitstream as it is. As soon as we have transcoding in between, let's say a call from a mobile to a different network and if there is PCM, for example, in between, the hidden information gets lost. But if a transmission is time-domain free which is one of the options in GSM then it would work. And the nice idea would be, the operator just has to sell to two people, two mobile phones without modifying the network at all. If they are in the same network, they would from the very first usage of a telephone have wideband quality. But there is a second solution, and the second solution is to avoid this time-domain problem and to hide the signal in the samples not in the bitstream. And the idea is just the following: here we have the wideband spectrum. And we take the spectrum from 4 kilohertz to 6.4 for a certain reason, and we compress this and place this in the gap between 3.4 and 4 kilohertz. If we have a narrowband signal and if we have clean telephone bandpasses, stop band frequency of the bandpass is 3.4 kilohertz and between 3.4 and 4 we can transmit something else. So in-band signaling but in-band compressed transmissions of frequencies here. If it is periodic, it will be compressed and you could also say it's pitch scaling or spectral compression, and we do that in the frequency domain. The detailed solution will be explained next week by Bernd Geiser at the ICASSP. The processing is based on DFT domain. We have first analysis with a large window and a second analysis of the high-band signal -- Oh, sorry. First of all we split the signal into a high-band and into a low-band. We decimate the sample rate from 16 to 8 kilohertz and then, we have here a short analysis and here a longer analysis. And we inject this spectrum. In the high-band we do re-synthesis and there is more windowing and overlap to obtain wideband speech. This is the wideband signal. We clearly see that there is a lot of energy here especially in unvoiced sounds. And there is less energy in the voice regions. So if we do compression here and it is more or less noise-like, we can expect it's not so critical because we don't have too many harmonics here. This is for high-band. We take this and we compress it and place it in the range between 3.4 and 4. As I said we take the high-band from 4 to 6.4 and we compress it by a factor of 4 so we can place it here in this gap which has a width of 600 hertz. This is the narrowband signal where we did not apply the telephone bandpass. As you see in the true speech it goes up to here, but here would be the 3.4 kilohertz limit here. And we will replace this as you see. And it looks different. I'll go back and forth. It looks different, but even if you listen to that, it doesn't sound too disturbing. [Audio playback begins] >>: Oak is strong and also gives shade. Cats and dogs each hate the other. [Audio playback ends] >> Peter Vary: But, the normal terminal would have some telephone bandpass and would not allow it to offer these components to the user. But what we see is the insertion of the fricative sounds. This is a comparison of the reconstructed wideband approximation and the original wideband signal. The original wideband signal does not have here a gap between 3.4 and 4 kilohertz, so that's the true wideband. The narrowband signal has a gap because there's a cutoff frequency at 3.4 kilohertz. And we replace the high-band here, and the gap is not annoying at all. We know that from different codecs that you will not hear the gap and, therefore, we can expect a good quality. So let's listen to the extended. [Audio playback begins] >>: Oak is strong and also gives shade. Cats and dogs each hate the other. [Audio playback ends] >> Peter Vary: And the uncoded. [Audio playback begins] >>: And dogs each hate the other. [Audio playback ends] >> Peter Vary: Oh, sorry. I'll repeat once more because... [Audio playback begins] >>: Oak is strong and also gives shade. Cats and dogs each hate the other. Oak is strong and also gives shade. Cats and dogs each hate the other. [Audio playback ends] >> Peter Vary: It might be that you can recognize slight differences, but usually you don't have the AB comparison and the quality is pretty good. Now we have different options to improve the quality in the networks by coding, better source coding, compression, hiding information, whatever. But then, we have a transmission channel with errors. The question is, can we keep the quality even if we have bit errors. So we have to study the bit error sensitivities and maybe we have to use all the tricks which are known from channel coding. And one interesting solution is turbo coding because you can just apply it at the receive site. And a very special interesting solution is what we call Turbo Error Concealment. That is the idea of turbo coding applied to a channel decoder and a source decoder. Usually a turbo decoder has two component decoders; one helps the other in an iterative procedure. And this concept can be applied to a channel decoder and a source decoder, or let me say a parameter decoder. I will explain that. This is the transmission model we have. At the transmit site we extract some parameters, let's say any codec. We have channel coding. We have modulation and transmission. And at the receive end, we have soft in and soft output channel decoder. No real turbo decoder so far but just soft in-soft out; it's a component of a turbo decoder because it can accept extrinsic information. Extrinsic information means, "What do the other bits know about me as a bit?" The processing is as follows: the soft-out channel decoder produces bits and it produces reliabilities. Reliabilities are the probability that the bit here is right or wrong; it's usually described by the log-likelihood values. And then, we calculate here our so-called aposteriori probabilities not on bit level but on parameter level because what we extracted here is parameters, LPC coefficients, vector addresses, gain vectors. And these still have some redundancy. If you study the first LCP coefficient in a sequence, you will see there is clearly correlation. If you study the prediction signal of a predictive coder and even if you have perfect prediction, you would say, "While the output is white noise, even in white noise we still have redundancy because the white noise, even if it has no correlation, has a certain distribution." Literally, Gaussian distribution. And that is still redundancy, redundancy on the level of the samples or on the level of the parameters. And this is extracted here in this block. This simple equation describes at the end what happens. We calculate a-posteriori probabilities. We give something back on bit level. I will explain that on the next slide. So there is some iteration between the soft-channel decoder and the first block of the soft-decision source decoder. And after stopping the iterations, we know the best a-posteriori probabilities. So that means we have, let's say, a parameter V. And the parameter V might be transported by 3 bits. So if it is 3 bits, we have 8 possibilities. At the transmitter there is a quantization table. If it K log quantization with 8 entries. So these are the entries in the codebook and these are the 8 a-posteriori probabilities which you have found after iterating here. So we have received 3 bits and we interested in the probability if X, let's say, a group of 3 bits is fixed [inaudible] the probability for the first with second preferred entry number 8. And then we do conditional estimation in terms of minimizing the means squared estimation error, and it turns out we just have the multiply the codebook entries with their a-posteriori probabilities. If the transmission is perfect, if there is no bit error any more because the channel decoder has repaired all the bit errors then only one of these probabilities will be one and all the others will be zero. So the summation does just the table lookup and is exactly comparable with the standard. If the transmission is completely disturbed, let's say only random bits, then it will turn out that a-posteriori probabilities have all the same value, 1 over Q or 1 over 8. And what we get is the average of the codebook entries. And if that was a parameter with a sign and a symmetrical distribution, the average would be zero. So we have graceful degradation. So that improves significantly the quality by exploiting residual redundancy on the parameter level that is the distribution and the correlation of parameters. And from that we can feedback extrinsic information on bit level to the source decoder. Let's assume we have a situation with 3 bits and the channel decoder finds out that bit number two and three are zero and one, or should be zero and one in one iteration and the channel decoder asks the source decoder, "What do you think about bit number one?" The source decoder says, "Well, these three bits are belonging to a parameter which has a certain distribution which is known. And we know the bit assignment and the quantization table, so if bits two and three are zero and one, we have two possibilities: obviously, that your first bit is one or zero. But this probability than this here." And then, we can take the quotient and the algorithm. And this is a language the channel decoder understands, that's a log-likelihood value which is fed back to the channel decoder. And then, this way we do iterations across the channel and the source decoder. And the improvements are quite impressive. This table look-up decoding. This is source decoding if we rate over the channel quality as a quality measure the parameter SNR, not the speech or audio SNR. We take the parameter SNR of a gain vector of a parameter SNR of an LPC coefficient. And if we do the iterations, we come very close to the theoretical limits. Using the turbo-like process applied to channel and source decoding. One example, let's take speech. One iteration means actually no iteration; we just onehalf one path through the decoder. [Audio playback begins] >>: To further his prestige he occasionally reads the Wall Street Journal. [Audio playback ends] >> Peter Vary: Two iterations. [Audio playback begins] >>: To further his prestige he occasionally reads the Wall Street Journal. To further his prestige he occasionally reads the Wall Street Journal. To further his prestige he occasionally reads the Wall Street Journal. [Audio playback ends] >> Peter Vary: Let's listen to music [Music playback] >> Peter Vary: Yeah so we, let me say, have solved the problem. We have good codecs. We have good ideas to hide information, and we have very powerful channel decoders. The question is what might be the next step? Well, this is just one statement. It could be binaural telephony and then, we would need stereo headsets, stereo coding and transmission. And you might ask what is it good for? One situation you could imagine is you would not like or you cannot travel to a meeting. And you would like to participate in the meeting such that instead of you there is a dummy-head or at least two microphones. And we have two-channel transmission, and you are in Environment B with a mobile and you can listen and get spatial information. And for example, Environment A would be the meeting room at the airport and Environment B is at the beach, for example. And we have produced a nice demo. And you see it makes it really fun to participate in such a binaural audio conference. If you listen to that here via headphones you get some idea which information you transmit. I would say it's just the opposite philosophy of what we are doing today. Today we tried to suppress the background and the background noise. And the acoustic atmosphere, here, we do the opposite and we transmit. But the audio bandwidth is more than just telephone bandwidth. And it makes it really fun if you apply that to groups. In that case we have a Group A and a Group B and in each group only one has a master headset with two microphones and all the others have the headphones. The connection here might be wireless. And then, everyone can talk to everyone. Maybe you attended our demo at the last [inaudible]. It's very amazing. You look to the left and see, "Oh, he is not here in my room. He's somewhere else." And this might get great acceptance not only for the professional application of audio conference but even for groups of young people. Now my conclusions are that mobile phones will, I would say, hopefully surpass the speech quality of fixed-line telephones because all the ingredients are there to improve the quality. We have the standards and the networks provide sufficient capacity to transmit that. And it will become easier to introduce new codecs because in the 3G and 4G networks there is no dedicated speech transmission mode because it's just packet transmission. Then, it will be much easier. But nevertheless, even in the new world of the 4G networks we still have the old fashioned telephone network and we have interaction of new and old mobile phones. And then, we have the compatibility problem. Even if it seems to be that in quite a lot of networks there is HD-voice on the move, I would say, you have to distinguish between announcement and real public use. But then, we will see the discrepancy between narrowband and wideband, and in that situation BandWidth Extension might help to improve the user acceptance. Someone who is buying a new mobile phone can get some improvement from the very beginning independent of the situation which is in the connection presently. And it could bridge the gap between narrowband and HD. And that could be improved significantly if we hide information if we do the transmission in a competitive way, let me say, over GSM or [inaudible] network eventually. The latest proposal is to do frequency compression, and my impression is that could be a shortcut to HD mobile telephony. But the real step could be the future binaural HD telephony as close as being there. Yeah, that's the end of my talk but I would like to cite some ideas, true statements about what is HD. There is one analyst who says the expectation for quality in the public on cell phones is so low; people say it's good enough. But there is on the other side the mobile manufacturer, Peter Isberg, he said, it will be "the most significant quality upgrade in telephone history." And I expect it will be something in between but hopefully much closer to the most important quality upgrade. Thank you very much. [applause] >> Ivan Tashev: Questions [inaudible]? >>: I'm curious if you have tried some of those demonstrations with the Artificial BandWidth Extension in different signal [inaudible] and different noise settings? And I was curious to hear if you have the demo how it actually sounds [inaudible]? >> Peter Vary: Yeah, I can tell you it's not as good. We are working on that. And the first idea is what should we do? Should we first suppress the noise, pre-processing? Or should we imply the noise suppression in the parameter estimation? So it's like in speech recognition. And it looks like we should first of all try to get rid of the background noise and then it works quite good. Then your next question might be, what about music? Yeah, it's not that good. If you have music there and music in the background then the pure Artificial BandWidth Extension does not perform that well. But if we can transmit side information it's like a codec which spends not too many bits for the higher frequency band, and we know that the ear is not that sensitive there. >>: Another question is, okay, from the moment you switched from the traditional transportation to the pure digital, besides having errors in the bits of your transmitted data which is less common in the digital networks but then we face packet drops or packets arriving too late and how eventually this is going to be resolved. To keep the quality high, even in certain packet drop rate. >> Peter Vary: Yeah. The steganographic transmission is not limited to BandWidth Extension. You could use the capacity of 2 kilobits per second, for example, to transmit any information. And one interesting solution is to support packet substitution, to have redundant transmission. One of my students made some investigations in that direction and that really helps to improve the packet error concealment. But then, we have spent this information and we will not use it for wideband. But it goes further: from wideband you can extend to super wideband and we have some capacity which could be used for packet error concealment. >> Ivan Tashev: More questions? >>: I have a related question. With the AMR codec there are significant artifacts introduced by the codec itself. Presumably the presence of these artifacts has the effect of concealing other problems like channel errors or a poorly placed microphone or a poorly designed microphone. And then, as we go to HD-voice those artifacts are going to be reduced. And the user may then become more aware of other problems like a poorly placed microphone or channel errors [inaudible]. Do you have some insight as to how the user's awareness is going to change and how the irritation of other artifacts that used to be insignificant becomes more detectible? >> Peter Vary: Yeah, that's a very interesting question. Let me say, if you are in a videoaudio conference and one subscriber comes via the old telephone, you clearly see the significant difference between and hear the difference between the audio quality of Subscriber A and the telephone quality of Subscriber B. But you need not to have such complicated consolations. In a normal telephone conference which we have almost every day or quite often, we experience what is happening in the telephone networking over the last ten years; the telephone quality is not getting better. It's getting worse. And you know the people which are on the phone personally and you clearly find out, "Oh, that's probably a mobile and that's a terrible speaking mobile phone and that's a terrible desk telephone." In the office you hear the reverberation. I expect that the users are really aware of the degradations, and we should do something. And that should be applied to the whole chain; it starts at the acoustic front end. You address the microphones and the frequency responses. So we should improve that. We should not make them cheaper. But I'm realistic; I know we cannot replace all of the devices we have there. So we should try to improve by intelligent signal processing, let me say, intelligent equalization at the receiving end. Or better, pre-processing of a digital signal. I expect there is a lot of room for improvement even if we have an AMR codec in between. Yeah? But I think this is a very interesting question for the near-future. >> Ivan Tashev: More questions? >>: I have one questions. For a super wideband you mentioned that that would be an additional improvement. But I think what we found that is if you own a cell phone like that then the microphone is so far away from the mouth that actually [inaudible] frequencies are almost [inaudible]. >> Peter Vary: Yeah. >>: Did you do any studies on that? >> Peter Vary: Yeah, we have to distinguish between at least three different modes of usage: that is, the normal handset mode, the hands-free mode and then we have the headset mode. And presumably you will not get all the improvements in each mode. If you have the headset you clearly will get this improvement. Or if you have an audio conference situation with big loudspeakers. So that will be at least third or fourth one. >>: A few years back there were studies about Artificial Wideband Extension. But user fatigue. So sometimes when you hear it for the first time it's impressive, something is there that wasn't. But if you keep using it over the time, you may change your opinion. You may actually prefer a clean narrowband versus, you know, artificially [inaudible]. I'm not referring to the case where you embed information or the extension is actually intelligent to design. But do you have any comments on that? >> Peter Vary: Yeah, there is a big test emerged from the [inaudible] project and [inaudible] workshop last year. And there was a common effort to make a very deep comparison of different approaches of Artificial BandWidth Extension. And the results are very encouraging. If you would like to know more details about that, I could send you the paper. I don't have in mind where it was published. It is out meanwhile. >> Ivan Tashev: Let's thank again our speaker. [applause]