Uploaded by Macloud Kamula

Week-3 Representation of Speech Waveforms - EEE 2415

advertisement
Lecture Notes: EEE 2415 Speech Processing, Kirinyaga University
1
Representation of Speech Waveforms
Speech signals are sound signals, defined as pressure variations travelling through the air. These
variations in pressure can be described as waves and correspondingly they are often called sound waves.
In the current context, we are primarily interested in analysis and processing of such waveforms in digital
systems. We will therefore always assume that the acoustic speech signals have been captured by a
microphone and converted to a digital form.
A speech signal is then represented by a sequence of numbers π‘₯𝑛 , which represent the relative air pressure
at time-instant 𝑛 ∈ 𝑁 . This representation is known as pulse code modulation often abbreviated as PCM.
The accuracy of this representation is then specified by two factors;
1) the sampling frequency (the step in time between n and n+1) and
2) the accuracy and distribution of amplitudes of numbers π‘₯𝑛 .
Sampling Speech Signals
The original speech signal is a continuous analog signal, which needs to be sampled and converted into
discrete data on the timeline. After sampling, the analog signal is sampled at equal intervals. When the
signal is no longer continuous in time, but still continuous in amplitude. After sampling and processing,
the analog signal becomes a discrete-time signal. Sampling frequency refers to the number of times the
sound signal is sampled in a second. The higher the sampling frequency, the more real and natural the
sound is restored. In today's mainstream acquisition cards, the sampling frequency is generally divided
into three levels. They are 22.05KHz, 44.1KHz and 48KHz. 22.05KHz can only achieve the sound quality
of FM broadcasting, while, 44.1KHz is the theoretical limit of CD sound quality (human ears can
generally feel the sound of 20-20K Hz. According to Shannon's sampling theorem, the sampling
frequency should not be less than twice the highest frequency. So 40KHz is a value that can well restore
the sound heard by the human ear, so CD sets the sampling rate as 44.1KHz), while, 48KHz is more
accurate. The sampling frequency higher than 48KHz can no longer be recognized by the human ear, so it
is of little use on the computer.
Sampling rate
Sampling is a classic topic of signal processing. Here the most important aspect is the Nyquist frequency,
which is half the sampling rate Fs and defines the upper end of the largest bandwidth
which can be uniquely represented. In other words, if the sampling frequency
would be 8000 Hz, then signals in the frequency range 0 to 4000 Hz can be uniquely described with this
sampling frequency. The AD-converter would then have to contain a low-pass filter which removes any
content above the Nyquist frequency.
The most important information in speech signals are the formants, which reside in the range 300 Hz to
3500 Hz, such that a lower limit for the sampling rate is around 7 or 8kHz. In fact, first digital speech
codecs like the AMR-NB use a sampling rate of 8 kHz known as narrow-band. Some consonants,
especially fricatives like /s/, however contain substantial energy above 4kHz, whereby narrow-band is not
Instructor: Mr. J. P. Chibole
Lecture Notes: EEE 2415 Speech Processing, Kirinyaga University
2
sufficient for high quality speech. Most energy however remains below 8kHz such that wide-band, that is,
a sampling rate of 16 kHz is sufficient for most purposes. Super-wide band and full band further
correspond, respectively, to sampling rates of 32 kHz and 44.1 kHz (or 48kHz). The latter is also the
sampling rate used in compact discs (CDs). Such higher rates are useful when considering also nonspeech signals like music and generic audio.
1.2 Quantization
This can carry out hierarchical quantization, divide the amplitude of signal sampling into several sections,
classify the sampled values of samples falling in a certain section into one category, and give the
corresponding quantized values. According to whether the quantization interval is evenly divided, it can
be divided into uniform quantization and non-uniform quantization. Uniform quantization is characterized
by large signal-to-noise ratio of large signal and small signal-to-noise ratio of small signal. The
disadvantage is to ensure the requirement of signal-to-noise ratio, the number of coding bits must be large
enough. But this leads to low channel utilization. And if the number of coding bits is reduced, it cannot
meet the requirements of signal-to-noise ratio. According to the formula of signal-to-noise ratio, the larger
the number of coding bits, the higher the signal-to-noise ratio, the better the communication quality.
Non-uniform quantization is usually used for speech signals. The basic method is to use large
quantization interval for large signal and small quantization interval for small signal. Because the
quantization interval of small signal becomes smaller, the corresponding quantization noise power also
decreases (according to the quantization noise power formula). So, the quantization signal-to-noise ratio
of small signal is increased and the signal-to-noise ratio of small signal is improved. After quantization,
the signal is no longer continuous not only in time, but also in amplitude. After quantization, the discretetime signal is transformed into a digital signal.
Accuracy and distribution of steps on the amplitude axis
In digital representations of a signal you are forced to use a finite number of steps to describe the
amplitude. In practice, we must quantize the signal to some discrete levels.
Linear quantization
Linear quantization with a step size βˆ†π‘ž would correspond to defining the quantized signal as
π‘₯
π‘₯Μ‚ = βˆ†π‘ž. π‘Ÿπ‘œπ‘’π‘›π‘‘( )
βˆ†π‘ž
The intermediate representation,
π‘₯
𝑦 = βˆ†π‘ž. π‘Ÿπ‘œπ‘’π‘›π‘‘( )
βˆ†π‘ž
Instructor: Mr. J. P. Chibole
Lecture Notes: EEE 2415 Speech Processing, Kirinyaga University
3
can then be taken to represent, for example, signed 16-bit integers. Consequently, the quantization step
size βˆ†π‘ž has to be then chosen such that y remains in the range 𝑦 ∈ (−215 , 215 ] to avoid numerical
overflow.
The beauty of this approach is that it is very simple to implement. The drawback is that this approach is
sensitive to the choice of the quantization step size. To make use of the whole range and thus get best
accuracy for x, we should choose the smallest βˆ†π‘ž where we still remain within the bounds of integers.
This is difficult because the amplitudes of speech signals vary on a large range.
Logarithmic quantization and mu-law
To retain equal accuracy for loud and weak signals, we could quantize on a logarithmic scale as
(|π‘₯|)
π‘₯Μ‚ = 𝑠𝑖𝑔𝑛(π‘₯). exp[βˆ†π‘ž. π‘Ÿπ‘œπ‘’π‘›π‘‘ (π‘™π‘œπ‘”
)]
βˆ†π‘ž
Such operations which limit the detrimental effects of limited range are known
as companding algorithms.
Here the intermediate representation is
(|π‘₯|)
𝑦 = βˆ†π‘ž . π‘Ÿπ‘œπ‘’π‘›π‘‘(π‘™π‘œπ‘”
)
βˆ†π‘ž
which can be reconstructed by π‘₯Μ‚ = 𝑠𝑖𝑔𝑛(π‘₯). exp[βˆ†π‘ž. |𝑦|]
A benefit of this approach would be that we can encode signals on a much larger range and the
quantization accuracy is relative to the signal magnitude. Unfortunately, very small values cause
catastrophic problems. In particular, for π‘₯ = 0, the intermediate value goes to negative infinity 𝑦 =
−∞ which is not realizable in finite digital systems.
A practical solution to this problem is quantization with the mu-law algorithm, which defines a modified
logarithm as
log(1 + πœ‡|π‘₯|)
𝐹(π‘₯) ≔ 𝑠𝑖𝑔𝑛(π‘₯).
(1 + πœ‡)
By replacing the logarithm with 𝐹(π‘₯), we retain the properties of the logarithm for large x, but avoid the
problems when x is small.
Instructor: Mr. J. P. Chibole
Lecture Notes: EEE 2415 Speech Processing, Kirinyaga University
4
Wav-files
The most typical format for storing sound signals is the wav-file format. It is basically merely a way to
store a time sequence, with typically either 16 or 32-bit accuracy, as integer, mu-law or float. Sampling
rates can vary in a large range between 8 and 384 kHz. The files typically have no compression (no
lossless nor lossy coding), such that recording hours of sound can require a lot of disk space. For
example, an hour of mono (single channel) sound with a sampling rate of 44.1kHz requires 160 MB of
disk space.
Adaptive quantization, APCM
ο‚· To obtain a uniform quantization error during single phones or sentences, the quantization error
has to change slowly over time.
ο‚· In adaptive quantization (adaptive PCM or APCM) the quantization step size is adapted slowly
such that
o the available quantization levels cover a sufficient range such that numerical overflow
can be avoided,
o the quantization error is stable over time and
o as long as the above constraints are fulfilled, quantization error is minimized.
ο‚· An alternative, equivalent implementation to the change in quantization step size is to apply an
adaptive gain to the input signal before quantization.
Adaptive quantization with the feed-forward algorithm using an adaptive quantization step
Instructor: Mr. J. P. Chibole
Lecture Notes: EEE 2415 Speech Processing, Kirinyaga University
ο‚·
5
The feed-forward algorithm requires that in addition to the quantized signal, also the gaincoefficients or the quantization step is transmitted to the recipient.
o Transmitting such extra information increases the bit-rate, whereby the feed-forward
algorithm is not optimal for applications which try to minimize transmission rate.
Adaptive quantization with the feed-forward algorithm using an adaptive gain (compressor)
ο‚· In feed-backward algorithms the quantization step or gain-coefficient is determined from
previous samples which are already quantized.
o Since the previous samples are available also at the decoder, the quantization step or
gain-coefficient can be determined also at the decoder without extra transmitted
information.
o If the signal grows very rapidly, this approach can however not guarantee that there are
no numerical overflows, since adaptation is performed only after quantization.
Adaptive quantization with the feed-backward algorithm using an adaptive quantization step
ο‚· Note that the feed-forward algorithms all require transmission of the scaling or gain coefficient,
which can increase demand on bandwidth and adds to the complexity of the system.
ο‚· The parallel transmission line can be avoided by predicting those coefficients from previously
transmitted elements, with a feed-backward algorithm.
Adaptive quantization with the feed-backward algorithm using an adaptive gain coefficient
ο‚· The feed-backward algorithm can naturally be applied on gain adaptation as well.
Instructor: Mr. J. P. Chibole
Lecture Notes: EEE 2415 Speech Processing, Kirinyaga University
6
Differential quantization DPCM
ο‚· In differential quantization we predict the subsequent sample, whereby we can quantize only the
difference between the prediction and the actual sample.
o
o
ο‚·
If the predictor is simply π‘₯Μƒπ‘˜ ≔ π‘₯π‘˜−1 then the error is π‘’π‘˜ = π‘₯π‘˜ + π‘₯π‘˜−1
The first difference (delta modulation) is the simplest predictor, which uses the
assumption that subsequent samples are highly correlated.
o The reconstruction is obtained by reorganization of terms as π‘’π‘˜ = π‘₯π‘˜ − π‘₯Μƒπ‘˜
o Observe that the reconstruction is needed at both the encoder and decoder, to feed the
predictor.
o NB: At this point the flow-graphs start to get a bit complicated as there are several
feedback loops.
More generally, we can use a predictor P, which predicts a sample based on a weighted sum of
previous samples
𝑀
π‘₯Μƒπ‘˜ = − ∑ π‘Žβ„Ž π‘₯π‘˜−β„Ž
β„Ž=1
where the scalars π‘Žβ„Ž are the predictor parameters.
ο‚·
A feed-backward would here use the past quantized samples π‘₯Μ‚π‘˜
Adaptive and differential quantization with feed-forward
ο‚· The differential, source-model based quantization can naturally be combined with adaptive,
perception-based quantization.
o The adaptive differential PCM (ADPCM) adaptively predicts the signal and adaptively
choosing the quantization step.
Instructor: Mr. J. P. Chibole
Lecture Notes: EEE 2415 Speech Processing, Kirinyaga University
ο‚·
7
Adaptive differential quantization with feed-backward
The ADPCM can again, naturally, be implemented as a feed-backward algorithm as well.
1.3 Coding
The signal has become a digital signal after quantization and the digital signal needs to be encoded into
binary. "CD quality" speech uses a sampling rate of 44100 samples per second. The 16 bits is the number
of bits encoded. The process of sampling, quantizing, and coding is called A _ hand D conversion, as
shown in the following figure. The inverse process is the Dhand A conversion. The reason is that the prefiltering is carried out before the an amp D conversion. And a smoothing filter needs to be added after the
Dmax A conversion. These functions can be accomplished with a single chip, and all kinds of such chips
can be bought on the market.
Delta Modulation
PCM is powerful, but quite complex coders and decoders are required. An increase in resolution also
requires a higher number of bits pers ample. Standard PCM systems have no memory—each sample value
is separately encoded into a series of binary digits. An alternative, which overcomes some limitations of
PCM is to use past information in the encoding process. One way of doing this is to perform source
coding using delta modulation:
The signal is first quantized into discrete levels, but the size of the step between adjacent samples is kept
constant. The signal may therefore only make a transition from one level to an adjacent one. Once the
Instructor: Mr. J. P. Chibole
Lecture Notes: EEE 2415 Speech Processing, Kirinyaga University
8
quantization operation is performed, transmission of the signal can be achieved by sending a zero for a
negative transition, and a one for a positive transition. Note that this means that the quantized signal must
change at each sampling point. For the above case, the transmitted bit train would be 111100010111110.
The demodulator for a delta-modulated signal is simply a staircase generator. If a one is received, the stair
case increments positively, and if a zero is received, negatively. This is usually followed by a low pass
filter. The key to using delta modulation is to make the right choice of step size and sampling period—
an in corrects election will mean that the signal changes too fast for the steps to follow, a situation called
overloading. Important parameters are therefore the step size and the sampling period.
If the signal has a known upper-frequency cut off πœ”π‘š ,then we can estimate the fastest rate at which it can
change. Assuming that the signal is 𝑓(𝑑) = π‘π‘π‘œπ‘ (π‘€π‘š ), the maximum slope is given by
𝑑𝑓
= π‘π‘€π‘š = 2πœ‹π‘π‘“π‘š
| |
𝑑𝑑 π‘šπ‘Žπ‘₯
For a DM system with step size a, the maximum rate of rise that can be handled is π‘Ž/𝑇𝑠 = π‘Žπ‘“π‘ , so we
require 𝑓𝑠 ≥ 2πœ‹π‘π‘“
2πœ‹π‘π‘“π‘š 2πœ‹π‘“π‘š
= π‘Ž
π‘Ž
⁄𝑏
Making the assumption that the quantization noise in DM is uniformly distributed over (−π‘Ž, π‘Ž), the
mean-square quantization error power is a 2/3. We assume that this power is spread evenly over all
frequencies up to the sampling frequency fs. However, there is still the low pass filter in the DM
receiver—if the cut off frequency is set to the maximum frequency π‘“π‘š ,then the total noise power in the
reconstructed signal is
π‘Ž2 π‘“π‘š
2
Μ…Μ…Μ…Μ…Μ…Μ…Μ…Μ…Μ…Μ…
π‘›π‘žπ‘›π‘‘
(𝑑) =
3 𝑓𝑠
Still making the assumption of a sinusoidal signal, the SNR for DM is
𝑆
3𝑓𝑠 2
3𝑓𝑠
3 𝑓𝑠 3
= 2 Μ…Μ…Μ…Μ…Μ…Μ…Μ…
𝑓 (𝑑) = 2 𝑏 2 ⁄2 = 2 ( )
𝑁 π‘Ž π‘“π‘š
π‘Ž π‘“π‘š
8πœ‹ π‘“π‘š
𝑓𝑠 ≥
When the slope overload condition is just met. The SNR therefore increases by 9dB for every doubling of
the sampling frequency. Delta modulation is extremely simple, and gives acceptable performance in many
applications, but is clearly limited. One way of attempting to improve performance is to use adaptive DM,
where the step size is not required to be constant. (The voice communication systems on the US space
shuttles make use of this technique.) Another is to use delta PCM, where each desired step size is encoded
as a (multiple bit) PCM signal, and transmitted to the receiver as a code word. Differential PCM is
similar, but encodes the difference between a sample and its predicted value — this can further reduce the
number of bits required for transmission.
Comparison: Delta Modulation (DM) and Differential Pulse Code Modulation (DPCM)
Delta Modulation (DM)
Delta modulation is associated with analog-to-digital and digital-to-analog signal conversion
techniques. Delta modulation is utilized to appreciate high signal-to-noise magnitude relation. It uses
one-bit PCM code to appreciate the digital transmission of the analog signal. With delta modulation,
rather than transmitting a coded illustration of a sample, only one bit is transmitted, which simply
Instructor: Mr. J. P. Chibole
Lecture Notes: EEE 2415 Speech Processing, Kirinyaga University
9
indicates whether or not or not the sample is larger or smaller than the previous sample. it’s the most
effective kind of simplest sort of Differential Pulse Code Modulation. The Delta modulation signal is
smaller than the Pulse Code Modulation system.
Differential Pulse Code Modulation (DPCM):
DPCM stands for Differential Pulse Code Modulation, is same as Pulse Code Modulation technique
used for reworking analog signal into digital signal. Differential Pulse Code Modulation has moderate
signal to noise magnitude relation. Differential Pulse Code Modulation differs from Pulse Code
Modulation as a results of it quantizes the excellence of the actual sample and expected value. that’s the
reason it’s cited as differential Pulse Code Modulation(DPCM).
DPCM transmitter and DPCM receiver operations are given below through figure:
In the above diagram, if the signal is large then the next bit in digital data is 1 otherwise next bit is 0.
The difference between DM and DPCM, which are given below:
S.NO
Comparison based
on
1.
2.
DM
DPCM
Feedback
In DM, feedback exists in
the transmitter.
Here, feedback exists in both the
transmitter and receiver.
signal-to-noise ratio
DM has a poor signal-tonoise ratio.
DPCM has a fair signal-to-noise ratio.
Instructor: Mr. J. P. Chibole
Lecture Notes: EEE 2415 Speech Processing, Kirinyaga University
10
Comparison based
on
DM
DPCM
3.
Transmission
bandwidth
It requires the lowest
bandwidth.
Here, DPCM requires less bandwidth
than PCM.
4.
Levels, step size
In DM, the step size is fixed.
While here, the number of levels is
fixed.
5.
Efficiency
DM is less efficient than
DPCM.
DPCM is more efficient.
In DM, only one bit is used
per sample.
Here more than one but less than
PCM(Pulse Code Modulation) bits
are used.
S.NO
6.
Number of bits
7.
Quantization error
and distortion
Slope overload distortion
and granular noise are
present.
Slope overload distortion and
quantization noise are present.
8.
Applications
It is generally used in
speeches and images.
It is mostly used in videos and
speeches.
Instructor: Mr. J. P. Chibole
Download