Section 8: Digitising Speech, audio & video

advertisement
UNIVERSITY of MANCHESTER
School of Computer Science
Comp30291
Digital Media Processing 2009-10
Section 8 (amended):
Processing Speech, Music & Video
14 Dec'09
Comp30291 Section 8
1
8.1 Digitising speech
• Traditional telephone channels restrict speech to 300- 3400 Hz.
• Considered not to incur serious loss of intelligibility.
• Significant effect on naturalness of sound.
• Once band-limited, speech may be sampled at 8 kHz.
•ITU-T G711 standard for speech in POTS allocates 64000 b/s
for 8 kHz sampling rate with 8 bits / sample.
•Exercise: Why are components below 300 Hz removed?
14 Dec'09
Comp30291 Section 8
2
8.1.1 International standards for speech coding:
• ITU committee CCITT until 1993 part of UNESCO.
• Since 1993, CCITT has become part of ITU-T.
• Within ITU-T is study group responsible for speech digitisation
& coding standards.
•Among other organisations defining standards for telecoms &
telephony are:
“TCH-HS”: part of ETSI (GSM).
“TIA” USA equivalent of ETSI.
“RCR” Japanese equivalent of ETSI.
“Inmarsat” & various committees within NATO.
• Standards exist for digitising “wide-band” speech
(50 Hz to 7 kHz) e.g. ITU G722 .
14 Dec'09
Comp30291 Section 8
3
8.1.2. Uniform quantisation.
•Quantisation: each sample, x[n], of x(t) approximated by closest
available quantisation level.
•Uniform quantisation: constant voltage difference  between levels.
•With 8 bits, & input range V , have 256 levels with  = V/128.
•If x(t) between V, & samples are rounded, uniform quantisation
produces
xˆ[n]  x[n]  e[n] where
  / 2  e[n]   / 2
• Otherwise, overflow will occur & magnitude of error may >> /2.
• Overflow is best avoided.
14 Dec'09
Comp30291 Section 8
4
Noise due to uniform quantisation error
•Samples e[n] ‘random’ within /2.
•When quantised signal converted back to analogue,
adds random error or “noise” signal to x(t).
•Noise heard as ‘white noise’ sound added to x(t).
•Samples e[n] have uniform probability between /2.
It follows that the mean square value of e[n] is:
/2
2


1
1
e

2
2
e
p
(
e
)
de

e
 / 2
 / 2  de    3   12
 / 2
/2
3
/2
Power of analogue quantisation noise in 0 Hz to fS/2.
14 Dec'09
Comp30291 Section 8
5
8.1.3. Signal-to-quantisation noise ratio (SQNR)
• Measure how seriously signal is degraded by quantisation noise.


signal power

SQNR  10 log 10 
 quantisati on noise power 
in decibels (dB.)
• With uniform quantisation, quantisation-noise power is 2/12
• Independent of signal power.
• Therefore, SQNR will depend on signal power.
• If we amplify signal as much as possible without overflow, for
sinusoidal waveforms with m-bit uniform quantiser:
SQNR  6m + 1.8 dB.
•Approximately true for speech also.
14 Dec'09
Comp30291 Section 8
6
Variation of input levels
•For telephone users with loud voices & quiet voices,
quantisation-noise will have same power, 2/12.
•  may be too large for quiet voices, OK for slightly louder ones,
& too small (risking overflow) for much louder voices.
•Useful to know over what range of loudness will speech quality
be acceptable to users.
volts
111

001
000
14 Dec'09
 too big for
quiet voice
OK
Comp30291 Section 8
 too small for
loud voice

7
8.1.4. Dynamic Range (Dy)
 Max possible signal power (no overflow) 
Dy  10 log 10 
 dB.
 Min. power which gives acceptable SQNR 
For uniform quantisati on :
 Max signal power 
 Min power 
Dy  10 log 10 
  10 log 10 

2
2
 / 12


  / 12 
 Dy = max possible SQNR (dB)  min acceptable SQNR (dB)
This final expression for Dy is well worth remembering,
but it only works for uniform quantisation!
14 Dec'09
Comp30291 Section 8
8
Exercise
If SQNR must be at least 30dB to be acceptable, what is Dy
assuming sinusoids & an 8-bit uniform quantiser?
Solution:
= Max possible - Min acceptable SQNR (dB)
= (6m + 1.8) - 30 = 49.8 - 30 = 19.8 dB.
Too small for telephony
Exercise: Repeat this calculation for 12-bit uniform quantisation.
14 Dec'09
Comp30291 Section 8
9
8.1.5. Instantaneous companding
• 8-bits per sample not sufficient for good speech encoding with
uniform quantisation.
•Problem lies with setting a suitable quantisation step-size .
•One solution is to use instantaneous companding.
•Step-size adjusted according to amplitude of sample.
•For larger amplitudes, larger step-sizes used as illustrated next.
•‘Instantaneous’ because step-size changes from sample to sample.
14 Dec'09
Comp30291 Section 8
10
Non-uniform quantisation used for companding
x(t)
t
001
111
14 Dec'09
Comp30291 Section 8
11
Implementation of companding (in principle)
• Pass x(t) thro’ compressor to produce y(t).
• y(t) is quantised uniformly to give y’(t) which is transmitted or
stored digitally.
• At receiver, y’(t) passed thro’ expander which reverses effect of
compressor.
• Analog implementation uncommon but shows concept well.
x(t)
Compressor
14 Dec'09
y(t)
Uniform
quantiser
y’(t)
Expander
x’(t)
Transmit
or store
Comp30291 Section 8
12
Effect of compressor
• Increase smaller amplitudes of x(t) and reduce larger ones.
• When uniform quantiser is applied fixed  appears:
 smaller in proportion to smaller amplitudes of x(t),
 larger in proportion to larger amplitudes.
• Effect is non-uniform quantisation as illustrated previously.
14 Dec'09
Comp30291 Section 8
13
Effect of ‘compressor’ on sinusoid
x(t)
y(t)
1
1
t
t
-1
-1
• ‘Expander’ reverses this shape change.
14 Dec'09
Comp30291 Section 8
14
‘A-Law’ instantaneous companding
Common compressor is linear for |x(t)close to zero & logarithmic
for larger values. A suitable formula for x(t) in range ±1 volt is:
 x(t)
:
| x(t) |  1
K
A

y(t)  
1


sign(x(t)) 1  log e | x(t) | : 1  | x(t) |  1 A
 K


where K = 1+ loge (A)
A is constant which determines cross-over between linear & log.
14 Dec'09
Comp30291 Section 8
15
Mapping from x(t) to y(t) by A-law companding
y(t)
1
1/K
-1
-1/A
x(t)
1/A
-1/K
-1
14 Dec'09
Comp30291 Section 8
+1
A  13
(Too difficult to draw
if A is any larger)
16
G711 standard ‘A-law’ companding with A=87.6
•A-law companding as used in UK with A = 87.6 & K=5.47.
• General formula becomes:

16x(t)
:
| x(t) |  1
87.6
y(t)  
sign(x(t)) 1  0.183log e | x(t) | : 1  | x(t) |  187.6
• Maps x(t) in range ±1 to y(t) in range ±1.
•1 % of domain of x(t) linearly mapped onto 20 % of range of y(t).
• Remaining 99% of domain of x(t) logarithmically mapped
onto 80% of range for y(t).
14 Dec'09
Comp30291 Section 8
17
A-law expander formula
: | ŷ(t) |  1/K
 Kŷ(t)/A
x̂(t)  
K ( | ŷ(t)|  1 )
: 1/K  | ŷ(t) |  1
sign( ŷ(t))e
When A = 87.6,
ŷ(t)/16
: | ŷ(t) |  0.183

x̂(t)  
5.47 ( | ŷ(t)|  1 )
: 0.183  | ŷ(t) |  1
sign( ŷ(t))e
14 Dec'09
Comp30291 Section 8
18
Graph of A-law expander formula
x(t)
+1
-1
-1/K
y(t)
1/A
-1/A
1/K
1
A3
(difficult to draw
when A=87.6)
-1
14 Dec'09
Comp30291 Section 8
19
Effect of expander on ‘small samples’
•
•
•
•
•
•
•
•
•
Samples of x(t) ‘small’ when in range ±1/A (quiet speech).
Compresser has multiplied these by 16 to produce y(t).
(Assume A = 87.6)
Without quantisation, passing y(t) thro’ expander would
produce original samples of x(t) exactly, by dividing by 16.
But any small changes to y(t) are also divided by 16.
Effectively reduces quantisation step  by factor 16.
Reduces quantisation noise power (2/12) by factor 162 .
Multiplies SQNR for ‘small’ samples by 162
Increases SQNR for quiet speech by 24 dB as:
10log10(162) = 10log10(28)
= 80log10(2)  800.3 = 24 dB
14 Dec'09
Comp30291 Section 8
20
Effect of expander on ‘large’ samples
• Samples of x(t) are ‘large’ when amplitude  1/A.
• When x(t) =1/A, SQNR at output of expander is:
 (16 / A) 2 / 2 
7

10  log 10 
dB.
with
A

87
.
6
and


1
/
2
2


/
12


 35 dB
• Quantisation-step now increases in proportion to |x(t)| when
|x(t)| increases further, above 1/A towards 1.
• Therefore, SQNR will remain at about 35 dB.
• It will not increase much further as |x(t)| increases above 1/A.
14 Dec'09
Comp30291 Section 8
21
Observations
•
•
•
•
•
•
•
•
•
•
For |x(t)| > 1/A, expander causes quantisation step to increase in
proportion to x(t).
Quantisation noise gets louder as the signal gets louder.
If all samples of x(t) ‘large’, SQNR would remain approximately the
same, i.e. about 35 or 36 dB for G711 (see next slide)
If 30 dB SQNR is acceptable, large signals are quantised satisfactorily.
Small signals quantised satisfactorily for lower amplitudes.
Largest amplitude unchanged & smallest reduced by factor 16.
Dy increased by factor 16 i.e. 24 dB to 19.8+24 = 43.8 dB
Same as 12-bit linear (6x12+1.8 – 30) but with only 8 bits.
Quantisation error for A-law worse than for uniform when |x(t)| > 1/A
Price to be paid for the increased dynamic range.
14 Dec'09
Comp30291 Section 8
22
Variation of SQNR with amplitude of sample
48
SQNR
dB
Uniform
36
A-law
24
12
Amplitude
of sample
0
V/16
14 Dec'09
V/4
V/2
3V/4
Comp30291 Section 8
V
23
Mu ()-law companding
•Similar companding technique adopted in USA.
 log e ( 1  μ x(t) /V ) 

y(t)  sign x(t) 

log e ( 1  μ )


μ (mu)  255_genera lly_used
•When |x(t)| < V /  , y(t)  (  / loge(1+) )x(t)/V
since loge(1+x)  x  x2/2 + x2/3 - … when |x| < 1.
-law with  = 255 is like A-law with A=255,
though transition from small quantisation-steps for small x(t) to
larger ones for larger values of x(t) is more gradual with -law.
14 Dec'09
Comp30291 Section 8
24
Implementing compressor & expander
• Compression/expansion may be done digitally via ‘look-up’ table.
• x[n] assumed to be 12-bit integer in range -2048 to 2047.
• Digitise x(t) by 12-bit ADC, or use 16-bit ADC & truncate to 12 bits.
• For each 12-bit integer, table has corresponding 8-bit value of x(t).
• Expander table has a 12-bit word for each 8 bit input.
• Each 8-bit G711 ‘A-law’ sample is a sort of ‘floating point number:
S
X2 X1 X0 M3 M2 M1 M0
m
x
Value is (-1)S+1 . m . 2x-1 where the ‘mantissa m is:
(1 M3 M2 M1 M0 1 0 0 0)2 if x>0
or (0 M3 M2 M1 M0 1 0 0 0)2 if x=0
14 Dec'09
Comp30291 Section 8
25
8.2.Further reducing bit-rate for digitised speech
•PCM encodes each speech sample independently & is capable of
encoding any wave-shape correctly sampled.
•This is ‘waveform encoding’: simple but needs high bit-rate.
•Speech waveforms have special properties that can be exploited
by ‘parametric coding techniques’ to achieve lower bit-rates.
14 Dec'09
Comp30291 Section 8
26
Properties of speech waveforms
• General trends may be identified allowing one to estimate which
sample value is likely to follow a given set of samples.
• Makes part of information transmitted by PCM redundant
• Speech has 'voiced' & 'unvoiced' parts i.e. 'vowels' & 'consonants'.
• Predictability lies mostly in voiced speech as it has periodicity.
• In voiced speech, a ‘characteristic waveform', like a decaying
sinusoid, is repeated periodically (or approximately so).
Volts
t
14 Dec'09
Comp30291 Section 8
27
A characteristic waveform for voiced speech
Volts
t
14 Dec'09
Comp30291 Section 8
28
Voiced speech (vowels)
•Shapes of characteristic waveforms are, to some extent,
predictable from first few samples.
•Also, once one characteristic waveform has been received,
the next one can be predicted.
•Prediction not 100% accurate.
•Sending a ‘prediction error’ is more efficient than sending the
whole signal.
•‘Decaying sinusoid’ shape of each characteristic waveform is
due to the way sound is 'coloured' by shape of mouth.
•Similarity of repeated characteristic waveforms due to
periodicity of sound produced by vocal cords.
14 Dec'09
Comp30291 Section 8
29
Unvoiced speech (consonants)
• Random or noise-like with little periodicity.
• Lower in amplitude than voiced telephone speech.
• Exact shape of its noise-like waveform not critical for perception
•Almost any noise-like waveform will do as long as energy level
is correct, i.e. it is not too loud or too quiet.
• Unvoiced speech is easier to encode than voiced.
• Separate them at transmitter & encode them in separate ways.
14 Dec'09
Comp30291 Section 8
30
Characteristics of speech & perception exploited to
reduce bit-rate
1. Higher amplitudes may be digitised less accurately.
2. Adjacent samples usually close in value.
3. Voiced characteristic waveforms repeat quasi-periodically
4. Predictability within characteristic waveforms.
5. Unvoiced telephone speech quieter than voiced & exact waveshape not critical for perception.
6. Pauses of about 60 % duration per speaker.
7. Ear insensitive to phase spectrum of telephone speech
8. Ear more sensitive in some frequency ranges than others.
9. Audibility of low level frequency components 'masked' by
adjacent higher level components,
14 Dec'09
Comp30291 Section 8
31
Section of voiced & unvoiced telephone speech.
Large amplitudes are voiced (speech bandlimited to 3.4kHz).
1
x 10
4
0.8
0.6
Amplitude
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
0
14 Dec'09
0.2
0.4
0.6
0.8
1
Time (s)
Comp30291 Section 8
1.2
1.4
1.6
1.8
2
32
Small section of female voiced speech
This was extracted from previous graph (sampled at 8 kHz).
10000
8000
6000
4000
2000
0
-2000
-4000
-6000
-8000
0
500
1000
1500
2000
2500
3000
3500
sample number
14 Dec'09
Comp30291 Section 8
33
Smaller section of male voiced speech.
4
2
x 10
1.5
1
0.5
0
-0.5
-1
-1.5
-2
0
500
1000
sample
1500
Bandlimited to 300-3400 Hz & sampled at 8 kHz.
14 Dec'09
Comp30291 Section 8
34
Magnitude spectrum of male speech
Obtained by FFT analysing about 4 cycles as shown on next slide
14
dBW
13
12
11
10
9
8
7
6
14 Dec'09
0
kHz
0.5
1
Comp30291 2
Section 8 2.5
1.5
3
3.5
35
4
Section of male voiced speech
4
2
x 10
1.5
1
0.5
0
-0.5
-1
-1.5
0
14 Dec'09
100
200
300
400
Comp30291 Section 8
500
600
700
36
8.2.2. Differential coding
• Encode differences between samples.
• Where differences transmitted by PCM this is ‘differential PCM’.
•Omitted from syllabus in 2009-10
14 Dec'09
Comp30291 Section 8
37
8.2.3. Simplified DPCM coder & decoder
• Omitted from syllabus in 2009-10
14 Dec'09
Comp30291 Section 8
38
8.2.4. Linear prediction coding (LPC)
• Omitted from syllabus in 2009-10
14 Dec'09
Comp30291 Section 8
39
Formants in speech
• Excitation passes thro' vocal tract.
• Flexible tube : begins at glottis & ends at lips.
• Shape determined by tongue, jaw & lips.
• "Nasal tract" also tube connected at "velum".
• Tubes act like a filter with resonances ("formants").
• Resonances change as person speaks.
• Resonance at f1 Hz emphasises power close to f1.
• Amplitudes of some harmonics increased with respect to others.
• Refer back to speech graphs. We see this effect.
14 Dec'09
Comp30291 Section 8
40
8.2.5. Waveform coding and parametric coding.
•Waveform coding techniques such as PCM & DPCM try to
preserve exact shape of speech waveform as far as possible.
•Simple to understand & implement, but cannot achieve very low
bit-rates.
•Parametric techniques (e.g. LPC) do not aim to preserve exact
wave-shape, & represent features expected to be perceptually
significant by sets of parameters,
•
i.e. bi coefficients & parameters of stylised error signal.
• Parametric coding more complicated to understand &
implement than waveform coding, but achieves lower bit-rates.
14 Dec'09
Comp30291 Section 8
41
8.2.6. Regular pulse excited LPC with LTP : (RPE-LTP)
• Original speech coding technique for GSM: “GSM 06.10”.
• One of many LPC techniques for mobile telephony.
• Omitted from syllabus 2009-10
14 Dec'09
Comp30291 Section 8
42
8.2.7. CELP
• Based on similar principles to RPE-LRP.
• Omitted 2009-10
14 Dec'09
Comp30291 Section 8
43
8.2.8. Algebraic CELP (G729)
• Omitted 2009-10
14 Dec'09
Comp30291 Section 8
44
8.3. Digitising music:
• Standards exist for the digitisation of ‘audio’ quality sound.
• Compact disks: 2.83 megabits per second.
(fs= 44.1kHz, 16 bits/sample, stereo, with FEC.)
• Too high for many applications.
• DSP compression takes advantage of human hearing.
• Compression is “lossy”, i.e. not like “ZIP” .
• MPEG concerned with hi-fi audio, video & their synchronisation.
• DAB, MUSICAM use “MPEG-1 audio layer 2” (MP2).
• 3 MPEG layers offer range of sampling rates (32, 44.1, 48 kHz)
& bit-rates from 224 kb/s to 32 kb/s per channel.
• Difference lies in complexity of DSP processing required .
14 Dec'09
Comp30291 Section 8
45
MPEG1-4 with MP3 etc
•Layer 1 of MPEG-1 (MP1) simplest & best suited to higher bit-rates:
e.g. Philips digital compact cassette at 192 kb/s per channel.
• Layer 2 has intermediate complexity & is suited to bit-rates around
128 kb/s: DAB uses MP2.
• Layer 3 (MP3) is most complex but offers best audio quality.
Can be used at bit-rates as low as 64 kb/s:
Now used for distribution of musical recordings via Internet.
• All 3 MPEG-1 audio layers simple enough to be implemented on a
single DSP chip as real time encoder/decoder.
•MPEG-2 supports enhanced & original (MPEG-1) versions of MP1-3.
•MPEG-3 was started but abandoned.
Don’t confuse MPEG-3 with MP3 (MPEG-1/2 audio level 3)
•MPEG-4 ongoing & introduces ‘AAC’ & ‘MP4’.
14 Dec'09
Comp30291 Section 8
46
Introduction to MP3
CD recordings take little account of the nature of the music and music
perception.
Frequency masking & temporal masking can be exploited to reduce bitrate required for recording music. This is ‘lossy’ rather than ‘loss-less’
compression.
Frequency masking: ‘A strong tonal audio signal at a given frequency
will mask, i.e. render inaudible, quieter tones at nearby frequencies, above
and below that of the strong tone, the closer the frequency the more
effective the matching’.
Temporal masking: ‘A loud sound will mask i.e. render inaudible a
quieter sound occurring shortly before or shortly after it. The time
difference depends on the amplitude difference’.
14 Dec'09
Comp30291 Section 8
47
Perceived loudness
• Human ear is not equally sensitive to sound at different
frequencies.
• Most sensitive to sound between about 1 kHz and 4 kHz.
• ‘Equal loudness contour graphs’ reproduced below.
• For any contour, human listeners find sound equally loud
even though actual sound level varies with frequency.
• The sound level is expressed in dB_SPA (see notes)
14 Dec'09
Comp30291 Section 8
48
Equal loudness countours
14 Dec'09
Comp30291 Section 8
49
MP3 sub-bands
•
•
•
•
•
•
•
•
MP3 is a frequency-domain encoding technique.
Takes 50% overlapping frames of either 1152 or 384 samples.
With fs = 44.1 kHz : 26.12 ms or 8.7 ms of music
Sound split into frequency sub-bands.
Splitting in 2 stages, firstly by bank of 32 FIR band-pass filters.
Output from each filter down-sampled by a factor 32.
Then further split by means of DCT
Obtain 1152 or 384 frequency domain samples.
14 Dec'09
Comp30291 Section 8
50
Frequency-domain coding
• Frequency-domain samples can be coded.
• Spectral energy may be found to be concentrated in certain
frequency bands and may be very low in others.
• Assign more bits to some frequency-domain samples than others.
• Equal loudness contours useful here.
• Use ‘signal to masking’ (SMR) ratio at each frequency to
determine number of bits allocated to the spectral sample.
• More significant advantages from ‘psycho-acoustics’.
14 Dec'09
Comp30291 Section 8
51
Frequency masking
• ‘A strong tonal audio signal at a given frequency will
mask, i.e. render inaudible, quieter tones at nearby
frequencies, above and below that of the strong tone, the
closer the frequency the more effective the matching’.
• Characterised by a ‘spreading’ function.
• Triangular approximation available.
• Example given for a strong tone (60 dB_SPL) at 1kHz.
14 Dec'09
Comp30291 Section 8
52
Example spreading function (triangular approx)
dB_SPL
dB_SPL
60
60
40
40
20
4k
500
1000
-20
f Hz
20
f Hz
2k
500
-40
1000
2000
Represents a threshold of hearing for frequencies adjacent to the tone.
A tone below the spreading function will be masked & not heard.
Tones make masking threshold contour different from ‘in quiet’.
14 Dec'09
Comp30291 Section 8
53
Triangular approximation to spreading function
• Derived from +25 and -10 dB per Bark triangular approximation
(see WIKI)
• Assuming approximately 4 Barks per octave.
• Bark scale is alternative & perceptually based frequency scale
• Logarithmic for frequencies above 500 Hz.
• More accurate spreading functions are highly non-linear, not
quite triangular, & vary with & amplitude & frequency of
masker.
14 Dec'09
Comp30291 Section 8
54
Effect of 2 strong tones on masking contour
60
dB_SPL
Effect of tones at
800 Hz & 4 kHz
masking contour in
quiet
0
20
f Hz
100
1k
5k
10k
20 k
•Add spreading function for each tone as identified by FFT.
•Allocate bits according to SMR relative to this masking contour,
•More efficient & economic coding scheme is possible.
14 Dec'09
Comp30291 Section 8
55
Temporal masking
• ‘A loud sound will mask a quieter sound occurring shortly
before or shortly after it. The time difference depends on the
amplitude difference’.
• Effect of frequency masking continues for up to 200 ms after
the strong tone has finished, and even before it starts.
• Frequency masking contour for a given frame should be
calculated taking account previous & next frames
14 Dec'09
Comp30291 Section 8
56
Temporal characteristic of masking by a 20 ms
tone at 60 dB_SPL.
dB_SPL
Pre-
simultaneous
Postmasking
60
40
20
-50
14 Dec'09
0
t (ms)
200
Comp30291 Section 8
400
57
Diagram of an MP3 coder
Music
Transform to
frequency
domain
Devise quantsation
scheme for subbands according to
masking
Apply
Huffman
coding
MP3
Derive
psychoacoustic
masking
function
14 Dec'09
Comp30291 Section 8
58
Huffman coding & more detail
Quantisation scheme tries to make the 2/12 noise < masking threshold.
Non-uniform quantisation is used.
Further efficiency with Huffman coding (lossless)
‘Self terminating’ variable length codes for the quantisation levels.
Quantised samples which occur more often given shorter wordlengths.
MP3 decoder simpler than encoder
Reverses quantisation to get frequency domain samples &
Transforms back to time domain taking into account frames are 50%
overlapped.
• Some more detail about MP3 compression and Huffman coding is given
in references quoted in the Comp30192 web-site.
•
•
•
•
•
•
•
•
• Frequency masking demo: www.ece.uvic.ca/~aupward/p/demos.htm.
14 Dec'09
Comp30291 Section 8
59
8.4. Digitisation of Video
• Digital TV with 486 lines would require 720 pixels per line,
each pixel requiring 5 bits per colour, i.e. 2 bytes per pixel.
• At 30 frames/s, bit-rate  168 Mb/s or 21 Mbytes/s.
• Normal CD-Rom would hold 30 s of TV video at this bit-rate.
• For HDTV, requirement is about 933 Mb/s
• For film quality, required bit-rate  2300Mb/s.
• SVGA screen with 800x600 pixels requires 3 x 8 = 24 bits per
pixel, & 288 Mb/s if refreshed at 25Hz with interlacing.
• Need for video compression is clear.
14 Dec'09
Comp30291 Section 8
60
• MPEG-1 & 2 & FCC standard for HDTV use “2-D discrete
cosine transform (DCT)” applied to 8 x 8 (or 10x10) pixel “tiles”.
• Red, green & blue colour measurements for each pixel
transformed to a “luminance” & 2 “chrominance” measurements.
• The eye is more sensitive to differences in luminance than to
variations in chrominance.
• Three separate images dealt with separately.
•For chrominance, average sets of 4 pixels to produce fewer pixels.
• Apply DCT to each tile to obtain 8x8 (or 10x10) 2-D frequency
domain samples starting with a sort of “dc value” which represents
overall brightness of the “tile”.
• Finer detail added by higher 2-D frequency samples.
• Just as for 1-D, higher frequencies add finer detail to signal shape.
14 Dec'09
Comp30291 Section 8
61
• The 2-D frequency-domain samples for each tile now quantised
according to perceptual importance.
• Accurately quantise differences betw dc values of adjacent tiles.
• Differences often quite small & need few bits.
• Remaining DCT coeffs for each tile diminish in importance
with increasing frequency.
• Many are so small that they may be set to zero.
• Runs of zeros easily digitised by recording length of run.
• Further bit-rate savings achieved by Huffman coding.
• Assigns longer codes to numbers which occur less often.
• Shorter codes for commonly occurring numbers.
14 Dec'09
Comp30291 Section 8
62
•Technique above may be applied to a single video frame
•Used to digitise still pictures according to “JPEG” standard.
•MPEG-1 & 2 send JPEG encoded frames ( I-frames) about once or
twice per second.
•Between the I-frames MPEG-1 &-2 send
“P-frames” which encode differences between current frame
& previous frame
“B-frames” which encode differences between current
frame the previous & the next frame.
• MPEG-1 originally for encoding reasonable quality video at about
1.2 Mb/s.
• MPEG-2 originally for encoding broadcast quality video at bitrates between 4 & 6 Mb/s.
14 Dec'09
Comp30291 Section 8
63
End of section 8.
14 Dec'09
Comp30291 Section 8
64
Download