Image and Video Compression Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering University of Surrey Email: w.wang@surrey.ac.uk 1 Introduction • • • • • Course components A brief history Basic concepts Coding performance limits Coding of still image 2 Course Components • Component Coding Algorithms I (By myself) Fundamentals of Compression Coding of Still image JPEG standard Vector Quantisation Subband and Wavelet Coding • Component Coding Algorithms II (By Dr Fernando) Coding of video sequence, H.261, 263, coding algorithms MPEG-1, -2, -4 coding algorithms • Component Error Resilience in Video Communications (By Prof. Kondoz) 3 Further Reading • • • • • Component Coding Algorithms Ghanbari, M. Standard Codecs: Image Compression to Advanced Video Coding, IEE Telecommunication Series 49, 2003. 0-85296-710-1 A Clarke, R. J. Digital Compression of Still Images and Video, Academic Press, 1995. 0-12-175720-X B Haskell, B. G., Puri, A. and Netravali, A. N. Digital Video: An Introduction to MPEG-2, Chapman and Hall, 1997. 0-412-08411-2 B Error Resilience Sadka, A. H. Compressed Video Communications, J. Wiley and Co, 2001. 0-470843128 A More References 4 A Brief History of Image Communication • 1840 • 1895 • 1920s Louis J.M. Daguerre, France, William Henry Fox Talbot, USA, photographic film First public motion picture presentation First television experiment British TV pioneer J.L. Baird with Nipkow Disc (around 1926) 5 A Brief History of Image Comm. (Cont) • • • • 1930s 1930-32 1935 1936 Color movies First experimental television broadcasting in US First German television broadcasting in Berlin TV transmission during the Berlin Olympics British TV pioneer J.L. Baird with Nipkow Disc (around 1926) 6 A Brief History of Image Comm. (Cont.) • • • • • • • 1939 1952 1954 1967 1970s 1970s 1980s Regular monochrome TV service in US Regular TV service in Germany NTSC Color television in US PAL color television in Germany Consumer video cassette recorder (VCR) Fax machines Digital TV studios (ITU-R rec. 601) 7 A Brief History of Image Comm. (Cont.) • 1990s JPEG and MPEG standards Digital still cameras Digital TV broadcasting Digital video/versatile disk (DVD) Integration of computers and video World wide web Internet video streaming 8 A Brief History of Image Comm. (Cont.) Evolution of the video coding standard by the ITU-T and ISO/IEC committees 9 Fundamentals 10 What? • The minimisation of the amount of information required to represent an image/video signal • The reduction of the overall signal bandwidth 11 Why? • Applications for which bandwidth is a precious commodity • Storage applications: Archiving, television production, home entertainment, multimedia • Transmission applications: Radio and television broadcasting, internet video streaming, multimedia for mobile phones 12 How? • Image and video signals contain superfluous (redundant) information • Statistical redundancy associated with signal predictability/correlation/smoothness: Original signal can be recovered perfectly, therefore it is called “lossless” or “information preserving” coding • Subjective redundancy associated with the error tolerance of human vision: Original signal cannot be recovered perfectly, only an approximate reconstruction is possible, therefore it is called “lossy” or “error tolerant” coding 13 Performance Assessment • Efficiency in image and video coding (an indication of how much information has been reduced for the coded signal) lossless systems: ratios of uncoded-to-coded information, i.e. compression ratio lossy systems: the amount of coded information expressed as a function of the distortion introduced by the coding operation, i.e. rate/distortion function • Distortion in image and video coding (an indication of how close to the original is the coded signal) lossless systems: trivially zero distortion (infinite fidelity) lossy systems: distortion can be measured objectively (computation of error between the original and the coded representation) or subjectively (tests designed to measure response of human vision to coding artefacts) 14 Coding Operation in the Image Chain • Signal processing operations anywhere in the image chain can be regarded as coding operations. Such operations may be due to: Acquisition environment (such as lighting conditions and light propagation, special effects in studio, and atmospheric conditions in outside broadcasts) Acquisition systems (such as camera optics, scanning aperture and field integration in electronic imaging, and chemical process in film) Post-production environment (such as special effects) Image/video display systems (such as display aperture in electronic imaging, half-toning in printed media, and chemical process in film) Viewing environment (such as propagation of light, and optical paths) Human visual system (such as lens, and response of neurons to light stimuli) • We are not concerned with the above but need to be aware of their coding effects. • We are concerned with the processing of image/video signals after acquisition/post-production and prior to display 15 Classification of Video Coding Sytems • Analogue (signals predominantly in analogue form) PAL (Phase Alternating Line, transmission of terrestrial television) VHS (Video Home System, home video recording) MAC (Multiplexed Analogue Component, satellite television transmission) Betacam SP (Superior Performance, video recording in the studio) • Digital (signals predominantly in digital form) ITU-R Rec. 601 (BT.601, or CCIR-601) (professional video recording) MPEG-1 (home video recording, CD-ROM) MPEG-2 (television transmission) MPEG-4 (multimedia) H.261/3 (video conferencing) JPEG (still images) • We will be mainly concerned with digital signals in this module. For more about analogue signals, please refer to some textbooks. Here, we only introduce a few fundamentals about analogue signals that closely related to digital signals. 16 Coding of Colour Signals • One of the challenges facing the first of colour television systems was the inclusion of colour information without increasing the video bandwidth. • Colour cameras operate in the space of R,G,B primaries. Each of these component signals are fullbandwidth (i.e. 6.75MHz) • Colour coding systems (i.e. PAL) typically involve the conversion of component signals to composite by means of the following processing operations: R,G,B to Y,U,V co-ordinate transformation Low-pass filtering of U and V components DSSC-AM modulation of U and V by two sub-carriers in phase quadrature Sign alternation of modulated V at every other line 17 Coding of Colour Signals (Cont.) Y WR R WG G WB B U 0.436 ( B Y ) /(1 WB ) V 0.615 ( R Y ) /(1 WR ) where,WR 0.299, WB 0.114,WG 0.587 18 Coding of Colour Signals (Cont.) • Y – Luma component, representing the brightness of an image (i.e. the “black and white” or achromatic portion of the image). • U – Blue difference chroma (B-Y) • V – Red difference chroma (R-Y) • “Luma” and “chroma” are usually used in video engineering, while “luminance” and “chrominance” are used in color science. • In digital domain, YCbCr is used to represent the coded color, where DSSC-AM modulation is replaced by subsampling. 19 Consequence of Colour Coding • Compression ration: 3:1 • Artefacts Visible line structure, and interline flicker, Combing (distortion of vertical detail moving horizontally due to interlace) Spatial aliasing (i.e. diagonal straight lines cause spatial “beat” frequencies and jagged/staircase edges) Temporal aliasing (fast motion suffers from “judder”) Picture “softness” (aperture effects) • Artefact frequency: low • Artefact severity: high • Remedies At the transmitter end, intelligent PAL encoding allowing better segregation of colour and monochrome components with less crosstalk between them At the receiver end, intelligent PAL decoding possibly involving motion adaptive filtering (may attenuate some frequency components) 20 Digital Video Formats – A Case Study of Digital Television • • This format is standardised and is described in the document “Recommendation ITU-R BT.601”. Source signals: Y,U,V (one luminance and two colour-difference components, gamma pre-corrected and filtered) Sampling structure (625 line/50Hz analogue system) Orthogonal, line, field and frame repetitive U,V samples co-sited with odd Y samples in each line 864 total (720 active) luminance samples per line 432 total (360 active) chrominance samples per line 625 total (576 active) lines • • Sampling frequency (Y:135MHz, U,V:6.75MHz) Quantisation Uniformly quantised PCM 8 (optionally 10) bits per sample Scale 0-255 Luminance black level defined as level 16 Luminance peak white level defined as level 235 Luminance total number of active levels 220 Chrominance total number of active levels 225 with zero corresponding to 128 • 21 Digital Video Formats – A Case Study of Digital Television (Cont.) • Total active bit-rate 720 samples/line X 576 lines/frame X 25 frames/sec X 8 bits/sample/component X (1+0.5+0.5) components = 166 Mbits/sec • Total raw bit-rate (Y:135MHz, U,V:6.75MHz) 864 samples/line X 625 lines/frame X 25 frames/sec X 8 bits/sample/component X (1+0.5+0.5) components = 216 Mbits/sec For television transmission purposes this amount of information may require (depending on the modulation scheme) a bandwidth of 40 MHz upwards Today this corresponds to occupancy requirements of 6-7 analogue terrestrial television channels !! Therefore, to make digital television transmission a practical proposition compression in the digital domain is imperative. 22 Digital Video Formats – A Case Study of Digital Television (Cont.) • Note 1 Unused samples and levels are actually used to convey auxiliary and control information i.e. vertical and horizontal synchronisation (blanking), colour reference (burst) etc. There are applications which require this information in digital form • Note 2 The 601 standard is a specification of the output format only and is not concerned with the practical implementation of the A/D conversion. This is left to the system designer to implement but should typically involve anti-aliasing pre-filtering and attention to the effects of the non-ideal sampling aperture and pixel aspect ratio. 23 Digital Video Formats – Other Formats • High-definition television (HDTV) 1920 X 1152 X 50 Hz interlaced (16:9 aspect ratio) 1440 X 1152 X 50 Hz interlaced (4:3 aspect ratio) • Video-conferencing/Video-telephony 352 X 288 X 30 Hz Progressive CIF (Common Interchange Format) 352 X 288 (240) X 25 (30) Hz progressive SIF (Source Input Format-PAL (NTSC)) 176 X 144 X 30 Hz Progressive QCIF (Quarter CIF) • Composite (PAL) digital video (recording) 922 X 576 X 50 Hz interlaced This results from sampling a composite (PAL) signal with a frequency which is 4 times the colour subcarrier frequency and is used for the recording of digital composite signals for studio applications • Desktop 800 X 600 Super VGA (Vector Graphic Array) 640 X 480 VGA 24 The Hierarchy of Video Sampling Format 25 Sampling Formats for Chrominance 26 Coding Performance Limits and Assessment 27 Self-Information • A discrete source X with a finite alphabet A can be modelled as a discrete random process i.e. a sequence of random variables xi , i 1,2,... • Each random variable xi takes a value from the alphabet A {ak | k 1,2,...} • The information content of a symbol ak is related to the degree that the symbol is unpredictable and unexpected. Quantitatively this can be expressed by means of the self-information I (ak ) of symbol ak I (ak ) log2 ( p(ak )) (bits) 28 Source Models • Two useful source models are used for the studying the coding performance limit: The Discrete Memoryless Source (DMS) Successive symbols are statistically independent i.e. in a symbol sequence the current symbol does not depend on any previous one The Markov K-th order Source (MKS) Successive symbols are statistically dependent i.e. in a symbol sequence the current symbol depends on the K previous ones The entropy of a DMS source X is defined as the average selfinformation: H ( X ) p(ak ) I (ak ) p(ak ) log2 (ak ) k k The entropy is maximised for a uniform symbol distribution. 29 Markov-K Source • The MKS model is a more realistic model for images and video Images (of natural scenes) are correlated in the spatial domain i.e. plain areas (with little or no spatial detail) Video is correlated in the spatial domain as above and also in the temporal domain i.e. static areas (with little or no motion) • A MKS can be specified by the following conditional probabilities: p( X i ak | X i 1 ,..., X i k ) • i, k The entropy of a MKS source is defined as H ( X ) p( X i ak | X i 1,..., X i k ) H ( X | X i 1,..., X i k ) i, k Sk where H ( X | X i 1 ,..., X i k ) p( X i is the conditional entropy i.e. ak | X i 1 ,..., X i k ) log2 ( p( X i ak | X i 1 ,..., X i k )) i and Sk denotes all possible realisations {X i 1 ,..., X i k } 30 Coding Theorem 31 Coding Theorem (cont.) A typical rate distortion curve 32 Practical Considerations • Information rate for coded still images: Bits per pixel (bpp) i.e. the ratio of coded information in bits to the total number of pixels Compression ratio (dimensionless) i.e. the ratio of uncoded-to-coded information • Information rate for coded moving sequences: Bits per second (b/s) and its multiples (kb/s, Mb/s) i.e. the rate of flow of the coded information • Distortion of coded-and-decoded image/video: Objectively using the Peak Signal-to-Noise Ratio (PSNR) (# of active levels)2 PSNR 10log10 coding error variance Subjectively using quality and impairment scales designed to measure the response of human vision. For television, subjective assessment procedures are standardised and are described in Rec. ITU-R BT. 500 33 Subjective Picture Assessment for Television 34 Human Visual System Plot of contrast sensitivity (just perceptual modulation) function 35 Human Visual System (Cont.) 36 Coding of Still Images 37 Classification of Compression Techniques • Spatial (data) Domain Elements are used “raw” in suitable combinations. The frequency of occurrence of such combinations is used to influence the design of the coder so that shorter codewords are used for more frequent combinations and vice versa (entropy coding). • Transform Domain Elements are mapped onto a different domain (i.e. the frequency domain). The resulting coefficients are quantised and entropy-coded. • Hybrid Combinations of the above. 38 Lossless Coding in the Spatial Domain • Memoryless Coding 39 Lossless Coding in the Spatial Domain (Cont.) • Conditional Coding Construct “current” symbol histograms according to “previous” symbols and use separate codebooks accordingly 40 Lossless Coding in the Spatial Domain (Cont.) • Block (joint) Coding Define blocks of more than one symbols and record their occurrences using a multi-dimensional histogram Code book grows exponentially with block size Useful when symbols in a block are correlated An example using a block size of 2 i.e. two consecutive symbols. 41 Lossless Coding in the Spatial Domain (Cont.) • Predictive Coding (previous symbol) “Previous” symbol used as a prediction of “current” symbol Prediction error coded in a memoryless fashion Prediction error alphabet and codebook have twice the size i.e. symbol alphabet {1, 2, 3, 4} prediction alphabet {-3, -2, -1, 0, 1, 2, 3} A good predictor will minimise the error (most occurrence will be zero) 42 Lossless Coding in the Spatial Domain (Cont.) • Predictive Coding (generalised) Prediction is based on combination of previous symbols Prediction template needs to be “causal” i.e. template should contain only “previous” elements w.r.t the direction of scanning (shown with arrows). This is important for coding applications as the decoder will need to have decoded the template elements first to perform the prediction of the current element. 43 Lossless Coding in the Spatial Domain (Cont.) • Run-length Coding Useful when consecutive symbols in a string are identical A symbol is followed by the number of its repetitions A typical example A general example 44 Lossless Coding in the Spatial Domain (Cont.) • Zero Run-length Coding Useful for strings containing long runs of consecutive zeros and are sparsely populated by non-zero symbols i.e. quantised frame differences A non-zero symbol is followed by the number of consecutive zeros A typical example A general example 45 Entropy Coding (Variable Length Coding) • Assignment of codewords to individual symbols or collections of symbols according to likelihood • More probable symbols or collections of symbols are assigned shorter codewords and vice-versa, so called variable length coding (VLC) • There are two types of VLC, which are employed in the standard video codecs: Huffman coding and arithmetic coding. • Huffman coding is a simple VLC code, and it is suboptimal since its compression can never reach as low as the entropy due to the constraint that the assigned symbols must have an integral number of bits. It is employed in all standard codes. • Arithmetic coding is an optimal coding method which can approach the entropy since the symbols are coded collectively using a code string, which represent a fractional value on the number line between 0 and 1. It is employed in JPEG, JPEG2000, H.263 and MPEG-4, where extra compression is demanded. 46 Huffman Coding 47 Huffman Coding (Cont.) An example of Huffman code for seven symbols Average bit per symbol: Entropy: 48 Arithmetic Coding • Using a scale in which the coding intervals of real numbers between 0 and 1 are represented. This is in fact the cumulative probability density function of all the symbols which add up to 1. • The interval is partitioned according to symbol likelihood. • The interval is iteratively reduced by retaining, at each iteration, the sub-interval corresponding to the currently encoded input symbol 49 Arithmetic Coding (cont.) • An example: suppose the alphabet is {a,e,i,o,u,!} and the fixed model is used with the probabilities shown in the following table. Each individual symbol needs to be assigned a portion of the [1,0) range that corresponds to its probability of appearance in the cumulative density function. For example, the alphabet u with probability 0.1 can, defined in the range of [0.8, 0.9), can take any value from 0.8 to 0.89999… • • Suppose a message eaii! needs to be coded. The first symbol to be encoded is e. Hence, the final coded message e has to be a number in the range of [0.2, 0.5). The second symbol is a which is in the range of [0.0, 0.2), but in the subrange of [0.2, 0.5), as it is not the first number to be encoded. Consequently, after the second symbol, the number is restricted to the range of [0.2+0.0*(0.5-0.2) 0.2+0.2*(0.5-0.2)) = [0.2 0.26) 50 Arithmetic Coding (cont.) • The next symbol to be encoded is I, in the range of [0.5, 0.6), that corresponds to the new subrange [0.2, 0.26). Hence, after this symbol, the coded number is restricted to the range of [0.2+0.5*(0.26-0.2), 0.2+0.6*(0.26-0.2)) = [0.23, 0.236). Applying the same rule to the successive symbols. We can obtain the following table: • The final range [0.23354, 0.2336) represents the message eaii!. This means if we transmit any number in the range of [0.23354, 0.2336), that number represents the whole message of eaii!. 51 Arithmetic Coding (cont.) Representation of arithmetic coding process with the interval scaled up at each stage for the message eaii! 52 Arithmetic Coding (cont.) • Decoding process For the previous example, suppose a number 0.23355 in the range of [0.23354, 0.2336) is transmitted. The decoder, using the same probability intervals as the encoder, performs a similar procedure. Only the interval [0.2, 0.5) of e envelops the transmitted code of 0.23355. So the first symbol can only be e. The new code for the second symbol is (0.23355-0.2)/(0.50.2)=0.11185, which is enveloped by interval [0.0, 0.2) of symbol a. The new code for the third symbol is (0.11185-0.0)/(0.2-0.0) = 0.55925, which is enveloped by the range of [0.5, 0.6) of symbol i. Followed by (0.55925-0.5)/(0.6-0.5) = 0.5925 in the range of [0.5, 0.6) of symbol i. Further followed (0.5925-0.5)/(0.6-0.5) = 0.925, which is in the range of [0.9, 1) of symbol !. Therefore, the decoded message is eaii!. The decoding process is shown in the following table: 53 Lossless Coding in Transform Domain • Transforms commonly refer to expansions of signals to series of coefficients using sets of appropriate (i.e. orthonormal) basis functions so that the following are achieved. Decorrelation of input data Optimal distribution of energy (variance) into the smallest number of coefficients • The optimal transform according to the above is the Karhunen-Loeve (KL) transform. This is not used in practice: Its basis functions are the eigenvectors of the covariance matrix of the input signal, and hence data-dependent, and therefore need to be computed and transmitted for each data set. There are no fast implementations for the KL transform 54 Lossless Coding in Transform Domain (cont.) • In practice, sub-optimal transforms are used whose basis functions are data-independent and their performance is close to the KL transform, such as 55 Lossless Coding in Transform Domain (cont.) • The DCT is the most widely used transform in image/video coding and is a fundamental component of many standardised algorithms. KLT and DCT basis functions closely resemble each other for images modelled as firstorder Markov processes. A n-point DCT is equivalent to a 2n-point DFT obtained by reflection. This avoids spurious harmonics due to discontinuities at the boundaries of the repetition period. • The following example visualises the decorrelation and energy compaction properties of transforms: 56 Lossless Coding in Transform Domain (cont.) 57 Comparison of Various Transforms 58 Comparison of Various Transforms (cont.) (1) Energy concentration measured typical natural images of block size 1-by-32. (2) KLT is optimum and DCT performs slightly worse than KLT 59 Block Transform Coding 60 Block Transform Coding (cont.) 61 Block Transform Coding (cont.) 62 Lossy Coding • For natural images the compression performacne of lossless coding schemes is fairly modest Compression ratios of 3:1 or 4:1 can be achieved using the best of the above mentioned schemes. This is comparable to the performance achieved by the general purpose data compression algorithms i.e. Ziv-Lempel, which are not designed specifically to exploit image structure. • To improve performance some coding distortion will have to be tolerated. The main aims of lossy coding are: To optimise rate/distortion performance i.e. achieve the best image quality for a given target bit-rate To minimise the perceptual impact of distortion i.e. produce coding errors that are likely to be imperceptible to the human viewer 63 Lossy Coding (cont.) • The main tool for lossy coding is quantisation. This is applicable to most domains: Spatial (data) domain: applicable to raw pixels, pixel differences (predictive coding), conditional pixel occurrences (conditional coding), ensembles of pixels (joint coding). This is a special case of so-called vector quantisation which will be studied separately. Transform domain: applicable to transform coefficients and ensembles of coefficients (vector quantisation). • Another important tool is sampling This is usually applicable to the data domain. 64 Quantisation (scalar) 65 Lossy Predictive Coding Open-loop encoder (prediction based on past inputs) Closed-loop encoder (prediction based on past outputs) Decoder (prediction always based on past outputs) 66 Lossy Transform Coding Coder Decoder 67 Sampling: One-dimensional sampling 68 Sampling: One-dimensional sampling (cont.) 69 Sampling: Two-dimensional sampling 70 Sampling: Two-dimensional sampling (cont.) 71 Sampling: Two-dimensional sampling (cont.) 72 Non-ideal Sampling 73 Interpolation 74 Non-ideal Interpolation (sample-and-hold) 75 Non-ideal Interpolation (bi-linear) 76 Example of Non-ideal Interpolation 77 Summary A brief history of image communication and coding standard Coding performance theorem Some fundamental concepts of compression Coding methods for still images (This is the most important part of this lecturing session) 78 Acknowledgement Thanks to T. Vlachos, B. Girod for providing their lecture notes that have been partly used in this presentation. Thanks also to M. Ghanbari, and part of the material used here is from his textbook. 79