Building the BIG Picture: Enhanced Resolution from Coding by Roger George Kermode B.E. (hons) Electrical Engineering, University of Melbourne, 1989 B.Sc. Computer Science, University of Melbourne, 1990 Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, in Partial Fulfillment of the requirements of the degree of MASTER OF SCIENCE IN MEDIA ARTS & SCIENCES at the Massachusetts Institute of Technology June 1994 Copyright Massachusetts Institute of Technology, 1994 All Rights Reserved Signature of Author Program in Media Arts and Sciences May 6, 1994 A Certified by A few B. Lippman Asocire Director, MIT Media Lboratory Thesis pervisor Accepted By -V Stephen A.Benton Chairperson Departmental Committee on Graduate Students Program in Media Arts and Sciences Rftch MASSACHUSFETTS INSTITUTE JUL 131994 Building the BIG Picture: Enhanced Resolution from Coding by Roger George Kermode Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, on May 6, 1994 in Partial Fulfillment of the requirements of the degree of MASTER OF SCIENCE IN MEDIA ARTS & SCIENCES Abstract Video Coding Algorithms based on Hybrid Transform techniques are rapidly reaching a limit in compression efficiency. A significant improvement requires some sort of new representation -- a better model of the image based more closely on how we perceive it. This thesis proposes a coder that merges psycho-visual phenomena such as occlusion and longterm memory with the architecture and arithmetic processing used in a high quality hybrid coder (MPEG). The syntactic and processing overhead at the decoder is small when compared to a standard MPEG decoder. The final encoded representation consists of a small number of regions, their motion, and an error signal that corrects errors introduced when these regions are composited together. These regions can be further processed in the receiver to construct a synthetic background image that has the foreground regions removed and replaced, where possible, with revealed information from other frames. Finally, the receiver can also enhance the apparent resolution of the background via resolution out of time techniques. It is anticipated that this shift of intelligence from the encoder to the receiver will provide a means to browse, filter, and view footage in a much more efficient and intuitive manner than possible with current technology. Thesis Supervisor: Andrew B. Lippman Title: Associate Director, MIT Media Laboratory This work was supported by contracts from the Movies of the Future and Television of Tomorrow consortia, and ARPA/ISTO # DAAD-05-90-C-0333. Building the BIG Picture: Enhanced Resolution from Coding by Roger George Kermode The following people served as readers for this thesis: Reader: Edward H. Adelson Associate Professor of Vision Science Program in Media Arts and Sciences Reader: Michaeff 3 iggar Principal Engineer, Customer Services and Systems Telecom Australia Research Laboratories, Telstra Corporation Ltd. To myfamily; George, Fairlie,Ivy May, and Meredith, whose love, support, and understandingmade this thesis possible. Contents 10 1 Introduction 1.1 M otivation .......................................................................................... 10 1.2 The problem..................................................................................... 12 1.3 A pproach ............................................................................................ 13 2 Image Coding Fundamentals 2.1 Representation................................................................................... 2.2 JPEG Standard [1]..............................................................................18 2.2.1 Transform Coding.............................................................. 2.2.2 Quantization of Transform Coefficients.............................22 2.2.3 Entropy Coding...................................................................22 17 17 18 2.3 Frame Rate ....................................................................................... 26 2.4 MPEG Standards [2], [3] .................................................................. 28 2.4.1 Motion Estimation and Compensation ............................... 2.4.2 Residual Error Signal..........................................................32 3 Recent Advances in Image Representation 28 34 3.1 Salient Stills ..................................................................................... 34 3.2 Structured Video Coding .................................................................. 35 3.3 Layered Coder...................................................................................37 3.4 General Observations....................................................................... 4 Enhancing the Background Resolution 38 40 4.1 One Dimensional Sub Nyquist Sampling ......................................... 40 4.2 Enhancing Resolution Using Many Cameras .................................. 42 4.3 Enhancing Resolution Using One Camera....................................... 49 4.4 4.3.1 Stepped Sensor Mount....................................................... 49 4.3.2 Generalized Camera Motion.............................................. 50 Enhanced Resolution Image Construction....................................... 52 5 Foreground Extraction 56 5.1 Choosing a Basis for Segmentation .................................................. 56 5.2 Motion Estimation............................................................................ 59 5.3 Background Model Construction..................................................... 60 5.4 Foreground Model Construction.......................................................61 5.5 5.4.1 Initial Segmentation............................................................61 5.4.2 Variance Based Segmentation Adjustment ........................ 61 5.4.3 Region Allocation.............................................................. 63 5.4.4 Region Pruning ................................................................... 63 Temporarily Stationary Objects ....................................................... 65 6 Updating the Models 6.1 Frame Types and Ordering .............................................................. 66 67 6.1.1 Model (M) Frames..............................................................67 6.1.2 Model Delta (D) Frame .................................................... 68 6.1.3 Frame Ordering Example .................................................. 69 6.2 Updating the Affine Model.............................................................. 70 6.3 Model Update Decision .................................................................... 71 6.4 Encoder Block Diagram................................................................... 73 6.5 Decoder Block Diagram................................................................... 74 7 Simulation Results 7.1 7.2 75 Motion/Shape Estimation Algorithm Performance.............76 7.1.1 Foreground Object Extraction ........................................... 76 7.1.2 Background Construction Accuracy...................................78 7.1.3 Super Resolution Composited Image ................................. 80 Efficiency of Representation compared to MPEG2......................... 81 7.2.1 Motion/Shape Information ................................................ 81 7.2.2 Error Signal........................................................................ 83 8 Conclusions 85 8.1 Performance of Enhanced Resolution Codec....................................85 8.2 Possible Extensions for Further Study .............................................. Appendix A: Spatially Adaptive Hierarchial Block Matching 86 88 A. 1 Decreasing Computational Complexity .......................................... 88 A.2 Increasing Accuracy ......................................................................... 89 Appendix B:Affine Model Estimation 92 Appendix C:Enhanced Resolution Codec Bit Stream Definition 95 C. 1 C.2 Video Bitstream Syntax ....................................................................... 95 C.1.1 Video Sequence .................................................................. 95 C.1.2 Sequence Header ................................................................ 96 C. 1.3 Group of Pictures Header .................................................. 96 C. 1.4 Picture Header ..................................................................... 96 C.1.5 Picture Data ....................................................................... 97 C. 1.6 Background Affine Model.................................................. 97 C.1.7 Foreground Object.............................................................. 98 C.1.8 Relative Block Position .................................................... 99 C.1.9 Motion Vector...................................................................... 100 C.1.10 Error Residual Data ............................................................. 100 C .1.11 Error Slice............................................................................ 100 C. 1.12 Residual Macroblock........................................................... 101 C .1.13 B lock .................................................................................... 10 1 Variable Length Code Tables ............................................................ References 102 105 List Of Figures Fig. 1.1 Genus of the Enhanced Resolution Codec ....................................... 14 Fig. 2.1 Basis Functions for an 8 by 8 Two Dimensional DCT ................... 21 Fig. 2.2 Huffman Coding Example ................................................................ 25 Fig. 2.3 Progressive and Interlace Scanning .................................................. 27 Fig. 2.4 Block Based Motion Compensation Example .................................. 29 Fig. 2.5 MPEG2 Encoder Block Diagram.....................................................32 Fig. 2.6 MPEG2 Decoder Block Diagram .................................................... 33 Fig. 3.1 Salient Still Block Diagram [28] ...................................................... 34 Fig. 4.1 An arbitrary one dimensional sequence, ........................................ 40 Fig. 4.2 Sub nyquist sampled sequences, from odd and even Samples of Figure 4.1 ........................................ 41 Fig. 4.3 Spectral Analysis of Down Sampling ............................................. 42 Fig. 4.4 Four Camera Rig for Doubling Spatial Resolution..........................43 Fig. 4.5 Desired Camera response for doubling resolution...........................43 Fig. 4.6 Camera CCD Array (modified from Fig 2.22 in [5])........................44 Fig. 4.7 Low pass frequency response due to finite sensor element size.....45 Fig. 4.8 Two dimensional pulse train ........................................................... 46 Fig. 4.9 Modulation Transfer Function for a typical ccd array ..................... 46 Fig. 4.10 Modulation Transfer Function for a typical camera ........................ 48 Fig. 4.11 Two Shots of the Same Scene a) Wide angle, b) Close up (2 X) ......... 50 Fig. 4.12 Spectral Content of two pictures when combined ............................ Fig. 4.13 Nearest Model Pel Determination....................................................55 51 Fig. 5.1 Flower Garden Scene (from an MPEG test sequence) .................... 57 Fig. 5.2 Variance Based Adjustment.............................................................. 62 Fig. 5.3 Example of region allocation by scan line conversion and merge........63 Fig. 5.4 Example Region pruning .................................................................. 64 Fig. 6.1 Example of MPEG frame ordering ................................................. 66 Fig. 6.2 Enhanced Resolution Codec Frame Ordering Example ................... 69 Fig. 6.3 Enhanced Resolution Encoder Block Diagram................................73 Fig. 6.4 Enhanced Resolution Decoder Block Diagram ............................... Fig. 7.1 Original image and Segmented Lecturer with hole in head (and also in arm ) ............................................................................ Fig. 7.2 74 76 Poor segmentation due to flat regions causing inaccurate motion estimation..............................................77 Fig. 7.3 Carousel Foreground Segmentation..................................................77 Fig. 7.4 Super Resolution Backgrounds for frames 1, 51, and 89..................79 Fig. 7.5 Super Resolution Frame with foreground composited on top .......... Fig. 7.6 Motion Information bit rates for the Lecturer Sequence...................81 Fig. 7.7 Motion Information bit rates for the Carousel Sequence..................82 Fig. 7.8 SNR for the Lecturer Sequence ...................................................... 83 Fig. 7.9 SNR for the Carousel Sequence....................................................... 84 80 Fig. A. 1 Pyramid Decomposition .................................................................. 88 Fig. A .2 91 Spiral Search M ethod....................................................................... Fig. B. 1 Block Diagram of 6 Parameter Affine Estimation Algorithm ...... 94 Chapter 1 Introduction 1.1 Motivation Hybrid transform video compression coders have been the subject of a great deal of research in the past ten years, and within the last three, they have become practical and cost effective for applications ranging from videotelephony to entertainment television. They are the object of optimization rather than basic research. In fact, the sheer size of their expected deployment will effectively freeze their design until general purpose processors can perform fully programmable decoding. The current state of the art provides compression ratios on the order of 50 to 100:1 with acceptable loss of picture quality for home entertainment. Drawing a parallel with word processing, one could say that the level of sophistication of these coders is currently on a par with word processors of 15 years ago. These early word processors stored text very efficiently using codes of 7 to 8 bits per character, but unfortunately the final representations included little information about the text itself. No information concerning typefaces, page layout, or other higher levels of abstraction was included in the representation. Today the majority of word processors allow one to format text with a veritable plethora of options ranging from the size and style of font to automatic cross referencing between text in different documents. It is easy to see that embellishing the presentation with information about the text has made today's word processors infinitely richer and more useful than those that use words alone. When one examines encoded video representations today one finds that they are at the 'text only' stage. Current algorithms do a very good job at reducing the size of representation, but in doing so lose almost all ability to manipulate the content in a meaningful fashion. The much heralded information super highway with multi-gigabit data paths is expected to become a reality soon. When it does, it will be crucial that new representations be found for video that enable one to make quick decisions about what to watch without having to search through every single bit. These representations should contain information about the content that enable machine and viewer alike to make quick decisions without having to examine every second of video. Without these representations the benefits of having access to an information super highway, as opposed to access to traditional media sources will be minimal for all but the expert user. This being the case, it will be difficult to attract people to use the highways and it may become difficult to justify their cost. Admittedly these are very lofty goals, the amount of effort required to achieve them is enormous. Given the amount of effort required, and the newness of the super highway as a medium, it is likely that the realization of these goals will take quite some time. Consequently, if for no other reason than cost, it is also likely that the initial steps towards making these representations a reality will be derivatives of current representations. 1.2 The problem Before a new representation can be derived it is useful to examine current state of the art video coding algorithms in order to determine what their good and bad points are. The salient features of current hybrid transform algorithms can be summarized as follows; . Heavy reliance on statistical models of pictures, in particular spatiotemporal proximity between frames. . The model used for predicting subsequent frames consists of only one or two single frames that have previously been encoded. . Frames are subdivided into blocks which are then sequentially encoded using a single motion vector for each block and an error residual. . Image formation by 3D to 2D projection is ignored; little consideration is given to phenomena such as occlusion, reflection, or perspective. . There is no easy means for determining the contents of a frame without decoding it in its entirety. Current research into new approaches to coding generally attempt to address different sets of applications, or use radically different image models. For example, the MPEG' group, which has issued one standard for multimedia and broadcast television (MPEG-1) and is putting the final touches on another (MPEG-2 [3]), is beginning a three-year effort deliberately designed to speed the evolution of radically different coders that break the hybrid/DCT mold. These are expected to be used primarily in extremely low-bandwidth applications, but the development spurred by the MPEG effort may extend itself past that. 1. Moving Picture Experts Group, ISO/IEC JTCl SC29/WGII As may be surmised from the list of hybrid transform coder features on the previous page, these new modelling techniques derive from either new work on understanding human vision (and mapping that into a compression system), or work on new computational efficiencies that permit complex models to be exploited that previously had been academic tours de force. 1.3 Approach This thesis proposes an object-oriented coding model that is an architectural extension to existing MPEG-style coders. The objective model assumes no knowledge other than data contained in the frames themselves (no hand-seeded model of the scene is used), and is basically two or two-and-a-half dimensional (two dimensional plus depth of each two dimensional layer). The model classifies objects within the frame on the basis of their appearance, connectivity, and motion relative to their surrounds. While the raster-oriented decomposition of the frame (as used in MPEG) remains in this coder, the image is no longer coded as blocks classified on an individual basis. Instead, the image is coded as a montage of semi-homogenous regions in conjunction with an error signal for where the model fails. The structure of the coder borrows many elements from previous coders and image processing techniques in particular MPEG 2 [3], Teodosio's Salient Still [28] and the work by Wang, Adelson and Desai [4], [30]. The resulting coding model incorporates the efficiencies of motion compensation using 6 parameter affine models as well as the increased accuracy of standard block based motion compensation for foreground regions that do not match the affine model. It is hoped that the coding scheme developed here may eventually be extended into the framework of a fully three dimensional representation as developed by Holtzman [13] which promises extremely high compression ratios. The genealogy of the Enhanced Resolution Coder is shown in Figure 1.1. H.261, MPEG1, H.262 (MPEG2) Salient Still (Teodosio) 'nhancpd Rp.vnlutinn Cnrlpr Image Encoding & Representation Schemes Layered Coder, (Adelson & Wang) Full 3D Coder (Holtzman) >Full 3D Coder (automatic) Figure 1.1. Genus of the Enhanced Resolution Codec Another way to classify the algorithm is to consider its position in the table below, Encoder Type Information Units Example Waveform Color D1 Transform Blocks JPEG [1] Hybrid Transform Motion+ Blocks MPEG [3], [12] 2D Region Based Regions + Motion + Blocks Mussman's Work [21] Layered Coder [4], [30] Enhanced Resolution Coder Scenic 3D Objects "Holtzman Encoder" [13] Semantic Dialogue, Expressions Screenplay TABLE 1.1. Encoder Classes (after Mussman) Another motivating advantage of such a structural analysis of the image is that the structure can be exploited at the receiver, without any additional recognition, to facilitate browsing, sorting sequences, and selecting items of interest to the viewer. In essence, the pictures are transmitted as objects that are composited to form a frame on decoding, and each picture carries information about the scene background, the remaining foreground objects, and the relation between them. As compressed video becomes pervasive for entertainment, it is anticipated that "browsability" of the sequence will become at least as important as the compression efficiency to the viewer, for there is little value in having huge libraries of content without a means to quickly search through them to find entries of interest. With these ideas in mind, the target application of the televised lecture was chosen to demonstrate the enhanced resolution codec. The reasons behind this decision are threefold. First, the scene is best photographed with a long lens that minimizes camera distortion (this is a requirement for optimal performance). Secondly, enhancing the resolution of written or projected materials in the image will allow the viewer to see the material at a finer resolution, while still seeing a broader view of the lecture theatre. In fact the algorithm will allow the viewer to determine whether the entire scene or just the written/projected material is displayed. This feature is particularly useful as it allows the user to choose the picture content according to interest and decoder ability (small vs large screen). Finally, apart from technical considerations the aforementioned application has the potential to provide a useful service to students who, for whatever reason, have insufficient teaching facilities available, thus allowing them to participate in lectures from other schools and institutions The remainder of this thesis is organized as follows. Chapter Two briefly reviews the fundamentals of encoding still and moving images. Chapter Three examines three recent advances in image representation. An analysis of the theory behind resolution enhance- 1. One such program of televised lectures is already been undertaken by Stanford University which regularly broadcasts lectures via cable to surrounding institutions and companies who have employees enrolled in Stanford' classes. Livenet in London, England and Uninet in Sydney, Australia are also other examples of this type of program. ment is presented in Chapter Four. Chapters Five and Six map the results of the previous chapters into a frame work that defines the new encoder. Simulation results of the new algorithm, including a comparison with the MPEG1 algorithm, are presented in Chapter Seven. Finally, Chapter Eight presents an analysis on the performance of the new algorithm and makes suggestions for future work. Necessary and self contained technical derivations of various mathematical formulae and modifications to existing algorithms have been moved to the Appendices. Chapter 2 Image Coding Fundamentals Currently all displayed pictures, whether static or moving, are comprised of still (stationary) images. For example, photographs, television images and movies are sequences of one or more still images. This representation has not come about by accident but instead by design. As one Media Lab professor is fond of saying "God did not invent scanlines and pixels... .people did". Therefore, it is not surprising that the current image coding techniques for moving tend to be based on philosophies very similar to those used for encoding still images, mainly that of spatio temporal redundancy and probabilistic statistical modeling. This chapter examines these philosophies as they have been applied in development of the JPEG' and MPEG standards for still and moving image coding respectively. 2.1 Representation The choice of representation affects two fundamental characteristics of the coded image; . Picture Quality . Redundancy present in the representation These two characteristics are independent up to a point, generally one can remove redundant information from a representation and not suffer any loss in picture quality, this is lossless compression. However, once one starts to remove information beyond that which is redundant the picture quality degrades. This type of compression, where non redundant information is deliberately removed from the representation, is known as lossy 1. Joint Photographic Experts Group (JPEG), of ISO/IEC JTC1/SC18/WG8 compression. The amount and nature of the degradation caused by the use of lossy compression techniques is highly dependent on the representation chosen. Hence, it makes sense that great care should be exercised in choosing an appropriate representation when using lossy compression techniques. 2.2 JPEG Standard [1] The Joint Photographic Experts Group (JPEG) standard for still image compression is fairly simple in its design. It is comprised of three main functional units; a transformer, a quantizer, and an entropy coder. The first and last of these functional units represent lossless processes1 with the second block, quantization, being responsible for removing perceptually irrelevant information from the final representation. Obviously, if one desires high compression ratios then more information has to be discarded and the final representation becomes noticeably degraded when compared with the original. 2.2.1 Transform Coding A salient feature of most images is that they exhibit a high spatial correlation between the pixel values. This means that representations in the spatial domain are usually highly redundant and, therefore, require more bits than necessary when encoded. One method which attempts to decorrelate the data to facilitate better compression during quantization is Transform Coding. Transform Coding works by transforming the image from the spatial domain into an equivalent representation in another domain (e.g. the frequency domain). The transform used must provide a unique and equivalent representation from which one can reconstruct all possible original images. Consider the general case where a single one dimensional vector x = [x ,..., xk] is mapped to another u = [u form T, 1. With the exception of errors due to finite arithmetic precision ,..., Uk] by the trans- u = Tx (1.1) The rows of the transform T, ti are orthonormal, that is they satisfy the constraint 7 tit. ={ Ij or 0, ifi#j I1, ifi =j TT = I i.e. 7 1 = (1.2) The vectors t1 , ..., t, are commonly called the basis functions of the transform T. The fact that they form an orthonormal set is important as it ensures that; . they span the vector space of x in a unique manner, and - the total energy in the transform u is equal to that of the original data x. In other words, this property ensures that there is a unique mapping between a given transform u and the original data x'. Transforming from the spatial domain into another domain gains nothing in terms of the number of bits required to represent the information. In fact, the number of bits required to represent the transformed information can increase. The gain comes in that the transformed representation can be quantized more efficiently. This is due to the fact that the transform coefficients have been decorrelated and have also undergone energy compaction, thus many of them can be quantized very coarsely to zero. 1. The mathematics above deal with data that is represented by one dimensional vectors, however pictures are two dimensional entities. This is not a problem as one can simply extend the concept of transformation into two dimensions by applying a pair of one dimensional transforms performed horizontally and vertically (or vice versa). JPEG uses the Discrete Cosine Transform (DCT) to reduce the amount of redundancy present in the spatial domain. The image to be coded is subdivided into small blocks(8 by 8 pixel) which are then individually transformed into the frequency domain. Unlike the Discrete Fourier Transform (DFT) which generates both real and imaginary terms for each transformed coefficient, the DCT generates real terms when supplied with real terms only. Thus, the transform of an 8 by 8 pel block yields an 8 by 8 coefficient block, there is no expansion in the number of terms. However, these terms require more bits, typically 12 bits to prevent round off error as opposed to 8 in the spatial domain, so the representation is in fact slightly larger. The formulae for the N by N two dimensional DCT and Inverse DCT (IDCT) are DCT 2 N-IN-I F(u,v) =N C (u) C (v) 1 IDCT 2 N- 1N-I f(x,y) = N 1 (2x + 1) ui (2y + 1) vn f(x,y)cos ( N ) cos ( N (2y+1)vn (2x+1)tnC , C (u) C (v) F(u,v)cos u=Ov=0 N ) cos ( N with u,v,x,y = 0, 1,2, ... N-1 where x, y are coordinates in the pixel domain u, v are coordinates in the transform domain 1 C(u), C(v) = {F 1 for u, v =0 otherwise (1.3) The means by which a gain in compression efficiency can be achieved (by subsequent quantization) lies in the distribution of the transformed coefficients. A corollary of the fact that real world images exhibit high spatial correlation, is that the spectral content of real world images is skewed such that most of the energy is contained in the lower frequencies. Thus, the DCT representation will tend to be sparsely populated and have most of the energy in the coefficients corresponding to lower frequencies. This observation becomes intuitively obvious when one examines the basis functions for the DCT as shown in Figure 2.1. Increasing Horizontal Frequency up dct-basis(x,y) 0 1 2 3 5 4 6 7 ofl mu. 1 r) CD 2 3 CD - mu, MEM. MEE -t CD CD C 6 Eu. 7Figure 2.1 NEu Basis Functions for an 8 by 8 Two Dimensional DCT The basis functions depicted towards the upper left corner correspond to lower frequencies while those towards the lower right correspond to higher frequencies. The magnitude of the transform coefficient corresponding to each of the basis functions describes the 'weight' or presence of each basis function in the original image block. 2.2.2 Quantization of Transform Coefficients Having transformed the image into the frequency domain JPEG next performs quantization on the coefficients of each transformed block. The coefficients corresponding to the lower frequency samples play a larger role in reconstructing the image than the coefficients corresponding to higher frequencies. This is true not only by virtue of the fact that they are much more likely to be present as stated in Eq. 2.2.1, but also because perceptually the human visual system is much more attuned to noticing the average overall change in image intensity between blocks as opposed to the localized high frequency content contained in the higher frequency coefficients1 . Thus, a given distortion introduced into a component that represents low frequencies will be much more noticeable than if the same distortion was introduced into a component that represents high spatial frequencies. The DC (zero frequency) coefficient is particularly susceptible to distortion errors, as an error in this coefficient affects the overall brightness of the block. For this reason the DC coefficient is quantized with a much finer quantizer than that used for the remaining AC (non zero frequency) coefficients. Likewise, lower frequency AC components are quantized less aggressively than higher frequency ones. 2.2.3 Entropy Coding The final stage of the JPEG compression algorithm consists of an entropy coder. Entropy is generally defined to be a measure of (dis)order or the amount of redundancy present in a collection of data. A set of data is said to have high entropy if its members are independent 1. (Having said this it must be noted that if an image composed of inverse transformed blocks is displayed where the high frequency coefficients have been removed from all but a few of these blocks then those few blocks will be highly noticeable. For this reason it is desirable to attempt to discard the same number of higher coefficients in each block.) or uncorrelated. Conversely, a set that displays high dependency or correlation between its members is said to have low entropy. Given a set of symbols and their relative probabilities of occurrence within a message, one can make use of this concept in order to remove redundancy in their encoded representation. First, consider a set of data which is completely random, in other words one where every symbol is as likely as another to be present in a coded bitstream. Suppose that a restriction is imposed which states that each symbols' encoded representation is the same length. As each symbol is the same length in bits and just as likely to occur as another they all cost the same to transmit, the representation is efficient. Now consider the case where the symbol probabilities are not equal. It is easy to see that even though certain symbols hardly ever occur they will still require just as many bits as the more frequent ones, therefore, the representation is said to be inefficient. Now suppose that the restriction that all symbols be represented by codes of the same length is relaxed, thereby allowing codes of different lengths to be assigned to different symbols. If the shorter codes are assigned to more frequently used symbols and the longer codes to the less frequent, then it should be possible to reduce the number of bits required to encode a message consisting of these symbols. However several problems arise out of the fact that the codes are of variable length; . the position of a code in the encoded bit stream may no longer be known in advance, . a long code may by missed if it is mistaken for a short code. As a result, great care must be taken to ensure that no ambiguities arise when assigning the codes. If this requirement is met then the parsing of the bitstream becomes as simple as knowing where the previous code finished as this automatically provides the starting position for its successor. JPEG uses the results from a Huffman coding scheme based on this concept. It is useful to introduce two equations ([15],[19]23]); first the average code word length L for the discrete random variable x describing a finite set of symbols is defined as L = XL (x) p, (x) bits/symbol (1.4) x and secondly, the entropy of x, H (x) , defined by Eq. (1.5), is the lower bound on L H (x) = -jpx (x) log 2 (x) bits/symbol (1.5) x In short, the aim of entropy coding is try to make L as close as possible to H (x) . If L = H (x) then all the redundancy present in x will have been removed and hence any attempt to achieve even greater compression will result in loss of information. This is a theoretical limit and is seldom reached in practice. However it should be noted that higher compression ratios may be achieved without loss by choosing a different representation for the same information. The theoretical limit has not been violated in these cases. What happens is that the pdf of x changes, resulting in a change of H (x) which falls below H (x) of the original representation. Thus it makes sense to choose the most efficient representation before quantizing and subsequent entropy coding. Huffman coding makes several big assumptions; first, that the probability distribution of the set of symbols to be encoded is known and secondly that the probabilities do not change. Given that these assumptions hold, the codebook containing the code for each symbol can be precomputed and can either be stored permanently in both the encoder and decoder or transmitted once by the encoder to the decoder at the beginning of a transmission. The algorithm used to design these codebooks is very simple and constructs a binary tree that has the symbols for leaves and whose forks describe the code. The probability of each symbol is generated by performing preliminary encoding runs and recording the frequency of each symbol in the final representation. Provided sufficiently representative data is used to generate these symbols, the actual probability of occurrence of each symbol may be approximated by its relative probability in the preliminary representation. For a set of n symbols the algorithm constructs the tree in n - 1 steps where each step involves creating a new fork that combines the two nodes of highest probability. The new parent node created by the fork is then assigned a probability equal to the sum of the probability of its two children. A simple example is shown below in Figure 2.2 Pr(Symbol) Symbol 60 1 0.60 1 0.35 1 Code 0.40 a 0 0.25 b 10 0.20 c 110 0.15 d 111 L = 1x.40+2x0.25+3x0.20+3x0.15 bits/symbol = 1.95 H = - (0.4010g2 (0.40) + 0.2510g2 (0.25) + 0.2010g2 (0.20) + 0.151og 2 (0.15)) = 1.904 Figure 2.2 bits/symbol Huffman Coding Example Several codebooks generated by a method similar to that described above are used within JPEG for various types of data in a variety of different situations. 2.3 Frame Rate When one moves from still images to moving images comprised of a series of still images, one of the main things that must be taken into account is the rate at which pictures are presented to the viewer. If too few pictures are presented to the viewer then any motion within the picture will appear jerky, the viewer will notice that the moving image is not continuous but in fact comprised of many separate still images. However, if the frame rate is increased to a sufficiently high rate then the still images will "fuse" causing motion within the picture to appear smooth and continuous. The frame rate at which fusion occurs is known as the criticalflicker frequency (cff) and is dependent upon the picture size, ambient lighting conditions, and the amount of motion within the frame [17] & [23]. For example when a T.V. screen is viewed in a poorly lit room the cfff decreases, conversely the cfff increases when the same T.V. screen is viewed in well lit conditions. The cfff also increases with picture size, which explains why motion pictures are shot at 24 frames per second and then projected at 72 frames per second with each frame being shown three times. Standard television uses a special method to increase the temporal resolution without increasing the bandwidth of the transmitted signal or decreasing the spatial resolution by an inordinate amount. This trick is known as interlace and involves transmitting the picture as two separate fields; the even lines of the picture are sent in the first field, followed by the odd lines in the second field. Transmitting a picture in this manner means that the perceivable spatial resolution of the displayed image is larger than one field alone but smaller than if both fields had been transmitted simultaneously (progressive scanning). The gain comes in that the temporal resolution is increased beyond that of a system where the fields are combined into a single picture. Thus a small sacrifice in spatial resolution is made to allow for better temporal resolution without increasing the number of scanlines per unit time. Interlaced Scanning Progressive Scanning frame n - 2 frame n - 1 frame n fl fO frame n - 2 fl fO frame n - 1 fi fO frame n fO = field 0, even scan lines fI = field 1, odd scan lines Figure 2.3 Progressive and Interlace Scanning While interlace is appropriate and useful for analog televisions that only display pictorial information, it is particularly bad for computer generated text or graphics. When a normal television displays these sorts of artificial images the human eye fails to merge the spatial information in successive fields smoothly, and the 30 (NTSC)/ 25(PAL) Hz flicker becomes painfully obvious. For this reason, there has been a concerted push towards higher frame rates both for large screen computer monitors and emerging HDTV system proposals. These HDTV systems will need these higher frame rates to support a larger pictures. In addition, higher frame rate will also be needed to support the anticipated computer driven applications made possible by the digital technology used to deliver and decompress the images they will display. Some common frame rates used within the movie, television and computer industries are 24, 25, 29.97, 30, 59.94, 60 and 72 frames per second. 2.4 MPEG Standards [2], [3] The Moving Picture Experts Group (MPEG) standard MPEG 1 [2] and draft standard MPEG2 [3] are based on the same ideas that were used in the creation of the JPEG algorithm for still images. In fact, MPEG coders contain the same transform, quantization and entropy coding functional blocks as the JPEG coder. The difference between the MPEG and JPEG coders lies in the fact that the MPEG coders exploit interframe redundancy in addition to spatial redundancy, through the use of motion compensation. The three standards can be differentiated as follows; . JPEG compresses single images independently of one another. . MPEG 1 is optimized to encode picture at CD-ROM rates of around 1.1 Mbit/s at resolution roughly a quarter of that for broadcast TV by predicting the current picture from previously received ones. . MPEG2 inherits the features present in MPEG 1 but also includes additional features to support interlace and was optimized for bit rates of 4 and 9 Mbit/s 2.4.1 Motion Estimation and Compensation Many methods for estimating the motion between the contents of two pictures have been developed over the years. As motion estimation is only performed in the encoder and hence the method chosen does not impact on interoperability, MPEG does not specify any particular algorithm for generating the motion estimates. However, in an effort to limit decoder complexity and increase representation efficiency, the decision was made that motion vectors should be taken to represent constant (i.e. purely translational) motion for small tiled regions of the screen. The boundaries of these tiles coincided with those for a grouping of a small number of blocks used by the DCT stage, thus allowing the decoder to decode each tile in turn with a minimal amount of memory. Consequently, the most common method used to generate motion information for MPEG implementations is the block based motion estimator. Block based motion estimators are characterized by the fact that they generate a single motion vector to represent the motion of small (usually square) group of pels. The motion vector for each block is calculated by minimizing a simple error term describing the goodness of fit of the prediction by varying the translation between the block's position in the current frame and its position in a predictor. The translation corresponding to the minimum error is chosen as the representative motion vector for the block. An example of a prediction made by a typical block based motion estimator is shown in Figure 2.4 below. xx.: 17 8 :X 14 115 'P 4 2. X. I31132r ........... . .. X. ......... 145 P48 Predictor Figure 2.4 1 2 3 4567 9 10 1112 13 14 1516 .. 17 18 19 20 21.22.23.24 25 2612718293313 33 34 35 36 37 3. 39'.4 41 42 143 4451-46 47 48 Current Frame Block Based Motion Compensation Example A common example of an error term used by block based motion estimators is the Sum of Absolute Differences (SAD) between the estimate and the block being encoded. Generally, only the luminance component is used, as it contains higher spatial frequencies than the chrominance components and can, therefore, provide a more accurate prediction of the motion. The SAD minimization formula for a square block of side length blocksize is given by blocksize - 1blocksize - 1 SAD (8,,6,) = X y =0 I abs (lumeny) - lumpred + 8 ,y + 8 ) (2.1) x= 0 5 ,S,) is the motion vector which minimizes the Sum of Absolute Differences; Where (8 SAD (8GA,) Several advantages and disadvantages result from using the SAD error measure for motion estimation. The main advantage is that the resulting vectors do in fact minimize the entropy of the transform encoded error signal given the constraint that the motion must be constant within each block. However, the hill climbing algorithm implicit with the minimization process means that the motion vectors generated will not necessarily represent the true translational motion for the block, despite the fact that the entropy of the error signal is minimized. This is particularly true for blocks that contain a flat luminance surface or more than one moving object. Given that one is attempting to find true motion, the following list summarizes the main failings of block based motion estimation; . The motion estimate will most likely be quite different from the true motion where the block's contents are fairly constant and undergo slight variations in illumination from frame to frame. The cause of these errors lies in the fact that small amounts of noise will predominate in the SAD equation causing it to correlate on the noise instead of the actual data. . Blockwise motion estimation assumes purely translational motion in the plane of the image. It does not model zooms or rotations accurately as these involve a gradually changing motion field with differing motion for each pel within a block . The motion estimate is deemed constant for the entire macroblock, causing an inaccurate representation when there is more than one motion present in the block or complex illumination changes occur. Therefore block based motion estimation, while easy to implement, does not necessarily result in either a particularly accurate representation of the true motion, nor a low MSE where non translational motion is present. 2.4.2 Residual Error Signal Generally, there is some error associated with the prediction resulting from motion estimation and compensation techniques. Therefore, there must be a way to transmit additional information to correct the predicted picture to the original. This error can be coded using the same transform and entropy encoding methods described previously in 2.2.1 to 2.2.3. It should also be noted that there will be places in the picture for which the prediction will be particularly bad, in which case it may be less expensive to transmit that portion without making the prediction, i.e. intra coding. Figure 2.5 below, shows the main functional blocks within an standard MPEG2 encoder; motion estimator, several frame-stores, 8x8 DCT, quantizer (Q), entropy or VLC encoder, inverse-quantizer (Q~), 8x8 IDCT, rate-control, and output-buffer. *quantizer step size Framne -0'Re-Order Source Input Motion Estimator + DCT Q Rate Control Buffer VLC MUX S1 DCT MPEG2 stream out Framestores DCT DCTFigure 2.5 8x8 Discrete Cosine Transform - 8x8 Inverse Discrete Cosine Transform - MPEG2 Encoder Block Diagram Q - Quantization Q-1 - Inverse Quantization VLC - Variable Length Coding The corresponding MPEG2 decoder is quite similar. It consists of a subset of the encoder, with the addition of a VLC decoder. It should be noted that the motion estimator, one of the most expensive functional blocks in the encoder, is not required by the decoder. Therefore, the computational power required to implement an MPEG decoder is substantially less than that required for an MPEG encoder. For this reason MPEG is characterized as being an asymmetric algorithm where the requirements of the encoder are substantially different from the decoder. A block diagram of an MPEG decoder is shown in Figure 2.6. Quantizer step size Figure 2.6 2MPEG2 Decoder Block Diagram Chapter 3 Recent Advances in Image Representation As computing power has increased, the notion of what a digital image representation should encompass has grown to be something more than just a compressed version of the original. This chapter describes three recent approaches to image representation undertaken at the Media Laboratory that have helped to inspire the work in this thesis. 3.1 Salient Stills The Salient Still developed by Teodosio [28] is a good example of what can be done given sufficient memory and processing time. The basic premise behind the Salient Still is that one can construct a single image comprised of information from a sequence of images, and that this single image contains the salient features of the entire sequence. The Salient Still then has the ability to present not only contextual information in a broad sense (e.g. the audience at a concert) but also the ability to provide extra resolution at specific regions of interest or salient features (e.g. close up information of the performer at the concert). A block diagram showing how a Salient Still is created is shown below. Figure 3.1. Salient Still Block Diagram [28] The main algorithmic ideas to be gleaned from the Salient Still algorithm are; . A single image is created by warping many frames into a common space. . The value of a pel at a particular point is the result of a temporally weighted median of all the pels warped to that position. - The motion model used to generate the warp is a 6 parameter affine model derived by a Least Squares fit to the optical flow [6]. dx = ax+ b~x + cXy d, = a,+ b x +cy (3.1) Some of the problems associated with salient generation are; . Optical flow is particularly noisy near motion boundaries. . Inaccurate optical flow fields result in an inaccurate affine model. . The affine model is incapable of accurately representing motion other than that which is parallel to the image plane. 3.2 Structured Video Coding The "Lucy Encoder" described in Patrick McLean's Master's thesis "Structured Video Coding" [18] explores what is possible if one already has an accurate representation of the background. Footage from the 1950's sitcom "I Love Lucy" provided an easy means to explore this idea for two reasons; first, the sets remained the same from episode to episode and secondly, the camera motions used at the time were limited for the most part to pans. Thus, construction of the background was a fairly easy process. All one had to do was composite several frames together using pieces from each frame where the actors were not present. In addition to removing the actors, it was also possible to generate backgrounds that were wider than the transmitted image, thus enabling the viewer to see the entire set as opposed to the portion currently within the camera's field of view. Having obtained an accurate background representation the next step was to extract the actors. This proved to be no easy task. However, with some careful pruning and region growing the actors were successfully extracted (along with a small amount of background that surrounded them). Replaying the encoded footage then became a simple matter of transmitting an offset into the background predictor and pasting the actors into the window defined by this offset. McLean showed that if one transmitted the backgrounds prior to the remaining foreground then it was possible to encode "I Love Lucy" with reasonable quality at rates as low as 64kbits/sec using Motion JPEG (M-JPEG1 ). Several problems existed in the representation. Zooms could not be modeled as only the motion due to panning was estimated. In addition, the segmentation sometimes did not work as well as it should have, as either not all of an actor was extracted, or it was possible to see where the actors were pasted back in by virtue of mismatch in the background. However, these shortcoming are somewhat irrelevant as McLean demonstrated the more important idea that the algorithm should generate a representation that was not based solely on statistical models but one based on content. The representation now contained identifiable regions that could be attributed to a distinct physical meaning such as a "Lucy" region or a "Ricki" region. 1. Motion JPEG is a simple extension of JPEG for motion video in which each frame consists of a single JPEG compressed image 3.3 Layered Coder More recently, work in the area of layered coding has been done by Wang, Adelson and Desai [4], [30]. Along the same lines as the Salient Still and the Structured Video Coder, the Layered Coder addresses the idea of modeling not just the background by creating a small number of regions whose motion is defined by a 6 parameter affine model. These regions are transmitted once at the beginning of the sequence with subsequent frames generated by compositing the regions together after they have been warped using the 6 parameter models. The regions are constructed over the entire sequence by fitting planes to the optical flow field calculated between successive pairs of frames. Similar regions from each segmentation are then merged to form a single region which is representative of its content for the entire sequence. As might be surmised this technique works best for scenes where rigid body non 3D motion takes place. Another problem lies in the fact that the final regions have an error associated with their boundaries due to the inaccuracies of the segmentation. Thus, when the sequence is composited back together there tends to be an inaccurate reconstruction near the boundaries. Finally, as only a single image is transmitted to represent each region it turns out that closure (i.e. a value for every pel) cannot be guaranteed during the compositing stage after each region has been warped. For all these reasons the layered coder does not generate pixel accurate replicas of the frames in the original sequence for all but the simplest of scenes. However more importantly, even though a high Mean Square Error (MSE) may exist, the reconstructed images look similar to the original. The error is now, not so much an error due to noise or quantization but, a geometric error associated with an inaccurate motion model. 3.4 General Observations The three representations described in this chapter share several things in common; - Many frames are used to construct a single image representative of a region in the scene. . They construct this image using optical flow [6] to estimate interframe motion. - One or more 6 parameter affine models are used to represent the interframe motion. The techniques described above work well if one is attempting to model a simple scene and can generate reasonable approximations to actual regions in a frame. However if pixel accurate replication is required then, in general, the resulting images will not be accurate due to geometric distortions introduced by the affine model's inability to represent 3D motion. Therefore given that an efficient representation is paramount, if one requires a geometrically accurate reconstruction or is modelling a sequence containing significant amounts of 3D motion, then one should use a more accurate motion model in addition to also including an error signal with the affine model to correct these distortions. One final problem these methods face is that they rely on the accuracy of the motion estimates generated by the optical flow algorithm. Optical flow attempts to determine the velocity of individual pels in the image based on the assumption that the brightness constraint equation below holds. d lum(x,y) = 0 dt (3.2) This equation can be solved using a number of methods, the two most popular are correspondence methods (which are essentially the same as block matching) and gradient methods which attempt to solve the first order Taylor Series expansion of (3.2), v X lum (x, xax y) + v 9alum (x, y) + a-lum (x, y) a~y a = 0 (3.3) over a localized region R [6]. In order for gradient based optical flow to work the following assumptions are made; . The overall illumination of a scene is constant, . The luminance surface is smooth, . The amount of motion present is small. When these assumptions are invalid optical flow fails to give the correct velocity estimates. This is particularly true for; . Large motions, . Areas near motion boundaries, . Changes in lighting conditions. Thus for particularly complicated scenes involving lots of motion or lighting changes the output from the optical flow algorithm will be inaccurate. Hence any segmentation based upon the output from optical flow under these circumstances will be inaccurate. Chapter 4 Enhancing the Background Resolution A theme common to the three representations described in the previous chapter was that they all used more than one frame to construct a single representative image. The ability to combine frames of video in this way can result in two benefits. First one can remove foreground objects from the final image. Secondly, under certain conditions it may be possible that the apparent resolution of the final image can be enhanced beyond that of the original images. This chapter concentrates on the second benefit, specifically, methods that increase the sample density and resolution above that of the transmitted picture through the combination of several pictures sampled at different resolutions and times. The first benefit, that of removing foreground objects from the final image, is the subject of the next chapter. 4.1 One Dimensional Sub Nyquist Sampling Before examining resolution for images, it useful to examine a simpler one dimensional problem that can later be extended into two dimensions for images. Consider the one dimensional sequence x [n] shown below, x [nJ n Figure 4.1 An arbitrary one dimensional sequence, x [n] This sequence x [n] can be broken up into two sub nyquist sampled sequences; the first consisting of even samples x, [n] and the second x, [n] of the remaining odd samples. X,[n] Figure 4.2 Sub nyquist sampled sequences, from odd and even Samples of Figure 4.1 The operations required to remove the even and odd samples from x [n] to generate these sequences are described by the following formulae X, (x [n] + (-1)x [n]) [n] = 1 x, [n] = 1(x [n] - (-1)"x [n]) (4.1) In the frequency domain these formulae correspond to the introduction of aliased versions of the original sequence's spectrum, Xe (o)) = I (x (o)) (-) +X (4.2) X0 (o) = (X X Furthermore, it easy to see that when these two sequences are added together the aliased components cancel each other and the original sequence is obtained. AX (o) -2 TE I 2n 0 Xe (0) Ahased Components X (o) -2 T Figure 4.3 0 2T Spectral Analysis of Down Sampling 4.2 Enhancing Resolution Using Many Cameras So far it has been shown that one can decompose a one dimensional sequence into two sequences consisting of odd and even samples respectively. In addition, it was also shown that the aliasing introduced during the creation of these two sequences is eliminated when the two are recombined. Therefore, if one starts with two sequences, with the second offset half a sample period after the first, then it should be obvious to see how one could create a sequence of twice the sampling density by zero padding and then addition. If this concept is extended into two dimensions then it follows that four aliased input sequences are required; odd rows/odd columns, odd rows/even columns, even rows/odd columns, and even rows/even columns. Thus one could construct a four camera rig with each camera offset by half a pel from its neighbors as shown in Figure 4.4. 0 m~ .. . ---- ......... e e a .n......... E0 M 0 * Figure 4.4 Camera 1 Camera 2 Camera 3 Camera 4 a aa a .......... Four Camera Rig for Doubling Spatial Resolution Generally, such a system will not result in an exact doubling of the spatial bandwidth as it relies on the assumption that frequency response of the camera is flat and contains aliasing of only the first harmonic as shown below, F (ox Low pass response for no aliasing ) -Ox, y -27c -7t 7C 27c f(x, y) 21 1/2 Figure 4.5 -1/4 1/4 Desired Camera response for doubling resolution 1/2 - IVI Real cameras are incapable of generating signals with the response shown in Figure 4.5 for three reasons; . First, camera optics are not perfect and therefore they introduce a certain amount of distortion into the image. . Secondly, the sensors that form the camera chip are not infinitely small, they must have a significant finite area in order to reduce noise and also to minimize the magnitude of higher order aliased sidelobes. . Third and finally, a real sensor is incapable of generating a negative response. Therefore it is impossible to implement f(x, y) as the response must be that of an all-positive filter. Having deduced that it is impossible to obtain the desired response from practical sensors, one should determine how close the response of a practical sensor is to this ideal. Consider the 9 element ccd array shown in Figure 4.6, a. A b b A B Dark Im age B Clock A . Clock B -P+Islands Leslt Lenselet Figure 4.6 Camera CCD Array (modified from Fig 2.22 in [5]) bN-Type Substrate The sensors in this array have dimension a by b and are distributed on a rectangular grid with cells of size A by B. A common measure used to describe the geometry of the sensor array is the fill factor: ab (4.3) AB If the effect of the lenselets and other camera optical elements is temporarily ignored, the frequency response due to the finite sensor area can be defined in terms of a separable function consisting of two sinc functions with nodal frequencies as shown in Figure 4.7. Notice that the response is low pass and that its width is dependent on the sensor size. Smaller sensors result in wider response than larger ones that filter out more of the higher frequencies. LPF (wo,) a LPF(o) a b b LPF (wx, (o,)| Nx Figure 4.7 Low pass frequency response due to finite sensor element size Reading the data from the ccd has the effect of multiplying the double-sinc frequency response shown in Figure 4.7 with the two dimensional pulse train shown in Figure 4.8, A [0 X X X X4.4X XTB X X : X X :X X X Figure 4.8 10 X X X X X X X X X X X X "OK X X X X X X X X X X X X X : 10( X X X X X X X X X X 2 r/B Two dimensional pulse train When one combines the effects of all these operations on the projected image one obtains the Modulation Transfer Function (MTF) which describes the behavior of the recording system. Following on from the previous examples the MTF for the ccd array in Figure 4.6 is shown below (NB only the x axis projection is shown for clarity). LPF (o,) - -2n a -27c A 27c 2n A a LPF (w,) -2c -L A Figure 4.9 n 2n A A Modulation Transfer Function for a typical ccd array 46 )0 At this stage one should note that there is significant aliasing in the sensor array's MTF. In fact, one wonders if it would be possible to obtain a recognizable image from the camera. However, one should remember that the MTF shown in Figure 4.9 was derived for the camera sensor array only. No consideration was given to the lenselets and camera lenses in its derivation. When one includes these optical elements into the system description, the end result is that the image projected onto the ccd array is filtered by a much narrower filter than that implied by the ccd array's geometry. Generally, most cameras are deliberately designed to be slightly unfocused in order to perform this type of pre-filtering. For more details on the effect of focus on MTF responses see [7]. Combining the optics, sensor array and sampling into a single entity results in the system shown in Figure 4.10. (N.B. only the x axis projection is shown in the interests of clarity) Real World Optics h (o) Sensor Array h2 (0) Sampling X Video Signal L.............................................................I h3 () s(G) I.................................................................................. MTF (o) S (O) H, (O)X) MMTF (I) 0 H2 MTF ( ox) ( mx) x MTF (ox)| Figure 4.10 Modulation Transfer Function for a typical camera When one compares the desired camera response shown in Figure 4.5 with that of the actual response shown in Figure 4.10 it becomes apparent that the degree to which the resolution can be enhanced through the use of this technique is somewhat limited. The critical part of the system is the optics. If the camera is perfectly focused then the optical response is flat and the resulting image will be too highly aliased in order to extract the lower harmonic spatial frequencies. Conversely, an out of focus camera will effectively low pass filter the image so that the higher spatial frequencies are virtually eliminated. 4.3 Enhancing Resolution Using One Camera Good quality cameras are expensive, so it makes sense to find a way to achieve the same results using only a single camera. Two methods that only use one camera for resolution enhancement are described in this section. 4.3.1 Stepped Sensor Mount The simplest method which is capable of achieving the same results as the four camera rig involves a movable camera sensor that steps to another camera position after capturing each frame. This process is repeated so that a frame is captured at each of the four positions in turn, thus generating a cyclic sequence with a periodicity of four frames. This method sacrifices temporal resolution for spatial resolution, which is not unreasonable considering the aim is to increase the resolution of the background; a near stationary region for which temporal resolution is not critical. However it is not perfect as it still suffers from two major flaws. First, the input signal is assumed to be aliased as per the description in the previous section. If this condition is not met then no resolution enhancement will be possible as the spectra of the four images will be identical apart from the phase shifts caused by the different sample positions. The only gain possible will be in the form noise reduction from the multiple samples present. Secondly, the mechanics or electronics involved in moving the camera sensor in the precise manner required would increase the cost significantly to the point where it may be cheaper just to use a camera already capable of the desired resolution. In fact it is highly likely that it would be impossible to achieve the exact repetitive motion desired. Thus, the only way to ensure the half pel offsets is to use the higher resolution camera to begin with which once again defeats the purpose of this thesis. 4.3.2 Generalized Camera Motion The obvious solution to this problem is to allow the camera to pan and zoom freely and then to motion compensate successive frames to fit one of the four camera positions. Apart from realizing the one camera goal, this method also affords another benefit in that the resolution can be increased in a somewhat less subtle but more robust method through the use of additional memory. Consider that case in which the camera motion is purely translational and parallel to the background plane. In this case each image only requires a translation in order to align it with the appropriate sampling points and hence one has to rely on the odd/even sub nyquist sampling method described in the previous section in order to increase resolution. However, if zooms are allowed then the frequency response of the camera will be scaled during the motion compensation process. Perhaps, this can be best explained if one considers the scene below. Figure 4.11 Two Shots of the Same Scene a) Wide angle, b) Close up (2 X) For a given region there are four times as many pels (and hence twice the spectral resolution both horizontally and vertically) in the close up shot when compared to the wide angle shot. Therefore when one warps the wide angle image into the same scale as the close up and then examines their relative spectral content one finds that it is possible to reconstruct the portion of the image captured in the close up with a higher resolution than the wide angle image which is used to reconstruct the remainder of the image. Wide Angle Close Up 2x LPF() ........------. -27 A Figure 4.12 -- c 7c 21c A A A Wide Angle Close Up Spectral Content of two pictures when combined The process of warping one image into the same scale as another, and thereby shrinking or expanding the range of its spectrum, forms the basis for the background resolution enhancement portion of the coder. Of course the degree by which the resolution is enhanced is highly dependent on the accuracy of the warp, and the means used to incorporate the information provided by each new frame into the representation. If the warp is not particularly accurate then the resolution may actually be decreased below that of one of the original frames! Therefore some consideration should be given as to how to generate the most accurate warp possible and also how to update the prediction buffer to allow for the fact that warp will not necessarily be entirely accurate. 4.4 Enhanced Resolution Image Construction The enhanced resolution coder constructs enhanced resolution images of the background from several previously transmitted images. Each of these images has a single six parameter affine warp model associated with it which describes the interframe motion between the model image and the current one. The affine model, d, = ax+ b~x + cXy d, = a,+ b x + c~y (4.4) has six parameters and accordingly the types of motion it can represent accurately is limited. In general the affine model will be accurate in situations where there is little 3D motion present. Rotations not about the optical axis or scenes containing significant perspective or depth in the image will not be modelled accurately. However, if one constrains the scene so that the background is roughly planar and perpendicular to the optical axis of the camera then the six parameter affine model is usually accurate enough to model the interframe motion. In addition to the affine model, a coarse segmentation mask is also stored with each model to indicate which parts of the model are accurately described by the warp. Without this mask foreground objects may be incorrectly rendered into the background. Details of how the segmentation mask is derived are described in the next chapter. Having determined the warp model, consideration must be given as to how to generate values that don't fall on original sample points. It is a well known fact that the best interpolation function for a band limited signal is one based on sinc functions. However, as the sinc function requires infinite support and hence infinite computation it is rarely used. Instead the computationally less expensive bicubic warp below is used, xfrac = x - xint xint = int (x) yfrac = y - yint yint = int (y) 3 2 x1= xfrac 2/2 - xfrac/3 - xfrac3/6 y1 = yfrac2/2 - yfrac/3 - yfrac /6 X2= 1 -xfrac/2+xfrac3/2 Y2 = 1 -yfrac/2+yfrac 2 x= xfrac +xfrac - xfrac3/6 y3 = x4= xfrac 3/6 -xfrac/6 Y4 = yfrac 3 /6 -yfrac/6 (x 1pred (xint - 1, yint - 1) + x2pred (xint,yint) + t2 = y2 xpred (xint + 1, yint) + x 4pred (xint + 2, yint - 1) ,x4pred (x 1pred (xint - 1, yint + 1) + t3 = y3 x 2pred (xint, yint + 1) + x 3pred(xint+ 1,yint+ 1) + x 4pred (xint +2, yint + 1) 3 yfrac +xfrac - yfrac /6 (xlpred(xint - 1, yint) + x 2pred (xint, yint - 1) + ti = Y1 xspred(xint+ lyint- 1) + 3 /2 (xint +2, yint) (xpred(xint - 1, yint + 2) + x 2pred (xint, yint + 2) + tZ = Y4 xspred(xint+ 1,yint+2) + x 4pred (xint +2, yint +2) newval = t1+t2+t3+t4 (4.5) The support area of the bicubic warp above is the 4 by 4 region of pels centered on the position of the predicted pel (x, y) . The basic mechanism for constructing an image is very similar to that used by Irani and Peleg in their work on multiple motion analysis [16] and also that of Teodosio in her work on Salient Stills [28]. Each pel to be displayed is reconstructed from a number of previous frames, by collecting a candidate value from each model frame via a bicubic warp Eqn (4.5) and then evaluating a cost function to determine the best value. Unlike the Salient Still however, the models within the enhanced resolution coder also contain masking information which increases the accuracy of the reconstruction considerably as pels corresponding to foreground objects are no longer considered. The Salient Still algorithm generally gives the best results when the pel's color is determined by evaluating a weighted median function, where the weight associated with each candidate value is dependent on the relative scale of the model that it comes from. In other words, models with small bX and c, values corresponding to close up images are favored over those which had large bX and c, values and hence correspond to wide angle images. weight oc 2 (2+b +c,) (4.6) The enhanced resolution codec extends this concept one step further by including a measure of best fit to the sampling grid, the rationale being that the model providing the closest fit gives the most accurate prediction. The addition of this term generally only effects the outcome when images are of relatively similar scale, which makes sense as at other times a model of considerably smaller scale should have a considerably higher weight value than that for an image of large scale. 1 (b + c,) (1 +d) d = distance squared to nearest pel in model in terms of model scale, i.e by definition weight oc 2 (4.7) This is most easily shown in the diagram below, 0 0 0 00 0 0 0 0 Reconstructed Pel 0Na 0 Figure 4.13 0 0 0 Nearest Model Pel Determination Here, three models contribute candidate values to the median table. The scales of the models corresponding to the clear sample and cross hatched samples are the same. However, the shaded sample not only from a model of smaller scale but is closer to the reconstructed pels position. Therefore the candidate value corresponding to the shaded model will have the largest weighting. Chapter 5 Foreground Extraction The techniques of 6 parameter affine interframe motion representation and background resolution enhancement suffer from the assumption that no foreground objects are present in the input frames. Therefore, as any real or interesting footage will most likely contain foreground objects, a method must be found that either removes or nearly removes any foreground objects from the input. Once this is achieved the modified input can be used to generate a more accurate background motion model and hence a more accurate enhanced resolution background. This chapter describes the development of an algorithm that extracts foreground regions from a frame. It should be noted that the method described herewith is not designed to perform perfect segmentation, as section 5.1 will show this is a near impossible task. The design criteria for the extraction algorithm is somewhat more relaxed. So long as the modified input contains no foreground information that will corrupt the background image under construction then the algorithm is deemed successful. 5.1 Choosing a Basis for Segmentation The human visual system is extraordinarily good at performing the task of segmentation. Through the use of brightness, color, texture, lighting, motion, relative proximity, and stereo it can segment even the most complex scenes into a collection of related meaningful parts. For example upon examination of Figure 5.1. below, one might segment the scene into the following list of objects; a flower bed, trees, sky, houses and lamp posts. Figure 5.1. Flower Garden Scene (from an MPEG test sequence) Even though this image is flat and presented in black and white, the segmentation described above still used four of the aforementioned methods for segmentation; brightness, texture, relative proximity, and shape. However, one other method not previously mentioned was also probably used -- a prioriknowledge. This is extremely important as it means that people will have a bias or pre-conceived notion of how the segmentation should occur. In the context of the example above, most people will make use of prior experiences of what flowers, trees, houses and lamp posts look like, and therefore be predisposed to arriving at the conclusion that the tree is a foreground object. Another valid solution to the segmentation of Figure 5.1. would be to say that everything but the tree is painted on a card and that the tree is actually a hole in the card though which one can see a wooden pattern. Although this second solution is somewhat contrived, it shows that apriori knowledge plays a significant role in the way that humans attempt to understand images. Unfortunately, we have yet to learn how to program computers to make use of this sort of knowledge. The concept of 'tree-ness' or 'flower-ness', while extremely easy for us to comprehend, is something that is extremely difficult to describe to a computer. For this reason, the majority of current segmentation techniques rely heavily on artificially contrived situations such as blue screens and already known or uncluttered backgrounds. Those that don't may work well some of the time but invariably fail under certain conditions that all too often appear in real footage. Therefore, any algorithm that relies solely on an accurate segmentation in order to work is doomed to failure. Consequently, the algorithm described in the remainder of this chapter is designed to fail. Yes fail, so long as it doesn't allow foreground objects to corrupt the background it doesn't matter how much background is leaked into the foreground. Conversely, foreground objects that happen to be identical to the background can be leaked into the background as the end result will be the same. Of course this is rather glib but the reasoning is sound; an important goal of this research is to construct an enhanced resolution representation of the background of a scene. If perfect segmentation is not a prerequisite for this to be achievable then one should not aim for perfect segmentation! One of the simplest methods for performing segmentation between a foreground object and a background is to use the interframe motion to segment on the basis of relative velocities between regions in the image. Although somewhat hidden, the majority of motion estimators rely heavily on the brightness, texture, and shape of the objects present in the image. Consequently making segmentation on the basis of interframe motion is not an unreasonable proposition as all these characteristics are automatically incorporated into the decision process. The following sections describe the fault tolerant motion based segmentation algorithm used to extract the foreground from the input images before they are used to update the background. 5.2 Motion Estimation Current motion estimation methods all give false motion estimates under certain conditions; block matching fails where more than one motion is present in a block or where the luminance surface is smooth and optical flow fails badly at motion boundaries. However, with this understanding it is possible to take one of these algorithms and modify its operation to note which regions are likely to generate false motion estimates and to subsequently minimize the presence of false motion estimates in these areas. This section will briefly describe several modifications made to a standard block matching algorithm in order to improve its performance in two areas. First, the amount of computational complexity is reduced through the use of coarse to fine hierarchical pyramid techniques that generate an initial coarse estimate of the motion using low resolution images and then subsequently refine the motion estimates using images of successively higher resolution. Secondly, the presence of false motion estimates resulting from blocks containing more than one motion or comprised of flat luminance data is minimized by; - Making the block size a maximum of 8 by 8 pixels in areas of non uniform motion. - Using a spiral search originating at the vector corresponding to zero motion . Weighting the error criterion so that vectors closer to the zero motion vector are preferred over those at the extremes of the search range. The degree to which low magnitude motion vectors are preferred over high magnitude vectors is dependent on how flat the block being examined. . After each stage of refinement in the hierarchy the motion estimates outliers resulting from noise on flat blocks were filtered using a median filter. More explicit details of the motion estimation algorithm are given in Appendix A. 5.3 Background Model Construction As was shown in Chapter 4, finding an appropriate model for the background of an image scene is not an easy task. However, if one is willing to make a few, admittedly non-trivial, assumptions then it is possible to simplify this task considerably. The assumptions made in order for the enhanced resolution coder to work are; . The projected area of the background on the images presented to the encoder is large relative to that taken up by foreground objects. . The distortion introduced by the camera optics is minimized through the use of long lenses and a restricted field of view. Having made these assumptions, in particular the second, then one can represent interframe background motion using the 6 parameter model, dx = a + b~x+ cXy =a d~ d, = a, + b YX + c~y + b~~c~y(5.1) Using the sparse set of motion estimates generated by the motion estimator in Appendix A this model is derived using the least squares algorithm in Appendix B. 5.4 Foreground Model Construction Having obtained a reasonably accurate model for the background motion it becomes necessary to determine where it fails and hence which regions of the image should be transmitted as foreground information. 5.4.1 Initial Segmentation Using the least squares plane fitting algorithm described in Appendix B does more than just generate the 6 parameter affine model for the background motion. It also generates an initial segmentation based on the magnitude and direction of the motion vectors. This occurs as the outliers must be removed from the input data in order to generate an accurate model by the least squares method. The set of blocks corresponding to outlier motion vectors generated by the plane fitting algorithm forms the initial segmentation. 5.4.2 Variance Based Segmentation Adjustment As mentioned earlier in this chapter the output from the motion estimator cannot be relied upon for accuracy, even when the modifications described in section 5.2 are incorporated into its operation. Therefore, it is highly likely that some blocks will be tagged as foreground regions even though they are clearly part of the background. The segmentation adjustment procedure described in the following paragraphs attempts to identify these blocks and reassign them to the background. However, the procedure does more than that as it also allows currently tagged foreground blocks to become background if the resulting error from using the background motion is no worse than if the foreground motion was used. As might be expected the rationale for making the adjustment lies in the knowledge that the motion estimator is likely to fail for blocks that are flat. Such blocks are character- ized quantitatively by having low variance. These are the only blocks which are candidates for adjustment, which is performed in two phases; reassignment and filtering. The reassignment phase is rather simple. The variance of each block (having already been calculated during motion estimation) is compared to a threshold value dependent upon the blocks size. If a block's variance falls below the threshold then the block is a candidate for readjustment. Residuals from using the modelled background motion and also the foreground motion are then computed for candidate blocks. If the background residual is smaller or only slightly larger than that for the case when the foreground motion is used then the block is reassigned to background. The second filtering phase attempts to remove salt and pepper noise from the segmentation by filtering with a 3 by 3 median filter. Bus Frame 2 Bus Frame I Initial Segmentation Figure 5.2 After Variance Filtering (phase 1) Variance Based Adjustment After Median Filtering (phase 2) 5.4.3 Region Allocation Having appropriately filtered the segmentation on a blockwise basis, a more sophisticated filtering is performed which takes the blocks' connectivity into account. The first part of process consists of a scan line conversion algorithm which scans the segmentation mask in a raster like fashion and tags each block as either background or as part of a foreground object. After the scanline conversion process notes but does not merge new objects with previous ones to which they are connected, a second pass is performed after scanline conversion which merges equivalent regions. The output from this process is a segmentation map with each block tagged as background model compliant or as one of a small number of regions. 1 2 4 5w Figure 5.3 7 Example of region allocation by scan line conversion and merge 5.4.4 Region Pruning The final stage of the segmentation process prunes regions from the segmentation according to their size and the total error resulting from using the coarse block based foreground prediction over the warped affine background model. Extremely small regions are assumed to be the result of noise, either in the motion estimator output or the input image. In these cases the added expense of transmitting small regions over the larger error signal resulting from background model prediction is deemed unjustifiable and they are automatically reassigned to being part of the background. 2 1 Regions 2, 5 and 7 pruned 4 Figure 5.4 Example Region pruning Thus after segmentation the following information exists about the current frame; - a 6 parameter affine model which describes the background motion relative to the previous frame. . a segmentation mask which indicates whether a block belongs to the background or to a numbered region. If the block does belong to a region then the number of that region is stored in the segmentation mask. . a motion vector field which is used to describe the motion of each region on a block by block basis. This information is necessary and sufficient to describe the foreground object, details of how it is encoded into the bitstream may be found in Appendix B. 5.5 Temporarily Stationary Objects One of the major downfalls of the foreground model construction technique just described is that it depends on interframe motion being present between every frame pair presented to the encoder. Thus, any moving object that becomes stationary for one or more frames is then immediately and erroneously merged into the background. While this has little effect on the reconstruction of the current frame, it does significantly compromise the segmentation mask for that frame which may result in an inaccurate background construction for later frames. A partial solution to this problem which makes use of first order proximity was implemented. The basis of the solution was to attempt to predict the position of foreground regions in the next frame from the position and motion data generated for the current frame. A "holding delay" was then introduced for the regions that were predicted to contain foreground information. Foreground regions whose motion matched that of the background for the duration of this delay were then reassigned to the background. If a region's motion did not match that of the background then its holding delay was reset. As was stated, this solution was only partially successful. Details of how and why it failed for particular situations are presented in Chapter 7, Simulation Results. Chapter 6 Updating the Models The decision of how to update the various models within the coder/decoder is an important one. If the models associated with each frame are always dependent on those of previously received frames then several problems arise. First, it is impossible to "tune in" to the encoded stream at any other point but the beginning. Second, if an error occurs in the bitstream then recovery is impossible. The other extreme is to re-transmit each model in its entirety with every frame, which of course requires a higher bit rate than necessary. Clearly the best solution lies somewhere in between these two extremes. Single frames of video form the models used within MPEG [3] which solves the problem of updating them through the use of three frame types; Intra (I), Predicted (P) and Bidirectionally predicted (B). Intra frames are independent of any other, while Predicted frames are predicted from the previously transmitted I or P frame and Bipredicitive frames are predicted from the two most previously transmitted I or P frames. As I frames can be decoded without prediction from any other frame, they are transmitted on a regular basis to allow for random access and also to ensure any transmission errors are flushed from the decoder buffers. Figure 6.1. below shows a typical MPEG frame allocation, the arrows indicate predictive coding. BiurB Figure 6.1. B B Example of MPEG frame ordering B B B B 6.1 Frame Types and Ordering Like H.261 and MPEG, the Enhanced Resolution Coder encodes frames of video in several different ways in order to allow for random access and error recovery. Two distinct frame types are used; .Model (M) frame .Model Delta (D) frame 6.1.1 Model (M) Frames The first frame to be sent is always a model or M frame. The M frame is essentially the same as an MPEG I frame in both nature and implementation, that is, it contains Huffman encoded quantized DCT coefficients but no motion information. It can, however, contain segmentation information if it is available. As this information is not available for the first frame of a sequence; the first frame never contains segmentation information. M frames transmitted with segmentation information can be used to predict occluded information in subsequent frames; however if an M frame contains no segmentation information then it is only used to predict the following frame before it is discarded. This feature is particularly useful for "pre loading" the background in video conferencing situations. The M frame also provides an avenue for the encoder to take advantage of other more computationally expensive algorithms such as Salient Still [28] or Layered Coder [4]. As these algorithms work on significantly more frames than can be stored in the enhanced resolution encoder, they may be able to generate a more accurate background representation. The ability to make use of the output from these algorithms may afford improved performance by circumventing the memory and computation bounds present in the standard enhanced resolution encoder. 6.1.2 Model Delta (D) Frame The second frame type is the D frame. D frames are not unlike the MPEG B frames, in that they contain interframe motion information from one or more predictors, and an error signal consisting of Huffman encoded quantized DCT coefficients. However, unlike the MPEG B frame they also contain segmentation information about the position of a finite number of foreground regions. The background and foreground motion models transmitted within a D frame are complete and fully describe the motion and shape information for the frame, relative to the previous frame. However the error residual is based on the composition process which uses many frames to construct the background prediction. Consequently, some discrepancies will arise if a decoder "tunes" in halfway through without having the previous background models to aid in the prediction. Therefore the decoder should not attempt to decode images until it detects an M frame in the bitstream. In short the information contained in each frame can be summarized as follows Frame Type M D TABLE 6.1 Error Residual Background Motion Foreground Motion Foreground Shape yor n n y or n y y y y y Frame Types and their information content 6.1.3 Frame Ordering Example An example of frame ordering in a typical sequence is shown below in Figure 6.2. Several things should be learned from this example. First, D frames always construct predictions using the previous frame for foreground objects and the previous frame plus any valid models for the remaining background. Secondly, frame 9, an M frame, contains a segmentation mask and can, therefore, be used as a model whereas the first M frame (frame 0) did not and had to be discarded after being used to predict frame 1. 0 1 2 3 4 5 DD 6 7 8 9 D D 11 12 M2 M2 M3 NB each model consists of - a previously transmitted frame - a 6 parameter affine model - a segmentation mask M1 - model 1 M2 - model 2 M3 - model 3 Enhanced Resolution Codec Frame Ordering Example 13 D D E M M1 Figure 6.2 10 6.2 Updating the Affine Model Adjusting the affine parameters of existing models to reflect motion between the previous frame and the frame just being encoded is crucial to the operation of the algorithm. Fortunately, affine models can be added according to the equations below [28] to give a single affine model that combines the effects of warping using model 1 followed by second warp using model 2 axnew = ax1 + ax2 + cx x ay2 +bxl x ax2 aynew = ay1 + ay 2 + cy 1 x ay 2 + by 1 x ax 2 bxnw = bx1 +bx 2 + cx 1 x by 2 +bxl x bx 2 bynew = by, + by 2 + cx 1 x by 2 + by, x bx 2 (6.1) cxnew = cx 1 + cx 2 + cx 1 x cy 2 + bx Xcx 2 CYnew = cy 1 + cy 2 + CX1 X cy 2 + by 1 x cx 2 This property of the affine model means that each affine model needs only to be transmitted once. This negates the need to perform motion estimation on frames other than the previous one. Subsequent warps are then taken into account by adding the current affine model to those associated with previous models. Transmission of the affine model implies that some quantization of its parameters must take place. If too few bits are assigned to the parameters then the effects of quantization noise will quickly accumulate making the results of the summation progressively more inaccurate. Therefore, care must be taken to ensure that a sufficient number of bits are assigned to each of the parameters. For details on how the parameters are encoded see Appendix C. 6.3 Model Update Decision Deciding which frames should be used as models for prediction of subsequent frames is critical to the performance of the encoder. The simplest strategy is to cycle through the buffers throwing out the oldest model and replacing it with the current one. Unfortunately the amount of memory and computation required to implement a reasonable cycle length would make both the encoder and decoder prohibitively expensive. A more realistic approach would be to limit the number of models to perhaps somewhere between 4 and 8. Given a limited number of models two update strategies come to mind. Consider the situation where instead of using every frame as a model, only every m' frame is used where m is a small number, between say 2 and 5. This modification extends the temporal span (in terms of the time between the various frames used to construct the background) of the encoder by a factor of m. However, it still does not guarantee optimal model assignment in terms of maximal revealed background area. Another problem with this method is that it does not automatically take the scale of the frames into account, close ups are just as likely to be discarded as wide angle frames. The second strategy addresses the shortcomings of the first. By performing a weighted correlation of the coverage provided by all but one of the currently valid models, including the current frame, it is possible to determine which model should be discarded so that maximal coverage was achieved for the given number of models allowed. The correlation process is done in two phases. First, the bounds of the area that could be covered by the set of models being evaluated is calculated. Next, a correlation is performed by summing the weight corresponding to the "best" model for each pel within the bounds of the reconstruction space. The weight factor shown in (6.2) attempts to find the most accurate recon- struction for a given pel. In addition, it takes scale into account so that the final correlation value corresponds to the total number of valid samples that can be used to reconstruct the background. (6.2) 1 (bx +CY)) 2 2 This correlation is performed for each model in turn by omitting it from the set of candidate models, this subset is then used to generate the correlation value. The model whose omittance results in the highest correlation is replaced by the current frame. Note that this can result in the current frame being replaced by itself which corresponds to the case where the current models already provide the best coverage. 6.4 Encoder Block Diagram Below is the encoder block diagram. Notice that a significant portion of the encoder, the transform coder core, is borrowed from the MPEG coder. Thus, the enhanced resolution coder does not require a complete redesign of an existing encoder, only modifications to support the new motion analysis and compositing sections. MPEG Transform Coder Core Rate Control + DCr Q VLC Buffer DCT - 8x8 Discrete Cosine Transform Q1 - Quantization Q - Inverse Quantization Transform Cosine Discrete Inverse 8x8 DCT-1 * quantizer step size Figure 6.3 Enhanced Resolution Encoder Block Diagram VLC - Variable Length Coding 6.5 Decoder Block Diagram The corresponding enhanced resolution decoder can be somewhat more complicated than its MPEG counterpart. The viewer is given a choice between viewing the standard decoded display or an enhanced display that takes advantage of the extra models stored in the decoder to construct a super resolution, screen widened version of the same image. If the super resolution image is not required then that part of the decoder can be disabled, thus reducing the computational complexity, and hence the cost of the decoder. MPEG Transform Decoder Core -------------------------.----------------- ------------1 BfUfXr B e Bitstreamn in Figure 6.4 -- - VLC Q ---------------ma-k afn DCI) - ------- oe Enhanced Resolution Decoder Block Diagram + Display Chapter 7 Simulation Results Two sequences were used to test the enhanced resolution codec. The first sequence consisted of a short lecture presentation in front of a white board. Specifically designed to match the background model assumed by the enhanced resolution coder, it was hoped that the encoder would achieve a high performance for encoding this scene. This sequence consisted of 300 frames at a resolution of 352 pixels wide by 240 pixels high, 4:2:0 chrominance format (chrominance components sub-sampled by a factor of 2 horizontally and vertically relative to luminance component) and was coded at a target rate of 600kbit/s in order to achieve an average SNR between 30 and 35 dB. As the first sequence was a "best case", the second sequence was chosen to represent the "worst case", a sequence containing significant three dimensional rotational motion in which the foreground dominates the viewed area. To that end, a sequence containing a close up of a revolving carousel was chosen. This sequence consisted of 150 frames at a resolution of 352 pixels wide by 240 pixels high, 4:2:0 chrominance format and was coded at a target rate of 1.2Mbit/s in order to achieve an average SNR between 30 and 35 dB. The number of model frames in the enhanced resolution encoder was limited to eight. Also, motion estimation was performed as described in Appendix A to half pel accuracy and the segmentation was performed using a block size of 8 by 8 pels. The MPEG encoder used for comparison in section 7.2 was developed in the Media Laboratory's Information & Entertainment Systems group. The MPEG encoder uses MPEG2 Test Model 5 rate control and an exhaustive search followed by a half pel refinement for motion estimation. 7.1 Motion/Shape Estimation Algorithm Performance 7.1.1 Foreground Object Extraction Even though the motion disparity between the background and the "guest lecturer" was small, the encoder managed to segment the majority of the lecturer from the background. As was expected the segmentation was not perfect. In particular, regions with insufficient texture, e.g. the lecturers hair and arm, were initially incorrectly attributed to the background as a result of the motion estimator failing to detect the true motion for these regions. Figure 7.1 Original image and Segmented Lecturer with hole in head (and also in arm) Another problem arose during pans when new parts of the background were revealed. Initially, these new parts were correctly tagged as a foreground but incorrect motion estimates due to lack of texture caused them to remain as foreground instead of becoming background. While this was not a problem for situations when the camera was only performing pans, problems did arise when zooms were included. As closure is only guaranteed for the transmitted picture area, holes sometimes appeared in the supersized picture outside the transmitted picture area due to the fact that previous segmentation errors resulted in holes in the background. This situation was exacerbated by the fact that the foreground data was then subject to the "holding delay" put in place to prevent temporarily stationary foreground objects (see section 5.5) from being sent to the background. Figure 7.2 Poor segmentation due to flat regions causing inaccurate motion estimation Apart from these two flaws, for which solutions will be suggested in the next chapter, the segmentation algorithm worked reasonably well for the "lecturer" sequence. However the same cannot be said for the carousel sequence. Due to the large amount of 3D motion in this sequence practically the entire frame was tagged as foreground. Figure 7.3 Carousel Foreground Segmentation 7.1.2 Background Construction Accuracy As might be expected, the enhanced resolution encoder was unable to construct a background of any worth for the carousel sequence due to the excessive 3D motion present. The lecturer sequence was another story, however. Despite the coarse sampling of the motion vector field, the algorithm was usually able to generate reasonably accurate affine models for the sequence. In fact, when compared to the background at the original resolution, the background reconstruction was generally better due to the affine warp's capability to represent zooms. Of course the lecturer sequence was chosen with these ideas in mind, so it is hardly surprising that the algorithm should generate accurate models for its representation. Figure 7.4 on the following page shows the super resolution background constructed by the decoder at the beginning, middle and end of a 90 frame Group of Pictures (frames 1, 51, and 89). Initially the best estimate for the background left significant holes in the representation as shown by the black areas. By frame 51 the lecturer's arm had been removed from the representation along with most of his head. However, by frame 89 the lecturer's head and torso had reappeared in the background. The reason for this is that the encoder was set to only assume a maximum of eight model stores in the decoder. Therefore, only the "best" eight frames were kept. These eight frames provided sufficient information to reveal the white board panned off to the right and the region under the lecturer's arm, but in doing so was unable to reveal the regions behind the lecturers head and torso. Figure 7.4 Super Resolution Backgrounds for frames 1, 51, and 89 I 7.1.3 Super Resolution Composited Image In addition to generating a super resolution background, the decoder also has the capability to generate a second super resolution image that also includes the transient foreground information from the current frame. Below is an example of frame 89 with the foreground composited over the super resolution background buffer. Figure 7.5 Super Resolution Frame with foreground composited on top Notice that holes in the background caused by segmentation errors are concealed by the foreground for the region covered by the transmitted frame. 7.2 Efficiency of Representation compared to MPEG2 In order to make the comparison as fair as possible, both the MPEG and enhanced resolution algorithm parameters were matched. To that end the bit rates were set equal and the I frame Rate of MPEG was set to be the same rate as that for M frames in the enhanced resolution coder. The lecturer sequence was coded at 600kbit/s while the carousel sequence was coded at 1.2Mbit/s. 7.2.1 Motion/Shape Information As can be seen from the plot below for the lecturer sequence, the number of bits spent by the enhanced resolution coder on motion and shape information was significantly less than that spent by the MPEG1 algorithm for motion alone. MPEG spent an average of 3614 bits/frame on motion while the enhanced resolution coder spent an average of 2256 bits/frame, a savings of 1358 bits /frame or 38%. Bits 3000 2000 1000 0- 50 100 150 200 Frame Number Figure 7.6 Motion Information bit rates for the Lecturer Sequence 250 300 The tables were turned, however for the carousel sequence. Due to the overwhelming amount of foreground information, the enhanced resolution encoder used nearly twice as many bits as the MPEG encoder to encode the motion. However, when one considers the fact that the enhanced resolution coder used blocks of size 8 by 8 pixels instead of 16 by 16 pixels, and hence had four times as many motion vectors to send, the factor of two increase in bits used does not look so bad after all. If one ignores the differing block sizes, one could say that the enhanced resolution coder had fallen back into an MPEG like block based coder and as such should have perform liked one. 12000 10000 8000 Bits 6000 4000 2000 50 100 Frame Number Figure 7.7 Motion Information bit rates for the Carousel Sequence 150 7.2.2 Error Signal As can be seen from the plot below, the enhanced resolution algorithm performed considerably better in terms of SNR than MPEG for the lecturer sequence. This can be attributed to two main things. First, the motion representation was not only more accurate but also more compact than that for MPEG. This resulted in more accurate predictions but also allowed more bits to be used for the error signal. Second, the presence of multiple reference stores reduced the amount of information that had to be retransmitted due to cyclic occlusion. Over the three hundred frame sequence the average SNR for the MPEG 1 coder was 33.15 dB while the enhanced resolution encoder achieved an SNR of 34.38 dB, an improvement of 1.23 dB over MPEG 1. SNR 32 - 30 - 28 - 26 0 so 100 150 Frame Number Figure 7.8 SNR for the Lecturer Sequence 200 250 300 As might have been expected the enhanced resolution algorithm did not perform as well for the carousel sequence. However, it still performed better than MPEG, primarily due to the more accurate motion representation. Over the three hundred frame sequence the average SNR for the MPEG 1 coder was 29.37 dB while the enhanced resolution encoder achieved an SNR of 30.15 dB, an improvement of 0.78 dB over MPEG 1. SNR 32 30 28 26 50 100 Frame Number Figure 7.9 SNR for the Carousel Sequence 1SO Chapter 8 Conclusions 8.1 Performance of Enhanced Resolution Codec The primary aim of developing a new coding algorithm capable of including some notion of content into the encoded representation of video was achieved. Furthermore, this aim was achieved with little or no loss in picture quality when compared to MPEG. The new representation arose from splitting the image data into background and foreground data, which were then transmitted independently from one another. This made several things possible; . a synthetic super resolution image of the background in the scene could now be constructed, thus enabling the viewer to concentrate on items of interest in the background, e.g. writing on a whiteboard during a lecture. . foreground elements could be removed from this background image, thus enabling the viewer to see background areas of interest that became temporarily occluded by foreground objects, e.g. writing on a whiteboard temporarily obscured by the lecturers arm. . the resolution of the decoded image could now be determined by the decoder to some extent independently from the encoder without the need for enhancement layers transmitted in parallel to the primary bitstream. Interestingly all of these benefits were achieved at little or no expense in terms of compression efficiency for the test sequences (lecture and carousel) when compared with the more traditional MPEG1 hybrid transform coder. In fact, even when the input to the enhanced resolution algorithm was chosen to break the assumed background model, the enhanced resolution algorithm still out performed MPEG by achieving a higher average SNR for the carousel sequence encoded at 1.2Mbit/s. This feature is quite important as it means that the enhanced resolution algorithm degrades gracefully back to a more traditional hybrid transform coder when the assumptions made during its design are broken. Thus, the enhanced resolution encoder can be used for any sequence without the risk of excessively high bit rates when compared to the MPEG algorithm. 8.2 Possible Extensions for Further Study Although the enhanced resolution algorithm represents a significant improvement over MPEG a lot of work still remains to be done in order for it to achieve the more ambitious goals that motivated its inception. The primary aim of the thesis was to develop a new and more useful representation for encoded moving images: a representation that included content, one that allowed the viewer to manipulate images with greater freedom and understanding. Currently video is still very much a linear medium; searching for specific items of interest requires the use of wasteful techniques which result in at least one person investing a significant amount of time viewing an entire sequence. This should not be required. Some areas for improvement in the enhanced resolution algorithm that come to mind are; . Exploring better segmentation techniques that segment more cleanly than the motion based technique used. Perhaps the explicit inclusion of color and texture will allow more accurate segmentation to be achieved. - More efficient and meaningful representations for foreground objects. In terms of efficiency, triangular patches appear to be the obvious choice. They may even lead to methods that will enable 3D models to be constructed for the foreground object based on the relative motion of their vertices. - Random access and model update techniques. Currently the encoder makes an assumption about how many models should be used in the construction of the background. At the decoder the models from the previous Group of Pictures (GOP) are discarded when a new one starts, thus forfeiting the gains made by their presence. A smarter decoder with more memory than that assumed by the encoder would keep the best models from previous GOP's for inclusion with the current models. Appendix A: Spatially Adaptive Hierarchial Block Matching Motion estimation is somewhat of a black art, no motion estimation technique today generates a perfect output that represents the true motion of the image for all possible cases. The algorithm described here represents one attempt to improve a standard block matching motion estimator and to improve its performance both in terms of increased speed and increased accuracy. The modifications made to increase accuracy have been kept simple in keeping with the idea that the motion estimator should be as efficient as possible. A.1 Decreasing Computational Complexity The first and most straightforward modification made to the motion estimator was to make it hierarchical. This modification decreases the computational complexity of motion estimation considerably. A pyramid decomposition was performed on both the current image and the predictor where each level was scaled by a factor of two, both horizontally and vertically, from those above and below it. Motion estimation was then performed in a top down fashion with successive refinements being made to the motion estimates at each level until the bottoms of both pyramids were reached. Figure A.1. Pyramid Decomposition Level 2 1 Scale 1/16 1/4 0 1 -1 4 The refinement process between two levels consisted of two steps; . The motion vectors from the level above were doubled to reflect the change in scale. . Next a localized search was performed centered on the doubled motion vectors from step 1. The computational savings over a standard non-hierarchical algorithm arise out of the fact that large motions in lower levels become much smaller in higher levels. Having computed an estimate at a highest level negates the need to perform anything but a +/- 1 pel search at lower levels as the small motion estimates computed at the highest level will eventually be scaled to their true value. The search has basically become a two dimensional bilinear one. Having reduced the computational complexity considerably, it became necessary to examine methods to improve the motion estimator's accuracy. A.2 Increasing Accuracy Block matching algorithms based upon the SAD criterion are known to give unreliable estimates when the block being predicted consists of highly uniform samples. In these cases a small perturbation makes little difference to the error, so any vector is just as good as any other. Therefore, the SAD criterion will attempt to correlate, not the block data but the small amounts of noise present in both the block and the predictor, thereby generating an inaccurate prediction of the true motion. One solution to this problem is to use the normalized correlation coefficient [5] as the error measure shown in Eqn (A. 1) instead of the SAD criterion. N(8x, ) = E (qlq ) + E (qj) E (q 2 ) c (q 1) Y (q ) q1 = orignal block q2 predictor block offset by (Ox,8) = (A.1) However, the Normalized Correlation requires the computation of the variance of each possible matching block q2 for each original block q,, a requirement that increases the amount of computation required considerably. Consequently, the normalized correlation coefficient was not used. Three methods were chosen to improve the accuracy of the motion estimates: First, motion estimation was performed between original to original images not original to reconstructed as is common with MPEG coders. Using the reconstructed image instead of the original image makes sense for MPEG coders as accumulated errors in the reconstructed image may result in the true motion giving a larger residual than that for the true motion vector. However, as the aim of the motion estimator in the enhanced resolution codec is to find the true motion for the block, and not to minimize the error residual, original to original motion estimation was used. Secondly, instead of searching the candidate vector space using the usual raster scan method, a spiral search method was used which tended to favor smaller motion vectors rather than those oriented to the top left as with the raster search. In an effort to reduce the number of outliers in the motion field, this bias towards smaller motion vectors is further increased for highly uniform blocks by multiplying the final SAD coefficient by a factor dependent on the blocks variance. The spiral search order is shown in Figure A.2, 2 1 -- 14--4-2 1 2 4 2 2 1 1 2 Figure A.2 Spiral Search Method The third and final method employed to increase the accuracy of the motion estimates was to perform median filtering of the individual components of each vector using a 3 by 3 separable median filter after each refinement stage in the hierarchy. This had the effect of eliminating outliers that were missed by the spiral search modifications. Unfortunately, median filtering also removed the smallest valid regions. However, as the overhead associated with transmitting the regions could be quite expensive this was deemed to be only a minor flaw. 91 Appendix B: Affine Model Estimation The process of affine model estimation consists of fitting a model to a set of data points. Not all data points will fit the model therefore, the process of estimating the model involves several iterations of model estimation and subsequent pruning of the data points that do not fit the model. The model being used here is the standard 6 parameter model below v, = ax+ bx + c~y (B.1) v, = a + b x + c~y The parameters ax, bx, cx, ay, by and c, are computed from the set of motion vector estimates (vXvy) generated by the motion estimator. As the motion vectors consist of two components this process actually becomes one of fitting two models to two sets of data points with the only dependency between the two being that the data points used to construct each model be the same. The least squares method for fitting a single plane to either vX or v, is given below v = a+bx+cy LSE = X(v - (a+bx+cy))2 R = v2 -2v(a+bx+cy)+ (a+bx+cy)2 R (B.2) Differentiating the error term by a, b & c and then setting them to zero yields three linear equations in terms of the three unknown parameters, v(x,y) = 0 d LSE *al1I+bjx+cyy da R R R R d LSE - aix + b x 2 + cIxy - xv(x'y) = 0 db -dLSE dC R ay R R R +cy2 R R - yV(xy) = 0 R R (B.3) These Equations can be written in matrix form, 11 Yx ly R I R R X x2 R R __ a Yv(x,y) R Jxy b = Xxv(x,y) R (B.4) R jxy Xy 2- -Xy R R R Iyv(x,y) _R Solving for a, b & c is then simply achieved via diagonalization and back substitution. It should be noted that the choice of which data points to include in the region R is critical in the determination of the affine parameters. If outliers are included in the solution then they will tend to skew the parameters away from their correct values. In order to prevent this happening, the region R is recursively computed by estimating the parameters for the current set of data points and then redefining the region R to include only those points which are "close" to the model. The amount of deviation allowed is reduced with each step until the affine model closely reflects the data points in the region R which contains few, if any, outliers. The block diagram for the algorithm used in this thesis is presented below in Figure B.1. Start Finish Figure B.1. Block Diagram of 6 Parameter Affine Estimation Algorithm Appendix C: Enhanced Resolution Codec Bit Stream Definition The presentation style of the enhanced resolution codec bit stream syntax has been deliberately modeled on the style used by the MPEG2 Committee Draft. The reason for this lies in the fact that the MPEG2 syntax was used as a starting point for the Enhanced Resolution syntax. As many syntactical elements are identical between the two encoders those elements which are unique to the Enhanced Resolution Encoder are printed in italics, all syntactical elements printed in normal script should be considered to have identical meaning to their equivalents in the MPEG2 standard. C.1 Video Bitstream Syntax C.1.1 Video Sequence No. of bits videosequence() { nextstart_code() sequence-header() do { do if (nextbitsO == groupstart code) { group-ofpictures-header() } picture headerO picture dataO } while ((next-bits() == picture_startcode) 11 (next-bits() == group-start code)) if (nex-tbits() != sequence end code) { sequence.header() } I while (nextbits() sequence end code != sequenceendcode) 32 Mnemonic C.1.2 Sequence Header sequence_headerO { sequenceheadercode horizontalsizevalue verticalsizevalue aspect-ratio information frameratecode bitratevalue marker-bit vbv_buffer_size load intra-quantizer matrix if (load-intra-quantizer matrix) intra-quantizer-matrix[64] loadnonintra-quantizermatrix No of bits 32 12 12 4 4 18 1 10 Mnemonic bslbf uimsbf uimsbf uimsbf uimsbf uimsbf bslbf uimsbf 1 64*8 uimsbf 1 if (load non intra-quantizer-matrix) non-intra-quantizer matrix[64] 64*8 uimsbf C.1.3 Group of Pictures Header group-oLpictures headerO { group-startcode time-code No of bits 32 25 Mnemonic bslbf No of bits 32 10 3 16 Mnemonic bslbf uimsbf uimsbf uimsbf } C.1.4 Picture Header picture headerO { picture-start-code temporal-reference picturecoding-type vbv delay if ((picture coding-type == BGMODEL)II (picture codingjtype == MODELDELTA)) { 1 bg-jock if (bg_lock) bg lock model num }_____________ 4 uimsbf _____________ C.1.5 Picture Data No of bits picture data() { Mnemonic if (picture-coding-jype == MODELDELTA) { background motion() numJoreground-objectsA 7 fgseg block size while (fg object tobe-transmitted)[ foregroundobject() 5 uimsbf } errorresidual-data() }TF C.1.6 Background Affine Model background-motion() { bg-motionstartcode 4 4 a-scale a-size if (a-size No of bits 32 != 0) Mnemonic bslbf uimsbf uimsbf { ax-sign 1-16 ax 1 aysign 1 - 16 ay 5 5 bcscale bx_cy_size if (b-size != 0) vlclbf uimsbf uimsbf { bx-sign bx cy-sign cy by-cx-size if (by-cx-size != 0 { bySign by vlclbf 1 1 - 32 1 1 - 32 5 vlclbf vlclbf uimsbf 1 1 - 32 vlclbf 1 - 32 vlclbf cx-sign cx }T The 6 parameters; ax, bx, cx, ay, by, and cy, represent real numbers that have been encoded as integers. To reconstruct the true floating point value of a parameter the following scaling must be performed. truevalue = sign (signbit) value& ( (1 <<size) - 1) (B.5) 1 <<scale where value is the decoded integer of length size in bits. . The scales for the a, and b & c parameters are encoded separately. - If size is 0 then the parameters to which the size value refers shall be set to 0. . The signbit for each parameter is set to "1" to indicate a negative parameter, conversely "0" indicates that the parameter is positive. C.1.7 Foreground Object foreground-object [ foreground-objectstartcode object-num horizontal_ f code vertical f code first blockpx first block-py do No of bits 32 7 4 4 7 7 { if (not_first block) relative block-position() if (picture coding-type == MODELDELTA) { mv-ref motionvector() I while (next_bits != startcode) 1 Mnemonic bslbf uimsbf uimsbf uimsbf uimsbf uimsbf The mvref parameter indicates the reference for the following motion vector as follows; . "1" means that the motion vector is referenced to the previous valid motion vector EXCEPT in the case where pos-deltatype == OxO1 in which case the motion vector corresponding to the 1st block transmitted in the previous row shall be used as the reference vector. . "0" means that the motion vector is referenced to the motion vector of the block immediately above in the previous row. C.1.8 Relative Block Position relativeblock-position() I posdeltatype if ((pos-deltatype = fg-delta) 1| No of bits Mnemonic 1- 3 vlclbf 11 vlclbf 1- 11 vlclbf 9 vlclbf 9 1-11 vlclbf vlclbf (posdelta == bg-delta)) { while (nextbitsO == macroblockescape) macroblock-escape macroblock addres increment } else { while (nextbitsO == negativedelta escape) negativedelta-escape while (nextbitsO == positive-deltaescape) positive delta-escape pos.deltacode }____________ C.1.9 Motion Vector motion vectorO { motion horizontal code if ((horizontal f != Mnemonic vlcbf 1- 8 1 - 11 uimsbf vlclbf 1) && != 0)) (motionhorizontalcode motion horizontal r motion vertical code if ((vertical-f No of bits 1 - 11 != 1) && motionvertical code != 0)) 1 motion vertical r } - uimsbf 8 __ C.1.10 Error Residual Data errorresidualdata() [ 32 errorresidual start code do Mnemonic No of bits bslbf { errorslice() } while (nextbits() == error-slice start code) C.1.11 Error Slice error slice() [ 32 5 error slice start code quantizer-sclae code do Mnemonic No of bits bslbf uimsbf { residualmacroblock() } while (nextbits != startcode) nextstart_code() IT I 100 C.1.12 Residual Macroblock residual macroblockO { No of bits Mnemonic while (next bits() == macroblock_escape) 11 11 1-3 macroblock escape macroblockaddress increment residual macroblocktype if 1- vlclbf vlclbf vlclbf (macroblock-quant) 5 quantizer scale code uimsbf if (macroblock-pattern) 3-9 coded-block.pattern_420 for (i=0; i<blockcount; i++) { vlclbf block(i) } C.1.13 Block No of bits block(i) { if (pattern-code[i]) Mnemonic { if (macroblock intra) { { if (i<4) dct dc size luminance 2 -9 vlclbf 1- 11 uimsbf 2 - 10 vlclbf 1- 11 uimsbf if (dct-dcsizeluminance != 0) dct dcdifferential } else { dct dc size chrominance if (dct-dcsize chrominance != 0) dct dc size differential } } else { -.. first dct coefficient while (nextbitsO != End of block) Subsequent DCT coefficients End of block -.. -.. _____________ }_____________ 101 C.2 Variable Length Code Tables The variable length codes described here have been derived by either reducing or making additions to similar tables present in the MPEG standard. . Table 1, posjdeltajtype, is the exception to this rule and was derived by inspection. . Table 2, residualmacroblocktype for MODEL frames was derived from macroblock-type table for I frames by extracting the applicable modes and truncating unnecessarily long VLC codes through the removal of leading zeros. . Table 3, residualmacroblockjtype for MODELDELTA frames was derived from the macroblockjtype table for P frames by extracting the applicable modes and truncating unnecessarily long VLC codes through the removal of leading zeroes. - Table 4, posdelta code, was derived from the motioncode vlc table by adding two escape codes to allow deltas in excess of +/- 16. The use of the escape codes is identical to that used in MPEG2 for case where the macroblockincrement is greater than 33. 102 Variable length codes for pos-deltajtype TABLE C.1. TABLE C.2. pos-delta type VLC code 1 Description fgblock skip 01 bgblock-skip, relative to start of previous row 001 bgjlock-skip Variable length codes for residualmacroblocktype in model frames residualmacroblocktype VLC Code 1 01 TABLE C.3. Description Intra Intra, Quant Variable length codes for residualmacroblocktype in model delta frames residualmacroblockjtype VLC Code 1 011 001 0001 Description Inter Intra Inter, Quant Intra, Quant 103 TABLE C.4. Variable length codes for posdeltacode VLC Code 0000 0001 1 0000 0011001 0000 0011 011 0000 0011 101 0000 0011 111 00000100001 0000 0100 011 0000010011 0000010101 0000010111 00000111 0000 1001 00001011 0000111 0001 1 0011 011 1 010 0010 00010 0000110 00001010 0000 1000 00000110 0000 0101 10 0000010100 0000010010 00000100010 0000 0100 000 0000 0011 110 0000 0011 100 0000 0011 010 0000 0011000 0000 0001 0 posdeltacode negativejdelta-escape -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 positive delta escape 104 References [1] ISO/IEC DIS 10918-1, ITU- T Rec.T.81 (JPEG) "Information Technology - Digital compression and coding of continuous-tone still images", 1992 [2] ISO/IEC 11172 (1993) "Information technology - Coding of moving picture and associated audio for digital storage media as up to about 1.5 Mbit/s" [3] ISO/IEC CD 13818-2, Recommendation H.262, "Generic Coding of Moving Pictures and Associated Audio, Committee Draft, Seoul, November 1993 [4] E.H. Adelson, "Layered Representations for Image Coding", MIT Media Lab, Vision & Modelling Group Technical Report No 181, Dec 1992. [5] D.H. Ballard and C.M. Brown, "Computer Vision", Prentice Hall 1982 [6] J.R. Bergen and R.Hinorani, "Hierarchial Motion-Based Frame Rate Conversion", April 1990. [7] V.M. Bove Jr., "Entropy-based depth from focus", J.Opt.Soc. Am. A/Vol. 10, No. 4, April 1993 [8] H. Brusewitz, "Motion compensation with triangles", in Proc. 3rd International Conf. on 64kbit Coding of Moving Video, free session, Sept. 1990 [9] R. Clarke, Transform Coding of Images, Academic Press, 1985 [10] J. Foley, A. van Dam, S. Feiner & J. Hughes, Computer Graphics,Addison Wesley 1990. 105 [11] E.W. Forgy, "Cluster analysis of multivariate data: efficiency vs. interpretability of classifications", abstract, Biometrics, Vol. 21, 1965. [12] D. LeGall, "MPEG: A Video Compression Standard for Multimedia Applications", Communications of the ACM, April 1991, Vol. 34, No. 4, pp. 46- 58. [13] H. Holtzman, "Three-Dimensional Representations of Video using Knowledge Based Estimation", Master's Thesis, MIT Media Lab, 1991. [14] B.K.P. Horn, Robot Vision, The MIT Press/Mc Raw Hill, 1986. [15] D.A. Huffman, "A Method for the Construction of Minimum Redundancy Codes", Proceedings IRE, 1962, Vol. 40, pp. 1098-1101. [16] M. Irani and S. Peleg, "Image Sequence Enhancement Using Multiple Motions Analysis", Dept. of Computer Science, The Hebrew University of Jerusalem, Israel, Technical Report 91-15, December 1991. [17] D.H. Kelly, "Visual responses to time-dependent stimuli", J.Opt. Soc. Am., 51, pp. 422-429, 1961. [18] P. McLean, "Structured Video Coding", Master's Thesis, MIT Media Lab,1992. [19] J.S. Lim, Two-DimensionalSignal and Image Processing,Prentice Hall, 1993. [20] A.B. Lippman and R.G. Kermode, "Generalized Predictive Coding of Movies", Picture Coding Symposium (PCS '93), Lausanne, Switzerland, March 1993. [21] H.G Musmann, M. Hotter and J. Ostermann, "Object-Oriented Analysis-Synthesis Coding of Moving Images", Signal Processing: Image Communication, Vol. 1, No. 2, October 1989. 106 [22] Y Nakaya and H. Harashima, "Motion Compensation at Very low Rates", Picture Coding Symposium (PCS '93), Lausanne, Switzerland, March 1993. [23] A.N. Netravali and B.G. Haskell, DigitalPictures,Pienum Press, New York, 1988. [24] S. Okubo, "Requirements for high quality video coding standards", Signal Process., Vol.4, No.2, April 1992. [25] W.F. Schreiber, Fundamentalsof ElectronicImaging Systems, Springer-Verlag, Heidelberg 1986 [26] J.B. Stampleman, "Scalable Video Compression", Master's Thesis, MIT Media Lab 1992. [27] G.J. Sullivan, "Multiple-Hypothesis Motion Compensation for Low Bit-Rate Video Coding, IEEE ICASSP '93Sept 1993. [28] L. Teodosio, "Salient Stills", Master's Thesis, MIT Media Lab 1992. [29] G.K. Wallace, "The JPEG Compression Standard", Communications of the ACM, April 1991, Vol. 34, No. 4, pp. 30 - 44. [30] J.YA. Wang, E.H. Adelson, and U. Desai, "Applying Mid-level Vision Techniques for Video and Data Compression and Manipulation", Proceedings of the SPIE: Digital Video Compression on Personal Computers: Algorithms and Technologies, vol. 2187, San Jose, February 1994 107 Acknowledgments So many people, so much help for so long a time.......... Andy Lippman, for periodically raking me over the coals and challenging me to think. Edward Adelson and Michael Biggar, for being my readers and putting up with email messages and Fed Ex-ed copies to Florida. Walter Bender and Constantine Sapuntzakis for help with the Salient Stills and affine warps. Henry Holtzman, for answering my many many dumb questions, keeping the alpha alive and finally for being a really great guy. Linda Peterson, for helping to de-pig garbled verbiage and purge creative punctuation. Gillian Galloway and Jill Toffa, my spies on the second floor who kept me informed of what was really going on. Mike Bove, Shawn Becker and John "Wad" Watlington, for countless clarifications and helpful pointers. Eng Khor and Joe Kang for their help in writing the MPEG encoder and decoder. Mark, for putting up with me not only as an office mate but also as room mate. Cris, for being a true friend and a source of inspiration and understanding in times of stress. 108