1 High Resolution Images from a Sequence of Low Resolution and Compressed Observations: A Review C. A. Segall, R. Molina, and A. K. Katsaggelos, This work has been partially supported by the “Comisión Nacional de Ciencia y Tecnologı́a” under contract TIC2000-1275. A. K. Katsaggelos and C. A. Segall are with the Department of Electrical and Computer Engineering, Northwestern University, Evanston, Illinois 60208-3118. Rafael Molina is with the Departamento de Ciencias de la Computación e I.A. Universidad de Granada, 18071 Granada, Spain. 2 I. Introduction Super-resolution is the task of estimating high resolution images from a set of low resolution observations. These observations are acquired by either multiple sensors imaging a single scene or by a single sensor imaging the scene over a period of time. No matter the source of observations though, the critical requirement for super-resolution is that the observations contain different but related views of the scene. For scenes that are static, this requires sub-pixel displacements between the multiple sensors or within the motion of the single camera. For dynamic scenes, the necessary shifts are introduced by the motion of objects. Notice that super-resolution does not consider the case of a static scene acquired by a stationary camera. These applications are addressed by the field of image interpolation. A wealth of research considers modeling the acquisition and degradation of the low resolution frames and therefore solves the high resolution problem. (In this paper, we utilize the terms super-resolution, high resolution, and resolution enhancement interchangeably.) For example, literature reviews are presented in [1], [2] as well as this special issue. Work traditionally addresses the resolution enhancement of frames that are filtered and downsampled during acquisition and corrupted by additive noise during transmission and storage. In this paper though, we review approaches for the super-resolution of compressed video. Hybrid motion-compensation and transform coding methods are the focus, which incorporates the family of ITU and MPEG coding standards [3], [4]. The JPEG still image coding systems are also a special case of the approach. The use of video compression differentiates the resulting super-resolution and traditional problems. As a first difference, video compression methods represent images with a sequence of motion vectors and transform coefficients. The motion vectors provide a noisy observation of the temporal relationships within the high resolution scene. This is a type of observation not traditionally available to the high resolution problem. The transform coefficients represent a noisy observation of the high resolution intensities. This noise results from more sophisticated processing than the traditional processing scenario, as compression techniques may discard data according to perceptual significance. Additional problems arise with the introduction of compression. As a core requirement, 3 resolution enhancement algorithms operate on a sequence of related but different observations. Unfortunately, maintaining this difference is not the goal of a compression system. Compression algorithms predict frames with the motion vectors and then quantize the prediction error. This discards some of the differences between frames and decreases the potential for resolution enhancement. In the rest of this paper, we survey the field of super-resolution processing for compressed video. The introduction of motion vectors, compression noise and additional redundancies within the image sequence make this problem fertile ground for novel processing methods. In conducting this survey though, we develop and present all techniques within the Bayesian framework. This adds consistency to the presentation and facilitates comparison between the different methods. The paper is organized as follows. In Section II, we define the acquisition system utilized by the surveyed procedures. In Section III, we formulate the high resolution problem within the Bayesian framework. In Section IV, we survey models for the acquisition and compression systems. This requires consideration of both the motion vectors and transform coefficients within the compressed bit-stream. In Section V, we survey models for the original high resolution image intensities and displacement values. In Section VI, we discuss solutions for the super-resolution problem and provide examples of several approaches. Finally, we consider future research directions in Section VII. II. Acquisition Model and Notation Before we can recover a high resolution image from a sequence of low resolution observations, we must be precise in describing how the two are related. We begin with the pictorial depiction of the system in Figure 1. As can be seen from the figure, a continuous (high resolution) scene is first imaged by a low resolution sensor. This filters and samples the original high resolution data. The acquired low resolution image is then compressed with a hybrid motion-compensation and transform coding scheme. The resulting bit-stream contains both motion vectors and quantized transform coefficients. Finally, the compressed bit-stream serves as the input to the resolution enhancement procedure. This super-resolution algorithm provides an estimate of the high resolution scene. In the figure, the high resolution data represents a time-varying scene in the image plane 4 gl(m,n) f(x,y,t) yl(m,n) vl,i(m,n) Fig. 1. An overview of the super-resolution problem. A high resolution image sequence is captured at low resolution by a camera or other acquisition system. The low resolution frames are then compressed for storage and transmission. The goal of the super-resolution algorithm is to estimate the original high resolution sequence from the compressed information. coordinate system. This is denoted as f (x, y, t), where x, y and t are real numbers that indicate horizontal, vertical and temporal locations. The scene is filtered and sampled during acquisition to obtain the discrete sequence gl (m, n), where l is an integer time index, 1 ≥ m ≥ M , and 1 ≥ n ≥ N . The sequence gl (m, n) is not observable to any of the super-resolution procedures. Instead, the frames are compressed with a video compression system that results in the observable sequence yl (m, n). It also provides the motion vectors vl,i (m, n) that predict pixel yl (m, n) from the previously transmitted yi (m, n). Images in the figure are concisely expressed with the matrix-vector notation. In this format, one-dimensional vectors represent two-dimensional images. These vectors are formed by lexicographically ordering the image by rows, which is analogous to storing the frame in raster scan format. Hence, the acquired and observed low resolution images gl (m, n) and yl (m, n) are respectively expressed by the M N × 1 vectors gl and yl . The motion vectors that predict yl from yi are represented by the 2M N × 1 vector vl,i that is formed by stacking the transmitted horizontal and vertical offsets. Furthermore, since we utilize 5 C(d1 -1,l ) C(d1 +1,l ) fl-1 fl fl+1 AH AH AH gl-1 Q[ ] gl Q[ ] gl+1 Q[ ] MCl(ylP, vl) yl-1 P MCl+1(yl+1 ,vl+1) yl+1 yl Fig. 2. Relationships between the high resolution and low resolution images. High resolution frames are denoted as fl and are mapped to other time instances by the operator C(di,l ). The acquisition system transforms the high resolution frames to the low resolution sequence g l , which is then compressed to produce yl . Notice that the compressed frames are also related through the motion vectors v i,l . digital techniques to recover the high resolution scene, the high resolution frame is denoted as fl . The dimension of this vector is P M P N × 1, where P represents the resolution enhancement factor. Relationships between the original high resolution data and detected low resolution frames are further illustrated in Figure 2. Here, we show that the low resolution image g l and high resolution image fl are related by gl = AHfl , l = 1, 2, 3, . . . (1) where H is a P M P N × P M P N matrix that describes the filtering of the high resolution image and A is an M N × P M P N down-sampling matrix. The matrices A and H model the acquisition system and are assumed to be known. For the moment, we assume that the detector does not introduce noise. Frames within the high resolution sequence are also related through time. This is evident in Figure 2. Here, we assume a translational relationship between the frames that is written as n fl (m, n) = fk (m + dm l,k (m, n), n + dl,k (m, n)) + rl,k (m, n) (2) n where dm l,k (m, n) and dl,k (m, n) denote the horizontal and vertical components of the dis- placement dl,k (m, n) that relates the pixel at time k to the pixel at time l, and rl,k (m, n) 6 accounts for errors within the model. (Noise introduced by the sensor can also be incorporated into this error term.) In matrix-vector notation, Eq. (2) becomes fl = C(dl,k )fk + rl,k (3) where C(dl,k ) is the P M P N × P M P N matrix that maps frame fk to frame fl , dl,k is the column vector defined by lexicographically ordering the values of the displacements between the two frames, and rl,k is the registration noise. Note that while this displacement model is prevalent in the literature, a more general motion model could be employed. Having considered the relationship between low resolution and high resolution images prior to compression, let us turn our attention to the compression process. During compression, frames are divided into blocks that are encoded with one of two available methods. For the first method, a linear transform such as the Discrete Cosine Transform (DCT) is applied to the block. The operator decorrelates the intensity data, and the resulting transform coefficients are independently quantized and transmitted to the decoder. For the second method, predictions for the blocks are first generated by motion compensating previously transmitted image frames. The compensation is controlled by motion vectors that define the spatial and temporal offset between the current block and its prediction. Computing the prediction error, transforming it with a linear transform, quantizing the transform coefficients, and transmitting the quantized information refine the prediction. Elements of the video compression procedure are shown in Figure 3. In the figure, we show an original image frame and its transform and quantized representation in 3(a) and 3(b), respectively. This represents the first type of compression method, called intracoding, which also encompasses still image coding methods such as the JPEG procedures. The second form of compression is illustrated in Figure 4, where the original and previously compressed image frames appear in 4(a) and 4(b), respectively. This is often called intercoding, where the original frame is first predicted from the previously compressed data. Motion vectors for the prediction are shown in 4(c), and the motion compensated estimate appears in 4(d). The difference between the estimate and the original frame results in the displaced frame difference (or error residual), which is transformed with the DCT and quantized by the encoder. The quantized displaced frame difference for 4(a) and 4(d) appears in 4(e). At the encoder, the motion compensated estimate and quantized 7 (a) (b) Fig. 3. Intra-coding example. The image in (a) is transformed and quantized to produce the compressed frame in (b). displaced frame difference are combined to create the decoded frame appearing in 4(b). From the above discussion, we define the relationship between the acquired low resolution frame and its compressed observation as h ³ yl = T−1 Q T gl − M Cl (ylP , vl ) ´i + M Cl (ylP , vl ) l = 1, 2, 3 . . . (4) where Q[.] represents the quantization procedure, T and T−1 are the forward and inversetransform operations, respectively, M Cl (ylP , vl ) is the motion compensated prediction of gl formed by motion compensating previously decoded frame(s) as defined by the encoding method, and ylP and vl denote the set of decoded frames and motion vectors that predict yl , respectively. We want to make clear here that M Cl depends on vl and only a subset of y. For example, when a bit-stream contains a sequence of P -frames then ylP = yl−1 and vl = vl,l−1 . However, as there is a trend towards increased complexity and non-causal predictions within the motion compensation procedure, we keep the above notation for generality. With a definition for the compression system, we can now be precise in describing the relationship between the high resolution frames and the low resolution observations. Combining Eqs. (1), (3) and(4), the acquisition system depicted in Figure 1 is denoted as yl = AHC(dl,k )fk + el,k (5) 8 (a) (b) 18 16 14 12 10 8 6 4 2 2 4 6 8 10 12 14 16 18 20 22 (c) (d) (e) Fig. 4. Inter-coding example. The image in (a) is inter-coded to generate the compressed frame in (b). The process begins by finding the motion vectors in (c), which generates the motion compensated prediction in (d). The difference between the prediction and input image is then computed, and it is transformed and quantized. The resulting residual appears in (e) and is added to the prediction in (d) to generate the compressed frame in (b). 9 where el,k includes the errors introduced during compression, registration and acquisition. III. Problem Formulation With the acquisition system defined, we now formulate the super-resolution reconstruction problem. The reviewed methods utilize a window of low resolution compressed observations to estimate a single high resolution frame. Thus, the goal is to estimate the high resolution frame fk and displacements d given the decoded intensities y and motion vectors v. Here, displacements between fk and all of the frames within the processing window are encapsulated in d, as d = {dl,k |l = k − T B, . . . , k + T F } and T F + T B + 1 establishes the number of frames in the window. Similarly, all of the decoded observations and motion vectors within the processing window are contained in y and v, as y = {yl |l = k − T B, . . . , k + T F } and v = {vl |l = k − T B, . . . , k + T F } We pursue the estimate within the Bayesian paradigm. Therefore, the goal is to find f̂k , an estimate of fk , and d̂, an estimate of d, such that, f̂k , d̂ = arg max P (fk , d)P (y, v|fk , d) fk ,d (6) In the expression, P (y, v|fk , d) provides a mechanism to incorporate the compressed bitstream into the enhancement procedure, as it describes the probabilistic modeling of the process to obtain y and v from fk and d. Similarly, P (fk , d) allows for the integration of prior knowledge about the original high resolution scene and displacements. This is somewhat simplified in applications where the displacement values are assumed known, as the super-resolution estimate becomes f̂k = arg max P (fk )P (y, v|fk , d) fk (7) where d contains the previously found displacements. IV. Modeling the Observation Having presented the super-resolution problem within the Bayesian framework, we now consider the probability distributions in Eq. (6). We begin with the distribution P (y, v|fk , d) that models the relationship between the original high resolution intensities and displacements and the decoded intensities and motion vectors. For the purposes of 10 this review, it is rewritten as P (y, v|fk , d) = Y P (yl |fk , d)P (vl |fk , d, y) (8) l where P (yl |fk , d) is the distribution of the noise introduced by quantizing the transform coefficients and P (vl |fk , d, y) expresses any information derived from the motion vectors. Note that Eq. (8) assumes independence between the decoded intensities and motion vectors throughout the image sequence. This is well motivated when the encoder selects the motion vectors and quantization intervals without regard to the future bit-stream. Any dependence between these two quantities should be considered as future work. A. Quantization Noise To understand the structure of compression errors, we need to model the degradation process Q in Eq. (4). This is a non-linear operation that discards data in the transform domain, and it is typically realized by dividing each transform coefficient by a quantization scale factor and then rounding the result. The procedure is expressed as à [Tgk ](i) [Tyk ](i) = q(i)Round q(i) ! (9) where [Tyk ](i) denotes the ith transform coefficient of the compressed frame yk , q(i) is the quantization factor for coefficient i, and Round(.) is an operator that maps each value to the nearest integer. Two prominent models for the quantization noise appear in the super-resolution literature. The first follows from the fact that quantization errors are bounded by the quantization scale factor, that is − q(i) q(i) ≤ [Tyk ](i) − [Tgk ](i) ≤ 2 2 (10) according to Eq. (9). Thus, it seems reasonable that the recovered high resolution image (when mapped to low resolution) has transform coefficients within the same interval. This is often called the quantization constraint, and it is enforced with the distribution const P1 (yl |fk , d) = 0 if − q(i) 2 ≤ [T(AHC(dl,k )fk − M Cl (ylP , vl ))](i) ≤ q(i) 2 ∀i elsewhere (11) 11 As we are working within the Bayesian framework, we note that this states that the quantization errors are uniformly distributed within the quantization interval. Thus, [T(yk − gk )](i) ∼ U [−q(i)/2, q(i)/2] and so E([T(yk − gk )](i)) = 0 and var([T(yk − gk )](i)) = q(i)2 /12. Several authors employ the quantization constraint for super-resolution processing. For example, it is utilized by Altunbasak et al. [5], [6], Gunturk et al. [7], Patti and Altunbasak [8], and Segall et al. [9], [10]. With the exception of [7], quantization is considered to be the sole source of noise within the acquisition system. This simplifies the construction of Eq. (11). However, since the distribution P1 (yl |fk , d) in Eq. (11) is not differentiable, care must still be taken when finding the high resolution estimate. This is addressed later in this paper. The second model for the quantization noise is constructed in the spatial domain. This is appealing, as it motivates a Gaussian distribution that is differentiable. To understand this motivation, consider the following. First, the quantization operator in Eq. (9) quantizes each transform coefficients independently. Thus, noise in the transform domain is not correlated between transform indices. Second, the transform operator is linear. With these two conditions, quantization noise in the spatial domain becomes a linear sum of independent noise processes. The resulting distribution tends to be Gaussian, and it is expressed as [11] 1 P2 (yl |fk , d) ∝ exp − (yl − AHC(dl,k )fk )T K−1 Q (yl − AHC(dl,k )fk ) 2 · ¸ (12) where KQ is a covariance matrix that describes the noise. The normal approximation for the quantization noise appears in work by Chen and Schultz [12], Gunturk et al. [13], Mateos et al. [14], [15], Park et al. [16], [17], and Segall et al. [9], [18], [19]. A primary difference between these efforts lies in the definition and estimation of the covariance matrix. For example, a white noise model is assumed by Chen and Schultz and Mateos et al., while Gunturk et al. develop the distribution experimentally. Segall et al. consider a high bit-rate approximation for the quantization noise. Lower rate compression scenarios are addressed by Park et al., where the covariance matrix and high resolution frame are estimated simultaneously. In concluding this sub-section, we mention that the spatial domain noise model also 12 incorporates errors introduced by the sensor and motion models. This is accomplished by modifying the covariance matrix KQ [20]. Interestingly, since these errors are often independent of the quantization noise, incorporating the additional noise components further motivates the Gaussian model. B. Motion Vectors Incorporating the quantization noise is a major focus of much of the super-resolution for compressed video literature. However, it is also reasonable to use the motion vectors, v, within the estimates for f̂k and d. These motion vectors introduce a departure from traditional super-resolution techniques. In traditional approaches, the observed low resolution images provide the only source of information about the relationship between high resolution frames. When compression is introduced though, motion vectors provide an additional observation for the displacement values. This information differs from what is conveyed by the decoded intensities. There are several methods that exploit the motion vectors during resolution enhancement. At the high level though, each tries to model some similarity between transmitted motion vectors and actual high resolution displacements. For example, Chen and Schultz [12] constrain the motion vectors to be within a region surrounding the actual sub-pixel displacements. This is accomplished with the distribution const P1 (vl |fk , d, y) = 0 if |vl,i (j) − [AD dl,i ](j)| ≤ ∆, i ∈ P S, ∀j elsewhere (13) where AD is a matrix that maps the displacements to the low resolution grid, ∆ denotes the maximum difference between the transmitted motion vectors and estimated displacements, P S represents the set of previously compressed frames employed to predict fk , and [AD dl,i ](j) is the j th element of the vector AD dl,i . Similarly, Mateos et al. [15] utilize the distribution " γl X ||vl,i − AD dl,i ||2 P2 (vl |fk , d, y) ∝ exp − 2 i∈P S # (14) where γl specifies the similarity between the transmitted and estimated information. There are two disadvantages to modeling the motion vectors and high resolution displacements as similar throughout the frame. As a first problem, the significance of the 13 motion vectors depends on the underlying compression ratio, which typically varies within the frame. As a second problem, the quality of the motion vectors is dictated by the underlying intensity values. Segall et al. [18], [19] account for these errors by modeling the displaced frame difference within the encoder. This incorporates the motion vectors and is written as 1 P P3 (vl |fk , d, y) ∝ exp − (M Cl (ylP , vl ) − AHC(dl,k )fk )T K−1 M V (M Cl (yl , vl ) − AHC(dl,k )fk ) 2 (15) · where KM V is the covariance matrix for the prediction error between the original frame and its motion compensated estimate M Cl (ylP , vl ). Estimates for KM V are derived from the compressed bit-stream and therefore reflex the amount of compression. V. Modeling the Original Sequence We now consider the second distribution in Eq. (6), namely P (fk , d). This distribution contains a priori knowledge about the high resolution intensities and displacements. In the literature, it is assumed that this information is independent. Thus, we write P (fk , d) = P (fk )P (d) (16) for the remainder of this survey. Several researchers ignore the construction of the a priori densities, focusing instead on other facets of the super-resolution problem. For example, portions of the literature solely address the modeling of compression noise, e.g. [8], [7], [5]. This is equivalent to using the non-informative prior for both the original high resolution image and displacement data so that P (fk ) ∝ const and P (d) ∝ const (17) In these approaches, the noise model determines the high resolution estimate, and resolution enhancement becomes a maximum likelihood problem. Since the problem is ill-posed though, care must be exercised so that the approach does not become unstable or noisy. A. Intensities Prior distributions for the intensity information are motivated by the following two statements. First, it is assumed that pixels in the original high resolution images are ¸ 14 correlated. This is justified for the majority of acquired images, as scenes usually contain a number of smooth regions with (relatively) few edge locations. As a second statement, it is assumed that the original images are devoid of compression errors. This is also a reasonable statement. Video coding often results in structured errors, such as blocking artifacts, which rarely occur in uncompressed image frames. To encapsulate the statement that images are correlated and absent of blocking artifacts, the prior distribution " à λ2 λ1 k Q1 fk k2 + k Q2 AHfk k2 P (fk ) ∝ exp − 2 2 !# (18) is utilized in [9] (and references therein). Here, Q1 represents a linear high-pass operation that penalizes super-resolution estimates that are not smooth, Q2 represents a linear highpass operator that penalizes estimates with block boundaries, and λ1 and λ2 control the influence of the norms. A common choice for Q1 is the discrete 2D Laplacian; a common choice for Q2 is the simple difference operation applied at the boundary locations. Other distributions could also be incorporated into the estimation procedure. For example, Huber’s function could replace the quadratic norm. This is discussed in [12]. B. Displacements The non-informative prior in Eq. (17) is the most common distribution for P (d) in the literature. However, explicit models for the displacement values are recently presented in [18], [19]. There, the displacement information is assumed to be independent between frames so that P (d) = Y P (dl ) (19) l Then, the displacements within each frame are assumed to be smooth and absent of coding artifacts. To penalize these errors, the displacement prior is given by " λ3 P (dl ) ∝ exp − k Q3 dl k2 2 # (20) where Q3 is a linear high-pass operator, and λ3 is the inverse of the noise variance of the normal distribution. The discrete 2D Laplacian is typically selected for Q 3 . 15 VI. Realization of the Super-Resolution Methods With the super-resolution for compressed video techniques summarized by the previous distributions, we turn our attention to computing the enhanced frame. Formally this requires the solution of Eq. (6), where we estimate the high resolution intensities and displacements given some combination of the proposed distributions. The joint estimate is found by taking logarithms of Eq. (6) and solving f̂k , d̂ = arg max log P (fk , d)P (y, v|fk , d) fk ,d (21) with a combination of gradient descent, non-linear projection, and full-search methods. Scenarios where d is already known or separately estimated are a special case of the resulting procedure, One way to evaluate Eq. (21) is with the cyclic coordinate descent procedure [21]. With the approach, an estimate for the displacements is first found by assuming that the high resolution image is known, so that d̂q+1 = arg max log P (d)P (y, v|f̂kq , d) d (22) where q is the iteration index for the joint estimate. (For the case where d is known, Eq. (22) becomes d̂q+1 = d ∀q.) The intensity information is then estimated by assuming that the displacement estimates are exact, that is f̂kq+1 = arg max log P (fk )P (y, v|fk , d̂q+1 ) fk (23) The displacement information is re-estimated with the result from Eq.(23), and the process iterates until convergence. The remaining question is how to solve Eqs. (22) and (23) for the distributions presented in the previous sections. This is considered in the following subsections. A. Finding the Displacements As we have mentioned before, the non-informative prior in Eq. (17) is a common choice for P (d). We will consider its use first in developing algorithms, as the displacement estimate in Eq. (22) is simplified and becomes d̂q+1 = arg max log P (y, v|f̂kq , d) d (24) 16 This estimation problem is quite attractive, as displacement values for different regions of the frame are now independent. Block-matching algorithms are well suited for solving Eq. (24), and the construction of P (y, v|fk , d) controls the performance of the block-matching procedure. For example, Mateos et al. [15] combine the spatial domain model for the quantization noise with the distribution for P (v|fk , d, y) in Eq. (14). The resulting block-matching cost function is h q T q −1 d̂q+1 l,k = arg mindl,k (yl − AHC(dl,k )f̂k ) KQ (yl − AHC(dl,k )f̂k ) + γ2l ||vl,k − AD dl,k ||2 i (25) Similarly, Segall et al. [18], [19] utilize the spatial domain noise model and Eq. (15) for P (v|fk , d, y). The cost function then becomes h q T q −1 d̂q+1 l,k = arg mindl,k (yl − AHC(dl,k )f̂k ) KQ (yl − AHC(dl,k )f̂k ) q P +(M Cl (ylP , vl ) − AHC(dl,k )f̂kq )T K−1 M V (M Cl (yl , vl ) − AHC(dl,k )f̂k ) i (26) Finally, Chen and Schultz [12] substitute the distribution for P (v|f k , d, y) in Eq. (13), which results in the block-matching cost function d̂q+1 l,k = arg min dl,k ∈CM V h q (yl − AHC(dl,k )f̂kq )T K−1 Q (yl − AHC(dl,k )f̂k ) i (27) where CM V follows from Eq. (13) and denotes the set of displacements that satisfy the condition |vl,k (i) − [AD dl,k ](i)| < ∆ ∀i. When the quantization constraint in Eq. (11) is combined with the non-informative P (d), estimation of the displacement information is d̂q+1 = arg min log P (v|f̂km , d, y) dl,k ∈CQ (28) where CQ denotes the set of displacements that satisfy the constraint − q(i) ≤ [T(AHC(dl,k )f̂k − 2 M Cl (ylP , vl ))](i) ≤ q(i) 2 ∀i. This is a tenuous statement. P (v|fk , d, y) is traditionally defined at locations with motion vectors; however, motion vectors are not transmitted for every block in the compressed video sequence. To overcome this problem, authors typically estimate d separately. This is analogous to altering P (yk |fk , d) when estimating the displacements. 17 When P (d) is not described by the non-informative distribution, differential methods become common estimators for the displacements. These methods are based on the optical flow equation and are explored in Segall et al. [18], [19]. In these works, the spatial domain quantization noise model in Eq. (12) is combined with distributions for the motion vectors in Eq. (15) and displacements in Eq. (20). The estimation problem is then expressed as h q T q −1 d̂q+1 l,k = arg mindl,k (yl − AHC(dl,k )f̂k ) KQ (yl − AHC(dl,k )f̂k ) q P +(M Cl (ylP , vl ) − AHC(dl,k )f̂kq )T K−1 M V (M Cl (yl , vl ) − AHC(dl,k )f̂k ) i +λ3 dTl,k QT3 Q3 dl,k . (29) Finding the displacements is accomplished by differentiating Eq. (29) with respect to d l,k and setting the result equal to zero. This leads to a successive approximations algorithm [18]. An alternative differential approach is utilized by Park et al. [20], [16]. In these works, the motion between low resolution frames is estimated with the block based optical flow method suggested by Lucas and Kanade [22]. Displacements are estimated for the low resolution frames in this case. B. Finding the Intensities Methods for estimating the high resolution intensities from Eq. (23) are largely determined by the quantization noise model. For example, consider the least complicated combination of the quantization constraint in Eq. (11) with the non-informative distributions for P (fk ) and P (v|fk , d, y). The intensity estimate is then stated as f̂kq+1 ∈ FQ (30) where FQ denotes the set of intensities that satisfy the constraint − q(i) ≤ [T(AHC(dq+1 l,k )fk − 2 M Cl (ylP , vl ))](i) ≤ q(i) 2 ∀i. Note that the solution to this problem is not unique, as the set-theoretic method only limits the magnitude of the quantization error in the system model. A frame that satisfies the constraint is therefore found with the projection onto convex sets (POCS) algorithm [23], where sources for the projection equations include [8], [9]. 18 A different approach must be followed when incorporating the spatial domain noise model in Eq. (12). If we still assume a non-informative distribution for P (f k ) and P (v|fk , d, y), the estimate for the high resolution intensities is f̂kq+1 = arg min fk Xh q+1 T −1 (yl − AHC(d̂q+1 l,k )fk ) KQ (yl − AHC(d̂l,k )fk ) l i (31) This can be found with a gradient descent algorithm [9]. Alternative combinations of P (fk ) and P (v|fk , d, y) lead to more involved algorithms. Nonetheless, the fundamental difference lies in the choice for the quantization noise model. For example, combining the distribution for P (fk ) in Eq. (18) with the non-informative prior for P (v|fk , d, y) results in the estimate f̂kq+1 ( λ1 λ2 k Q1 fk k2 + k Q2 AHf̂k k2 − log P (y|fk , d) = arg min fk 2 2 ) (32) Interestingly, the estimate for fk is not changed by substituting Eqs. (13) for (14) for the distribution P (v|fk , d, y), as these distributions are non-informative to the intensity estimate. An exception is the distribution in Eq. (15) that models the displaced frame difference. When this P (v|fk , d, y) is combined with the model for P (fk ) in Eq. (18) the high resolution estimate becomes n f̂km+1 = arg minfk + P 1 P l 2 (M Cl (yl , vl ) λ1 2 k Q1 fk k2 + λ22 k Q2 AHfk k2 q+1 −1 T P − AHC(d̂q+1 l,k )fk ) KM V (M Cl (yl , vl ) − AHC(d̂l,k )fk ) − log P (y|fk , d)} (33) In concluding the sub-section, we utilize the estimate in Eq. (32) to compare the performance of the quantization noise models. Two iterative algorithms are employed. For the case that the quantization constraint in Eq. (11) is utilized, the estimate is found with the iteration h fkq+1,s+1 = PQ f̂kq+1,s − αf n³ λ1 QT1 Q1 f̂kq+1,s + λ2 HT AT QT2 Q2 AHf̂kq+1,s ´oi (34) where f̂kq+1,s+1 and f̂kq+1,s are estimates for f̂kq+1 at the (s + 1)th and sth iterations of the algorithm, respectively, αf controls the convergence and rate of convergence of the algorithm, and PQ denotes the projection operator that finds the solution to Eq. (30). 19 When the spatial domain noise model in Eq. (12) is utilized, we estimate intensities with the iteration f̂kq+1,s+1 = f̂kq+1,s − αf nP l q+1 q+1,s −1 T T T C(d̂q+1 ) l,k ) H A KQ (yl − AHC(d̂l,k )f̂k +λ1 QT1 Q1 fkq+1,s + λ2 HT AT QT2 Q2 AHf̂kq+1,s o (35) where KQ is the covariance matrix found in [10]. The iterations in Eqs. (34) and (35) are applied to a single compressed bit-stream. The bit-stream is generated by sub-sampling an original 352x288 image sequence by a factor of two in both the horizontal and vertical directions. The decimated frames are then compressed with an MPEG-4 encoder operating at 1Mbps. This is a high bit-rate simulation that maintains differences between temporally adjacent frames. (Lower bitrates generally result in less resolution enhancement for a given processing window.) The original high resolution frame and decoded result appear in Fig. 5. For the simulations, we first incorporate the non-informative P (fk ) so that λ1 = λ2 = 0. Displacement information is then estimated prior to enhancement. This is motivated by the quantization constraint, which complicates the displacement estimate, and the method in [10] is utilized. Seven frames are incorporated into the estimate with T B = T F = 3, and we choose αf = .125. Estimates in Figs. 6(a) and 6(b) show the super-resolved results from Eqs. (34) and (35), respectively. As a first observation, notice that both estimates lead to a higher resolution image. This is best illustrated by comparing the estimated frames and the bilinearly interpolated result in Fig. 5(b). (Observe the text and texture regions in the right hand part of the frame; both are sharper than the interpolated image.) However, close inspection of the two high resolution estimates shows that the image in Fig. 6(a) is corrupted by artifacts near sharp boundaries. This is attributable to the quantization constraint noise model, and it is introduced by registration errors as well as the non-unique solution of the approach. In comparison, the normal approximation for the quantization noise in Eq. (12) is less sensitive to registration errors. This is a function of the covariance matrix KQ and the unique fixed point of the algorithm. PSNR results for the estimated frames support the visual assessment. The quantization constraint and normal approximation models result in a PSNR of 28.8dB and 33.4dB, respectively. 20 A second set of experiments appears in Fig. 7. In these simulations, the prior model in Eq. (18) is now utilized for P (fk ). Parameters are unchanged, except that λ1 = λ2 = .25 and αf = .05. This facilitates comparison of the incorporated and non-informative priors for P (fk ). Estimates from Eqs. (34) and (35) appear in Figs. 7(a) and 7(b), respectively. Again, we see evidence of resolution enhancement in both frames. Moreover, incorporating the quantization constraint no longer results in artifacts. This is a direct benefit of the prior model P (fk ) that regularizes the solution. For the normal approximation method, we see that the estimated frame is now smoother. This is a weakness of the method, as it is sensitive to parameter selection. PSNR values for the two sequences are 32.4dB and 30.5dB, respectively. As a final simulation, we employ Eq. (15) for P (v|fk , d, y). This incorporates a model for the motion vectors, and it requires an expansion of the algorithm in Eq. (32) as well as a definition for KM V . We utilize the methods in [18]. Parameters are kept the same as the previous experiments, except that λ1 = .01, λ2 = .02, αf = .125. The estimated frame appears in Fig. 8. Resolution improvement is evident throughout the image, and it leads to the largest PSNR value for all simulated algorithms. The PSNR value is 33.7dB. VII. Directions of Future Research In concluding this paper, we want to identify several research areas that will benefit the field of super-resolution from compressed video. As a first area, we believe that the simultaneous estimate of multiple high resolution frames should lead to improved solutions. These enlarged estimates incorporate additional spatio-temporal descriptions for the sequence and provide increased flexibility in modeling the scene. For example, the temporal evolution of the displacements can be modeled. Note that there is some related work in the field of compressed video processing, see for example Choi et al. [24]. Accurate estimates of the high resolution displacements are critical for the super-resolution problem, and methods that improve these estimates are a second area of research. Optical flow techniques seem suitable for the general problem of resolution enhancement. However, there is work to be done in designing methods for the blurred, sub-sampled, aliased, and blocky observations provided by a decoder. Towards this goal, alternative probability distributions within the estimation procedures are of interest. This is related 21 (a) (b) Fig. 5. Acquired image frames: (a) original image and (b) decoded result after bilinear interpolation. The original image is down-sampled by a factor of two in both the horizontal and vertical directions and then compressed. 22 (a) (b) Fig. 6. Resolution enhancement with two estimation techniques: (a) super-resolved image employing the quantization constraint for P (y|fk , d), (b) super-resolved image employing the normal approximation for P (y|fk , d). The method in (a) is susceptible to artifacts when registration errors occur, which is evident around the calendar numbers and within the upper-right part of the frame. PSNR values for (a) and (b) are 28.8dB and 33.4dB, respectively. 23 (a) (b) Fig. 7. Resolution enhancement with two estimation techniques: (a) super-resolved image employing the quantization constraint for P (y|fk , d), (b) super-resolved image employing the normal approximation for P (y|fk , d). The distribution in Eq (18) is utilized for resolution enhancement. This regularizes the method in (a). However, the technique in (b) is sensitive to parameter selection and becomes overly smooth. PSNR values for (a) and (b) are 32.4dB and 30.5dB. 24 Fig. 8. Example high resolution estimate when information from the motion vectors is incorporated. This provides further gains in resolution improvement and the largest quantitative measure in the simulations. The PSNR for the frame is 33.7dB. to the work by Simoncelli et al. [25] and Nestares and Navarro [26]. Also, coarse-to-fine estimation methods have the potential for further improvement, see for instance Luettgen et al. [27]. Finally, we mention the use of banks of multi-directional/multi-scale representations for estimating the necessary displacements. A review of these methods appears in Chamorro-Martı́nez [28]. Prior models for the high resolution intensities and displacements will also benefit from future work. For example, the use of piece-wise smooth model for the estimates will improve the high resolution problem. This is realized with line processes [29] or an object based approach. For example, Irani and Peleg [30], Irani et al. [31], and Weiss and Adelson [32] present the idea of segmenting frames into objects and then reconstructing each object individually. This could benefit the resolution enhancement of compressed video. It also leads to an interesting estimation scenario, as compression standards such as MPEG-4 25 provide information about the boundary information within the bit-stream. Finally, resolution enhancement is often a precursor to some form of image analysis or feature recognition task. Recently, there is a trend to address these problems directly. The idea is to learn priors from the images and then apply them to the high resolution problem. The recognition of faces is a current focus, and relevant work is found in Baker and Kanade [33], Freeman et al.[34], [35], and Capel and Zisserman [36]. With the increasing use of digital video technology, it seems only natural that these investigations consider the processing of compressed video. References [1] S. Chaudhuri, Ed., Super-Resolution Imaging, Kluwer Academic Publishers, 2001. [2] S. Borman and R. Stevenson, “Spatial resolution enhancement of low-resolution image sequences. a comprehensive review with directions for future research,” Tech. Rep., Technical report, laboratory for Image and Sugnal Analysis, University of Notre Dame, 1998. [3] A.N. Netravali and B.G. Haskell, Digital Picture - Representation and Compression, Plenum Press, 1995. [4] V. Bhaskaran and K. Konstantinides, Image and Video Compression Standards, Kluwer Academic Publishers, [5] Y. Altunbasak, A.J. Patti, and R.M. Mersereau, “Super-resolution still and video reconstruction from mpeg- 1995. coded video,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, pp. 217–226, 2002. [6] Y. Altunbasak and A.J Patti, “A maximum a posteriori estimator for high resolution video reconstruction [7] B.K. Gunturk, Y. Antunbasak, and R. Mersereau, from mpeg video,” in IEEE International Conference on Image Processing, 2000, vol. 2, pp. 649 –652. “Bayesian resolution-enhancement framework for transform-coded video,” in IEEE International Conference on Image Processing, 2001, vol. 2, pp. 41–44. [8] A.J. Patti and Y. Altunbasak, “Super-resolution image estimation for transform coded video with application to mpeg,” in IEEE International Conference on Image Processing, 1999, vol. 3, pp. 179–183. [9] C. A. Segall, A. K. Katsaggelos, R. Molina, and J. Mateos, Super-resolution from compressed video, chapter 9 in Super-Resolution Imaging, S. Chaudhuri, Ed., pp. 211–242, Kluwer Academic Publishers, 2001. [10] C.A. Segall, R. Molina, A.K. Katsaggelos, and J. Mateos, “Bayesian high-resolution reconstruction of lowresolution compressed video,” in IEEE International Conference on Image Processing, 2001, vol. 2, pp. 25 –28. [11] M.A. Robertson and R.L. Stevenson, “Dct quantization noise in compressed images,” in IEEE International Conference on Image Processing, 2001, vol. 1, pp. 185 –188. [12] D. Chen and R.R. Schultz, “Extraction of high-resolution video stills from mpeg image sequences,” in IEEE International Conference on Image Processing, 1998, vol. 2, pp. 465–469. [13] B.K. Gunturk, Y. Altunbasak, and R.M. Mersereau, “Multiframe resolution-enhancement methods for compressed video,” IEEE Signal Processing Letters, vol. 9, pp. 170 –174, 2002. [14] J. Mateos, A.K. Katsaggelos, and R. Molina, “Resolution enhancement of compressed low resolution video,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000, vol. 4, pp. 1919 –1922. [15] J. Mateos, A.K. Katsaggelos, and R. Molina, “Simultaneous motion estimation and resolution enhancement 26 of compressed low resolution video,” in IEEE International Conference on Image Processing, 2000, vol. 2, pp. 653 –656. [16] S.C. Park, M.G. Kang, C.A. Segall, and A.K. Katsaggelos, “Spatially adaptive high-resoltuion image reconstruction of low-resolution dct-based compressed images,” in IEEE International Conference on Image Processing, 2002. [17] S.C. Park, M.G. Kang, C.A. Segall, and A.K. Katsaggelos, “Spatially adaptive high-resoltuion image reconstruction of low-resolution dct-based compressed images,” in submission, 2002. [18] C.A. Segall, R. Molina, A.K. Katsaggelos, and J. Mateos, “Reconstruction of high-resolution image frames from a sequence of low-resolution and compressed observations,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002, 2002, vol. 2, pp. 1701 –1704. [19] C. A. Segall, A. K. Katsaggelos, R. Molina, and J. Mateos, “Bayesian resolution enhancement of compressed video,” in submission, 2002. [20] S.C. Park, M.G. Kang, C.A. Segall, and A.K. Katsaggelos, “High-resolution image reconstruction of lowresolution dct-based compressed images,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002, vol. 2, pp. 1665 –1668. [21] D.G. Luenberger, Linear and Nonlinear Programming, Reading, MA: Addison-Wesley Publishing Company, Inc., 1984. [22] B.D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” in Proceedings of Imaging Understanding Workshop, 1981, pp. 121–130. [23] D. C. Youla and H. Webb, “Image restoration by the method od convex projections: Part 1-theory,” IEEE Transactions on Medical Imaging, vol. MI-1, no. 2, pp. 81–94, 1982. [24] M. C. Choi, Y. Yang, and N. P. Galatsanos, “Multichannel regularized recovery of compressed video sequences,” IEEE Trans. on Circuits and Systems II, vol. 48, pp. 376–387, 2001. [25] E.P. Simoncelli, E.H. Adelson, and D.J. Heeger, “Probability distributions of optical flow,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1991, pp. 310 –315. [26] O. Nestares and R. Navarro, “Probabilistic estimation of optical flow in multiple band-pass directional channels,” Image and Vision Computing, vol. 19, pp. 339–351, 2001. [27] M.R. Luettgen, W. Clem Karl, and A.S. Willsky, “Efficient multiscale regularization with applications to the computation of optical flow,” IEEE Transactions on Image Processing, vol. 3, pp. 41–64, 1994. [28] J. Chamorro-Martı́nez, Desarrollo de modelos computacionales de representación de secuencias de imágenes y su aplicación a la estimación de movimiento (in Spanish), Ph.D. thesis, University of Granada, 2001. [29] J. Konrad and E. Dubois, “Bayesian estimation of motion vector fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 14, no. 9, pp. 910–927, 1992. [30] M. Irani and S. Peleg, “Motion analysis for image enhancement: Resolution, occlusion, and transparency,” JVCIP, vol. 4, pp. 324–335, 1993. [31] M. Irani, B. Rousso, and S. Peleg, “Computing Occluding and Transparent Motions,” IJCV, vol. 12, no. 1, pp. 5–16, January 94. [32] Y. Weiss and E.H. Adelson, “A unified mixture framework for motion segmentation: incorporating spatial coherence and estimating the number of models,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1996, pp. 321–326. [33] S. Baker and T. Kanade, “Limits on super-resolution and how to break them,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 24, pp. 1167–1183, 2002. 27 [34] W.T. Freeman, E.C. Pasztor, and O.T. Carmichael, “Learning low-level vision,” International Journal of Computer Vision, vol. 40, pp. 24–57, 2000. [35] W.T. Freeman, T.R. Jones, and E.C.Pasztor, “Example-based super-resolution,” IEEE Computer Graphics and Applications, pp. 56–65, 2002. [36] D. Capel and A. Zisserman, “Super-resolution from multiple views using learnt image models,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2001, vol. 2, pp. 627–634. C. Andrew Segall received the B.S. and M.S. degrees in Electrical Engineering from Oklahoma State University in 1995 and 1997, respectively, and the Ph.D. degree in Electrical Engineering from Northwestern University in 2002. He is currently a post-doctoral researcher at Northwestern University, Evanston, IL. His research interests are in image processing and include recovery problems for compressed video, scale space theory and nonlinear filtering. Dr. Segall is a member of Phi Kappa Phi, Eta Kappa Nu, SPIE and the IEEE. Rafael Molina was born in 1957. He received the degree in mathematics (statistics) in 1979 and the Ph.D. degree in optimal design in linear models in 1983. He became Professor of computer science and artificial intelligence at the University of Granada, Granada, Spain, in 2000. His areas of research interest are image restoration (applications to astronomy and medicine), parameter estimation, image and video compression, high resolution image reconstruction and blind deconvolution. Dr. Molina is a member of SPIE, Royal Statistical Society, and the Asociación Española de Reconocimento de Formas y Análisis de Imágenes (AERFAI). Aggelos K. Katsaggelos received the Diploma degree in electrical and mechanical engineering from the Aristotelian University of Thessaloniki, Greece, in 1979 and the M.S. and Ph.D. degrees both in electrical engineering from the Georgia Institute of Technology, in 1981 and 1985, respectively. In 1985 he joined the Department of Electrical and Computer Engineering at Northwestern University, where he is currently professor, holding the Ameritech Chair of Information Technology. He is also the Director of the Motorola Center for Communications. During the 1986-1987 academic year he was an assistant professor at Polytechnic University, Department of Electrical Engineering and Computer Science, Brooklyn, NY. Dr. Katsaggelos is a Fellow of the IEEE, an Ameritech Fellow, a member of the Associate Staff, Department of Medicine, at Evanston Hospital, and a member of SPIE. He is a member of the Publication Board of the IEEE Proceedings, the IEEE Technical Committees on Visual Signal Processing and Communications, and Multimedia Signal Processing, Editorial Board Member of Academic Press, Marcel Dekker: Signal Processing Series, Applied Signal Processing, and Computer Journal, He has served as editor-in-chief of the IEEE Signal Processing Magazine (1997-2002), a member of the Publication Boards of the IEEE Signal Processing Society, the IEEE TAB Magazine Committee, an Associate editor for the IEEE Transcations on Signal Processing (1990-1992), an area editor for the journal 28 Graphical Models and Image Processing (1992-1995), a member of the Steering Committees of the IEEE Transactions on Image Processing (1992-1997) and the IEEE Transactions on Medical Imaging (1990-1999), a member of the IEEE Technical Committee on Image and Multi-Dimensional Signal Processing (1992-1998), and a member of the Board of Governors of the IEEE Signal Processing Society (1999-2001). He is the editor of Digital Image Restoration (Springer-Verlag, Heidelberg, 1991), co-author of Rate-Distortion Based Video Compression (Kluwer Academic Publishers, 1997), and co-editor of Recovery Techniques for Image and Video Compression and Transmission, (Kluwer Academic Publishers, 1998). He is the the coinventor of eight international patents, and the recipient of the IEEE Third Millennium Medal (2000), the IEEE Signal Processing Society Meritorious Service Award (2001), and an IEEE Signal Processing Society Best Paper Award (2001).