High Resolution

advertisement
1
High Resolution Images from a Sequence of
Low Resolution and Compressed
Observations: A Review
C. A. Segall, R. Molina, and A. K. Katsaggelos,
This work has been partially supported by the “Comisión Nacional de Ciencia y Tecnologı́a” under contract
TIC2000-1275.
A. K. Katsaggelos and C. A. Segall are with the Department of Electrical and Computer Engineering, Northwestern
University, Evanston, Illinois 60208-3118.
Rafael Molina is with the Departamento de Ciencias de la Computación e I.A. Universidad de Granada, 18071
Granada, Spain.
2
I. Introduction
Super-resolution is the task of estimating high resolution images from a set of low
resolution observations. These observations are acquired by either multiple sensors imaging
a single scene or by a single sensor imaging the scene over a period of time. No matter
the source of observations though, the critical requirement for super-resolution is that the
observations contain different but related views of the scene. For scenes that are static,
this requires sub-pixel displacements between the multiple sensors or within the motion
of the single camera. For dynamic scenes, the necessary shifts are introduced by the
motion of objects. Notice that super-resolution does not consider the case of a static scene
acquired by a stationary camera. These applications are addressed by the field of image
interpolation.
A wealth of research considers modeling the acquisition and degradation of the low resolution frames and therefore solves the high resolution problem. (In this paper, we utilize
the terms super-resolution, high resolution, and resolution enhancement interchangeably.)
For example, literature reviews are presented in [1], [2] as well as this special issue. Work
traditionally addresses the resolution enhancement of frames that are filtered and downsampled during acquisition and corrupted by additive noise during transmission and storage. In this paper though, we review approaches for the super-resolution of compressed
video. Hybrid motion-compensation and transform coding methods are the focus, which
incorporates the family of ITU and MPEG coding standards [3], [4]. The JPEG still image
coding systems are also a special case of the approach.
The use of video compression differentiates the resulting super-resolution and traditional problems. As a first difference, video compression methods represent images with
a sequence of motion vectors and transform coefficients. The motion vectors provide a
noisy observation of the temporal relationships within the high resolution scene. This
is a type of observation not traditionally available to the high resolution problem. The
transform coefficients represent a noisy observation of the high resolution intensities. This
noise results from more sophisticated processing than the traditional processing scenario,
as compression techniques may discard data according to perceptual significance.
Additional problems arise with the introduction of compression. As a core requirement,
3
resolution enhancement algorithms operate on a sequence of related but different observations. Unfortunately, maintaining this difference is not the goal of a compression system.
Compression algorithms predict frames with the motion vectors and then quantize the
prediction error. This discards some of the differences between frames and decreases the
potential for resolution enhancement.
In the rest of this paper, we survey the field of super-resolution processing for compressed
video. The introduction of motion vectors, compression noise and additional redundancies
within the image sequence make this problem fertile ground for novel processing methods. In conducting this survey though, we develop and present all techniques within the
Bayesian framework. This adds consistency to the presentation and facilitates comparison
between the different methods.
The paper is organized as follows. In Section II, we define the acquisition system utilized
by the surveyed procedures. In Section III, we formulate the high resolution problem
within the Bayesian framework. In Section IV, we survey models for the acquisition
and compression systems. This requires consideration of both the motion vectors and
transform coefficients within the compressed bit-stream. In Section V, we survey models
for the original high resolution image intensities and displacement values. In Section VI,
we discuss solutions for the super-resolution problem and provide examples of several
approaches. Finally, we consider future research directions in Section VII.
II. Acquisition Model and Notation
Before we can recover a high resolution image from a sequence of low resolution observations, we must be precise in describing how the two are related. We begin with the
pictorial depiction of the system in Figure 1. As can be seen from the figure, a continuous (high resolution) scene is first imaged by a low resolution sensor. This filters and
samples the original high resolution data. The acquired low resolution image is then compressed with a hybrid motion-compensation and transform coding scheme. The resulting
bit-stream contains both motion vectors and quantized transform coefficients. Finally, the
compressed bit-stream serves as the input to the resolution enhancement procedure. This
super-resolution algorithm provides an estimate of the high resolution scene.
In the figure, the high resolution data represents a time-varying scene in the image plane
4
gl(m,n)
f(x,y,t)
yl(m,n) vl,i(m,n)
Fig. 1. An overview of the super-resolution problem. A high resolution image sequence is captured at
low resolution by a camera or other acquisition system. The low resolution frames are then compressed
for storage and transmission. The goal of the super-resolution algorithm is to estimate the original high
resolution sequence from the compressed information.
coordinate system. This is denoted as f (x, y, t), where x, y and t are real numbers that
indicate horizontal, vertical and temporal locations. The scene is filtered and sampled
during acquisition to obtain the discrete sequence gl (m, n), where l is an integer time
index, 1 ≥ m ≥ M , and 1 ≥ n ≥ N . The sequence gl (m, n) is not observable to any of the
super-resolution procedures. Instead, the frames are compressed with a video compression
system that results in the observable sequence yl (m, n). It also provides the motion vectors
vl,i (m, n) that predict pixel yl (m, n) from the previously transmitted yi (m, n).
Images in the figure are concisely expressed with the matrix-vector notation. In this format, one-dimensional vectors represent two-dimensional images. These vectors are formed
by lexicographically ordering the image by rows, which is analogous to storing the frame
in raster scan format. Hence, the acquired and observed low resolution images gl (m, n)
and yl (m, n) are respectively expressed by the M N × 1 vectors gl and yl . The motion
vectors that predict yl from yi are represented by the 2M N × 1 vector vl,i that is formed
by stacking the transmitted horizontal and vertical offsets. Furthermore, since we utilize
5
C(d1 -1,l )
C(d1 +1,l )
fl-1
fl
fl+1
AH
AH
AH
gl-1
Q[ ]
gl
Q[ ]
gl+1
Q[ ]
MCl(ylP, vl)
yl-1
P
MCl+1(yl+1
,vl+1)
yl+1
yl
Fig. 2. Relationships between the high resolution and low resolution images. High resolution frames are
denoted as fl and are mapped to other time instances by the operator C(di,l ). The acquisition system
transforms the high resolution frames to the low resolution sequence g l , which is then compressed to
produce yl . Notice that the compressed frames are also related through the motion vectors v i,l .
digital techniques to recover the high resolution scene, the high resolution frame is denoted as fl . The dimension of this vector is P M P N × 1, where P represents the resolution
enhancement factor.
Relationships between the original high resolution data and detected low resolution
frames are further illustrated in Figure 2. Here, we show that the low resolution image g l
and high resolution image fl are related by
gl = AHfl ,
l = 1, 2, 3, . . .
(1)
where H is a P M P N × P M P N matrix that describes the filtering of the high resolution
image and A is an M N × P M P N down-sampling matrix. The matrices A and H model
the acquisition system and are assumed to be known. For the moment, we assume that
the detector does not introduce noise.
Frames within the high resolution sequence are also related through time. This is
evident in Figure 2. Here, we assume a translational relationship between the frames that
is written as
n
fl (m, n) = fk (m + dm
l,k (m, n), n + dl,k (m, n)) + rl,k (m, n)
(2)
n
where dm
l,k (m, n) and dl,k (m, n) denote the horizontal and vertical components of the dis-
placement dl,k (m, n) that relates the pixel at time k to the pixel at time l, and rl,k (m, n)
6
accounts for errors within the model. (Noise introduced by the sensor can also be incorporated into this error term.) In matrix-vector notation, Eq. (2) becomes
fl = C(dl,k )fk + rl,k
(3)
where C(dl,k ) is the P M P N × P M P N matrix that maps frame fk to frame fl , dl,k is
the column vector defined by lexicographically ordering the values of the displacements
between the two frames, and rl,k is the registration noise. Note that while this displacement
model is prevalent in the literature, a more general motion model could be employed.
Having considered the relationship between low resolution and high resolution images
prior to compression, let us turn our attention to the compression process. During compression, frames are divided into blocks that are encoded with one of two available methods.
For the first method, a linear transform such as the Discrete Cosine Transform (DCT)
is applied to the block. The operator decorrelates the intensity data, and the resulting
transform coefficients are independently quantized and transmitted to the decoder. For
the second method, predictions for the blocks are first generated by motion compensating
previously transmitted image frames. The compensation is controlled by motion vectors
that define the spatial and temporal offset between the current block and its prediction.
Computing the prediction error, transforming it with a linear transform, quantizing the
transform coefficients, and transmitting the quantized information refine the prediction.
Elements of the video compression procedure are shown in Figure 3. In the figure,
we show an original image frame and its transform and quantized representation in 3(a)
and 3(b), respectively. This represents the first type of compression method, called intracoding, which also encompasses still image coding methods such as the JPEG procedures.
The second form of compression is illustrated in Figure 4, where the original and previously
compressed image frames appear in 4(a) and 4(b), respectively. This is often called intercoding, where the original frame is first predicted from the previously compressed data.
Motion vectors for the prediction are shown in 4(c), and the motion compensated estimate
appears in 4(d). The difference between the estimate and the original frame results in
the displaced frame difference (or error residual), which is transformed with the DCT
and quantized by the encoder. The quantized displaced frame difference for 4(a) and
4(d) appears in 4(e). At the encoder, the motion compensated estimate and quantized
7
(a)
(b)
Fig. 3. Intra-coding example. The image in (a) is transformed and quantized to produce the compressed
frame in (b).
displaced frame difference are combined to create the decoded frame appearing in 4(b).
From the above discussion, we define the relationship between the acquired low resolution frame and its compressed observation as
h
³
yl = T−1 Q T gl − M Cl (ylP , vl )
´i
+ M Cl (ylP , vl )
l = 1, 2, 3 . . .
(4)
where Q[.] represents the quantization procedure, T and T−1 are the forward and inversetransform operations, respectively, M Cl (ylP , vl ) is the motion compensated prediction of
gl formed by motion compensating previously decoded frame(s) as defined by the encoding
method, and ylP and vl denote the set of decoded frames and motion vectors that predict
yl , respectively. We want to make clear here that M Cl depends on vl and only a subset
of y. For example, when a bit-stream contains a sequence of P -frames then ylP = yl−1
and vl = vl,l−1 . However, as there is a trend towards increased complexity and non-causal
predictions within the motion compensation procedure, we keep the above notation for
generality.
With a definition for the compression system, we can now be precise in describing
the relationship between the high resolution frames and the low resolution observations.
Combining Eqs. (1), (3) and(4), the acquisition system depicted in Figure 1 is denoted as
yl = AHC(dl,k )fk + el,k
(5)
8
(a)
(b)
18
16
14
12
10
8
6
4
2
2
4
6
8
10
12
14
16
18
20
22
(c)
(d)
(e)
Fig. 4.
Inter-coding example. The image in (a) is inter-coded to generate the compressed frame in
(b). The process begins by finding the motion vectors in (c), which generates the motion compensated
prediction in (d). The difference between the prediction and input image is then computed, and it is
transformed and quantized. The resulting residual appears in (e) and is added to the prediction in (d) to
generate the compressed frame in (b).
9
where el,k includes the errors introduced during compression, registration and acquisition.
III. Problem Formulation
With the acquisition system defined, we now formulate the super-resolution reconstruction problem. The reviewed methods utilize a window of low resolution compressed observations to estimate a single high resolution frame. Thus, the goal is to estimate the
high resolution frame fk and displacements d given the decoded intensities y and motion
vectors v. Here, displacements between fk and all of the frames within the processing
window are encapsulated in d, as d = {dl,k |l = k − T B, . . . , k + T F } and T F + T B + 1
establishes the number of frames in the window. Similarly, all of the decoded observations and motion vectors within the processing window are contained in y and v, as
y = {yl |l = k − T B, . . . , k + T F } and v = {vl |l = k − T B, . . . , k + T F }
We pursue the estimate within the Bayesian paradigm. Therefore, the goal is to find f̂k ,
an estimate of fk , and d̂, an estimate of d, such that,
f̂k , d̂ = arg max P (fk , d)P (y, v|fk , d)
fk ,d
(6)
In the expression, P (y, v|fk , d) provides a mechanism to incorporate the compressed bitstream into the enhancement procedure, as it describes the probabilistic modeling of the
process to obtain y and v from fk and d. Similarly, P (fk , d) allows for the integration
of prior knowledge about the original high resolution scene and displacements. This is
somewhat simplified in applications where the displacement values are assumed known, as
the super-resolution estimate becomes
f̂k = arg max P (fk )P (y, v|fk , d)
fk
(7)
where d contains the previously found displacements.
IV. Modeling the Observation
Having presented the super-resolution problem within the Bayesian framework, we
now consider the probability distributions in Eq. (6). We begin with the distribution
P (y, v|fk , d) that models the relationship between the original high resolution intensities
and displacements and the decoded intensities and motion vectors. For the purposes of
10
this review, it is rewritten as
P (y, v|fk , d) =
Y
P (yl |fk , d)P (vl |fk , d, y)
(8)
l
where P (yl |fk , d) is the distribution of the noise introduced by quantizing the transform
coefficients and P (vl |fk , d, y) expresses any information derived from the motion vectors.
Note that Eq. (8) assumes independence between the decoded intensities and motion
vectors throughout the image sequence. This is well motivated when the encoder selects
the motion vectors and quantization intervals without regard to the future bit-stream.
Any dependence between these two quantities should be considered as future work.
A. Quantization Noise
To understand the structure of compression errors, we need to model the degradation
process Q in Eq. (4). This is a non-linear operation that discards data in the transform
domain, and it is typically realized by dividing each transform coefficient by a quantization
scale factor and then rounding the result. The procedure is expressed as
Ã
[Tgk ](i)
[Tyk ](i) = q(i)Round
q(i)
!
(9)
where [Tyk ](i) denotes the ith transform coefficient of the compressed frame yk , q(i) is the
quantization factor for coefficient i, and Round(.) is an operator that maps each value to
the nearest integer.
Two prominent models for the quantization noise appear in the super-resolution literature. The first follows from the fact that quantization errors are bounded by the
quantization scale factor, that is
−
q(i)
q(i)
≤ [Tyk ](i) − [Tgk ](i) ≤
2
2
(10)
according to Eq. (9). Thus, it seems reasonable that the recovered high resolution image
(when mapped to low resolution) has transform coefficients within the same interval. This
is often called the quantization constraint, and it is enforced with the distribution


 const
P1 (yl |fk , d) = 
 0
if −
q(i)
2
≤ [T(AHC(dl,k )fk − M Cl (ylP , vl ))](i) ≤
q(i)
2
∀i
elsewhere
(11)
11
As we are working within the Bayesian framework, we note that this states that the
quantization errors are uniformly distributed within the quantization interval. Thus,
[T(yk − gk )](i) ∼ U [−q(i)/2, q(i)/2] and so E([T(yk − gk )](i)) = 0 and var([T(yk −
gk )](i)) = q(i)2 /12.
Several authors employ the quantization constraint for super-resolution processing. For
example, it is utilized by Altunbasak et al. [5], [6], Gunturk et al. [7], Patti and Altunbasak
[8], and Segall et al. [9], [10]. With the exception of [7], quantization is considered to be
the sole source of noise within the acquisition system. This simplifies the construction of
Eq. (11). However, since the distribution P1 (yl |fk , d) in Eq. (11) is not differentiable, care
must still be taken when finding the high resolution estimate. This is addressed later in
this paper.
The second model for the quantization noise is constructed in the spatial domain. This is
appealing, as it motivates a Gaussian distribution that is differentiable. To understand this
motivation, consider the following. First, the quantization operator in Eq. (9) quantizes
each transform coefficients independently. Thus, noise in the transform domain is not
correlated between transform indices. Second, the transform operator is linear. With
these two conditions, quantization noise in the spatial domain becomes a linear sum of
independent noise processes. The resulting distribution tends to be Gaussian, and it is
expressed as [11]
1
P2 (yl |fk , d) ∝ exp − (yl − AHC(dl,k )fk )T K−1
Q (yl − AHC(dl,k )fk )
2
·
¸
(12)
where KQ is a covariance matrix that describes the noise.
The normal approximation for the quantization noise appears in work by Chen and
Schultz [12], Gunturk et al. [13], Mateos et al. [14], [15], Park et al. [16], [17], and
Segall et al. [9], [18], [19]. A primary difference between these efforts lies in the definition
and estimation of the covariance matrix. For example, a white noise model is assumed
by Chen and Schultz and Mateos et al., while Gunturk et al. develop the distribution
experimentally. Segall et al. consider a high bit-rate approximation for the quantization
noise. Lower rate compression scenarios are addressed by Park et al., where the covariance
matrix and high resolution frame are estimated simultaneously.
In concluding this sub-section, we mention that the spatial domain noise model also
12
incorporates errors introduced by the sensor and motion models. This is accomplished by
modifying the covariance matrix KQ [20]. Interestingly, since these errors are often independent of the quantization noise, incorporating the additional noise components further
motivates the Gaussian model.
B. Motion Vectors
Incorporating the quantization noise is a major focus of much of the super-resolution
for compressed video literature. However, it is also reasonable to use the motion vectors,
v, within the estimates for f̂k and d. These motion vectors introduce a departure from
traditional super-resolution techniques. In traditional approaches, the observed low resolution images provide the only source of information about the relationship between high
resolution frames. When compression is introduced though, motion vectors provide an
additional observation for the displacement values. This information differs from what is
conveyed by the decoded intensities.
There are several methods that exploit the motion vectors during resolution enhancement. At the high level though, each tries to model some similarity between transmitted
motion vectors and actual high resolution displacements. For example, Chen and Schultz
[12] constrain the motion vectors to be within a region surrounding the actual sub-pixel
displacements. This is accomplished with the distribution


 const
P1 (vl |fk , d, y) = 
 0
if |vl,i (j) − [AD dl,i ](j)| ≤ ∆, i ∈ P S, ∀j
elsewhere
(13)
where AD is a matrix that maps the displacements to the low resolution grid, ∆ denotes
the maximum difference between the transmitted motion vectors and estimated displacements, P S represents the set of previously compressed frames employed to predict fk , and
[AD dl,i ](j) is the j th element of the vector AD dl,i . Similarly, Mateos et al. [15] utilize the
distribution
"
γl X
||vl,i − AD dl,i ||2
P2 (vl |fk , d, y) ∝ exp −
2 i∈P S
#
(14)
where γl specifies the similarity between the transmitted and estimated information.
There are two disadvantages to modeling the motion vectors and high resolution displacements as similar throughout the frame. As a first problem, the significance of the
13
motion vectors depends on the underlying compression ratio, which typically varies within
the frame. As a second problem, the quality of the motion vectors is dictated by the
underlying intensity values. Segall et al. [18], [19] account for these errors by modeling
the displaced frame difference within the encoder. This incorporates the motion vectors
and is written as
1
P
P3 (vl |fk , d, y) ∝ exp − (M Cl (ylP , vl ) − AHC(dl,k )fk )T K−1
M V (M Cl (yl , vl ) − AHC(dl,k )fk )
2
(15)
·
where KM V is the covariance matrix for the prediction error between the original frame
and its motion compensated estimate M Cl (ylP , vl ). Estimates for KM V are derived from
the compressed bit-stream and therefore reflex the amount of compression.
V. Modeling the Original Sequence
We now consider the second distribution in Eq. (6), namely P (fk , d). This distribution
contains a priori knowledge about the high resolution intensities and displacements. In
the literature, it is assumed that this information is independent. Thus, we write
P (fk , d) = P (fk )P (d)
(16)
for the remainder of this survey.
Several researchers ignore the construction of the a priori densities, focusing instead on
other facets of the super-resolution problem. For example, portions of the literature solely
address the modeling of compression noise, e.g. [8], [7], [5]. This is equivalent to using the
non-informative prior for both the original high resolution image and displacement data
so that
P (fk ) ∝ const
and
P (d) ∝ const
(17)
In these approaches, the noise model determines the high resolution estimate, and resolution enhancement becomes a maximum likelihood problem. Since the problem is ill-posed
though, care must be exercised so that the approach does not become unstable or noisy.
A. Intensities
Prior distributions for the intensity information are motivated by the following two
statements. First, it is assumed that pixels in the original high resolution images are
¸
14
correlated. This is justified for the majority of acquired images, as scenes usually contain
a number of smooth regions with (relatively) few edge locations. As a second statement,
it is assumed that the original images are devoid of compression errors. This is also a
reasonable statement. Video coding often results in structured errors, such as blocking
artifacts, which rarely occur in uncompressed image frames.
To encapsulate the statement that images are correlated and absent of blocking artifacts,
the prior distribution
"
Ã
λ2
λ1
k Q1 fk k2 + k Q2 AHfk k2
P (fk ) ∝ exp −
2
2
!#
(18)
is utilized in [9] (and references therein). Here, Q1 represents a linear high-pass operation
that penalizes super-resolution estimates that are not smooth, Q2 represents a linear highpass operator that penalizes estimates with block boundaries, and λ1 and λ2 control the
influence of the norms. A common choice for Q1 is the discrete 2D Laplacian; a common
choice for Q2 is the simple difference operation applied at the boundary locations.
Other distributions could also be incorporated into the estimation procedure. For example, Huber’s function could replace the quadratic norm. This is discussed in [12].
B. Displacements
The non-informative prior in Eq. (17) is the most common distribution for P (d) in the
literature. However, explicit models for the displacement values are recently presented
in [18], [19]. There, the displacement information is assumed to be independent between
frames so that
P (d) =
Y
P (dl )
(19)
l
Then, the displacements within each frame are assumed to be smooth and absent of coding
artifacts. To penalize these errors, the displacement prior is given by
"
λ3
P (dl ) ∝ exp − k Q3 dl k2
2
#
(20)
where Q3 is a linear high-pass operator, and λ3 is the inverse of the noise variance of the
normal distribution. The discrete 2D Laplacian is typically selected for Q 3 .
15
VI. Realization of the Super-Resolution Methods
With the super-resolution for compressed video techniques summarized by the previous
distributions, we turn our attention to computing the enhanced frame. Formally this
requires the solution of Eq. (6), where we estimate the high resolution intensities and
displacements given some combination of the proposed distributions. The joint estimate
is found by taking logarithms of Eq. (6) and solving
f̂k , d̂ = arg max log P (fk , d)P (y, v|fk , d)
fk ,d
(21)
with a combination of gradient descent, non-linear projection, and full-search methods.
Scenarios where d is already known or separately estimated are a special case of the
resulting procedure,
One way to evaluate Eq. (21) is with the cyclic coordinate descent procedure [21]. With
the approach, an estimate for the displacements is first found by assuming that the high
resolution image is known, so that
d̂q+1 = arg max log P (d)P (y, v|f̂kq , d)
d
(22)
where q is the iteration index for the joint estimate. (For the case where d is known,
Eq. (22) becomes d̂q+1 = d ∀q.) The intensity information is then estimated by assuming
that the displacement estimates are exact, that is
f̂kq+1 = arg max log P (fk )P (y, v|fk , d̂q+1 )
fk
(23)
The displacement information is re-estimated with the result from Eq.(23), and the process
iterates until convergence.
The remaining question is how to solve Eqs. (22) and (23) for the distributions presented
in the previous sections. This is considered in the following subsections.
A. Finding the Displacements
As we have mentioned before, the non-informative prior in Eq. (17) is a common choice
for P (d). We will consider its use first in developing algorithms, as the displacement
estimate in Eq. (22) is simplified and becomes
d̂q+1 = arg max log P (y, v|f̂kq , d)
d
(24)
16
This estimation problem is quite attractive, as displacement values for different regions of
the frame are now independent.
Block-matching algorithms are well suited for solving Eq. (24), and the construction
of P (y, v|fk , d) controls the performance of the block-matching procedure. For example,
Mateos et al. [15] combine the spatial domain model for the quantization noise with the
distribution for P (v|fk , d, y) in Eq. (14). The resulting block-matching cost function is
h
q T
q
−1
d̂q+1
l,k = arg mindl,k (yl − AHC(dl,k )f̂k ) KQ (yl − AHC(dl,k )f̂k )
+ γ2l ||vl,k − AD dl,k ||2
i
(25)
Similarly, Segall et al. [18], [19] utilize the spatial domain noise model and Eq. (15) for
P (v|fk , d, y). The cost function then becomes
h
q T
q
−1
d̂q+1
l,k = arg mindl,k (yl − AHC(dl,k )f̂k ) KQ (yl − AHC(dl,k )f̂k )
q
P
+(M Cl (ylP , vl ) − AHC(dl,k )f̂kq )T K−1
M V (M Cl (yl , vl ) − AHC(dl,k )f̂k )
i
(26)
Finally, Chen and Schultz [12] substitute the distribution for P (v|f k , d, y) in Eq. (13),
which results in the block-matching cost function
d̂q+1
l,k = arg
min
dl,k ∈CM V
h
q
(yl − AHC(dl,k )f̂kq )T K−1
Q (yl − AHC(dl,k )f̂k )
i
(27)
where CM V follows from Eq. (13) and denotes the set of displacements that satisfy the
condition |vl,k (i) − [AD dl,k ](i)| < ∆ ∀i.
When the quantization constraint in Eq. (11) is combined with the non-informative
P (d), estimation of the displacement information is
d̂q+1 = arg min log P (v|f̂km , d, y)
dl,k ∈CQ
(28)
where CQ denotes the set of displacements that satisfy the constraint − q(i)
≤ [T(AHC(dl,k )f̂k −
2
M Cl (ylP , vl ))](i) ≤
q(i)
2
∀i. This is a tenuous statement. P (v|fk , d, y) is traditionally
defined at locations with motion vectors; however, motion vectors are not transmitted for
every block in the compressed video sequence. To overcome this problem, authors typically estimate d separately. This is analogous to altering P (yk |fk , d) when estimating the
displacements.
17
When P (d) is not described by the non-informative distribution, differential methods
become common estimators for the displacements. These methods are based on the optical
flow equation and are explored in Segall et al. [18], [19]. In these works, the spatial domain
quantization noise model in Eq. (12) is combined with distributions for the motion vectors
in Eq. (15) and displacements in Eq. (20). The estimation problem is then expressed as
h
q T
q
−1
d̂q+1
l,k = arg mindl,k (yl − AHC(dl,k )f̂k ) KQ (yl − AHC(dl,k )f̂k )
q
P
+(M Cl (ylP , vl ) − AHC(dl,k )f̂kq )T K−1
M V (M Cl (yl , vl ) − AHC(dl,k )f̂k )
i
+λ3 dTl,k QT3 Q3 dl,k .
(29)
Finding the displacements is accomplished by differentiating Eq. (29) with respect to d l,k
and setting the result equal to zero. This leads to a successive approximations algorithm
[18].
An alternative differential approach is utilized by Park et al. [20], [16]. In these works,
the motion between low resolution frames is estimated with the block based optical flow
method suggested by Lucas and Kanade [22]. Displacements are estimated for the low
resolution frames in this case.
B. Finding the Intensities
Methods for estimating the high resolution intensities from Eq. (23) are largely determined by the quantization noise model. For example, consider the least complicated
combination of the quantization constraint in Eq. (11) with the non-informative distributions for P (fk ) and P (v|fk , d, y). The intensity estimate is then stated as
f̂kq+1 ∈ FQ
(30)
where FQ denotes the set of intensities that satisfy the constraint − q(i)
≤ [T(AHC(dq+1
l,k )fk −
2
M Cl (ylP , vl ))](i) ≤
q(i)
2
∀i. Note that the solution to this problem is not unique, as the
set-theoretic method only limits the magnitude of the quantization error in the system
model. A frame that satisfies the constraint is therefore found with the projection onto
convex sets (POCS) algorithm [23], where sources for the projection equations include [8],
[9].
18
A different approach must be followed when incorporating the spatial domain noise
model in Eq. (12). If we still assume a non-informative distribution for P (f k ) and P (v|fk , d, y),
the estimate for the high resolution intensities is
f̂kq+1 = arg min
fk
Xh
q+1
T
−1
(yl − AHC(d̂q+1
l,k )fk ) KQ (yl − AHC(d̂l,k )fk )
l
i
(31)
This can be found with a gradient descent algorithm [9].
Alternative combinations of P (fk ) and P (v|fk , d, y) lead to more involved algorithms.
Nonetheless, the fundamental difference lies in the choice for the quantization noise model.
For example, combining the distribution for P (fk ) in Eq. (18) with the non-informative
prior for P (v|fk , d, y) results in the estimate
f̂kq+1
(
λ1
λ2
k Q1 fk k2 + k Q2 AHf̂k k2 − log P (y|fk , d)
= arg min
fk
2
2
)
(32)
Interestingly, the estimate for fk is not changed by substituting Eqs. (13) for (14) for
the distribution P (v|fk , d, y), as these distributions are non-informative to the intensity
estimate. An exception is the distribution in Eq. (15) that models the displaced frame
difference. When this P (v|fk , d, y) is combined with the model for P (fk ) in Eq. (18) the
high resolution estimate becomes
n
f̂km+1 = arg minfk
+
P
1
P
l 2 (M Cl (yl , vl )
λ1
2
k Q1 fk k2 + λ22 k Q2 AHfk k2
q+1
−1
T
P
− AHC(d̂q+1
l,k )fk ) KM V (M Cl (yl , vl ) − AHC(d̂l,k )fk )
− log P (y|fk , d)}
(33)
In concluding the sub-section, we utilize the estimate in Eq. (32) to compare the performance of the quantization noise models. Two iterative algorithms are employed. For
the case that the quantization constraint in Eq. (11) is utilized, the estimate is found with
the iteration
h
fkq+1,s+1 = PQ f̂kq+1,s − αf
n³
λ1 QT1 Q1 f̂kq+1,s + λ2 HT AT QT2 Q2 AHf̂kq+1,s
´oi
(34)
where f̂kq+1,s+1 and f̂kq+1,s are estimates for f̂kq+1 at the (s + 1)th and sth iterations of
the algorithm, respectively, αf controls the convergence and rate of convergence of the
algorithm, and PQ denotes the projection operator that finds the solution to Eq. (30).
19
When the spatial domain noise model in Eq. (12) is utilized, we estimate intensities with
the iteration
f̂kq+1,s+1 = f̂kq+1,s − αf
nP
l
q+1 q+1,s
−1
T
T T
C(d̂q+1
)
l,k ) H A KQ (yl − AHC(d̂l,k )f̂k
+λ1 QT1 Q1 fkq+1,s + λ2 HT AT QT2 Q2 AHf̂kq+1,s
o
(35)
where KQ is the covariance matrix found in [10].
The iterations in Eqs. (34) and (35) are applied to a single compressed bit-stream.
The bit-stream is generated by sub-sampling an original 352x288 image sequence by a
factor of two in both the horizontal and vertical directions. The decimated frames are
then compressed with an MPEG-4 encoder operating at 1Mbps. This is a high bit-rate
simulation that maintains differences between temporally adjacent frames. (Lower bitrates generally result in less resolution enhancement for a given processing window.) The
original high resolution frame and decoded result appear in Fig. 5.
For the simulations, we first incorporate the non-informative P (fk ) so that λ1 = λ2 = 0.
Displacement information is then estimated prior to enhancement. This is motivated by
the quantization constraint, which complicates the displacement estimate, and the method
in [10] is utilized. Seven frames are incorporated into the estimate with T B = T F = 3,
and we choose αf = .125.
Estimates in Figs. 6(a) and 6(b) show the super-resolved results from Eqs. (34) and (35),
respectively. As a first observation, notice that both estimates lead to a higher resolution
image. This is best illustrated by comparing the estimated frames and the bilinearly
interpolated result in Fig. 5(b). (Observe the text and texture regions in the right hand
part of the frame; both are sharper than the interpolated image.) However, close inspection
of the two high resolution estimates shows that the image in Fig. 6(a) is corrupted by
artifacts near sharp boundaries. This is attributable to the quantization constraint noise
model, and it is introduced by registration errors as well as the non-unique solution of
the approach. In comparison, the normal approximation for the quantization noise in Eq.
(12) is less sensitive to registration errors. This is a function of the covariance matrix
KQ and the unique fixed point of the algorithm. PSNR results for the estimated frames
support the visual assessment. The quantization constraint and normal approximation
models result in a PSNR of 28.8dB and 33.4dB, respectively.
20
A second set of experiments appears in Fig. 7. In these simulations, the prior model in
Eq. (18) is now utilized for P (fk ). Parameters are unchanged, except that λ1 = λ2 = .25
and αf = .05. This facilitates comparison of the incorporated and non-informative priors
for P (fk ). Estimates from Eqs. (34) and (35) appear in Figs. 7(a) and 7(b), respectively.
Again, we see evidence of resolution enhancement in both frames. Moreover, incorporating
the quantization constraint no longer results in artifacts. This is a direct benefit of the
prior model P (fk ) that regularizes the solution. For the normal approximation method,
we see that the estimated frame is now smoother. This is a weakness of the method, as
it is sensitive to parameter selection. PSNR values for the two sequences are 32.4dB and
30.5dB, respectively.
As a final simulation, we employ Eq. (15) for P (v|fk , d, y). This incorporates a model
for the motion vectors, and it requires an expansion of the algorithm in Eq. (32) as well
as a definition for KM V . We utilize the methods in [18]. Parameters are kept the same as
the previous experiments, except that λ1 = .01, λ2 = .02, αf = .125. The estimated frame
appears in Fig. 8. Resolution improvement is evident throughout the image, and it leads
to the largest PSNR value for all simulated algorithms. The PSNR value is 33.7dB.
VII. Directions of Future Research
In concluding this paper, we want to identify several research areas that will benefit
the field of super-resolution from compressed video. As a first area, we believe that
the simultaneous estimate of multiple high resolution frames should lead to improved
solutions. These enlarged estimates incorporate additional spatio-temporal descriptions
for the sequence and provide increased flexibility in modeling the scene. For example, the
temporal evolution of the displacements can be modeled. Note that there is some related
work in the field of compressed video processing, see for example Choi et al. [24].
Accurate estimates of the high resolution displacements are critical for the super-resolution
problem, and methods that improve these estimates are a second area of research. Optical flow techniques seem suitable for the general problem of resolution enhancement.
However, there is work to be done in designing methods for the blurred, sub-sampled,
aliased, and blocky observations provided by a decoder. Towards this goal, alternative
probability distributions within the estimation procedures are of interest. This is related
21
(a)
(b)
Fig. 5.
Acquired image frames: (a) original image and (b) decoded result after bilinear interpolation.
The original image is down-sampled by a factor of two in both the horizontal and vertical directions and
then compressed.
22
(a)
(b)
Fig. 6. Resolution enhancement with two estimation techniques: (a) super-resolved image employing the
quantization constraint for P (y|fk , d), (b) super-resolved image employing the normal approximation for
P (y|fk , d). The method in (a) is susceptible to artifacts when registration errors occur, which is evident
around the calendar numbers and within the upper-right part of the frame. PSNR values for (a) and (b)
are 28.8dB and 33.4dB, respectively.
23
(a)
(b)
Fig. 7.
Resolution enhancement with two estimation techniques: (a) super-resolved image employing
the quantization constraint for P (y|fk , d), (b) super-resolved image employing the normal approximation
for P (y|fk , d). The distribution in Eq (18) is utilized for resolution enhancement. This regularizes the
method in (a). However, the technique in (b) is sensitive to parameter selection and becomes overly
smooth. PSNR values for (a) and (b) are 32.4dB and 30.5dB.
24
Fig. 8. Example high resolution estimate when information from the motion vectors is incorporated. This
provides further gains in resolution improvement and the largest quantitative measure in the simulations.
The PSNR for the frame is 33.7dB.
to the work by Simoncelli et al. [25] and Nestares and Navarro [26]. Also, coarse-to-fine
estimation methods have the potential for further improvement, see for instance Luettgen
et al. [27]. Finally, we mention the use of banks of multi-directional/multi-scale representations for estimating the necessary displacements. A review of these methods appears in
Chamorro-Martı́nez [28].
Prior models for the high resolution intensities and displacements will also benefit from
future work. For example, the use of piece-wise smooth model for the estimates will
improve the high resolution problem. This is realized with line processes [29] or an object
based approach. For example, Irani and Peleg [30], Irani et al. [31], and Weiss and Adelson
[32] present the idea of segmenting frames into objects and then reconstructing each object
individually. This could benefit the resolution enhancement of compressed video. It also
leads to an interesting estimation scenario, as compression standards such as MPEG-4
25
provide information about the boundary information within the bit-stream.
Finally, resolution enhancement is often a precursor to some form of image analysis or
feature recognition task. Recently, there is a trend to address these problems directly. The
idea is to learn priors from the images and then apply them to the high resolution problem.
The recognition of faces is a current focus, and relevant work is found in Baker and Kanade
[33], Freeman et al.[34], [35], and Capel and Zisserman [36]. With the increasing use
of digital video technology, it seems only natural that these investigations consider the
processing of compressed video.
References
[1]
S. Chaudhuri, Ed., Super-Resolution Imaging, Kluwer Academic Publishers, 2001.
[2]
S. Borman and R. Stevenson, “Spatial resolution enhancement of low-resolution image sequences. a comprehensive review with directions for future research,” Tech. Rep., Technical report, laboratory for Image and
Sugnal Analysis, University of Notre Dame, 1998.
[3]
A.N. Netravali and B.G. Haskell, Digital Picture - Representation and Compression, Plenum Press, 1995.
[4]
V. Bhaskaran and K. Konstantinides, Image and Video Compression Standards, Kluwer Academic Publishers,
[5]
Y. Altunbasak, A.J. Patti, and R.M. Mersereau, “Super-resolution still and video reconstruction from mpeg-
1995.
coded video,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, pp. 217–226, 2002.
[6]
Y. Altunbasak and A.J Patti, “A maximum a posteriori estimator for high resolution video reconstruction
[7]
B.K. Gunturk, Y. Antunbasak, and R. Mersereau,
from mpeg video,” in IEEE International Conference on Image Processing, 2000, vol. 2, pp. 649 –652.
“Bayesian resolution-enhancement framework for
transform-coded video,” in IEEE International Conference on Image Processing, 2001, vol. 2, pp. 41–44.
[8]
A.J. Patti and Y. Altunbasak, “Super-resolution image estimation for transform coded video with application
to mpeg,” in IEEE International Conference on Image Processing, 1999, vol. 3, pp. 179–183.
[9]
C. A. Segall, A. K. Katsaggelos, R. Molina, and J. Mateos, Super-resolution from compressed video, chapter
9 in Super-Resolution Imaging, S. Chaudhuri, Ed., pp. 211–242, Kluwer Academic Publishers, 2001.
[10] C.A. Segall, R. Molina, A.K. Katsaggelos, and J. Mateos, “Bayesian high-resolution reconstruction of lowresolution compressed video,” in IEEE International Conference on Image Processing, 2001, vol. 2, pp. 25
–28.
[11] M.A. Robertson and R.L. Stevenson, “Dct quantization noise in compressed images,” in IEEE International
Conference on Image Processing, 2001, vol. 1, pp. 185 –188.
[12] D. Chen and R.R. Schultz, “Extraction of high-resolution video stills from mpeg image sequences,” in IEEE
International Conference on Image Processing, 1998, vol. 2, pp. 465–469.
[13] B.K. Gunturk, Y. Altunbasak, and R.M. Mersereau, “Multiframe resolution-enhancement methods for compressed video,” IEEE Signal Processing Letters, vol. 9, pp. 170 –174, 2002.
[14] J. Mateos, A.K. Katsaggelos, and R. Molina, “Resolution enhancement of compressed low resolution video,”
in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000, vol. 4, pp. 1919 –1922.
[15] J. Mateos, A.K. Katsaggelos, and R. Molina, “Simultaneous motion estimation and resolution enhancement
26
of compressed low resolution video,” in IEEE International Conference on Image Processing, 2000, vol. 2,
pp. 653 –656.
[16] S.C. Park, M.G. Kang, C.A. Segall, and A.K. Katsaggelos, “Spatially adaptive high-resoltuion image reconstruction of low-resolution dct-based compressed images,” in IEEE International Conference on Image
Processing, 2002.
[17] S.C. Park, M.G. Kang, C.A. Segall, and A.K. Katsaggelos, “Spatially adaptive high-resoltuion image reconstruction of low-resolution dct-based compressed images,” in submission, 2002.
[18] C.A. Segall, R. Molina, A.K. Katsaggelos, and J. Mateos, “Reconstruction of high-resolution image frames
from a sequence of low-resolution and compressed observations,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002, 2002, vol. 2, pp. 1701 –1704.
[19] C. A. Segall, A. K. Katsaggelos, R. Molina, and J. Mateos, “Bayesian resolution enhancement of compressed
video,” in submission, 2002.
[20] S.C. Park, M.G. Kang, C.A. Segall, and A.K. Katsaggelos, “High-resolution image reconstruction of lowresolution dct-based compressed images,” in IEEE International Conference on Acoustics, Speech, and Signal
Processing, 2002, vol. 2, pp. 1665 –1668.
[21] D.G. Luenberger, Linear and Nonlinear Programming, Reading, MA: Addison-Wesley Publishing Company,
Inc., 1984.
[22] B.D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,”
in Proceedings of Imaging Understanding Workshop, 1981, pp. 121–130.
[23] D. C. Youla and H. Webb, “Image restoration by the method od convex projections: Part 1-theory,” IEEE
Transactions on Medical Imaging, vol. MI-1, no. 2, pp. 81–94, 1982.
[24] M. C. Choi, Y. Yang, and N. P. Galatsanos, “Multichannel regularized recovery of compressed video sequences,” IEEE Trans. on Circuits and Systems II, vol. 48, pp. 376–387, 2001.
[25] E.P. Simoncelli, E.H. Adelson, and D.J. Heeger, “Probability distributions of optical flow,” in Proceedings of
IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1991, pp. 310 –315.
[26] O. Nestares and R. Navarro, “Probabilistic estimation of optical flow in multiple band-pass directional
channels,” Image and Vision Computing, vol. 19, pp. 339–351, 2001.
[27] M.R. Luettgen, W. Clem Karl, and A.S. Willsky, “Efficient multiscale regularization with applications to the
computation of optical flow,” IEEE Transactions on Image Processing, vol. 3, pp. 41–64, 1994.
[28] J. Chamorro-Martı́nez, Desarrollo de modelos computacionales de representación de secuencias de imágenes
y su aplicación a la estimación de movimiento (in Spanish), Ph.D. thesis, University of Granada, 2001.
[29] J. Konrad and E. Dubois, “Bayesian estimation of motion vector fields,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 14, no. 9, pp. 910–927, 1992.
[30] M. Irani and S. Peleg, “Motion analysis for image enhancement: Resolution, occlusion, and transparency,”
JVCIP, vol. 4, pp. 324–335, 1993.
[31] M. Irani, B. Rousso, and S. Peleg, “Computing Occluding and Transparent Motions,” IJCV, vol. 12, no. 1,
pp. 5–16, January 94.
[32] Y. Weiss and E.H. Adelson, “A unified mixture framework for motion segmentation: incorporating spatial
coherence and estimating the number of models,” in IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, 1996, pp. 321–326.
[33] S. Baker and T. Kanade, “Limits on super-resolution and how to break them,” IEEE Trans. on Pattern
Analysis and Machine Intelligence, vol. 24, pp. 1167–1183, 2002.
27
[34] W.T. Freeman, E.C. Pasztor, and O.T. Carmichael, “Learning low-level vision,” International Journal of
Computer Vision, vol. 40, pp. 24–57, 2000.
[35] W.T. Freeman, T.R. Jones, and E.C.Pasztor, “Example-based super-resolution,” IEEE Computer Graphics
and Applications, pp. 56–65, 2002.
[36] D. Capel and A. Zisserman, “Super-resolution from multiple views using learnt image models,” in Proceedings
of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2001, vol. 2, pp.
627–634.
C. Andrew Segall received the B.S. and M.S. degrees in Electrical Engineering from Oklahoma State University in 1995 and 1997, respectively, and the Ph.D. degree in Electrical
Engineering from Northwestern University in 2002. He is currently a post-doctoral researcher
at Northwestern University, Evanston, IL. His research interests are in image processing and
include recovery problems for compressed video, scale space theory and nonlinear filtering.
Dr. Segall is a member of Phi Kappa Phi, Eta Kappa Nu, SPIE and the IEEE.
Rafael Molina was born in 1957. He received the degree in mathematics (statistics) in 1979
and the Ph.D. degree in optimal design in linear models in 1983. He became Professor of
computer science and artificial intelligence at the University of Granada, Granada, Spain,
in 2000. His areas of research interest are image restoration (applications to astronomy and
medicine), parameter estimation, image and video compression, high resolution image reconstruction and blind deconvolution. Dr. Molina is a member of SPIE, Royal Statistical Society,
and the Asociación Española de Reconocimento de Formas y Análisis de Imágenes (AERFAI).
Aggelos K. Katsaggelos received the Diploma degree in electrical and mechanical engineering from the Aristotelian University of Thessaloniki, Greece, in 1979 and the M.S. and Ph.D.
degrees both in electrical engineering from the Georgia Institute of Technology, in 1981 and
1985, respectively.
In 1985 he joined the Department of Electrical and Computer Engineering at Northwestern University, where
he is currently professor, holding the Ameritech Chair of Information Technology. He is also the Director of the
Motorola Center for Communications. During the 1986-1987 academic year he was an assistant professor at Polytechnic University, Department of Electrical Engineering and Computer Science, Brooklyn, NY. Dr. Katsaggelos
is a Fellow of the IEEE, an Ameritech Fellow, a member of the Associate Staff, Department of Medicine, at
Evanston Hospital, and a member of SPIE. He is a member of the Publication Board of the IEEE Proceedings, the
IEEE Technical Committees on Visual Signal Processing and Communications, and Multimedia Signal Processing,
Editorial Board Member of Academic Press, Marcel Dekker: Signal Processing Series, Applied Signal Processing,
and Computer Journal, He has served as editor-in-chief of the IEEE Signal Processing Magazine (1997-2002), a
member of the Publication Boards of the IEEE Signal Processing Society, the IEEE TAB Magazine Committee,
an Associate editor for the IEEE Transcations on Signal Processing (1990-1992), an area editor for the journal
28
Graphical Models and Image Processing (1992-1995), a member of the Steering Committees of the IEEE Transactions on Image Processing (1992-1997) and the IEEE Transactions on Medical Imaging (1990-1999), a member
of the IEEE Technical Committee on Image and Multi-Dimensional Signal Processing (1992-1998), and a member
of the Board of Governors of the IEEE Signal Processing Society (1999-2001). He is the editor of Digital Image
Restoration (Springer-Verlag, Heidelberg, 1991), co-author of Rate-Distortion Based Video Compression (Kluwer
Academic Publishers, 1997), and co-editor of Recovery Techniques for Image and Video Compression and Transmission, (Kluwer Academic Publishers, 1998). He is the the coinventor of eight international patents, and the
recipient of the IEEE Third Millennium Medal (2000), the IEEE Signal Processing Society Meritorious Service
Award (2001), and an IEEE Signal Processing Society Best Paper Award (2001).
Download