MATRIX STRUCTURES AND PARALLEL ALGORITHMS FOR IMAGE SUPERRESOLUTION RECONSTRUCTION

advertisement
MATRIX STRUCTURES AND PARALLEL ALGORITHMS FOR
IMAGE SUPERRESOLUTION RECONSTRUCTION
QIANG ZHANG∗ , RICHARD T. GUY
† , AND
ROBERT J. PLEMMONS
‡
Abstract. Computational resolution enhancement (superresolution) is generally regarded as
a memory intensive process due to the large matrix-vector calculations involved. In this paper,
a detailed study of the structure of the n2 × n2 superresolution matrix is used to decompose the
matrix into 9 matrices of size l2 × l2 where l is the upsampling factor. As a result, previously large
martix vector products can be broken into many small, parallelizable products. An algorithm is
presented that utilizes the structural results to perform superresolution on compact, highly parallel
architectures such as Field-Programmable Gate Arrays.
Key words. image superresolution, FPGA, parallel computation, structured matrices
AMS subject classifications. 65R32, 65F10, 65F50, 94A08
1. Introduction. Computational methods for resolution improvement (superresolution) have attracted much attention lately due in part to their ability to overcome the optical limitations of inexpensive, lower resolution sensors. See, for instance,
[6, 14, 16]. Superresolution (SR) is based on the idea that slight variations in the information encoded in a series of low resolution (LR) images can be used to recover a
high resolution (HR) image.
The basic superresolution problem can be posed as an inverse problem [1, 6],
min ||DHi Si f − gi ||22 , i = 1, . . . , l2 ,
f
(1.1)
where f is the vectorized true high resolution image, gi is a vectorized lower resolution
image, D is the decimation matrix, Hi is a blurring matrix, Si is a shift matrix and l is
the upsampling factor. In the models that follow, the decimation matrix D is a local
averaging matrix that aggregates values of non-intersecting small neighborhoods of HR
pixels to produce LR pixel values. The shift matrix Si , also called the interpolation
matrix, assigns weights according to a bilinear interpolation of HR pixel values to
perform a rigid translation of the original image. The blurring matrix Hi is generated
from a point spread function (PSF) and represents distortion from atmospheric and
other sources. As it will be better explained in Section 2, usually the l2 matrices
DHi Si are stacked to create one large least squares problem
min ||Af − g||22 ,
f
(1.2)
where, using the MATLAB notation, A = [DH1 S1 ; . . . ; DHl2 Sl2 ], g = [g1 ; . . . ; gl2 ],
2
2
and A ∈ Rn ×n , being n × n the dimension of the true high resolution image f .
The dimensionality of the problem is usually quite large. Given a moderate HR
image size of 256 × 256 with l = 4, the naı̈ve way to construct A would require 2l2
matrices Hi and Si , for i = 1, . . . , l2 , each 65536 × 65536, plus one smaller matrix D
that is 4096 × 65536, assuming l = 4. The system matrix A is sparse but is of size
65536 × 65536.
∗ Department of Biostatistical Sciences, Wake Forest University Health Sciences, Medical Center
Boulevard, Winston-Salem, NC 27157 (qizhang@wfubmc.edu).
† Department of Mathematics, Wake Forest University, Winston-Salem, NC 27106
‡ Departments of Mathematics and Computer Science, Wake Forest University, Winston-Salem,
NC 27106.
1
This motivates a search for efficient SR algorithms, which has prompted various
studies, [5, 12]. To give one example, Nguyen et al. [12] proposed efficient block
circulant preconditioners to accelerate convergence
of a conjugate gradient algorithm,
√
due to the fact that its complexity is O( kn2 ), where k is the condition number.
Conjugate gradient algorithms and variations are popular for this problem due to
their strength in solving sparse systems.
Only recently have studies appeared that address implemention of SR algorithms
with on-board hardware (System-on-Chips) [13, 15]. In those implementations, the
SR model is simplified or expensive post-processing steps are included. This paper
presents an algorithm that makes use of a detailed examination of the matrices D,
Hi and Si to replace large scale computations involving sparse matrices with a series
of smaller operations which are readily parallelizable. The result is a Gauss-Siedel
type algorithm optimized for use on highly parallel, compact architectures such as a
Field-Programmable Gate Arrays (FPGA). In particular, the algorithm is suitable for
use on hardware that can be integrated with a camera.
The paper proceeds as follows. In Section 2, we examine the block structures of
permutations of the matrices D, Hi and Si . The results motivate a simple algorithm
based on the Block Gauss-Seidel algorithm which is introduced and analyzed in Section 3. In Section 4 we present numerical evidence that the new algorithm produces
results comparable to the popular conjugate gradient algorithm despite very modest
computational and memory demands. Finally, in Section 5 we discuss the use of matrix structures for general matrix vector products on small scale hardware. All proofs
appear in the Appendix.
2. Matrix Structures. Consider the image superresolution reconstruction problem defined in (1.1). We can concatenate all the product matrices DHi Si , each of
size n2 /l2 × n2 , to form a larger matrix A, having size n2 × n2 , and similarly we can
concatenate all gi to form one n2 × 1 vector g. Thus we treat the original problem as
a least squares problem given in (1.2).
The matrix A is often ill-conditioned and a Tikhonov regularization term,
min ||Af − g||22 + α||f ||22
f
(2.1)
is applied [2]. Well developed algorithms exist to solve (2.1) using iterative methods
or by considering the normal equations
(AT A + αI)f = AT g.
(2.2)
(see [8].)
It is readily apparent that all three component matrices of A involve only local
operations, and thus we should expect A and possibly AT A to possess a sparse form.
For instance, if one assumes that the interpolation matrix Si represents spatially
invariant translational shifts (δxi , δyi ) then the entire n2 × n2 matrix is generated
by only two scalar quantities. The decimation and blurring matrices (subject to
conditions discussed later) also have a simply defined structure, and it is possible to
permute the matrix A to bring all of the non-zero elements into a tridiagonal structure.
The next four theorems make this notion more precise. For simplicity, we will first
ignore the blurring matrices Hi . Again, all proofs appear in the Appendix.
Theorem 2.1. Let A = (DS1 ; . . . ; DSl2 ) be a no-blurring superresolution system
matrix with Si representing 2D rigid translational subpixel shifts, δxi , δyi ∈ (−1, 1),
2
and D representing a weighted average decimation matrix. Then there exist permutations Q and P such that QT AP has a tridiagonal block Toeplitz structure with
tridiagonal block Toeplitz blocks, represented as

B0
 A−1

QT AP = 



A1
A0
A−1
A1
A0
...


,


A1
A−1
(2.3)
A0
with

(i)
B0
 (i)
A
 −1

Ai = 


(i)
A1
(i)
A0
(i)
A−1

(i)
A1
(i)
A0
...
(i)
A1
(i)
(i)
A−1
(i)
(i)
2



,


(2.4)
A0
2
where B0 , Ai ∈ Rnl×nl , i = −1, 0, 1 and B0 , Aj ∈ Rl ×l , j = −1, 0, 1. Both QT AP
and Ai have n/l × n/l blocks.
Note that using a circulant boundary condition, as assumed later, we would have
(i)
(i)
(i)
B0 = A0 and B0 = A0 , but if other boundary conditions are assumed, B0 and B0
(i)
differ from A0 and A0 .
Before rigorously defining the permutation matrices P and Q in the Appendix,
we briefly explain here that P is equivalent to an alternate indexing method for
vectorizing a matrix. The typical way to vectorize a matrix follows a column-wise
ordering as illustrated in (2.5). Using a 256 × 256 matrix M, each number in the
matrix represents the position of that element in the vectorized matrix. For example,
the element on the first row and the second column of the original image will be the
257th entry of the vectorized matrix.


1
257 513 769 1025 1281 1537 1793 ...
 2
258 514 770 1026 1282 1538 1794 ... 


(2.5)

 ...
256 512 768 1024 1280 1536 1792 2048 ...
If we assume an upsampling factor of 4, the following indexing method (2.6) represents
the action of vectorizing the matrix P T M P.


1
2
3
4
1025 1026 1027 1028 ...
 5
6
7
8
1029 1030 1031 1032 ... 


(2.6)
 ...

1021 1022 1023 1024 2045 2046 2047 2048 ...
Now the element on the first row and the second column of the matrix P T M P is the
2nd entry. We may name this vectorization method as “l-length block vectorization”.
The motivation comes from the fact that matrix A in (1.2), is essentially a spatially
local operator, which operates on spatially close pixels of the HR image f . The
traditional column-wise ordering would leave pixels in the next column of the image
an n pixels away. The l-length block ordering maintains more spatial information by
3
leaving proximate pixels nearby in the vectorized f. The result is a more compact
structure for spatially local operators like A.
The left permutation matrix Q is the product P Q̂ where Q̂ maps element k of
the ith vectorized LR image gi to element (k − 1)l2 + i of the stacked g in (1.2),
for k = 1, . . . , (n/l)2 . Intuitively, Q̂ performs a perfect shuffle [7] on the l2 blocks
of size (n/l)2 × n. It is not necessary to explicitly construct and store the matrix P
in the computations as the vectors P f and P g can be constructed from block-wise
reorderings.
In the process of proving the theorem above, we note that A can be simplified
even further by putting constraints on δx or δy .
Corollary 2.2. The following conditions hold:
1. If all δxi ≥ 0, QT AP is an upper bidiagonal block Toeplitz matrix.
2. If all δxi ≤ 0, QT AP is a lower bidiagonal block Toeplitz matrix.
3. If all δyi ≥ 0, Ai is an upper bidiagonal block Toeplitz matrix.
4. if all δyi ≤ 0, Ai is a lower bidiagonal block Toeplitz matrix.
It is always possible to satisfy conditions 1 or 3 of Corollary 2.2 by choosing the
left most or upper most LR image as the reference. It is not possible in general to
satisfy both constraints unless the imaging system is designed such that the leftmost
and uppermost images are the same.
In general, if the matrix B is a tridiagonal block Toeplitz matrix, then B T B will
be a pentadiagonal block Toeplitz matrix with a rank-2 correction. However, as the
next result states, the outermost blocks in the pentadiagonal AT A are identically zero
and the correction is not necessary. This makes it easier to find a direct solution to
the normal equations (2.2). We formalize this notion in the following lemma and
theorem.
(i)
(j)
Lemma 2.3. AT1 A−1 = 0 and (A1 )T A−1 = 0, where i, j = −1, 0, 1.
Theorem 2.4. Matrix P T AT AP has the same structure as matrix A in Theorem
2.1.
The two-level tridiagonal block Toeplitz structure in QT AP and P T AT AP introduce algorithmic efficiency by reducing the storage requirement to only nine l2 × l2
matrices. In practice, Corollary 2.2 makes it possible to store the permuted system
matrix in six l2 ×l2 matrices. We also see a simplification of the matrix vector products
used in Conjugate Gradient Least Squares (CGLS) [11] and other iterative algorithms
for sparse matrices. Similarly, the permuted normal equations satisfy a tridiagonal
block structure that allows an efficient solution to (2.2).
In general, Hi can have many nonzero entries and the matrix A can suffer from a
more complicated structure. However, in many applications the nonzero elements of
the PSF are concentrated in a small circle around the center. By posing a moderate
limit on the size of the diameter containing nonzero entries, it is possible to retain
many of the patterns previously introduced.
Theorem 2.5. Let A = (DH1 S1 ; . . . ; DHl2 Sl2 ) where Hi represents a PSF of
diameter less than or equal to 2l + 1. Then QAP T has a two level penta-diagonal block
Toeplitz structure represented as,



QT AP = 


B0
A−1
A−2
A1
A0
A−1
A2
A1
A0
...
4

A2
A1
A2
A−2
A−1


,


A0
(2.7)
with




Ai = 


(i)
B0
(i)
A−1
(i)
A−2
(i)
A1
(i)
A0
(i)
A−1
(i)
A2
(i)
A1
(i)
A0
...

(i)
A2
(i)
A1
(i)
A2
(i)
(i)
A−2
A−1
(i)
where Aj ∈ Rnl×nl , j = {−2, −1, 0, 1, 2} and Aj ∈ Rl
T
2
×l2
(i)



,


(2.8)
A0
, i = {−2, −1, 0, 1, 2}.
T
As in the case without blurring, the matrix P A AP has the same structure as
QT AP.
Theorem 2.6. With A defined as in Theorem 2.5, P T AT AP has the same
structure as QT AP .
In some cases, it is advantageous to consider the structure of the sub matrices
(i)
Aj . To that end, we have the following theorem:
Theorem 2.7. Under the hypotheses of Theorem 2.1, the following conditions
summarize the tertiary (third level) structure of the permuted system matrix QT AP :
1. If {δx } = {δy } = {0, 1/z, . . . , (l − 1)/z} for some integer z ∈ [1, ∞), sorted in
(i)
y then x, then Aj is Block Hankel with l × l Hankel Blocks. When z = 1, all
nonzero blocks have constant value 1/l2 . Furthermore, if j = 0 then all blocks
Hīj̄ such that ī < j̄ are nonzero and if j = −1 then all blocks with ī ≥ j̄ are
non-zero, where ī, j̄ = 1, . . . , l.
(i)
2. For all δx , δy , the following sum holds for Aj :
1
1
X
X
(i)
Aj =
i=−1 j=−1
1
l2
(2.9)
l2 ×l2
where the right hand side is a constant matrix with each entry equal to 1/l2 .
At this point, the structure of QT AP and P T AT AP are sufficiently simplified to
suggest efficient structured matrix algorithms to solve (2.2). In the next section, to
avoid treating B0 as a different diagonal block matrix, we assume a periodic boundary
condition to effectively change it to A0 .
3. Algorithms. In this section we first present an algorithm to solve the normal
equations (2.2) with a chosen α that takes advantage of the specific matrix structure
introduced in the last section. The algorithm presented is a Block Gauss-Seidel approach with an inner Cyclic Reduction (no blurring) or Gauss-Seidel (blurring present)
iteration. The algorithm is first presented for the matrix in Theorem 2.1, then extended to the matrix in Theorem 2.5.
3.1. Without blurring. The Cyclic Reduction (CR) method [3, 8] is a direct
method for solving a linear system in which matrix A has a tridiagonal block Toeplitz
structure. After the first CR step, i.e. the even-odd permutations in both rows and
5
columns, matrix P T AT AP in (2.3) becomes

A0





T T
P A AP ⇒ 
 A−1




A1
A−1
A0
...

A1
...
A0
A1
A−1
...
A−1
A0
A1
...
A0
...
A−1




A1 
.





A0
(3.1)
We could follow the CR steps by inverting A0 and multiplying it with A1 and A−1 .
However, given that the size of Ai is nl × nl, this is still computationally intensive.
Notice that A0 has the same first order structure as A, but with each block of smaller
size l2 × l2 . Thus, we can use CR to solve a subproblem A0 x = b. To set up
the n/2l subproblems, we introduce an outer block Gauss-Seidel iteration. That
(k)
is, we first break f into n/l segments, fi ∈ Rnl and at step k, we use fi where
(k+1)
i = n/2l + 1, . . . , n/l to solve for each fi
, i = 1, . . . , n/2l. The updating formula
for the first half of f at step k is
(k+1)
= g1 − A1 fn/2l+1 ,
(k+1)
= gi − A−1 fi+n/2l−1 − A1 fi+n/2l , i = 2, . . . , n/2l.
A0 f1
A0 fi
(k)
(k)
(k)
(3.2)
The tridiagonal block Toeplitz structure within A1 and A−1 allows us to perform
(k+1)
matrix-vector multiplications on the l2 × l2 blocks. Each fi
can be solved independently using CR since A0 is a tridiagonal block Toeplitz matrix. Next we use
the updated first half of f (k+1) to solve for the second half of f (k+1) using CR. The
updating formula is
(k+1)
= gi − A1 fi−n/2l+1 − A−1 fi−n/2l , i = n/2l + 1, . . . , n/l − 1,
(k+1)
= gi − A−1 fn/2l .
A0 f i
A0 fn/l
(k+1)
(k+1)
(k+1)
(3.3)
For simplicity, we assume that the size of A0 is a power of 2, however there is an
extension to other sizes that adds a slight computational penalty (see [8, Sec. 4.5.4]).
One important feature of this approach is extensive parallelism. Each step reduces
to n/2l smaller scale subproblems that can be solved independently by n/2l processors,
which greatly enhances throughput. Furthermore, implementation of superresolution
on systems like FPGAs is not only feasible but desirable due to the FPGA’s strong
performance in parallel applications. Note as well that the algorithm only requires
matrix vector multiplication and that all multiplications are on the scale of l2 . As a
result, implementation is vastly simplified.
It is well known that the Gauss-Seidel iteration approach is absolutely convergent
on Sx = y provided the matrix S is symmetric and positive definite (see the proof in
[8, Sec 10.1]). The regularized normal equations AT A + αI meet these criteria, and
the proof can easily be extended to the block Gauss-Seidel method.
Theorem 3.1. The Block Gauss-Seidel algorithm described above converges for
(2.2) from any f (0) .
6
3.2. With blurring. The main difference when the blurring matrix is included
is that the first and second level structures are penta-diagonal rather than tridiagonal,
forcing us to abandon Cyclic Reduction to solve the inner problem. However, one
can utilize Gauss-Seidel on both the outer and inner iterations. We first group the
block rows and columns of P T AT AP into sets {3k + 1}, {3k + 2} and {3k + 3} for
k = 0, . . . , n/3l − 1, as shown in (3.4).

A0






 A−1


P T AT AP ⇒ 




 A−2





Â0
=  Â−1
Â−2
A1
A−2
A0
A2
A−1
A1
...
A−2
...
A0
A2
A−1
Â1
Â0
Â−1
...
A−1
A1
A−2
A0
A1
...
...
A−2
A0
A−1
A1
A−2
A2
A1
A0
A2
...

A−1
A1
...
A2
A−1
A0
A2
...
A0
...
A−2

A−1
Â2
Â1  .
Â0
(3.4)
There are n/3l × n/3l inner blocks on each one of the 3 × 3 outer blocks. For
convenience, we assume n is divisible by 3l. Otherwise, it is always possible to add
extra one or two zero columns and rows to the LR images to make n divisible by 3l.
In (3.4), each Ai is penta-diagonal block Toeplitz, which we can permute in the same
way to create a second level 3×3 block form identical to (3.4) above. The block GaussSeidel iterations occur at two levels corresponding to the two level matrix structure.
We first group entries of f in sets {3k + 1}, {3k + 2} and {3k + 3} denoted f1 , f2 and
f3 , which will be updated iteratively in the first level as,
(k+1)
= g1 − Â1 f2
(k+1)
= g2 − Â−1 f1
(k+1)
= g3 − Â−1 f2
Â0 f1
Â0 f2
Â0 f3




A2 







A1 






A0
(k)
(k)
− Â2 f2
(k+1)
− Â1 f3
(k)
(k+1)
− Â−2 f1
(k+1)
.
(3.5)
For each sub-problem in (3.5), we use the same update rules on the second level
block matrices which is structurally identical. Each sub-problem requires a matrix
inversion on the order of l2 × l2 which can be performed once and stored. Absolute
convergence is provable for each level of the two levels of Gauss-Seidel interations in
(3.5), leading to a proof of absolute convergence for the combined two-level iteration.
Theorem 3.2. Using the algorithm described above for Problem (2.2), the iteration converges from any f (0) .
4. Numerical Experiments. The algorithms in the last section were applied
to both a simulated satellite image and real images taken by a lenslet array camera on
an Air Force resolution target [10]. The algorithm in Section 3.1 was also implemented
on an FPGA development board [4]. In the next three subsections, we present details
of all three experiments.
7
(a)
(b)
Fig. 4.1. (a) Original satellite image. (b) Low resolution satellite image.
4.1. Simulated images. One original HR satellite image [11] of size 256 ×
256, shown in Figure 4.1(a), is downsampled, interpolated using known (δˆx , δˆy ) and
degraded with additive Gaussian noises to create 16 LR images of size 64 × 64, one
of which appears in Figure 4.1(b). A sub-pixel registration algorithm [17] was then
applied on 16 LR images to estimate a set of (δx , δy ). The LR images combined with
offsets are used to reconstruct a 256 × 256 HR image.
We compare our algorithm with the CGLS method [9] in light of the general
popularity of CG for solving sparse systems. A naı̈ve CG implementation requires the
construction of 16 shift matrices Si of size 65536 × 65536 plus a decimation matrix
D of size 4096 × 65536. Matrix vector products occur with the entire 65536 × 1
vector. Figures 4.2(a) and 4.2(b) show the results of CG and the Block Gauss-Seidel
with Cyclic Reduction (BGS-CR). It is clear that the results are nearly identical.
The relative difference in Frobenius norm between the two results is .0218 and mean
square errors when compared to the true image are .0119 for CG and .0118 for BGSCR. However, the BGS-CR algorithm takes 3.2 seconds on a 3.0GHz Pentium IV
processor whereas CG takes 8.7 seconds. Both algorithms stopped after 5 iterations
when no significant improvement was observed, i.e. when the mean difference between
iterations was less than 10−4 .
Much of the work in the CG algorithm goes into fully constructing large matrices
D and Si in the scale of n2 , while our algorithm only needs to construct small inner
blocks of D and Si in the scale of l2 . Using the matrix structures presented in this
paper to avoid explicit construction of the system matrix leads to a much faster CG
implementation. In fact, the results in Section 2 can be used to create a matrix
vector multiplication function for use in any reconstruction method that only requires
a matrix-vector multiplication. Such a function will have the advantage of reduced
memory and computational complexity.
The same test was performed with the addition of a Gaussian blur to a noisy HR
image to create a blurred and noisy HR image see in 4.3(a) before downsampling and
interpolating. Reconstruction was performed using the two level Block Gauss-Seidel
(BGS) algorithm described in Section 3.2. Figure 4.3(b) shows the recovered image,
which needed a smaller regularization parameter due to the fact that the smoothing
effect of the blur operator removed some of the original noise in the noisy HR image.
Once again, results were comparable with the CG method, but BGS is much faster.
The CG method without using the block structure took 24.5 seconds, while the two
level BGS algorithm took 15 seconds. We see a reduced speedup factor here because
8
(a)
(b)
Fig. 4.2. (a) Conjugate gradient method using the tridiagonal block Toeplitz structure, α = .1
and M SE = .0118. (b) Block Gauss-Seidel coupled with Cyclic Reduction method, α = .1 and
M SE = .0119.
(a)
(b)
Fig. 4.3. (b) Blurred and noisy image. (c) Recovered image, α = .03.
of the sequential processing of many more small matrix-vector multiplications in the
scale of l2 . An even more dramatic improvement can be expected if the many small
products are performed in parallel on a platform such as FPGA.
4.2. Real images from a lenslet array camera. A lenslet array camera [10]
was used to capture images of an Air Force resolution target in a lab setting. A 10
mega-pixel raw image is segmented into 16 subimages of size 128 × 128, which are
then registered with the subpixel registration algorithm [17] and used to reconstruct
a 512 × 512 HR image. Figure 4.4 compares the resolutions of the reconstructed
HR image on the left and a blown-up LR image. Clearly, we observe resolution
enhancement.
4.3. FPGA implementation. Last, the algorithm in Section 3.1 was ported
to a Xilinx Virtex 5 SX50T development board, with 52K logic cells, 594KB BlockRAM, 288 DSPs and 256MB DDR2 SDRAM. A custom 32-bit pipeline was designed,
based on matrix-vector multiplication. A raw image from a lenslet array camera was
segmented into 16 LR images of size 128 × 128, which were then registered with the
subpixel registration algorithm [17] and used to reconstruct a 512 × 512 HR image.
The LR images were read into the onboard memory through a Ethernet port and
the reconstructed HR image was retrieved through a USB port and displayed on a 7
inch LCD. The current maximum processing capability is 2 frames per second (fps).
9
Fig. 4.4. Resolution enhancement after the reconstruction using real images taken by a lenslet
array camera.
Without the memory interface bottleneck, the processing capability can move up 5
fps. The system is highly scalable, because the core co-processing element is an l2
scale matrix-vector multplication (MVM), which can be easily replicated for larger
SR problems and saved on larger FPGAs. Thus the speedup should be near linear
until memory bandwidth is exhausted. The current system has only one MVM core
while we project that a scaled system on a Virtex 5 LX330T could hold 6 MVM
cores and thus an estimated performance of 30 fps. This translates into a speedup
factor of approximately 383 when compared to a desktop computer running the same
algorithm in MATLAB and a speedup factor of 1, 043 when compared to a desktop
computer running the CGLS algorithm. However, regarding implementing the CGLS
algorithm on FPGA, it is important to note that it is not currently possible to store
the system matrix A in onboard memory, even in a sparse format. Furthermore, very
few tools for large scale matrix-vector products exist for FPGAs [4, 13].
5. Conclusions. Traditional models of superresolution lead to large, sparse matrices that are difficult to implement with on-board electronics. This paper presents
algorithms that make use of the structure of the sparse matrices to replace large scale
matrix vector products with small, parallelizable products. The introduced algorithms
do not introduce any degradation in quality over current reconstruction methods but
they offer a much faster reconstruction with limited memory requirements. As a result,
a problem once thought difficult to implement with special purpose digital computers
can be made to take full advantage of the small scale, highly parallel capabilities of
such systems. In addition, the structural results are suitable for constructing matrix
vector products for use in any algorithm in which they are required.
Acknowledgments. The authors thank Dr. James Nagy, Dr. Paúl Pauca,
Dr. Sudhakar Prased, Dr. Todd Torgensen and other researchers in the PERIODIC
research group for their critiques [10]. The research described in this paper was
supported in part by the Intelligence Advanced Research Projects Agency (IARPA)
through the Defense Microelectronics Activity (DMEA) under cooperative agreement
number H94003-08-2-0802, and by the Air Force Office of Scientific Research (AFOSR)
under award number FA9550-08-1-0151.
REFERENCES
10
[1] R. Barnard, P. Pauca, T. Torgersen, R. Plemmons, S. Prasad, J. van der Gracht,
J. Nagy, J. Chung, G. Behrmann, S. Matthews, and M. Mirotznik, High-resolution
iris image reconstruction from low-resolution imagery, in Proc. SPIE, Advanced Signal
Processing Algorithms, Architectures, and Implementations, vol. 6313, San Diego, CA,
Aug. 2006, pp. 63130D1–63130D13.
[2] M. Bertero and P. Boccacci, Introduction to Inverse Problems in Imaging, Institute of
Physics Publishing, 1998.
[3] D. Bini and B. Meini, Solving block banded block toeplitz systems with structured blocks: new
algorithms and open problems, Large-scale Scientific Computations of Engineering and
Environmental Problems II, Notes on Numerical Fluid Mechanics, 13 (2000), pp. 15–24.
[4] S.D. Brown, R.J. Francis, J. Rose, and Z.G. Vranesic, Field-Programmable Gate Arrays,
Springer, 1992.
[5] J. Chung, E. Haber, and J. Nagy, Numerical methods for coupled super-resolution, Inverse
Problems, 22 (2006), pp. 1261–1272.
[6] S. Farsiu, D. Robinson, M. Elad, and P. Milanfar, Advances and challenges in superresolution, International Journal of Imaging Systems and Technology, 14 (2004), pp. 47–57.
[7] S.W. Golomb, Permutations by cutting and shuffling, SIAM Review, 3 (1961), pp. 293–297.
[8] G.H. Golub and C.F. Van Loan, Matrix Computations, Johns Hopkins University Press,
3rd ed., 1996.
[9] M.R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems,
Journal of Research of the National Bureau of Standards, 49 (1952), pp. 409–436.
[10] M. Mirotznik, S. Mathews, R. Plemmons, P. Pauca, T. Torgersen, R. Barnard, T. Guy,
B. Gray, Q. Zhang, J. van der Gracht, C. Petersen, M. Bodnar, and S. Prasad,
A practical enhanced-resolution integrated optical-digital imaging camera (PERIODIC),
in Proc. SPIE, vol. 7348, Orlando, FL, April 2009, Conference on Defense, Security and
Sensing.
[11] J. Nagy, R.J. Plemmons, and T.C. Torgersen, Iterative image restoration using approximate inverse preconditioning, IEEE Transactions on Image Processing, 5 (1996), pp. 1151–
1162.
[12] N. Nguyen, P. Milanfar, and G. Golub, A computational efficient image superresolution
algorithm, IEEE Transactions on Image Processing, 10 (2001), pp. 573–583.
[13] F.E. Ortiz, E.J. Kelmelis, J.P. Durbano, and D.W. Pratherb, FPGA acceleration of
superresolution algorithms for embedded processing in millimeter-wave sensors, in Proc.
SPIE, vol. 6548, May 2007, pp. 6548–0K.
[14] S.C. Park, M.K. Park, and M.G. Kang, Super-resolution image reconstruction: A technical
overview, IEEE Signal Processing Magazine, 20 (2003), pp. 21–36.
[15] D. Reddy, Z. Yue, and P. Topiwala, An efficient real time superresolution ASIC system, in
Proc. SPIE, vol. 6957, 2008, pp. 6957–09.
[16] R.S. Wagner, D.E. Waagen, and M.L. Cassabaum, Image super-resolution for improved
automatic target recognition, in Proc. SPIE, vol. 5426, Orlando FL, April 2004, pp. 188–
196.
[17] Q. Zhang, Analytical approximations of translational subpixel shifts in signal and image registrations, in Proc. SPIE, vol. 7074, San Diego, CA, August 2008, pp. 70740E1–70740E7.
Appendix A. Proofs.
The following section contains proofs of Theorems 2.1 to 2.7. The proofs revolve
around the definitions of the three component matrices D, Si , and Hi plus the permutation matrices P and Q. Structures of the matrices in question are revealed through
the row and column indices of nonzero entries. In particular, if A is a matrix such
that all nonzero entries on the ith row are within the range [i − l2 , i + l2 ], then A is a
tridiagonal block matrix having a block size of l2 × l2 .
A.1. Decimation matrix. We start with the decimation matrix, which can
also be regarded as a local mean matrix.
2
2
2
Definition A.1. We define a decimation operation, D ∈ Rn /l ×n , on a vec2
torized image, f ∈ Rn ×1 , as g = Df , such that the entries of D are determined using
11
the following averaging equation.


l−1 X
l−1
X
gij = 
fi−ī,j−j̄  /l2 ,
(A.1)
ī=0 j̄=0
where i, j are the row and column indices of decimated image. The vectorization
follows the typical column ordering.
As an example, the (1, 1) pixel of g is an averaging of the pixels in the square
from (1, 1) to (l, l) in f. The structure of D is given below.
Proposition A.2. Matrix D defined above has a block diagonal given by


D1


D1

,
(A.2)
D=


...
D1 n2 /l2 ×n2
where D has n/l × n/l blocks, and D1 = [D2 D2 . . . D2 ] ∈ Rn/l×nl , whose also has a
diagonal structure as


v


v

,
(A.3)
D2 = 


...
v n/l×n
where D2 also has n/l × n/l blocks, and v = (1/l2 , 1/l2 , . . . , 1/l2 ) ∈ R1×l is a constant
row vector.
Proof. The structure follows from the definition of a traditional matrix vectorization. Furthermore, all values are equal to 1/l2 because D takes an unweighted
average.
The nonzero entries in each row of matrix D are separated by a distance of n
entries, so di = (0, . . . , 0, v, 0, . . . , 0, v, 0, . . . , v, 0, . . . , v, 0, . . . , 0, . . . , 0), where di is the
ith row of D. However, we can group these entries into one continuous block of size
1 × l2 by moving the nonzero columns together through a permutation matrix P. The
permuted D is a diagonal Block Toeplitz matrix.
Definition A.3. Define the permutation matrix P = (pi ) by the row vector
pi = enli1 +ni2 +i3 where i = 1, . . . , n2 , i1 = bi/nlc, ĩ = i − 1 mod nl, i2 = ĩ mod n
2
and i3 = d(ĩ + 1)/ne. Here ej ∈ R1×n is the identity vector with the entry 1 at
position j and entry zeros at other positions.
While it is not used until later, we include the definition of the permutation matrix
Q here for convenience.
Definition A.4. Define the permutation matrix Q̂ such that column i is the
unit vector eb i−1 c+l2 ((i−1) mod n2 ) . The permutation matrix is defined by the product
n2 /l2
l2
Q = P Q̂.
Proposition A.5. Under the permutation matrix P defined above, matrix D is
a block diagonal matrix


v̂


v̂

P T DP = 
,
(A.4)


...
v̂ n2 /l2 ×n2
12
2
and v̂ = (1/l2 , 1/l2 , . . . , 1/l2 ) ∈ R1×l .
Proof. The proof is by comparison between entries in (2.5) and the corresponding
entry in (2.6).
Proposition A.6. Matrix (DP )T DP = P T DT DP is also a block diagonal
matrix, while DP (DP )T is a diagonal matrix given by


(DP )T DP = 


D̂2
D̂2



...
D̂2
2
,
(A.5)
n2 ×n2
2
where D̂2 = v̂ T v̂ ∈ Rl ×l is a constant matrix (1/l2 )l2 ×l2 .
Proof. Note that P T DT DP = P T DT P P T DP = (P T DP )T P T DP. Products
involving non-zero elements only occur on the block diagonal and thus v̂ T v̂ is the
diagonal block of (DP )T DP .
A.2. Shift matrix. The support of the shift matrix Si in (1.1) depends on the
2D translational shifts. To avoid a notation conflict in using index i, throughout this
section, we use S to represent Si in (1.1) for any i = 1, . . . , l2 , except in the proofs of
Theorem 2.1, Corollary 2.2 and Theorem 2.4.
2
2
Definition A.7. We define a shift operation, S ∈ Rn ×n , on a vectorized image,
2
f ∈ Rn ×1 , as fˆ = Sf , such that the entries of S are determined by a 2D translational
shift (δx , δy ) and the relationship between fˆij and fij is given by
fˆij = w11 fi+δ̂y ,j+δ̂x + w12 fi+δ̂y +1,j+δ̂x + w21 fi+δ̂y ,j+δ̂x +1 + w22 fi+δ̂y +1,j+δ̂x +1 , (A.6)
where δ̂y = bδy c and δ̂x = bδx c, w1 to w4 are weights formulated as (1 − δ̃x )(1 − δ̃y ),
(1 − δ̃y )δ̃x , (1 − δ̃x )δ̃y and δ̃x δ̃y respectively, and δ̃x = δx − δ̂x , δ̃y = δy − δ̂y .
Figure A.1 shows the grid of a 10 × 10 original image in (circles) and the shifted
image (in squares), where (δx , δy ) = (1.5, −2.5). Here we use the image intensity
values at circles to linearly interpolate for intensity values at the square points.
It is clear that the non-zero entries of S can only be one of w1 to w4 , and that they
possess a regular pattern. In fact, using the column-wise ordering in the vectorization
of f and fˆ, we can pinpoint the four nonzero entries on the i + (j − 1)nth row as
i+ δ̂y +(j + δ̂x −1)n, i+ δ̂y +1+(j + δ̂x −1)n, i+ δ̂y +(j + δ̂x )n, i+ δ̂y +1+(j + δ̂x )n. (A.7)
This corresponds to a matrix structure specified in the following proposition.
Proposition A.8. Given a 2D rigid translational shift, (δx , δy ), where δx ∈
(−l, l) and δy ∈ (−l, l), the shift matrix, S, defined in Definition A.7, has a block
Toeplitz form that can be determined in the following way. If δx > 0,


0 . . . 0 S1 S2
 0 ... 0

0 S1 S2




.
.
.



S
S
S=
,
(A.8)
1
2 



0
0




...
0
0
n2 ×n2
13
Fig. A.1. An illustration of the shift matrix, Si .
where S1 , S2 ∈ Rn×n . The number of columns of zero blocks to the left is δ̂x and the
number of rows of zero blocks at the bottom is δ̂x + 1. If δx < 0,


0
0


...


 0

0




S =  S1 S2
,
(A.9)

 0 S1 S2





...
S1 S2 0 . . . 0 n2 ×n2
where the number of columns of zero blocks to the right is δ̂x and the number of rows
of zero blocks at the top is δ̂x + 1.
If δy > 0,


0 . . . 0 wi1 wi2
 0 ... 0

0 wi1 wi2




.
.
.



w
w
Si = 
, i = 1, 2
(A.10)
i1
i2 



0
0




...
0
0
n×n
The number of columns of zero blocks to the left is δ̂y and the number of rows of zero
blocks at the bottom is δ̂y + 1. If δy < 0,


0
0


...


 0

0



w
w
Si = 
, i = 1, 2
(A.11)
i1
i2


 0 wi1 wi2





...
wi1 wi2 0 . . . 0 n×n
where the number of columns of zero blocks to the right is δ̂y and the number of rows
of zero blocks at the top is δ̂y + 1.
14
Proof. The structure described by (A.8) to (A.11) is simply the matrix representation of (A.7). The details can be verified by the interested reader.
Next we permute S to gain a better structure.
Proposition A.9. The permuted matrix P T SP has a two level bidiagonal block
Toeplitz structure. If δx > 0,

Ŝ1

Ŝ2
Ŝ1


P SP = 


Ŝ2
...
T



,

Ŝ2 
Ŝ1 n2 ×n2
(A.12)
and if δx < 0,

Ŝ1

Ŝ2
P T SP = 


Ŝ1
...
Ŝ2



Ŝ1
,
(A.13)
,
(A.14)
n2 ×n2
where Ŝi ∈ Rnl×nl .
If δy > 0,

Ŝi1


Ŝi = 


Ŝi2
Ŝi1

Ŝi2
...
Ŝi2
Ŝi1





nl×nl
and if δy < 0,

Ŝi1
 Ŝi2
Ŝi = 

2

Ŝi1
...
Ŝi2



Ŝi1
,
(A.15)
nl×nl
2
where Ŝij ∈ Rl ×l .
Proof. We again rely on (A.7) to identify the nonzero entries after applying both
row and column permutations using P .
After the vectorization using the new indexing method, the entry (i, j) in f will
be at the position
pij = (i − 1)l + j̃nl + ĵ,
(A.16)
where j̃ = b(j − 1)/lc and ĵ = (j − 1) mod l + 1.
The following inequalities define a two-level bidiagonal structure and make use of
the restriction that δ̂x , δ̂y ∈ [−l + 1, l − 1]. The diagonal block of size l2 × l2 is given
by
15
pij − l2 ≤ pi+δ̂y ,j+δ̂x ≤ pij + l2 ,
(A.17)
pij − l2 ≤ pi+δ̂y +1,j+δ̂x ≤ pij + l2 ,
(A.18)
2
2
pij − l ≤ pi+δ̂y ,j+δ̂x +1 ≤ pij + l ,
(A.19)
pij − l2 ≤ pi+δ̂y +1,j+δ̂x +1 ≤ pij + l2 .
(A.20)
The upper block diagonal when δx > 0 is given by
pij + nl − l2 ≤ pi+δ̂y ,j+δ̂x +1 ≤ pij + nl + l2 ,
(A.21)
pij + nl − l2 ≤ pi+δ̂y +1,j+δ̂x +1 ≤ pij + nl + l2 ,
(A.22)
while the lower block diagonal when δx < 0 is given by
pij − nl − l2 ≤ pi+δ̂y ,j+δ̂x +1 ≤ pij − nl + l2 ,
(A.23)
pij − nl − l2 ≤ pi+δ̂y +1,j+δ̂x +1 ≤ pij − nl + l2 .
(A.24)
To verify the Toeplitz structure, we only need to prove that the permuted Ŝ =
P T SP satisfies
ŝIJ = ŝI+l2 ,J+l2 .
It is not difficult to verify that
pi+l,j
2
pij + l =
pi+l−n,j+l
if i ≤ n − l
if i > n − l
(A.25)
(A.26)
which is the definition of an l2 × l2 block Toeplitz structure.
As an example, we note that
pi+l+δ̂y ,j+δ̂x = pi+δ̂y ,j+δ̂x + l2 ,
and
pi+l−nδ̂y ,j+l+δ̂x = pi+δ̂y ,j+δ̂x + l2 .
The three remaining nonzero entries on row i can be verified in a similar manner.
Proposition A.10. Matrix P T (DS)P has a two-level bidiagonal block Toeplitz
structure similar to which in (A.12) and (A.13), with second level blocks consisting of
1 × l2 row vectors v̂ Ŝi instead of Ŝi , for i = 1, 2.
Proof. Note that P T (DS)P = P T DP (P T SP ). By Proposition A.5, P T DP has
a block diagonal structure with each block of size 1 × l2 . By Proposition A.9, P T SP
has a two level bidiagonal block Toeplitz structure with l2 × l2 blocks. It follows that
the product P T DSP is two level bidiagonal block Toeplitz with 1 × l2 blocks.
A.2.1. Proof of Theorem 2.1. Each P T DSi P is a block bidiagonal matrix
with 1 × l2 blocks, but it is important to note that not all matrices have the same
non-zero diagonals. The particular non-zero diagonal depends on the sign of δx and δy .
However, if we stack them to form  = (P T DS1 P ; . . . ; P T DSl2 P ) then premultiply
by Q̂ to form the shuffle of each 1 × l2 block, we form a tridiagonal block Toeplitz
matrix with l2 × l2 second level blocks. This completes the proof of Theorem 2.1.
16
A.2.2. Proof of Corollary 2.2. The proof of this corollary is a natural extension of the two-level bidiagonal structure of P T Si P , proven in Proposition A.9. If all
δx or δy have the same sign then all P T DSi P are bidiagonal with the same non-zero
diagonals. It follows that one of the three diagonals in the tridiagonal block Toeplitz
matrix proved in A.2.1 is identically zero.
A.2.3. Proof of Lemma 2.3. Note that
P T (DS)T DSP = P T S T P (P T DT DP )P T SP,
and P T DT DP = (DP )T DP has a block diagonal structure with each block having
a size of l2 × l2 (see Proposition A.6). By Proposition A.10, P T SP has a two-level
bidiagonal block Toeplitz structure, so P T S T P also has a two-level bidiagonal block
Toeplitz structure, except the upper diagonal blocks will be transposed to the lower
diagonal and vice versa. Hence the multiplication of these three matrices will be a
two-level tridiagonal block Toeplitz structure.
Pl 2
A.2.4. Proof of Theorem 2.4. Note that AT A = i=1 (DSi )T DSi and thus
Pl 2
P T AT AP = i=1 P T (DSi )T DSi P .
A.3. Blurring matrix. The blurring matrix we consider here is the regular
block Toeplitz with Toeplitz block blurring matrix generated by a spatially invariant
n × n PSF matrix and zero boundary conditions. Furthermore, we assume the blur
is radially symmetric with a small diameter. A large diameter corresponds to more
off-diagonal entries in each Toeplitz block of H, and thus a more complex structure.
Here we impose a limit on the diameter to gain a simpler structure of H while still
accepting a large enough PSF for real applications. In particular, we have the following
proposition. Again, to avoid the notation conflict in using index i, throughout this
section, we use H to represent Hi in (1.1) for any i = 1, . . . , l2 , except in the proofs
of Theorem 2.5 and 2.6.
Proposition A.11. If the diamater of a spatial invariant PSF is not greater
than 2l + 1, P T HP is a two level tridiagonal block Toeplitz structure.
Proof. We can write the blurring operation in the matrix form as
fˆ = Hf,
where f and fˆ are two vectorized images. Written explicitly, the entries of fˆ are given
by,
fˆij =
l
l
X
X
hīj̄ fi−ī,j−j̄ .
(A.27)
ī=−l j̄=−l
Note that we have applied the diameter limit of 2l + 1 to the two summation indices,
ī and j̄. The (2l + 1)2 nonzero entries on each row, except the first and last several
rows, lie at entries i − ī + (j − j̄)n which represents a block Toeplitz with Toeplitz
block structure.
After permuation, the matrix representation is
P T fˆ = P T HP (P T f ),
(A.28)
and the nonzero entries of P T HP are at pi−ī,j−j̄ , where p is defined in (A.16). The
proof of the two level tridiagonal structure is equivalent to proving similar inequalities
17
in Proposition A.9. For the diagonal block, we need
pij − l2 ≤ pi−ī,j−j̄ ≤ pij + l2 .
(A.29)
For the block off diagonal to the right, we need
pij + nl − l2 ≤ pi−ī,j−j̄ ≤ pij + nl + l2 .
(A.30)
For the block off diagonal to the left, we need
pij − nl − l2 ≤ pi−ī,j−j̄ ≤ pij − nl + l2 .
(A.31)
The inequalities follow from the domains of ī and j̄.
To verify the Toeplitz structure, we demonstrate that the permuted Ȟ = P T HP
satisfies
ȟIJ = ȟI+l2 ,J+l2 .
Note that I and J are the row and column indices of matrix Ȟ while i and j are the
indices of the original image f , and their relationships are I = pij and J = pi−ī,j−j̄ .
Now it becomes clear that
pi−ī,j−j̄ + l2 = (i + l − 1)l + j]
− j̄nl + j[
− j̄,
where j]
− j̄ = b(j − j̄ − 1)/lc and j[
− j̄ = (j − j̄ − 1) mod l + 1. Thus ī and j̄ remains
unchanged when pi−ī,j−j̄ increases by l2 , which in turns means ȟI+l2 ,J+l2 = hī,j̄ =
ȟI,J .
The next proposition about the structure of P T HSP gives the key piece of the
proofs of Theorems 2.5 and 2.6.
Proposition A.12. With H defined in Proposition A.11 and S defined in Definition A.7, P T HSP is also a two level tridiagonal block Toeplitz structure.
Proof. Once again, the proof utilizes the definitions of H and S in the subscript
form. We have that f˜ = H fˆ = HSf is equivalent to
f˜ij =
l
l
X
X
hī,j̄ fˆi−ī,j−j̄
ī=−l j̄=−l
=
l
l
X
X
hī,j̄ (w11 fi−ī+δ̂y ,j−j̄+δ̂x + w12 fi−ī+δ̂y +1,j−j̄+δ̂x
ī=−l j̄=−l
+ w21 fi−ī+δ̂y ,j−j̄+δ̂x +1 + w22 fi−ī+δ̂y +1,j−j̄+δ̂x +1 ).
(A.32)
As an example, one of the nonzero entries on the row pij is given by pi−ī+δ̂y ,j−j̄+δ̂x .
By the definition of pij , it is easy to see the positions of nonzero entries are effectively
shifted by a fixed amount determined by δx on the first level and by δy on the second
level. Due to the limitation that |δ(x,y) | ≤ l we have a shifted two-level tri-diagonal
structure.
This proposition comes as somewhat of a surprise because we would normally
expect that a two-level tri-diagonal block Toeplitz structure (P T HP ) times by a twolevel bi-diagonal block Toeplitz structure (P T SP ), would result in a two-level quadradiagonal block Toeplitz structure. However, in our case, it remains tri-diagonal.
18
A.3.1. Proof of Theorem 2.5. The proof of theorem is similar to that Theorem 2.1, because each block of P T DHi Si P is also of size 1 × l2 . Variations in the
particular set of offsets δx and δy corresponding to Si imply that some P T DHi Si P
have a tri-diagonal structure shifted to the right while others shifted to the left. Their
concatenation into QSP is a two-level penta-diagonal block Toeplitz structure.
A.3.2. Proof of Theorem 2.6. Theorem 2.6 follows from Proposition A.12.
Note that
P T (DHS)T DHSP = (P T (HS)T P )(P T DT DP )(P T HSP ).
Since P T DT DP is a block diagonal matrix, (P T (HS)T P )(P T DT DP )(P T (HS)P ) has
the same structure as P T (HS)T P P T (HS)P. It follows that the product of two twolevel tridiagonal block Toeplitz matrices is a two-level penta-diagonal block Toeplitz
Pl2
Pl2
matrix. Since AT A = i=1 (DHi Si )T DHi Si and P T AT AP = i=1 P T (DHi Si )T DHi Si P ,
the matrix AT A has the same two-level structure.
A.3.3. Proof of Theorem 2.7. Note that Si is sparse with non-zero diagonals
(i) (i)
whose weights correspond to a bilinear interpolation defined by δx , δy . This allows
2
us to restrict our attention to the the nl × n submatrix
[A−1 A0 A1 0 . . . 0]
(A.33)
(−1)
(A.34)
and the l2 × nl submatrix
[Ai
(0)
(1)
Ai Ai 0 . . . 0]
and use the first and second order Toeplitz structure of A.
Furthermore, we have the following lemma whose proof follows by direct calculation.
Lemma A.13. The following structural descriptions hold:
1. Only si,j in S where nl < i ≤ 2nl, l < (j mod n) ≤ 2l contribute to Aji under
the product QT DSP.
2. Row k of Aji contains information from only DSk .
Proof. The proof involves a partition of rows of matrices DSi so their permuted
form can be investigated. We first label the rows of Si so that row α ≡ k mod n has
label ck . The rows are then labeled
[c1 c2 . . . cn c1 . . . cn . . . c1 . . . cn ]T
with n repetitions of c1 , . . . , cn . Under the permutation P T Si P the labels are reordered
as
[c1 . . . c1 c2 . . . c2 . . . cn . . . cn ]T
(A.35)
such that the l rows labeled ck are clustered together. The label pattern repeats
after nl rows because the permutation P is closed on sets of indices of length nl.
The operation DP averages l2 consecutive rows into one row of DSi P. Finally, the
permutation Q shuffles rows of DSi P into rows j ≡ i mod l2 . Excluding the first
repetition of row labels in (A.35) we have that the sub matrix [A−1 A0 A1 0 . . . 0] is
comprised of one set of l2 rows of DSi P for each i ≤ l2 , proving part (2).
Part (1) of the matrix follows from the proceeding analysis and Definition A.7.
19
Using Lemma A.13, it is possible to completely characterize Aji from only 3l2
elements of each Sk for 1 ≤ k ≤ l2 .
Next, we examine the effect of multiplying P T SP by DP. Note that DP averages
columns in row blocks of size l2 . Thus, the proof reduces to an investigation of the
support of rows n + l2 + 1, . . . , n + 2l2 of each P T Si P (the first few blocks form B
and the analysis follows similarly). Partition these rows into l2 × l2 blocks Hi which
further divide into l × l blocks Jα,β . Within each Hi , the permutation P T Si P shuffles
Jα,β such that the first l ×l block of the permuted block Ĥi contains the (1, 1) element
from each J(α,β) . The first block has format





J(1,1),1,1
J(2,1),1,1
..
.
J(1,2),1,1
...
J(l,1),1,1
J(l,2),1,1
...
...
J(1,l),1,1
J(2,l),1,1
..
.



.

(A.36)
J(l,l),1,1
where the left indices identify a block J and the right indices identify an element in
the block.
Parts (1) of Theorem 2.7 follows directly. Part (2) follows because the sum of the
4 weights in each bilinear interpolation is 1.
A.4. Proof of Theorem 3.1. The matrices AT A + αI and A0 are positive
definite, so we can refer to Theorem 10.1.2 in [8]. However, Theorem 10.1.2 concerns
the Gauss-Seidel iteration method, not the block Gauss-Seidel iteration introduced
here. Most of the proof extends naturally, but we clarify one less obvious point. Using
the notation in [8], we define G = −(D + L)−1 LT , where D = diag(A0 , A0 , . . . , A0 )
and L is a strictly lower triangle matrix. We need to prove that
G1 ≡ D1/2 GD−1/2 = −(I + L1 )−1 LT1 ,
(A.37)
where L1 = D−1/2 LD−1/2 , or equivalently,
D1/2 (D + L)−1 D1/2 = (I + L1 )−1 .
(A.38)
When D is only a diagonal matrix, it is easy to verify (A.41), but in this case D
is block diagonal. This proves not to be a problem. Notice that P T AT AP + αI has
a 2 × 2 block form and thus we can explicitly write its inverse.
(D + L)
−1
=
D0
L
0
D0
−1
=
D0−1
−1
−D0 LD0−1
0
D0−1
,
(A.39)
where D0 is the upper left or the lower right block and L is the lower left block of
P T AT AP + αI in (3.1). Then we multiply D1/2 on both sides to get
D1/2 (D + L)−1 D1/2 = D1/2
D0−1
−1
−D0 LD0−1
=
I
−1/2
LD0
−1/2
−D0
0
D0−1
0
.
I
−1
D1/2
It is easy to verify the right side of the equation above is (I + L1 )−1 .
20
(A.40)
A.5. Proof of Theorem 3.2. Again, AT A + αI and A0 are positive definite
and we can refer to Theorem 10.1.2 in [8]. We need to verify that that
G1 ≡ D1/2 GD−1/2 = −(I + D−1/2 LD−1/2 )−1 (D−1/2 LD−1/2 )T ,
(A.41)
or equivalently,
D1/2 (D + L)−1 D1/2 = (I + D−1/2 LD−1/2 )−1 ,
(A.42)
Notice that AT A + αI has a 3 × 3 block form and we can explicitly write out its
inverse.

(D + L)−1
Â0
=  Â−1
Â−2

0
Â0
Â−1
−1
0
0 
Â0
Â−1
0
−1
=
−Â−1
0 Â−1 Â0
−1
−1
−Â0 (Â−1 Â0 Â−1 − Â−2 )Â−1
0
0
Â−1
0
−1
−Â0 Â−1 Â−1
0

0
0 ,
Â−1
0
where Â0 is the diagonal block, which is positive definite. Hence,

0
0

I
0 .
D1/2 (D+L)−1 D1/2
−1/2
−1/2
I
−Â0 Â−1 Â0
(A.43)
It is easy to verify the right side of (A.43) is (I + D−1/2 LD−1/2 )−1 .

I

−1/2
−1/2
−Â0 Â−1 Â0
=
−1/2
−1/2
−Â0 (Â−1 Â−1
0 Â−1 − Â−2 )Â0
21
Download