MATRIX STRUCTURES AND PARALLEL ALGORITHMS FOR IMAGE SUPERRESOLUTION RECONSTRUCTION QIANG ZHANG∗ , RICHARD T. GUY † , AND ROBERT J. PLEMMONS ‡ Abstract. Computational resolution enhancement (superresolution) is generally regarded as a memory intensive process due to the large matrix-vector calculations involved. In this paper, a detailed study of the structure of the n2 × n2 superresolution matrix is used to decompose the matrix into 9 matrices of size l2 × l2 where l is the upsampling factor. As a result, previously large martix vector products can be broken into many small, parallelizable products. An algorithm is presented that utilizes the structural results to perform superresolution on compact, highly parallel architectures such as Field-Programmable Gate Arrays. Key words. image superresolution, FPGA, parallel computation, structured matrices AMS subject classifications. 65R32, 65F10, 65F50, 94A08 1. Introduction. Computational methods for resolution improvement (superresolution) have attracted much attention lately due in part to their ability to overcome the optical limitations of inexpensive, lower resolution sensors. See, for instance, [6, 14, 16]. Superresolution (SR) is based on the idea that slight variations in the information encoded in a series of low resolution (LR) images can be used to recover a high resolution (HR) image. The basic superresolution problem can be posed as an inverse problem [1, 6], min ||DHi Si f − gi ||22 , i = 1, . . . , l2 , f (1.1) where f is the vectorized true high resolution image, gi is a vectorized lower resolution image, D is the decimation matrix, Hi is a blurring matrix, Si is a shift matrix and l is the upsampling factor. In the models that follow, the decimation matrix D is a local averaging matrix that aggregates values of non-intersecting small neighborhoods of HR pixels to produce LR pixel values. The shift matrix Si , also called the interpolation matrix, assigns weights according to a bilinear interpolation of HR pixel values to perform a rigid translation of the original image. The blurring matrix Hi is generated from a point spread function (PSF) and represents distortion from atmospheric and other sources. As it will be better explained in Section 2, usually the l2 matrices DHi Si are stacked to create one large least squares problem min ||Af − g||22 , f (1.2) where, using the MATLAB notation, A = [DH1 S1 ; . . . ; DHl2 Sl2 ], g = [g1 ; . . . ; gl2 ], 2 2 and A ∈ Rn ×n , being n × n the dimension of the true high resolution image f . The dimensionality of the problem is usually quite large. Given a moderate HR image size of 256 × 256 with l = 4, the naı̈ve way to construct A would require 2l2 matrices Hi and Si , for i = 1, . . . , l2 , each 65536 × 65536, plus one smaller matrix D that is 4096 × 65536, assuming l = 4. The system matrix A is sparse but is of size 65536 × 65536. ∗ Department of Biostatistical Sciences, Wake Forest University Health Sciences, Medical Center Boulevard, Winston-Salem, NC 27157 (qizhang@wfubmc.edu). † Department of Mathematics, Wake Forest University, Winston-Salem, NC 27106 ‡ Departments of Mathematics and Computer Science, Wake Forest University, Winston-Salem, NC 27106. 1 This motivates a search for efficient SR algorithms, which has prompted various studies, [5, 12]. To give one example, Nguyen et al. [12] proposed efficient block circulant preconditioners to accelerate convergence of a conjugate gradient algorithm, √ due to the fact that its complexity is O( kn2 ), where k is the condition number. Conjugate gradient algorithms and variations are popular for this problem due to their strength in solving sparse systems. Only recently have studies appeared that address implemention of SR algorithms with on-board hardware (System-on-Chips) [13, 15]. In those implementations, the SR model is simplified or expensive post-processing steps are included. This paper presents an algorithm that makes use of a detailed examination of the matrices D, Hi and Si to replace large scale computations involving sparse matrices with a series of smaller operations which are readily parallelizable. The result is a Gauss-Siedel type algorithm optimized for use on highly parallel, compact architectures such as a Field-Programmable Gate Arrays (FPGA). In particular, the algorithm is suitable for use on hardware that can be integrated with a camera. The paper proceeds as follows. In Section 2, we examine the block structures of permutations of the matrices D, Hi and Si . The results motivate a simple algorithm based on the Block Gauss-Seidel algorithm which is introduced and analyzed in Section 3. In Section 4 we present numerical evidence that the new algorithm produces results comparable to the popular conjugate gradient algorithm despite very modest computational and memory demands. Finally, in Section 5 we discuss the use of matrix structures for general matrix vector products on small scale hardware. All proofs appear in the Appendix. 2. Matrix Structures. Consider the image superresolution reconstruction problem defined in (1.1). We can concatenate all the product matrices DHi Si , each of size n2 /l2 × n2 , to form a larger matrix A, having size n2 × n2 , and similarly we can concatenate all gi to form one n2 × 1 vector g. Thus we treat the original problem as a least squares problem given in (1.2). The matrix A is often ill-conditioned and a Tikhonov regularization term, min ||Af − g||22 + α||f ||22 f (2.1) is applied [2]. Well developed algorithms exist to solve (2.1) using iterative methods or by considering the normal equations (AT A + αI)f = AT g. (2.2) (see [8].) It is readily apparent that all three component matrices of A involve only local operations, and thus we should expect A and possibly AT A to possess a sparse form. For instance, if one assumes that the interpolation matrix Si represents spatially invariant translational shifts (δxi , δyi ) then the entire n2 × n2 matrix is generated by only two scalar quantities. The decimation and blurring matrices (subject to conditions discussed later) also have a simply defined structure, and it is possible to permute the matrix A to bring all of the non-zero elements into a tridiagonal structure. The next four theorems make this notion more precise. For simplicity, we will first ignore the blurring matrices Hi . Again, all proofs appear in the Appendix. Theorem 2.1. Let A = (DS1 ; . . . ; DSl2 ) be a no-blurring superresolution system matrix with Si representing 2D rigid translational subpixel shifts, δxi , δyi ∈ (−1, 1), 2 and D representing a weighted average decimation matrix. Then there exist permutations Q and P such that QT AP has a tridiagonal block Toeplitz structure with tridiagonal block Toeplitz blocks, represented as B0 A−1 QT AP = A1 A0 A−1 A1 A0 ... , A1 A−1 (2.3) A0 with (i) B0 (i) A −1 Ai = (i) A1 (i) A0 (i) A−1 (i) A1 (i) A0 ... (i) A1 (i) (i) A−1 (i) (i) 2 , (2.4) A0 2 where B0 , Ai ∈ Rnl×nl , i = −1, 0, 1 and B0 , Aj ∈ Rl ×l , j = −1, 0, 1. Both QT AP and Ai have n/l × n/l blocks. Note that using a circulant boundary condition, as assumed later, we would have (i) (i) (i) B0 = A0 and B0 = A0 , but if other boundary conditions are assumed, B0 and B0 (i) differ from A0 and A0 . Before rigorously defining the permutation matrices P and Q in the Appendix, we briefly explain here that P is equivalent to an alternate indexing method for vectorizing a matrix. The typical way to vectorize a matrix follows a column-wise ordering as illustrated in (2.5). Using a 256 × 256 matrix M, each number in the matrix represents the position of that element in the vectorized matrix. For example, the element on the first row and the second column of the original image will be the 257th entry of the vectorized matrix. 1 257 513 769 1025 1281 1537 1793 ... 2 258 514 770 1026 1282 1538 1794 ... (2.5) ... 256 512 768 1024 1280 1536 1792 2048 ... If we assume an upsampling factor of 4, the following indexing method (2.6) represents the action of vectorizing the matrix P T M P. 1 2 3 4 1025 1026 1027 1028 ... 5 6 7 8 1029 1030 1031 1032 ... (2.6) ... 1021 1022 1023 1024 2045 2046 2047 2048 ... Now the element on the first row and the second column of the matrix P T M P is the 2nd entry. We may name this vectorization method as “l-length block vectorization”. The motivation comes from the fact that matrix A in (1.2), is essentially a spatially local operator, which operates on spatially close pixels of the HR image f . The traditional column-wise ordering would leave pixels in the next column of the image an n pixels away. The l-length block ordering maintains more spatial information by 3 leaving proximate pixels nearby in the vectorized f. The result is a more compact structure for spatially local operators like A. The left permutation matrix Q is the product P Q̂ where Q̂ maps element k of the ith vectorized LR image gi to element (k − 1)l2 + i of the stacked g in (1.2), for k = 1, . . . , (n/l)2 . Intuitively, Q̂ performs a perfect shuffle [7] on the l2 blocks of size (n/l)2 × n. It is not necessary to explicitly construct and store the matrix P in the computations as the vectors P f and P g can be constructed from block-wise reorderings. In the process of proving the theorem above, we note that A can be simplified even further by putting constraints on δx or δy . Corollary 2.2. The following conditions hold: 1. If all δxi ≥ 0, QT AP is an upper bidiagonal block Toeplitz matrix. 2. If all δxi ≤ 0, QT AP is a lower bidiagonal block Toeplitz matrix. 3. If all δyi ≥ 0, Ai is an upper bidiagonal block Toeplitz matrix. 4. if all δyi ≤ 0, Ai is a lower bidiagonal block Toeplitz matrix. It is always possible to satisfy conditions 1 or 3 of Corollary 2.2 by choosing the left most or upper most LR image as the reference. It is not possible in general to satisfy both constraints unless the imaging system is designed such that the leftmost and uppermost images are the same. In general, if the matrix B is a tridiagonal block Toeplitz matrix, then B T B will be a pentadiagonal block Toeplitz matrix with a rank-2 correction. However, as the next result states, the outermost blocks in the pentadiagonal AT A are identically zero and the correction is not necessary. This makes it easier to find a direct solution to the normal equations (2.2). We formalize this notion in the following lemma and theorem. (i) (j) Lemma 2.3. AT1 A−1 = 0 and (A1 )T A−1 = 0, where i, j = −1, 0, 1. Theorem 2.4. Matrix P T AT AP has the same structure as matrix A in Theorem 2.1. The two-level tridiagonal block Toeplitz structure in QT AP and P T AT AP introduce algorithmic efficiency by reducing the storage requirement to only nine l2 × l2 matrices. In practice, Corollary 2.2 makes it possible to store the permuted system matrix in six l2 ×l2 matrices. We also see a simplification of the matrix vector products used in Conjugate Gradient Least Squares (CGLS) [11] and other iterative algorithms for sparse matrices. Similarly, the permuted normal equations satisfy a tridiagonal block structure that allows an efficient solution to (2.2). In general, Hi can have many nonzero entries and the matrix A can suffer from a more complicated structure. However, in many applications the nonzero elements of the PSF are concentrated in a small circle around the center. By posing a moderate limit on the size of the diameter containing nonzero entries, it is possible to retain many of the patterns previously introduced. Theorem 2.5. Let A = (DH1 S1 ; . . . ; DHl2 Sl2 ) where Hi represents a PSF of diameter less than or equal to 2l + 1. Then QAP T has a two level penta-diagonal block Toeplitz structure represented as, QT AP = B0 A−1 A−2 A1 A0 A−1 A2 A1 A0 ... 4 A2 A1 A2 A−2 A−1 , A0 (2.7) with Ai = (i) B0 (i) A−1 (i) A−2 (i) A1 (i) A0 (i) A−1 (i) A2 (i) A1 (i) A0 ... (i) A2 (i) A1 (i) A2 (i) (i) A−2 A−1 (i) where Aj ∈ Rnl×nl , j = {−2, −1, 0, 1, 2} and Aj ∈ Rl T 2 ×l2 (i) , (2.8) A0 , i = {−2, −1, 0, 1, 2}. T As in the case without blurring, the matrix P A AP has the same structure as QT AP. Theorem 2.6. With A defined as in Theorem 2.5, P T AT AP has the same structure as QT AP . In some cases, it is advantageous to consider the structure of the sub matrices (i) Aj . To that end, we have the following theorem: Theorem 2.7. Under the hypotheses of Theorem 2.1, the following conditions summarize the tertiary (third level) structure of the permuted system matrix QT AP : 1. If {δx } = {δy } = {0, 1/z, . . . , (l − 1)/z} for some integer z ∈ [1, ∞), sorted in (i) y then x, then Aj is Block Hankel with l × l Hankel Blocks. When z = 1, all nonzero blocks have constant value 1/l2 . Furthermore, if j = 0 then all blocks Hīj̄ such that ī < j̄ are nonzero and if j = −1 then all blocks with ī ≥ j̄ are non-zero, where ī, j̄ = 1, . . . , l. (i) 2. For all δx , δy , the following sum holds for Aj : 1 1 X X (i) Aj = i=−1 j=−1 1 l2 (2.9) l2 ×l2 where the right hand side is a constant matrix with each entry equal to 1/l2 . At this point, the structure of QT AP and P T AT AP are sufficiently simplified to suggest efficient structured matrix algorithms to solve (2.2). In the next section, to avoid treating B0 as a different diagonal block matrix, we assume a periodic boundary condition to effectively change it to A0 . 3. Algorithms. In this section we first present an algorithm to solve the normal equations (2.2) with a chosen α that takes advantage of the specific matrix structure introduced in the last section. The algorithm presented is a Block Gauss-Seidel approach with an inner Cyclic Reduction (no blurring) or Gauss-Seidel (blurring present) iteration. The algorithm is first presented for the matrix in Theorem 2.1, then extended to the matrix in Theorem 2.5. 3.1. Without blurring. The Cyclic Reduction (CR) method [3, 8] is a direct method for solving a linear system in which matrix A has a tridiagonal block Toeplitz structure. After the first CR step, i.e. the even-odd permutations in both rows and 5 columns, matrix P T AT AP in (2.3) becomes A0 T T P A AP ⇒ A−1 A1 A−1 A0 ... A1 ... A0 A1 A−1 ... A−1 A0 A1 ... A0 ... A−1 A1 . A0 (3.1) We could follow the CR steps by inverting A0 and multiplying it with A1 and A−1 . However, given that the size of Ai is nl × nl, this is still computationally intensive. Notice that A0 has the same first order structure as A, but with each block of smaller size l2 × l2 . Thus, we can use CR to solve a subproblem A0 x = b. To set up the n/2l subproblems, we introduce an outer block Gauss-Seidel iteration. That (k) is, we first break f into n/l segments, fi ∈ Rnl and at step k, we use fi where (k+1) i = n/2l + 1, . . . , n/l to solve for each fi , i = 1, . . . , n/2l. The updating formula for the first half of f at step k is (k+1) = g1 − A1 fn/2l+1 , (k+1) = gi − A−1 fi+n/2l−1 − A1 fi+n/2l , i = 2, . . . , n/2l. A0 f1 A0 fi (k) (k) (k) (3.2) The tridiagonal block Toeplitz structure within A1 and A−1 allows us to perform (k+1) matrix-vector multiplications on the l2 × l2 blocks. Each fi can be solved independently using CR since A0 is a tridiagonal block Toeplitz matrix. Next we use the updated first half of f (k+1) to solve for the second half of f (k+1) using CR. The updating formula is (k+1) = gi − A1 fi−n/2l+1 − A−1 fi−n/2l , i = n/2l + 1, . . . , n/l − 1, (k+1) = gi − A−1 fn/2l . A0 f i A0 fn/l (k+1) (k+1) (k+1) (3.3) For simplicity, we assume that the size of A0 is a power of 2, however there is an extension to other sizes that adds a slight computational penalty (see [8, Sec. 4.5.4]). One important feature of this approach is extensive parallelism. Each step reduces to n/2l smaller scale subproblems that can be solved independently by n/2l processors, which greatly enhances throughput. Furthermore, implementation of superresolution on systems like FPGAs is not only feasible but desirable due to the FPGA’s strong performance in parallel applications. Note as well that the algorithm only requires matrix vector multiplication and that all multiplications are on the scale of l2 . As a result, implementation is vastly simplified. It is well known that the Gauss-Seidel iteration approach is absolutely convergent on Sx = y provided the matrix S is symmetric and positive definite (see the proof in [8, Sec 10.1]). The regularized normal equations AT A + αI meet these criteria, and the proof can easily be extended to the block Gauss-Seidel method. Theorem 3.1. The Block Gauss-Seidel algorithm described above converges for (2.2) from any f (0) . 6 3.2. With blurring. The main difference when the blurring matrix is included is that the first and second level structures are penta-diagonal rather than tridiagonal, forcing us to abandon Cyclic Reduction to solve the inner problem. However, one can utilize Gauss-Seidel on both the outer and inner iterations. We first group the block rows and columns of P T AT AP into sets {3k + 1}, {3k + 2} and {3k + 3} for k = 0, . . . , n/3l − 1, as shown in (3.4). A0 A−1 P T AT AP ⇒ A−2 Â0 = Â−1 Â−2 A1 A−2 A0 A2 A−1 A1 ... A−2 ... A0 A2 A−1 Â1 Â0 Â−1 ... A−1 A1 A−2 A0 A1 ... ... A−2 A0 A−1 A1 A−2 A2 A1 A0 A2 ... A−1 A1 ... A2 A−1 A0 A2 ... A0 ... A−2 A−1 Â2 Â1 . Â0 (3.4) There are n/3l × n/3l inner blocks on each one of the 3 × 3 outer blocks. For convenience, we assume n is divisible by 3l. Otherwise, it is always possible to add extra one or two zero columns and rows to the LR images to make n divisible by 3l. In (3.4), each Ai is penta-diagonal block Toeplitz, which we can permute in the same way to create a second level 3×3 block form identical to (3.4) above. The block GaussSeidel iterations occur at two levels corresponding to the two level matrix structure. We first group entries of f in sets {3k + 1}, {3k + 2} and {3k + 3} denoted f1 , f2 and f3 , which will be updated iteratively in the first level as, (k+1) = g1 − Â1 f2 (k+1) = g2 − Â−1 f1 (k+1) = g3 − Â−1 f2 Â0 f1 Â0 f2 Â0 f3 A2 A1 A0 (k) (k) − Â2 f2 (k+1) − Â1 f3 (k) (k+1) − Â−2 f1 (k+1) . (3.5) For each sub-problem in (3.5), we use the same update rules on the second level block matrices which is structurally identical. Each sub-problem requires a matrix inversion on the order of l2 × l2 which can be performed once and stored. Absolute convergence is provable for each level of the two levels of Gauss-Seidel interations in (3.5), leading to a proof of absolute convergence for the combined two-level iteration. Theorem 3.2. Using the algorithm described above for Problem (2.2), the iteration converges from any f (0) . 4. Numerical Experiments. The algorithms in the last section were applied to both a simulated satellite image and real images taken by a lenslet array camera on an Air Force resolution target [10]. The algorithm in Section 3.1 was also implemented on an FPGA development board [4]. In the next three subsections, we present details of all three experiments. 7 (a) (b) Fig. 4.1. (a) Original satellite image. (b) Low resolution satellite image. 4.1. Simulated images. One original HR satellite image [11] of size 256 × 256, shown in Figure 4.1(a), is downsampled, interpolated using known (δˆx , δˆy ) and degraded with additive Gaussian noises to create 16 LR images of size 64 × 64, one of which appears in Figure 4.1(b). A sub-pixel registration algorithm [17] was then applied on 16 LR images to estimate a set of (δx , δy ). The LR images combined with offsets are used to reconstruct a 256 × 256 HR image. We compare our algorithm with the CGLS method [9] in light of the general popularity of CG for solving sparse systems. A naı̈ve CG implementation requires the construction of 16 shift matrices Si of size 65536 × 65536 plus a decimation matrix D of size 4096 × 65536. Matrix vector products occur with the entire 65536 × 1 vector. Figures 4.2(a) and 4.2(b) show the results of CG and the Block Gauss-Seidel with Cyclic Reduction (BGS-CR). It is clear that the results are nearly identical. The relative difference in Frobenius norm between the two results is .0218 and mean square errors when compared to the true image are .0119 for CG and .0118 for BGSCR. However, the BGS-CR algorithm takes 3.2 seconds on a 3.0GHz Pentium IV processor whereas CG takes 8.7 seconds. Both algorithms stopped after 5 iterations when no significant improvement was observed, i.e. when the mean difference between iterations was less than 10−4 . Much of the work in the CG algorithm goes into fully constructing large matrices D and Si in the scale of n2 , while our algorithm only needs to construct small inner blocks of D and Si in the scale of l2 . Using the matrix structures presented in this paper to avoid explicit construction of the system matrix leads to a much faster CG implementation. In fact, the results in Section 2 can be used to create a matrix vector multiplication function for use in any reconstruction method that only requires a matrix-vector multiplication. Such a function will have the advantage of reduced memory and computational complexity. The same test was performed with the addition of a Gaussian blur to a noisy HR image to create a blurred and noisy HR image see in 4.3(a) before downsampling and interpolating. Reconstruction was performed using the two level Block Gauss-Seidel (BGS) algorithm described in Section 3.2. Figure 4.3(b) shows the recovered image, which needed a smaller regularization parameter due to the fact that the smoothing effect of the blur operator removed some of the original noise in the noisy HR image. Once again, results were comparable with the CG method, but BGS is much faster. The CG method without using the block structure took 24.5 seconds, while the two level BGS algorithm took 15 seconds. We see a reduced speedup factor here because 8 (a) (b) Fig. 4.2. (a) Conjugate gradient method using the tridiagonal block Toeplitz structure, α = .1 and M SE = .0118. (b) Block Gauss-Seidel coupled with Cyclic Reduction method, α = .1 and M SE = .0119. (a) (b) Fig. 4.3. (b) Blurred and noisy image. (c) Recovered image, α = .03. of the sequential processing of many more small matrix-vector multiplications in the scale of l2 . An even more dramatic improvement can be expected if the many small products are performed in parallel on a platform such as FPGA. 4.2. Real images from a lenslet array camera. A lenslet array camera [10] was used to capture images of an Air Force resolution target in a lab setting. A 10 mega-pixel raw image is segmented into 16 subimages of size 128 × 128, which are then registered with the subpixel registration algorithm [17] and used to reconstruct a 512 × 512 HR image. Figure 4.4 compares the resolutions of the reconstructed HR image on the left and a blown-up LR image. Clearly, we observe resolution enhancement. 4.3. FPGA implementation. Last, the algorithm in Section 3.1 was ported to a Xilinx Virtex 5 SX50T development board, with 52K logic cells, 594KB BlockRAM, 288 DSPs and 256MB DDR2 SDRAM. A custom 32-bit pipeline was designed, based on matrix-vector multiplication. A raw image from a lenslet array camera was segmented into 16 LR images of size 128 × 128, which were then registered with the subpixel registration algorithm [17] and used to reconstruct a 512 × 512 HR image. The LR images were read into the onboard memory through a Ethernet port and the reconstructed HR image was retrieved through a USB port and displayed on a 7 inch LCD. The current maximum processing capability is 2 frames per second (fps). 9 Fig. 4.4. Resolution enhancement after the reconstruction using real images taken by a lenslet array camera. Without the memory interface bottleneck, the processing capability can move up 5 fps. The system is highly scalable, because the core co-processing element is an l2 scale matrix-vector multplication (MVM), which can be easily replicated for larger SR problems and saved on larger FPGAs. Thus the speedup should be near linear until memory bandwidth is exhausted. The current system has only one MVM core while we project that a scaled system on a Virtex 5 LX330T could hold 6 MVM cores and thus an estimated performance of 30 fps. This translates into a speedup factor of approximately 383 when compared to a desktop computer running the same algorithm in MATLAB and a speedup factor of 1, 043 when compared to a desktop computer running the CGLS algorithm. However, regarding implementing the CGLS algorithm on FPGA, it is important to note that it is not currently possible to store the system matrix A in onboard memory, even in a sparse format. Furthermore, very few tools for large scale matrix-vector products exist for FPGAs [4, 13]. 5. Conclusions. Traditional models of superresolution lead to large, sparse matrices that are difficult to implement with on-board electronics. This paper presents algorithms that make use of the structure of the sparse matrices to replace large scale matrix vector products with small, parallelizable products. The introduced algorithms do not introduce any degradation in quality over current reconstruction methods but they offer a much faster reconstruction with limited memory requirements. As a result, a problem once thought difficult to implement with special purpose digital computers can be made to take full advantage of the small scale, highly parallel capabilities of such systems. In addition, the structural results are suitable for constructing matrix vector products for use in any algorithm in which they are required. Acknowledgments. The authors thank Dr. James Nagy, Dr. Paúl Pauca, Dr. Sudhakar Prased, Dr. Todd Torgensen and other researchers in the PERIODIC research group for their critiques [10]. The research described in this paper was supported in part by the Intelligence Advanced Research Projects Agency (IARPA) through the Defense Microelectronics Activity (DMEA) under cooperative agreement number H94003-08-2-0802, and by the Air Force Office of Scientific Research (AFOSR) under award number FA9550-08-1-0151. REFERENCES 10 [1] R. Barnard, P. Pauca, T. Torgersen, R. Plemmons, S. Prasad, J. van der Gracht, J. Nagy, J. Chung, G. Behrmann, S. Matthews, and M. Mirotznik, High-resolution iris image reconstruction from low-resolution imagery, in Proc. SPIE, Advanced Signal Processing Algorithms, Architectures, and Implementations, vol. 6313, San Diego, CA, Aug. 2006, pp. 63130D1–63130D13. [2] M. Bertero and P. Boccacci, Introduction to Inverse Problems in Imaging, Institute of Physics Publishing, 1998. [3] D. Bini and B. Meini, Solving block banded block toeplitz systems with structured blocks: new algorithms and open problems, Large-scale Scientific Computations of Engineering and Environmental Problems II, Notes on Numerical Fluid Mechanics, 13 (2000), pp. 15–24. [4] S.D. Brown, R.J. Francis, J. Rose, and Z.G. Vranesic, Field-Programmable Gate Arrays, Springer, 1992. [5] J. Chung, E. Haber, and J. Nagy, Numerical methods for coupled super-resolution, Inverse Problems, 22 (2006), pp. 1261–1272. [6] S. Farsiu, D. Robinson, M. Elad, and P. Milanfar, Advances and challenges in superresolution, International Journal of Imaging Systems and Technology, 14 (2004), pp. 47–57. [7] S.W. Golomb, Permutations by cutting and shuffling, SIAM Review, 3 (1961), pp. 293–297. [8] G.H. Golub and C.F. Van Loan, Matrix Computations, Johns Hopkins University Press, 3rd ed., 1996. [9] M.R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems, Journal of Research of the National Bureau of Standards, 49 (1952), pp. 409–436. [10] M. Mirotznik, S. Mathews, R. Plemmons, P. Pauca, T. Torgersen, R. Barnard, T. Guy, B. Gray, Q. Zhang, J. van der Gracht, C. Petersen, M. Bodnar, and S. Prasad, A practical enhanced-resolution integrated optical-digital imaging camera (PERIODIC), in Proc. SPIE, vol. 7348, Orlando, FL, April 2009, Conference on Defense, Security and Sensing. [11] J. Nagy, R.J. Plemmons, and T.C. Torgersen, Iterative image restoration using approximate inverse preconditioning, IEEE Transactions on Image Processing, 5 (1996), pp. 1151– 1162. [12] N. Nguyen, P. Milanfar, and G. Golub, A computational efficient image superresolution algorithm, IEEE Transactions on Image Processing, 10 (2001), pp. 573–583. [13] F.E. Ortiz, E.J. Kelmelis, J.P. Durbano, and D.W. Pratherb, FPGA acceleration of superresolution algorithms for embedded processing in millimeter-wave sensors, in Proc. SPIE, vol. 6548, May 2007, pp. 6548–0K. [14] S.C. Park, M.K. Park, and M.G. Kang, Super-resolution image reconstruction: A technical overview, IEEE Signal Processing Magazine, 20 (2003), pp. 21–36. [15] D. Reddy, Z. Yue, and P. Topiwala, An efficient real time superresolution ASIC system, in Proc. SPIE, vol. 6957, 2008, pp. 6957–09. [16] R.S. Wagner, D.E. Waagen, and M.L. Cassabaum, Image super-resolution for improved automatic target recognition, in Proc. SPIE, vol. 5426, Orlando FL, April 2004, pp. 188– 196. [17] Q. Zhang, Analytical approximations of translational subpixel shifts in signal and image registrations, in Proc. SPIE, vol. 7074, San Diego, CA, August 2008, pp. 70740E1–70740E7. Appendix A. Proofs. The following section contains proofs of Theorems 2.1 to 2.7. The proofs revolve around the definitions of the three component matrices D, Si , and Hi plus the permutation matrices P and Q. Structures of the matrices in question are revealed through the row and column indices of nonzero entries. In particular, if A is a matrix such that all nonzero entries on the ith row are within the range [i − l2 , i + l2 ], then A is a tridiagonal block matrix having a block size of l2 × l2 . A.1. Decimation matrix. We start with the decimation matrix, which can also be regarded as a local mean matrix. 2 2 2 Definition A.1. We define a decimation operation, D ∈ Rn /l ×n , on a vec2 torized image, f ∈ Rn ×1 , as g = Df , such that the entries of D are determined using 11 the following averaging equation. l−1 X l−1 X gij = fi−ī,j−j̄ /l2 , (A.1) ī=0 j̄=0 where i, j are the row and column indices of decimated image. The vectorization follows the typical column ordering. As an example, the (1, 1) pixel of g is an averaging of the pixels in the square from (1, 1) to (l, l) in f. The structure of D is given below. Proposition A.2. Matrix D defined above has a block diagonal given by D1 D1 , (A.2) D= ... D1 n2 /l2 ×n2 where D has n/l × n/l blocks, and D1 = [D2 D2 . . . D2 ] ∈ Rn/l×nl , whose also has a diagonal structure as v v , (A.3) D2 = ... v n/l×n where D2 also has n/l × n/l blocks, and v = (1/l2 , 1/l2 , . . . , 1/l2 ) ∈ R1×l is a constant row vector. Proof. The structure follows from the definition of a traditional matrix vectorization. Furthermore, all values are equal to 1/l2 because D takes an unweighted average. The nonzero entries in each row of matrix D are separated by a distance of n entries, so di = (0, . . . , 0, v, 0, . . . , 0, v, 0, . . . , v, 0, . . . , v, 0, . . . , 0, . . . , 0), where di is the ith row of D. However, we can group these entries into one continuous block of size 1 × l2 by moving the nonzero columns together through a permutation matrix P. The permuted D is a diagonal Block Toeplitz matrix. Definition A.3. Define the permutation matrix P = (pi ) by the row vector pi = enli1 +ni2 +i3 where i = 1, . . . , n2 , i1 = bi/nlc, ĩ = i − 1 mod nl, i2 = ĩ mod n 2 and i3 = d(ĩ + 1)/ne. Here ej ∈ R1×n is the identity vector with the entry 1 at position j and entry zeros at other positions. While it is not used until later, we include the definition of the permutation matrix Q here for convenience. Definition A.4. Define the permutation matrix Q̂ such that column i is the unit vector eb i−1 c+l2 ((i−1) mod n2 ) . The permutation matrix is defined by the product n2 /l2 l2 Q = P Q̂. Proposition A.5. Under the permutation matrix P defined above, matrix D is a block diagonal matrix v̂ v̂ P T DP = , (A.4) ... v̂ n2 /l2 ×n2 12 2 and v̂ = (1/l2 , 1/l2 , . . . , 1/l2 ) ∈ R1×l . Proof. The proof is by comparison between entries in (2.5) and the corresponding entry in (2.6). Proposition A.6. Matrix (DP )T DP = P T DT DP is also a block diagonal matrix, while DP (DP )T is a diagonal matrix given by (DP )T DP = D̂2 D̂2 ... D̂2 2 , (A.5) n2 ×n2 2 where D̂2 = v̂ T v̂ ∈ Rl ×l is a constant matrix (1/l2 )l2 ×l2 . Proof. Note that P T DT DP = P T DT P P T DP = (P T DP )T P T DP. Products involving non-zero elements only occur on the block diagonal and thus v̂ T v̂ is the diagonal block of (DP )T DP . A.2. Shift matrix. The support of the shift matrix Si in (1.1) depends on the 2D translational shifts. To avoid a notation conflict in using index i, throughout this section, we use S to represent Si in (1.1) for any i = 1, . . . , l2 , except in the proofs of Theorem 2.1, Corollary 2.2 and Theorem 2.4. 2 2 Definition A.7. We define a shift operation, S ∈ Rn ×n , on a vectorized image, 2 f ∈ Rn ×1 , as fˆ = Sf , such that the entries of S are determined by a 2D translational shift (δx , δy ) and the relationship between fˆij and fij is given by fˆij = w11 fi+δ̂y ,j+δ̂x + w12 fi+δ̂y +1,j+δ̂x + w21 fi+δ̂y ,j+δ̂x +1 + w22 fi+δ̂y +1,j+δ̂x +1 , (A.6) where δ̂y = bδy c and δ̂x = bδx c, w1 to w4 are weights formulated as (1 − δ̃x )(1 − δ̃y ), (1 − δ̃y )δ̃x , (1 − δ̃x )δ̃y and δ̃x δ̃y respectively, and δ̃x = δx − δ̂x , δ̃y = δy − δ̂y . Figure A.1 shows the grid of a 10 × 10 original image in (circles) and the shifted image (in squares), where (δx , δy ) = (1.5, −2.5). Here we use the image intensity values at circles to linearly interpolate for intensity values at the square points. It is clear that the non-zero entries of S can only be one of w1 to w4 , and that they possess a regular pattern. In fact, using the column-wise ordering in the vectorization of f and fˆ, we can pinpoint the four nonzero entries on the i + (j − 1)nth row as i+ δ̂y +(j + δ̂x −1)n, i+ δ̂y +1+(j + δ̂x −1)n, i+ δ̂y +(j + δ̂x )n, i+ δ̂y +1+(j + δ̂x )n. (A.7) This corresponds to a matrix structure specified in the following proposition. Proposition A.8. Given a 2D rigid translational shift, (δx , δy ), where δx ∈ (−l, l) and δy ∈ (−l, l), the shift matrix, S, defined in Definition A.7, has a block Toeplitz form that can be determined in the following way. If δx > 0, 0 . . . 0 S1 S2 0 ... 0 0 S1 S2 . . . S S S= , (A.8) 1 2 0 0 ... 0 0 n2 ×n2 13 Fig. A.1. An illustration of the shift matrix, Si . where S1 , S2 ∈ Rn×n . The number of columns of zero blocks to the left is δ̂x and the number of rows of zero blocks at the bottom is δ̂x + 1. If δx < 0, 0 0 ... 0 0 S = S1 S2 , (A.9) 0 S1 S2 ... S1 S2 0 . . . 0 n2 ×n2 where the number of columns of zero blocks to the right is δ̂x and the number of rows of zero blocks at the top is δ̂x + 1. If δy > 0, 0 . . . 0 wi1 wi2 0 ... 0 0 wi1 wi2 . . . w w Si = , i = 1, 2 (A.10) i1 i2 0 0 ... 0 0 n×n The number of columns of zero blocks to the left is δ̂y and the number of rows of zero blocks at the bottom is δ̂y + 1. If δy < 0, 0 0 ... 0 0 w w Si = , i = 1, 2 (A.11) i1 i2 0 wi1 wi2 ... wi1 wi2 0 . . . 0 n×n where the number of columns of zero blocks to the right is δ̂y and the number of rows of zero blocks at the top is δ̂y + 1. 14 Proof. The structure described by (A.8) to (A.11) is simply the matrix representation of (A.7). The details can be verified by the interested reader. Next we permute S to gain a better structure. Proposition A.9. The permuted matrix P T SP has a two level bidiagonal block Toeplitz structure. If δx > 0, Ŝ1 Ŝ2 Ŝ1 P SP = Ŝ2 ... T , Ŝ2 Ŝ1 n2 ×n2 (A.12) and if δx < 0, Ŝ1 Ŝ2 P T SP = Ŝ1 ... Ŝ2 Ŝ1 , (A.13) , (A.14) n2 ×n2 where Ŝi ∈ Rnl×nl . If δy > 0, Ŝi1 Ŝi = Ŝi2 Ŝi1 Ŝi2 ... Ŝi2 Ŝi1 nl×nl and if δy < 0, Ŝi1 Ŝi2 Ŝi = 2 Ŝi1 ... Ŝi2 Ŝi1 , (A.15) nl×nl 2 where Ŝij ∈ Rl ×l . Proof. We again rely on (A.7) to identify the nonzero entries after applying both row and column permutations using P . After the vectorization using the new indexing method, the entry (i, j) in f will be at the position pij = (i − 1)l + j̃nl + ĵ, (A.16) where j̃ = b(j − 1)/lc and ĵ = (j − 1) mod l + 1. The following inequalities define a two-level bidiagonal structure and make use of the restriction that δ̂x , δ̂y ∈ [−l + 1, l − 1]. The diagonal block of size l2 × l2 is given by 15 pij − l2 ≤ pi+δ̂y ,j+δ̂x ≤ pij + l2 , (A.17) pij − l2 ≤ pi+δ̂y +1,j+δ̂x ≤ pij + l2 , (A.18) 2 2 pij − l ≤ pi+δ̂y ,j+δ̂x +1 ≤ pij + l , (A.19) pij − l2 ≤ pi+δ̂y +1,j+δ̂x +1 ≤ pij + l2 . (A.20) The upper block diagonal when δx > 0 is given by pij + nl − l2 ≤ pi+δ̂y ,j+δ̂x +1 ≤ pij + nl + l2 , (A.21) pij + nl − l2 ≤ pi+δ̂y +1,j+δ̂x +1 ≤ pij + nl + l2 , (A.22) while the lower block diagonal when δx < 0 is given by pij − nl − l2 ≤ pi+δ̂y ,j+δ̂x +1 ≤ pij − nl + l2 , (A.23) pij − nl − l2 ≤ pi+δ̂y +1,j+δ̂x +1 ≤ pij − nl + l2 . (A.24) To verify the Toeplitz structure, we only need to prove that the permuted Ŝ = P T SP satisfies ŝIJ = ŝI+l2 ,J+l2 . It is not difficult to verify that pi+l,j 2 pij + l = pi+l−n,j+l if i ≤ n − l if i > n − l (A.25) (A.26) which is the definition of an l2 × l2 block Toeplitz structure. As an example, we note that pi+l+δ̂y ,j+δ̂x = pi+δ̂y ,j+δ̂x + l2 , and pi+l−nδ̂y ,j+l+δ̂x = pi+δ̂y ,j+δ̂x + l2 . The three remaining nonzero entries on row i can be verified in a similar manner. Proposition A.10. Matrix P T (DS)P has a two-level bidiagonal block Toeplitz structure similar to which in (A.12) and (A.13), with second level blocks consisting of 1 × l2 row vectors v̂ Ŝi instead of Ŝi , for i = 1, 2. Proof. Note that P T (DS)P = P T DP (P T SP ). By Proposition A.5, P T DP has a block diagonal structure with each block of size 1 × l2 . By Proposition A.9, P T SP has a two level bidiagonal block Toeplitz structure with l2 × l2 blocks. It follows that the product P T DSP is two level bidiagonal block Toeplitz with 1 × l2 blocks. A.2.1. Proof of Theorem 2.1. Each P T DSi P is a block bidiagonal matrix with 1 × l2 blocks, but it is important to note that not all matrices have the same non-zero diagonals. The particular non-zero diagonal depends on the sign of δx and δy . However, if we stack them to form  = (P T DS1 P ; . . . ; P T DSl2 P ) then premultiply by Q̂ to form the shuffle of each 1 × l2 block, we form a tridiagonal block Toeplitz matrix with l2 × l2 second level blocks. This completes the proof of Theorem 2.1. 16 A.2.2. Proof of Corollary 2.2. The proof of this corollary is a natural extension of the two-level bidiagonal structure of P T Si P , proven in Proposition A.9. If all δx or δy have the same sign then all P T DSi P are bidiagonal with the same non-zero diagonals. It follows that one of the three diagonals in the tridiagonal block Toeplitz matrix proved in A.2.1 is identically zero. A.2.3. Proof of Lemma 2.3. Note that P T (DS)T DSP = P T S T P (P T DT DP )P T SP, and P T DT DP = (DP )T DP has a block diagonal structure with each block having a size of l2 × l2 (see Proposition A.6). By Proposition A.10, P T SP has a two-level bidiagonal block Toeplitz structure, so P T S T P also has a two-level bidiagonal block Toeplitz structure, except the upper diagonal blocks will be transposed to the lower diagonal and vice versa. Hence the multiplication of these three matrices will be a two-level tridiagonal block Toeplitz structure. Pl 2 A.2.4. Proof of Theorem 2.4. Note that AT A = i=1 (DSi )T DSi and thus Pl 2 P T AT AP = i=1 P T (DSi )T DSi P . A.3. Blurring matrix. The blurring matrix we consider here is the regular block Toeplitz with Toeplitz block blurring matrix generated by a spatially invariant n × n PSF matrix and zero boundary conditions. Furthermore, we assume the blur is radially symmetric with a small diameter. A large diameter corresponds to more off-diagonal entries in each Toeplitz block of H, and thus a more complex structure. Here we impose a limit on the diameter to gain a simpler structure of H while still accepting a large enough PSF for real applications. In particular, we have the following proposition. Again, to avoid the notation conflict in using index i, throughout this section, we use H to represent Hi in (1.1) for any i = 1, . . . , l2 , except in the proofs of Theorem 2.5 and 2.6. Proposition A.11. If the diamater of a spatial invariant PSF is not greater than 2l + 1, P T HP is a two level tridiagonal block Toeplitz structure. Proof. We can write the blurring operation in the matrix form as fˆ = Hf, where f and fˆ are two vectorized images. Written explicitly, the entries of fˆ are given by, fˆij = l l X X hīj̄ fi−ī,j−j̄ . (A.27) ī=−l j̄=−l Note that we have applied the diameter limit of 2l + 1 to the two summation indices, ī and j̄. The (2l + 1)2 nonzero entries on each row, except the first and last several rows, lie at entries i − ī + (j − j̄)n which represents a block Toeplitz with Toeplitz block structure. After permuation, the matrix representation is P T fˆ = P T HP (P T f ), (A.28) and the nonzero entries of P T HP are at pi−ī,j−j̄ , where p is defined in (A.16). The proof of the two level tridiagonal structure is equivalent to proving similar inequalities 17 in Proposition A.9. For the diagonal block, we need pij − l2 ≤ pi−ī,j−j̄ ≤ pij + l2 . (A.29) For the block off diagonal to the right, we need pij + nl − l2 ≤ pi−ī,j−j̄ ≤ pij + nl + l2 . (A.30) For the block off diagonal to the left, we need pij − nl − l2 ≤ pi−ī,j−j̄ ≤ pij − nl + l2 . (A.31) The inequalities follow from the domains of ī and j̄. To verify the Toeplitz structure, we demonstrate that the permuted Ȟ = P T HP satisfies ȟIJ = ȟI+l2 ,J+l2 . Note that I and J are the row and column indices of matrix Ȟ while i and j are the indices of the original image f , and their relationships are I = pij and J = pi−ī,j−j̄ . Now it becomes clear that pi−ī,j−j̄ + l2 = (i + l − 1)l + j] − j̄nl + j[ − j̄, where j] − j̄ = b(j − j̄ − 1)/lc and j[ − j̄ = (j − j̄ − 1) mod l + 1. Thus ī and j̄ remains unchanged when pi−ī,j−j̄ increases by l2 , which in turns means ȟI+l2 ,J+l2 = hī,j̄ = ȟI,J . The next proposition about the structure of P T HSP gives the key piece of the proofs of Theorems 2.5 and 2.6. Proposition A.12. With H defined in Proposition A.11 and S defined in Definition A.7, P T HSP is also a two level tridiagonal block Toeplitz structure. Proof. Once again, the proof utilizes the definitions of H and S in the subscript form. We have that f˜ = H fˆ = HSf is equivalent to f˜ij = l l X X hī,j̄ fˆi−ī,j−j̄ ī=−l j̄=−l = l l X X hī,j̄ (w11 fi−ī+δ̂y ,j−j̄+δ̂x + w12 fi−ī+δ̂y +1,j−j̄+δ̂x ī=−l j̄=−l + w21 fi−ī+δ̂y ,j−j̄+δ̂x +1 + w22 fi−ī+δ̂y +1,j−j̄+δ̂x +1 ). (A.32) As an example, one of the nonzero entries on the row pij is given by pi−ī+δ̂y ,j−j̄+δ̂x . By the definition of pij , it is easy to see the positions of nonzero entries are effectively shifted by a fixed amount determined by δx on the first level and by δy on the second level. Due to the limitation that |δ(x,y) | ≤ l we have a shifted two-level tri-diagonal structure. This proposition comes as somewhat of a surprise because we would normally expect that a two-level tri-diagonal block Toeplitz structure (P T HP ) times by a twolevel bi-diagonal block Toeplitz structure (P T SP ), would result in a two-level quadradiagonal block Toeplitz structure. However, in our case, it remains tri-diagonal. 18 A.3.1. Proof of Theorem 2.5. The proof of theorem is similar to that Theorem 2.1, because each block of P T DHi Si P is also of size 1 × l2 . Variations in the particular set of offsets δx and δy corresponding to Si imply that some P T DHi Si P have a tri-diagonal structure shifted to the right while others shifted to the left. Their concatenation into QSP is a two-level penta-diagonal block Toeplitz structure. A.3.2. Proof of Theorem 2.6. Theorem 2.6 follows from Proposition A.12. Note that P T (DHS)T DHSP = (P T (HS)T P )(P T DT DP )(P T HSP ). Since P T DT DP is a block diagonal matrix, (P T (HS)T P )(P T DT DP )(P T (HS)P ) has the same structure as P T (HS)T P P T (HS)P. It follows that the product of two twolevel tridiagonal block Toeplitz matrices is a two-level penta-diagonal block Toeplitz Pl2 Pl2 matrix. Since AT A = i=1 (DHi Si )T DHi Si and P T AT AP = i=1 P T (DHi Si )T DHi Si P , the matrix AT A has the same two-level structure. A.3.3. Proof of Theorem 2.7. Note that Si is sparse with non-zero diagonals (i) (i) whose weights correspond to a bilinear interpolation defined by δx , δy . This allows 2 us to restrict our attention to the the nl × n submatrix [A−1 A0 A1 0 . . . 0] (A.33) (−1) (A.34) and the l2 × nl submatrix [Ai (0) (1) Ai Ai 0 . . . 0] and use the first and second order Toeplitz structure of A. Furthermore, we have the following lemma whose proof follows by direct calculation. Lemma A.13. The following structural descriptions hold: 1. Only si,j in S where nl < i ≤ 2nl, l < (j mod n) ≤ 2l contribute to Aji under the product QT DSP. 2. Row k of Aji contains information from only DSk . Proof. The proof involves a partition of rows of matrices DSi so their permuted form can be investigated. We first label the rows of Si so that row α ≡ k mod n has label ck . The rows are then labeled [c1 c2 . . . cn c1 . . . cn . . . c1 . . . cn ]T with n repetitions of c1 , . . . , cn . Under the permutation P T Si P the labels are reordered as [c1 . . . c1 c2 . . . c2 . . . cn . . . cn ]T (A.35) such that the l rows labeled ck are clustered together. The label pattern repeats after nl rows because the permutation P is closed on sets of indices of length nl. The operation DP averages l2 consecutive rows into one row of DSi P. Finally, the permutation Q shuffles rows of DSi P into rows j ≡ i mod l2 . Excluding the first repetition of row labels in (A.35) we have that the sub matrix [A−1 A0 A1 0 . . . 0] is comprised of one set of l2 rows of DSi P for each i ≤ l2 , proving part (2). Part (1) of the matrix follows from the proceeding analysis and Definition A.7. 19 Using Lemma A.13, it is possible to completely characterize Aji from only 3l2 elements of each Sk for 1 ≤ k ≤ l2 . Next, we examine the effect of multiplying P T SP by DP. Note that DP averages columns in row blocks of size l2 . Thus, the proof reduces to an investigation of the support of rows n + l2 + 1, . . . , n + 2l2 of each P T Si P (the first few blocks form B and the analysis follows similarly). Partition these rows into l2 × l2 blocks Hi which further divide into l × l blocks Jα,β . Within each Hi , the permutation P T Si P shuffles Jα,β such that the first l ×l block of the permuted block Ĥi contains the (1, 1) element from each J(α,β) . The first block has format J(1,1),1,1 J(2,1),1,1 .. . J(1,2),1,1 ... J(l,1),1,1 J(l,2),1,1 ... ... J(1,l),1,1 J(2,l),1,1 .. . . (A.36) J(l,l),1,1 where the left indices identify a block J and the right indices identify an element in the block. Parts (1) of Theorem 2.7 follows directly. Part (2) follows because the sum of the 4 weights in each bilinear interpolation is 1. A.4. Proof of Theorem 3.1. The matrices AT A + αI and A0 are positive definite, so we can refer to Theorem 10.1.2 in [8]. However, Theorem 10.1.2 concerns the Gauss-Seidel iteration method, not the block Gauss-Seidel iteration introduced here. Most of the proof extends naturally, but we clarify one less obvious point. Using the notation in [8], we define G = −(D + L)−1 LT , where D = diag(A0 , A0 , . . . , A0 ) and L is a strictly lower triangle matrix. We need to prove that G1 ≡ D1/2 GD−1/2 = −(I + L1 )−1 LT1 , (A.37) where L1 = D−1/2 LD−1/2 , or equivalently, D1/2 (D + L)−1 D1/2 = (I + L1 )−1 . (A.38) When D is only a diagonal matrix, it is easy to verify (A.41), but in this case D is block diagonal. This proves not to be a problem. Notice that P T AT AP + αI has a 2 × 2 block form and thus we can explicitly write its inverse. (D + L) −1 = D0 L 0 D0 −1 = D0−1 −1 −D0 LD0−1 0 D0−1 , (A.39) where D0 is the upper left or the lower right block and L is the lower left block of P T AT AP + αI in (3.1). Then we multiply D1/2 on both sides to get D1/2 (D + L)−1 D1/2 = D1/2 D0−1 −1 −D0 LD0−1 = I −1/2 LD0 −1/2 −D0 0 D0−1 0 . I −1 D1/2 It is easy to verify the right side of the equation above is (I + L1 )−1 . 20 (A.40) A.5. Proof of Theorem 3.2. Again, AT A + αI and A0 are positive definite and we can refer to Theorem 10.1.2 in [8]. We need to verify that that G1 ≡ D1/2 GD−1/2 = −(I + D−1/2 LD−1/2 )−1 (D−1/2 LD−1/2 )T , (A.41) or equivalently, D1/2 (D + L)−1 D1/2 = (I + D−1/2 LD−1/2 )−1 , (A.42) Notice that AT A + αI has a 3 × 3 block form and we can explicitly write out its inverse. (D + L)−1 Â0 = Â−1 Â−2 0 Â0 Â−1 −1 0 0 Â0 Â−1 0 −1 = −Â−1 0 Â−1 Â0 −1 −1 −Â0 (Â−1 Â0 Â−1 − Â−2 )Â−1 0 0 Â−1 0 −1 −Â0 Â−1 Â−1 0 0 0 , Â−1 0 where Â0 is the diagonal block, which is positive definite. Hence, 0 0 I 0 . D1/2 (D+L)−1 D1/2 −1/2 −1/2 I −Â0 Â−1 Â0 (A.43) It is easy to verify the right side of (A.43) is (I + D−1/2 LD−1/2 )−1 . I −1/2 −1/2 −Â0 Â−1 Â0 = −1/2 −1/2 −Â0 (Â−1 Â−1 0 Â−1 − Â−2 )Â0 21