A NEW SCALING FOR NEWTON’S ITERATION FOR THE POLAR DECOMPOSITION AND ITS BACKWARD STABILITY RALPH BYERS∗† ‡ AND HONGGUO XU∗§ Abstract. We propose a scaling scheme for Newton’s iteration for calculating the polar decomposition. The scaling factors are generated by a simple scalar iteration in which the initial value depends only on estimates of the extreme singular values of the original matrix, which can for example be the Frobenius norms of the matrix and its inverse. In exact arithmetic, for matrices with condition number no greater than 1016 , with this scaling scheme, no more than 9 iterations are needed for convergence to the unitary polar factor with a convergence tolerance roughly equal to 10−16 . It is proved that if matrix inverses computed in finite precision arithmetic satisfy a backward-forward error model then the numerical method is backward stable. It is also proved that Newton’s method with Higham’s scaling or with Frobenius norm scaling is backward stable. Key words. matrix sign function, polar decomposition, singular value decomposition (SVD), Newton’s method, numerical stability, scaling AMS subject classifications. 65F05, 65G05 1. Introduction. Every matrix A ∈ Cn×n has a polar decomposition A = QH, where H = H ∗ ∈ Cn×n is Hermitian positive semi-definite and Q ∈ Cn×n is unitary, i.e., Q∗ Q = I. The polar decomposition is unique with positive definite symmetric factor H if and only if A is nonsingular. Its applications include unitary approximation and distance calculations [8, 9, 12]. The polar decomposition generalizes to rectangular matrices; see, for example, [15]. We consider only the square matrix case here, because numerical methods for computing the polar decomposition typically begin by reducing to the problem down to the square matrix case using, for example, a QR factorization [5, 8]. (An algorithm that works directly with rectangular matrices appears in [6].) The polar decomposition may be easily constructed from a singular value decomposition (SVD) of A. However, the SVD is a substantial calculation that displays much more of the structure of A than does the polar decomposition. Constructing the polar decomposition from the SVD destroys this extra information and wastes the arithmetic work used to compute it. It is intuitively more appealing to use the polar decomposition as a preliminary step in the computation of the SVD as in [12]. When A is nonsingular one way to compute the polar decomposition is through Newton’s iteration 1 Qk+1 = (1.1) ζk Qk + (ζk Qk )−∗ , Q0 = A, 2 where ζk = ζ(Qk ) > 0 is a positive scalar function of Qk chosen to accelerate convergence [8]. Each iterate Qk has polar decomposition Qk = QHk where Q is the unitary polar factor of A, H0 = H is the Hermitian polar factor of A, and ∗ Department of Mathematics, University of Kansas, Lawrence, Kansas 66045, USA. xu@math. ku.edu. † Deceased. ‡ This material is based upon work partially supported by the University of Kansas General Research Fund allocations 2301062-003 and 2301054-003 and by the National Science Foundation awards 0098150, 0112375, and 9977352. § This author is partially supported by NSF under Grant No.EPS-9874732 with matching support from the State of Kansas and the University of Kansas General Research Fund allocation 2301717003. 1 2 R. BYERS AND H. XU Hk+1 = (ζk Hk + (ζk Hk )−1 )/2, k ≥ 0. For appropriately chosen acceleration parameters ζk , limk→∞ Hk = I. Hence, the unitary polar factor is Q = limk→∞ Qk and the Hermitian polar factor is H = limk→∞ Q∗k A. Iteration (1.1) was first proposed in [8] and studied further in [5, 6, 14]. It is called “Newton’s iteration” because it can be derived from Newton’s method applied to the equation X ∗ X = I. It is closely related to Newton’s iteration for the matrix sign function [17, 22]. Simplicity is an attractive feature of (1.1). Apart from the computation of ζk , each iteration needs only one matrix inversion and one matrix-matrix addition. The simplicity allows implementations of (1.1) to take advantage of the hierarchical memory and parallelism [1, 2, 13]. Many authors have studied choices of the acceleration parameters ζk [3, 4, 8, 16, 17, 22]. If ζk ≡ 1, then the iterates Qk converge quadratically to the unitary polar factor Q [8]. Convergence is also quadratic if ζ = ζ(U ) is a smooth function of U ∈ Cn×n and ζ(U ) = 1 whenever U is unitary. The choice s ||Q−1 (2) k ||2 (1.2) ζk = ||Qk ||2 where || · ||2 is the spectral norm is proposed in [8]. This scale factor is optimal in the sense that, given Qk , (1.2) minimizes the next error ||Qk+1 − Q||2 . With this scale factor, for the matrices Qk generated by (1.1), the error sequence ||Qk − Q||2 converges monotonically to zero. Unfortunately, to determine the scale factor (1.2), one needs to compute two extreme singular values of Qk at each iteration. In order to preserve rapid convergence of (1.1) with scaling (1.2), highly accurate values of these extreme (2) singular values are required to guarantee ζk → 1. This is expensive enough to make scale factor (1.2) unattractive. To save the cost of computing the extreme singular values, one might approximate (1.2). A commonly used scale factor is the (1, ∞)-scaling (1.3) (1,∞) ζk = −1 ||Q−1 k ||1 ||Qk ||∞ ||Qk ||1 ||Qk ||∞ 14 , (where || · ||1 and || · ||∞ are the 1-norm and ∞-norm, respectively) which was proposed (1,∞) (2) by Higham in [8]. The factor ζk is within a constant factor of ζk . It adds a negligible amount of arithmetic work compared to the cost of Q−1 k which is needed at each iteration anyway. The scale factor (1.4) (F ) ζk 1/2 −1/2 = ||Q−1 k ||F ||Qk ||F (where || · ||F is the Frobenius norm) is discussed in [5, 8, 16]. It can also be computed at a negligible cost. It is optimal in the sense that, given Qk , it minimizes ||Qk+1 ||F , and causes the sequence ||Qk ||F to converge monotonically [5]. Another relatively inexpensive scale factor is [4] (1.5) (d) ζk = | det(Qk )|−1/n . The complex modulus of the determinant is very inexpensively obtained from the same matrix factorization used to calculate Q−1 k . This scaling is optimal in the sense that NEWTON’S ITERATION FOR POLAR DECOMPOSITION 3 Pn (k+1) 2 (k+1) for a given iterate Qk , it minimizes D(Qk+1 ) = j=1 (ln(σj )) where σj is the j-th singular value of Qk+1 . The function D(Qk+1 ) is a measure of the departure of Qk+1 from the unitary matrices. This paper considers the sub-optimal scaling strategy s √ p √ 2 ab , ζk = 1/ ρ(ζk−1 ), k = 2, 3, . . . , (1.6) ζ0 = 1/ ab, ζ1 = a+b where ρ(x) = (x + x−1 )/2 and a and b are any numbers such that 0 < a ≤ ||A−1 ||−1 2 ≤ ||A||2 ≤ b. Apart from estimating the extreme singular values of the initial matrix Q0 = A, the scale factor costs only several floating point operations per iteration. Moreover, only rough estimates of ||A||2 and ||A−1 ||−1 are needed. One may simply 2 choose a = ||A−1 ||−1 F and b = ||A||F . From Table 2.1 with such choices for any matrices with condition number no greater than 1016 and size no greater than 1011 , at most nine iterations of (1.1) with scaling (1.6) are necessary to approximate the unitary polar factor Q to within 2-norm distance less than 10−16 . We show below that in the presence of rounding error, (1.1) with (1.6) is numerically stable assuming that matrix inverses are calculated with small forward-backward error. This is the case, for example, when matrix inverses are computed using the bidiagonal reduction [7, Page 252]. We also prove the numerical stability of Newton’s iteration with any of the scalings (1.2) - (1.4). Commenting on an early draft of this paper, Krystyna Ziȩtak pointed out that the suboptimal (quasi-optimal) scaling parameters were discovered independently by Andrzej Kielbasiński but not published in the open literature. They were presented by Krystyna Ziȩtak at the 1999 Householder meeting at Whistler and the 1999 ILAS conference at Barcelona. In Section 5 of their recent paper [19], Kielbasiński et al mention the quasi-optimal scaling parameters. In [18], these authors gave an error analysis of Higham’s method [8] with the same mixed backward-forward stability assumption for matrix inversion. They gave numerical experiments in [23]. In the following A ∈ Cn×n is always nonsingular. A = U ΣV ∗ is the SVD of A, where U and V are unitary, Σ = diag(σ1 , . . . , σn ) is diagonal, and σ1 ≥ . . . ≥ σn ≥ 0 are the singular values of A. The set of the singular values is denoted by σ(A). The condition number with respect to the spectral norm of A is denoted by κ2 (A) = σ1 /σn . Following [7, Page 18], a flop is the computational work of a floating point addition, subtraction, multiplication or division together with the associated subscripting and indexing overhead. It takes two flops to execute the Fortran statement A(I,J) = A(I,J) + C*A(K,J). 2. Scaling and convergence. Let A ∈ Cn×n be nonsingular with the SVD A = U ΣV ∗ with Σ = diag(σ1 , σ2 , . . . , σn ). Each Newton iterate Qk in (1.1) has singular value decomposition Qk = U Σk V ∗ where Σk+1 = ζk Σk + (ζk Σk )−1 /2, (2.1) (k) (k) Σ0 = Σ. (k) In particular, Qk has singular values σ1 , σ2 , . . . σn k > 0) that obey (2.2) (0) σj = σj , (k+1) σj 1 = 2 (k) ζk σj + 1 (k) ζk σj (in no particular order when ! (k) = ρ(ζk σj ), k = 0, 1, 2, . . . , 4 R. BYERS AND H. XU (k) where ρ(x) = (x + x−1 )/2. For appropriately chosen ζk , limk→∞ σj = 1 and limk→∞ Qk = U V ∗ = Q. Consequently, the convergence properties of (1.1) derive (k) directly from the n scalar sequences σj determined by (2.2). To attain good convergence behavior in (1.1), the acceleration parameters ζk must interact well with ρ(x) = (x + x−1 )/2. The following two lemmas list some easily verified elementary properties of ρ(x) = (x + x−1 )/2. Lemma 2.1. If x > 0, then 1. ρ( x1 ) = ρ(x), 2. 1 ≤ ρ(x) ≤ max(x, x−1 ) with either equality iff x = 1. 3. ρ(x) is decreasing on x ∈ (0, 1] and increasing on x ∈ [1, ∞). Lemma 2.2. Suppose that 0 < a ≤ b. Define αζ = max{(ζa)−1 , ζb} and −1/2 ζopt = (ab) . Then we have the following properties. 1. For any ζ > 0, 1 ≤ max ρ(ζx) = ρ(αζ ), and 1 = max ρ(ζx) iff αζ = 1. a<x<b a<x<b 2. For any ζ > 0, 1 ≤ min ρ(ζx), and 1 = min ρ(ζx) iff ζa ≤ 1 ≤ ζb. a<x<b pa<x<b 3. min αζ = αζopt = b/a, and ζ = ζopt is the only minimizer. ζ>0 p 4. min max ρ(ζx) = min ρ(αζ ) = ρ αζopt = ρ b/a . ζ>0 a≤x≤b ζ>0 In the following, for ease of notation let τ (x) be the function √ 1 √ 1 (2.3) . τ (x) = ρ( x) = x+ √ 2 x The k-fold composition of τ (x) with itself is written τ k (x), i.e., τ 0 (x) = x, τ 1 (x) = τ (x), and for k > 1, τ k+1 (x) = τ (τ k (x)). Similarly ρk (x) is the k-fold composition of ρ(x) = (x + x−1 )/2 with itself. Suppose that 0 < a ≤ σn ≤ σ1 ≤ b. Consider the sequence of intervals generated by Newton’s iteration: [a0 , b0 ] = [a, b], [a1 , b1 ] = ρ(ζ0 [a0 , b0 ]), [a2 , b2 ] = ρ(ζ1 [a1 , b1 ]), (k) . . . . It follows from (2.2) that σj ∈ [ak , bk ], j = 1, 2, . . . , n, k = 0, 1, . . .. Note that minx>0 ρ(x) = 1, so, for k ≥ 1, [ak , bk ] ⊆ [1, bk ] and (2.4) (k) 1 ≤ σj ≤ bk . It is intuitively satisfying to choose the sequence of acceleration parameters ζk in (1.1) to minimize the sequence bk . From Lemma 2.2, the initial optimal scaling factor is ζ0 = (ab)−1/2 . The initial p p interval is scaled to be ζ0 [a, b] = [ a/b, b/a] which contains 1. The next interval is p [a1 , b1 ] = ρ(ζ0 [a, b]) = [1, ρ( b/a)] = [1, τ (b/a)] where τ (x) is given by (2.3). The left endpoint is a1 = 1, so the optimal scaling factor p −1/2 for the next iteration is ζ1 = b1 = 1/ τ (b/a). The next interval is p [a2 , b2 ] = ρ(ζ1 [a1 , b1 ]) = [1, ρ( τ (b/a))] = [1, τ 2 (b/a)]. An easy induction shows that the sequence of intervals is [a0 , b0 ] = [a, b] and for k ≥ 1, [ak , bk ] = [1, τ k (b/a)], and the sequence of optimal scaling factors is q √ (2.5) ζ0 = 1/ ab, ζk = 1/ τ k (b/a), k = 1, 2, 3, . . . NEWTON’S ITERATION FOR POLAR DECOMPOSITION 5 which is equivalent to (1.6). √ Since τ (x) = ρ( x) ≤ ρ(x) for x ≥ 1, τ (b/a) ≤ ρ(b/a). By induction we have τ k (b/a) ≤ ρk (b/a), for all k ≥ 1. The sequence b/a, ρ(b/a), ρ2 (b/a), . . . , is generated by Newton’s iteration xk+1 = ρ(xk ) with x0 = b/a. It converges to 1 quadratically. Obviously τ k (b/a) ≥ 1 for k ≥ 0. So b/a, τ (b/a), τ 2 (b/a), . . . also converges to 1 at least quadratically. It is not difficult to show that 1 ≤ τ (x) ≤ x for any x ≥ 1. We have bk = τ k (b/a) = τ (τ k−1 (b/a)) ≤ τ k−1 (b/a) = bk−1 . Hence, after the first step, the sequence of intervals satisfies [a1 , b1 ] ⊇ [a2 , b2 ] ⊇ . . . ⊇ [ak , bk ] ⊇ . . . , and it converges to the single point 1 quadratically. (Note ak = 1 for all k ≥ 1.) The initial interval, [a0 , b0 ] = [a, b] is an exception, because, in general, [a, b] + [1, τ (b/a)]. Based on this fact and (2.4), the convergence properties of (1.1) with (1.6) are clear, and we summarize them in the following theorem. Theorem 2.3. If (2.6) 0 < a ≤ ||A−1 ||−1 2 ≤ ||A||2 ≤ b, and Qk is obtained from the Newton iteration (1.1) with scaling (1.6) then (2.7) ||Qk − Q|| ≤ τ k (b/a) − 1 ≤ ρk (b/a) − 1, k = 1, 2, . . . . In fact the convergence properties are highly satisfactory even in case b/a is large. Table 2.1 uses Theorem 2.3 to list the number of Newton’s iteration (1.1) with scaling (1.6) (and exact arithmetic) required to guarantee selected absolute errors δ > ||Qk − Q||2 and values of b/a. The table demonstrates that Newton’s iteration (1.1) with scaling (1.6) typically needs no more than nine iterations to attain typical floating point precision accuracy. The table also demonstrates that convergence is insensitive to the choice of a and b—widely differing values of b/a need similar numbers of iterations to attain similar accuracy. In particular the easy-to-compute choices a = ||A−1 ||−1 F and b = ||A||F satisfy (2.6) and are unlikely to lead to even one more iteration than the optimum choices of a = ||A−1 ||−1 2 and b = ||A||2 , particularly for ill-conditioned matrices. For instance, for any A ∈ Cn×n with κ2 (A) = 1016 , for 16 27 a = ||A−1 ||−1 for any F and b = ||A||F we have b/a ≤ nκ(A) = n10 . Then b/a ≤ 10 11 n ≤ 10 , and the number of iterations is 9, same as with the optimum choices. In Theorem 2.3, smaller values of b/a give smaller values of τ k (b/a) and hence better error bounds. Inequality (2.6) implies that b/a ≥ κ2 (A) = ||A−1 ||2 ||A||2 and equality can be achieved only with a = ||A−1 ||−1 = σn and b = ||A||2 = σ1 . With 2 a = σn and b = σ1 the scaling factors (1.6) are q √ (2.8) ζk = 1/ τ k (κ2 (A)), (k ≥ 1), ζ0 = 1/ σ1 σn , 6 R. BYERS AND H. XU Table 2.1 Number of Newton iterations (1.1) with scaling (1.6) (and exact arithmetic) required to guarantee absolute error ||Qk − Q||2 < δ for selected values of δ and b/a such that 0 < a ≤ ||A−1 ||−1 2 ≤ ||A||2 ≤ b. See Theorem 2.3. δ \ b/a 10−1 10−4 10−7 10−10 10−13 10−16 10−19 10 2 4 4 5 5 5 6 105 4 5 6 7 7 7 7 1010 5 6 7 7 8 8 8 1015 6 7 8 8 8 9 9 1020 6 7 8 8 9 9 9 1025 6 8 8 9 9 9 10 1027 6 8 8 9 9 9 10 and the corresponding intervals are [σn , σ1 ], and [1, τ k (κ2 (A))] for k ≥ 1. Let Σk be the matrices generated by (2.1) with a = σn and b = σ1 . It is easy to verify that in this case, the right endpoint of the k-th interval is a singular value of Σk . This in turn implies that inequality (2.7) is an equality, i.e., ||Qk − Q||2 = ||Σk − I||2 = τ k (κ2 (A)) − 1. The number sequence bk was also derived in [16] in order to show the convergence behavior of Newton’s method with the optimal scale factors. It is shown that when (2) a = σn and b = σ1 , for Qk generated with ζk−1 defined in (1.2), one has ||Qk ||2 ≤ bk [16, 11]. Due to this fact we call ζk defined in (2.5) sub-optimal scale factors. Note that bk is derived based on different interpretations here. It is the right end point of the kth interval generated by applying Newton’s iteration to the initial interval [a, b]. For this interval iteration, ζk is the scale factor that minimizes bk+1 − 1 (i.e., it makes [ak+1 , bk+1 ] as close to 1 as possible). 3. The Algorithm. The Newton’s method (1.1) with scaling scheme (1.6) is implemented by the following algorithm. Algorithm 3.1 (Newton’s method (1.1) with scaling (1.6)) Input: Nonsingular matrix A ∈ Cn×n and a stopping criterion δ > 0 Output: The polar decomposition A = QH. Step 0: a. Set Q0 = A; Compute Q−∗ 0 √ −1 b. Choose a ≤ ||Q−1 0 ||2 and b ≥ ||Q0 ||2 ; ζ0 = 1/ ab c. Set k = 0 Step 1: While ||Qk − Q−∗ k ||F ≥ δ a. Qk+1 = (ζk Qk + ζk−1 Q−∗ k )/2 b. q If k = 0, ζ1 = √ 2 √ qb/a+ a/b Else ζk+1 = ζ −12+ζ k k End if c. Compute Q−∗ k+1 d. k = k + 1 End while Step 2: Q = (Qk + Q−∗ H = 12 (Q∗ A + (Q∗ A)∗ ) k )/2; Here are some remarks. NEWTON’S ITERATION FOR POLAR DECOMPOSITION 7 1. The matrix A−1 = Q−1 0 needs to be computed in the first iteration anyway. Hence, power iterations on A and A−1 may be used evaluate the extreme singular values σ1 and σn−1 , respectively using only O(n2 ) extra flops per iteration. These estimates may then serve as the sub-optimal scaling factors b and a−1 , respectively. Since highly accurate estimates of σ1 and σn are unnecessary, a few power iterations should suffice. Alternatively, ||A||F and ||A−1 ||−1 F may be used for b and a. || < δ is essentially equivalent to ||Qk+1 − 2. The stopping criterion ||Qk − Q−∗ F k Qk ||F < δ, which is used in [11, Section 8.9]. This follows from the fact that when ζk ≈ 1 (which is usually the case for a small δ), Qk+1 − Qk = 1 1 −1 −∗ (ζ Qk − (2 − ζk )Qk ) ≈ (Q−∗ − Qk ). 2 k 2 k In practice, in order for the computed Qk to be within O(ε)√of a unitary matrix, where ε is the machine epsilon, it is sufficient to choose δ = O( ε). See (4.16). 3. Commonly the matrix inversion method used in the algorithm is an LU factorization-based method such as the Gaussian elimination with partial pivoting or complete pivoting [8]. Such an inversion method usually works well in practice [18]. In order to guarantee the algorithm to be numerically backward stable, one may use the more expensive bidiagonal reduction-based matrix inversion method provided in Appendix A.1. So the computed matrix inverses satisfy the backward-forward error model. See Assumption 4.1 in Section 4 below. 4. Estimating σ1 and σn usually uses O(n2 ) flops. Each iteration uses 2n3 flops for the matrix inverse by an LU factorization-based method and O(n2 ) flops for matrix addition. Computing H uses 2n3 flops. If p is the number of iterations for convergence, then the algorithm uses a total of roughly 2(p + 1)n3 flops [8]. When p = 9 it is about 20n3 flops, which is less than the QR-like SVD method (which takes 22n3 to 26n3 flops for the SVD and 4n3 for Q and H). If the bidiagonal reduction-based matrix inversion method is used, the total cost will be 2(3p + 1)n3 flops. 5. In order to reduce the cost while maintaining numerical stability, one may first use the bidiagonal reduction-based method for a few iterations. When κ2 (Qk ) is not too large, say 100, one shifts to an LU factorization-based inversion method for the subsequent iterations. The matrix inverses essentially satisfy the backward-forward error model in the latter case [10, Section 14]. Also, it takes only a few iterations for the condition number to drop below 100. In the case when κ2 (A) = 10−16 , with a = σn and b = σ1 , then ||Q3 ||2 = τ 3 (10−16 ) ≈ 42. Since this is usually the worst case in practice, the bidiagonal reduction-based method is required in no more than 3 iterations. With this strategy, the maximum cost (with p = 9) is 3 · 6n3 + (9 − 3) · 2n3 + 2n3 = 32n3 flops. 6. Although the cost for computing the scale factors (1.3) - (1.5) is negligible in Newton’s iteration, computing the sub-optimal scale factors is essentially costless. Also, the use of sub-optimal scaling simplifies the algorithm, since the “shifting scale factor to 1” strategy, which is used for the (1, ∞)-scaling ([8]), is not needed. Finally, with sub-optimal scaling, in general, the number of iterations is no greater than 9, and it can be obtained by simply computing τ k (b/a) − 1. It is still not clear how to predict the number of iterations with other scalings, although in practice it is observed that the (1, ∞)-scaling and the sub-optimal scaling essentially have the same convergence rate. 4. Stability and Rounding Error Analysis. In this section, a first order error analysis establishes that Newton’s method (1.1) with scaling (1.6) can be implemented 8 R. BYERS AND H. XU in a backward stable way. The same conclusion is drawn for the scalings (1.2), (1.3), b H|| b 2 for the and (1.4). In outline the approach is to estimate the residual ||A − Q b b rounding-error-perturbed unitary factor Q and Hermitian factor H produced by finite precision arithmetic in the algorithm in Section 3. The method here is to first estimate b − Q and H b − H and then use them to estimate the residual. the forward errors Q For the error analysis, we employ the standard model of floating point arithmetic with machine epsilon ε [10, Section 2.2]. We also need the following assumptions. Assumption 4.1. If a nonsingular matrix A ∈ Cn×n is inverted using finite precision arithmetic with machine epsilon ε to obtain a “computed inverse” X, then X = (A + E)−1 + F where E, F ∈ Cn×n are perturbation matrices satisfying ||E||2 ≤ c1 (n)ε||A||2 , ||F ||2 ≤ c2 (n)ε||A−1 ||2 and ci (n) (i = 1, 2) are some low degree polynomials of n. In Newton’s method it is typical to use Gaussian elimination with partial or complete pivoting for computing matrix inverses. Although it works well in practice, the computed matrix inverses may not satisfy Assumption 4.1 [10, Section 14.1]. We show in Appendix A.1 that Assumption 4.1 is satisfied by a matrix inversion algorithm that uses the bidiagonal reduction method. Assumption 4.2. c3 (n)κ2 (A)ε < 1, where c3 (n) is a low degree polynomial of n. Note that 1/κ2 (A) is the measure of the relative distance of a nonsingular A to the nearest singular matrices [7, Page 73], i.e., ||E||2 1 = min . κ2 (A) det(A+E)=0 ||A||2 If such a condition doesn’t hold then matrices like A + E in Assumption 4.1 can be singular. So this is a condition about numerical nonsingularity of A. It is essential in the subsequent first order error analysis, although it won’t be explicitly stated. We now begin the error analysis. In practice, rounding errors perturb Newton’s b k is the computed version of method recurrence (1.1). Under Assumptions 4.1, if Q Qk , then b k+1 = ζk Q b k + Fk,1 + 1 (Q b k + Fk,2 )−∗ + Fk,3 Q 2 2ζk ζk b 1 b 1 ζk −∗ = (Qk + Fk,2 ) + (Qk + Fk,2 ) + Fk,1 + Fk,3 − Fk,2 2 2ζk 2ζk 2 ζk b 1 b (4.1) =: (Qk + Fkb ) + (Qk + Fkb )−∗ + Fkf 2 2ζk where Fkb = Fk,2 and Fkf = Fk,1 + 2ζ1k Fk,3 − ζ2k Fk,2 . The perturbation matrix Fk,1 represents rounding errors introduced by floating point matrix addition and scalar NEWTON’S ITERATION FOR POLAR DECOMPOSITION 9 multiplication, and the perturbation matrices Fk,2 and Fk,3 represent rounding errors introduced by matrix inversion under Assumption 4.1. The F ’s obey the bounds b k ||2 , ||ζ −1 Q b −∗ ||2 ||Fk,1 ||2 ≤ d1 ε max ||ζk Q k k b k ||2 ||Fk,2 ||2 ≤ d2 ε||Q b −∗ ||2 ||Fk,3 ||2 ≤ d3 ε||Q k b k ||2 ||Fkb ||2 ≤ db ε||Q ε b k ||2 , ||ζ −1 Q b −∗ ||2 , ||Fkf ||2 ≤ df max ||ζk Q k k 2 where ε is the machine epsilon and d1 , d2 = db , d3 and df are some modest constants that may depend on n, the details of the arithmetic and the inversion algorithm but b k . Each Qk is a smooth function of Qk−1 and each depend neither on Qk nor Q Fk,j = O(ε), so, by induction, (4.2) b k = Qk + O(ε). Q Hence, the bounds above may be loosely expressed in terms of Qk as (4.3) (4.4) ||Fkb ||2 ≤ db ε||Qk ||2 + O(ε2 ) ε 2 ||Fkf ||2 ≤ df max ||ζk Qk ||2 , ||ζk−1 Q−∗ k ||2 + O(ε ) 2 ≤ df ε||Qk+1 ||2 + O(ε2 ). Inequality (4.4) is a consequence of (2.2). We need the following lemma for continuing our analysis. Lemma 4.3. Let A ∈ Cn×n be a nonsingular matrix with polar decomposition A = QH with Q ∈ Cn×n unitary and H ∈ Cn×n Hermitian positive definite. If F ∈ Cn×n with ||F ||2 = 1 and t ≥ 0, then when t is sufficiently small A + tF has the polar decomposition A + tF = Q (I + tE) + O(t2 ) H + tG + O(t2 ) where G ∈ Cn×n is the unique Hermitian solution to (4.5) F ∗ A + A∗ F = GH + HG, and E ∈ Cn×n is the unique skew-Hermitian solution to (4.6) Q∗ F − F ∗ Q = EH + HE. Also, E is given by (4.7) E = (Q∗ F − G)H −1 . Proof. See proof in Appendix A.2. At each of the perturbed Newton iteration (4.1) rounding errors are equivalent to b k to Q b k + Fkb , taking one Newton step (1.1), then perturbing the result perturbing Q b k = Wk H b k and Q b k + Fkb = W fk H e k be the polar decompositions by adding Fkf . Let Q b b of Qk and Qk + Fkb , respectively. By Lemma 4.3, (4.8) fk = Wk (I + Ekb ) + O(ε2 ), W 10 R. BYERS AND H. XU where Ekb satisfies ∗ bk + H b k Ekb = Wk∗ Fkb − Fkb Ekb H Wk + O(ε2 ). (4.9) From (4.1), 1 b ζk b b k+1 − Fkf . (Qk + Fkb ) + (Qk + Fkb )−∗ = Q 2 2ζk fk , Since the unitary factor in the polar decomposition of the left-hand side matrix is W b applying Lemma 4.3 to Qk+1 , we have fk = Wk+1 (I − Ekf ) + O(ε2 ), W or equivalently, fk (I + Ekf ) + O(ε2 ), Wk+1 = W (4.10) where Ekf satisfies (4.11) ∗ ∗ b k+1 + H b k+1 Ekf = Wk+1 Wk+1 + O(ε2 ). Ekf H Fkf − Fkf b k satisfies (4.2), by Lemma 4.3 Since Qk has the polar decomposition Qk = QHk and Q b one also has Hk = Hk +O(ε), Wk = Q+O(ε). Based on these first-order error results, (4.9) and (4.11) can be expressed as (4.12) (4.13) ∗ Ekb Hk + Hk Ekb = Q∗ Fkb − Fkb Q + O(ε2 ) ∗ Ekf Hk+1 + Hk+1 Ekf = Q∗ Fkf − Fkf Q + O(ε2 ). Combining (4.8) and (4.10), one has Wk+1 = Wk (I + Ekb + Ekf ) + O(ε2 ). It follows by induction (with W0 = Q), Wj = Q(I + Ej ) + O(ε2 ), j > 0, where (4.14) Ej = j−1 X (Ekb + Ekf ). k=0 Suppose that Algorithm 3.1 applied to a nonsingular matrix A ∈ Cn×n with polar b p that decomposition A = QH completes after p iterations in Step one. We obtain Q −∗ −∗ b b b b b bp satisfies ||Qp − Qp ||2 ≤ ||Qp − Qp ||F < δ. With the polar decomposition Qp = Wp H and (2.2), we have (see the proof in Appendix A.3) (4.15) 1 b b −∗ b (Qp + Q p ) = Wp + ∆Qp , 2 where b p ||2 < δ 2 /8. ||∆Q NEWTON’S ITERATION FOR POLAR DECOMPOSITION 11 Suppose δ is small. Step two of the algorithm produces approximate polar factors 1 b∗ ∗b b = 1 (Q bp + Q b −∗ b Q + K , Q A + A Q ) + F, H = p 2 2 b from Q b p and K for rounding error where F accounts for rounding error forming Q b from A and Q b obeying forming H ||F || ≤ dF ε, ||K|| ≤ dK ε||A||2 , with modest constants dF and dK . So we have b − Wp ||2 ≤ dF ε + δ 2 /8. ||Q (4.16) Since Wp = Q(I + E) with E := Ep defined in (4.14), by (4.15), (4.17) b = Q(I + E) + ∆Q b p + F = Q(I + E + L), Q b p + F ) satisfies where L = Q∗ (∆Q ||L||2 ≤ dL max(ε, δ 2 ) for some modest constant dL , which combines the rounding error F and the effect of stopping criterion. Both dL and dK may depend on n and the details of the finite b or H. b With (4.17) precision arithmetic and computational algorithm but not on A, Q and the fact that E is skew-Hermitian, (4.18) b = 1 ((I + E + L)∗ Q∗ A + A∗ Q(I + E + L) + K) H 2 1 = (1 − E + L∗ )H + H(I + E + L) + K) 2 1 = (2H − EH + HE + L∗ H + HL + K). 2 So by (4.17) and (4.18), to the first order, bH b − A = 1 Q(I + E + L)(2H − EH + HE + L∗ H + HL + K) − A Q 2 1 = Q (2H − EH + HE + L∗ H + HL + 2EH + 2LH + K) − A 2 +O(||L||22 ) + O(ε||L||2 ) + O(ε2 ) 1 = Q (EH + HE + L∗ H + HL + 2LH + K) + O(max(ε2 , εδ 2 , δ 4 )). 2 From (4.14) this expression can be written p−1 X bH b − A = 1Q Q (Ekb H + HEkb + Ekf H + HEkf ) 2 k=0 (4.19) 1 + Q(L∗ H + HL + 2LH + K) + O(max(ε2 , εδ 2 , δ 4 )). 2 Note that so far the sub-optimal scale factors have not play a role. In order to continue the analysis we need the following lemma which involves the sub-optimal scaling. 12 R. BYERS AND H. XU Lemma 4.4. Let A ∈ Cn×n be a nonsingular matrix with singular value decomposition U ΣV ∗ , Σ = diag(σ1 , σ2 , . . . , σn ) and σ1 ≥ σ2 ≥ · · · ≥ σn . Consider Newton’s (k) iteration (2.1) with scaling (1.6) and initial iterate Σ0 = Σ. Let σj be the j-th (k) (k) diagonal entry of Σk , and let σmax = max1≤j≤n σj . If a, b satisfy 0 < a ≤ σn and b ≥ σ1 , then, for all k ≥ 0 and 1 ≤ i, j ≤ n, (k) σmax (k) σi + (k) σj ≤ b . σi + σj Proof. See Appendix A.4. Recall that Ekb , Ekf satisfy (4.12) and (4.13), respectively. Let Hk = V Σk V ∗ be an eigen-decomposition where V is unitary and Σk is diagonal obeying (2.1). (Recall that throughout the algorithm the singular vectors of the Qk ’s are the singular vectors of A. In particular, for all k, the unitary matrix of right singular vectors of A is also a unitary matrix of eigenvectors of Hk .) In this notation, (4.12) and (4.13) can be written ∗ ekb Σk + Σk E ekb = Fekb − Fekb E + O(ε2 ) ∗ ekf Σk+1 + Σk+1 E ekf = Fekf − Fekf E + O(ε2 ) where ekb = V ∗ Ekb V, E ekf = V ∗ Ekf V, Fekb = V ∗ Q∗ Fkb V, Fekf = V ∗ Q∗ Fkf V. (4.20) E ekb and E ekf are So, the (i, j)-th entries of E (4.21) eeij,kb = feij,kb − feji,kb (k) σi + (k) σj + O(ε2 ), eeij,kf = feij,kf − feji,kf (k+1) σi + (k+1) σj + O(ε2 ). ekj ||2 = ||Ekj ||2 and ||Fekj ||2 = ||Fkj ||2 for j = b, f , because V and Q are Note that ||E unitary. Multiplying (4.19) on the left by V ∗ and on the right by V gives bH b − A)V = V ∗ (Q p−1 X 1 ∗ ekb Σ + ΣE ekb + E ekf Σ + ΣE ekf V QV E 2 k=0 1 + V ∗ (L∗ H + HL + 2LH + K) V + O(max(ε2 , εδ 2 , δ 4 )) 2 ekb and E ekf are given by (4.20). From (4.21), the (i, j)-th entry of the sum where E 13 NEWTON’S ITERATION FOR POLAR DECOMPOSITION Pp−1 e e e e is k=0 Ekb Σ + ΣEkb + Ekf Σ + ΣEkf p−1 X (e eij,kb + eeij,kf )(σi + σj ) k=0 = p−1 X k=0 = p−1 X feij,kb − feji,kb (k) σi + (k) σj + (k+1) σi + (k) (k) σmax σi (σi + σj ) + O(ε2 ) (k+1) σj (k) feij,kb − feji,kb (σi + σj )σmax k=0 ! feij,kf − feji,kf (k) + (k+1) feij,kf − feji,kf (σi + σj )σmax (k+1) + σj (k+1) σmax σi ! (k+1) + σj + O(ε2 ). Inequalities (4.3) and (4.4) and Lemma 4.4 imply p−1 X eij,kb + eeij,kf )(σi + σj ) (e k=0 ≤ 2ε p−1 X (k+1) (k) db (σi + σj )σmax (k) σi k=0 (k) + σj + df (σi + σj )σmax (k+1) σi (k+1) ! + O(ε2 ) + σj ≤ 2pε(db + df )b + O(ε2 ). Hence, the residual is bounded as bH b − A||2 ≤ ||Q ||Q p−1 X (Ekb H + HEkb + Ekf H + HEkf ) /2||2 k=0 ∗ + ||Q(L H + HL + 2LH + K)/2||2 + O(max(ε2 , εδ 2 , δ 4 )) ≤ npε(db + df )b + (2dL + dK /2) max(ε, δ 2 )||A||2 + O(max(ε2 , εδ 2 , δ 4 )) In the same way, from (4.18) we can obtain b − H||2 ≤ ||(−EH + HE)/2||2 + ||(L∗ H + HL + K)/2||2 + O(δ 4 ) ||H ≤ npε(db + df )b + (dL + dK /2) max(ε, δ 2 )||A||2 + O(max(ε2 , εδ 2 , δ 4 )). By applying Lemma 4.4 to (4.21) to estimate ||E||2 , then from (4.17) we can derive b − Q||2 ≤ ||QE||2 + ||QL||2 ≤ θnpε(db + df )b + dL max(ε, δ 2 ), ||Q where (4.22) θ= 1 σn 2 σn−1 +σn A is complex A is real. (The formula in real case is based on the fact eeii,kb = eeii,kf = 0 from (4.21).) We present the above error analysis results as well as (4.16) in the following theorem. Theorem 4.5. Suppose that A ∈ Cn×n satisfies Assumption 4.2 and has the polar b and H b be the matrices computed by Algorithm 3.1 decomposition A = QH. Let Q 14 R. BYERS AND H. XU after p iterations with a matrix inversion method that satisfies the error model in Assumption 4.1. Then bH b − A||2 ≤ npε(db + df )b + (2dL + dK /2) max(ε, δ 2 )||A||2 + O(max(ε2 , εδ 2 , δ 4 )) ||Q b − H||2 ≤ npε(db + df )b + (dL + dK /2) max(ε, δ 2 )||A||2 + O(max(ε2 , εδ 2 , δ 4 )) ||H b − Q||2 ≤ θnpε(db + df )b + dL max(ε, δ 2 ) ||Q b − Wp ||2 ≤ dF ε + δ 2 /8, ||Q where db , df , dL , dK , dF are some modest constants, Wp is a unitary matrix, and θ is defined in (4.22). As noted in Table 2.1, in most practical situations p ≤ 9. Therefore, if b is not too much greater√than ||A||2 and the algorithm uses a stopping criterion δ not too much greater than ε, then the algorithm is backward stable. 1√ Corollary 4.6. If b = ||A||2 and δ = n 4 ε then √ bH b − A||2 ≤ (np(db + df ) + n(2dL + dK /2))ε||A||2 + O(ε2 ) ||Q √ b − H||2 ≤ (np(db + df ) + n(dL + dK /2))ε||A||2 + O(ε2 ) ||H √ b − Q||2 ≤ np(db + df )ε(θ||A||2 ) + ndL ε ||Q √ b − Wp ||2 ≤ (dF + n/8)ε. ||Q b −H||2 and ||Q−Q|| b Note that the error bounds for ||H 2 coincide with the perturbation results; see, for example, [8, 21, 20] and [11, Section 8.2]. The quantity θ||A||2 serves as the condition number for the perturbation of Q. Remark 1. The same procedure can be used to give an error analysis for Newton’s method with other scalings. Note that the backward stability depends on whether (k) (4.23) (σi + σj )σmax (k) σi (k) + σj = O(||A||2 ), which depends on scaling factors. From Remark 2 in Appendix A.4, (4.23) holds for the optimal scaling (1.2). Lemma A.3 in Appendix A.5 shows that (4.23) also holds for the (1, ∞)-scaling (1.3) and the scaling (1.4). Therefore, Newton’s method with these three scalings is also backward stable under the same conditions of Theorem 4.5 and an appropriate stopping criterion, when the number of iterations is not too large. (See Remark 3 in Appendix A.5.) We also observed that the numerical stability doesn’t necessarily depend on how fast the method converges. In fact, one can show that when ||A||2 ≥ ||A−1 ||2 , Newton’s method without scaling (a = b = 1) computes a polar decomposition satisfying the same error bounds given in Theorem 4.5. 5. Numerical Examples. We did some numerical experiments with Newton’s method (1.1) with scaling (1.6) using a = ||A−1 ||−1 F , b = ||A||F and also with the (1, ∞)-scaling (1.3). The main purpose is to test the numerical stability results and convergence rate, and to compare the sub-optimal scaling and (1, ∞)-scaling. For this reason we used the bidiagonal reduction matrix inversion method Algorithm BR for computing matrix inverses. All numerical experiments were done on a Dell PC with a Pentium-IV processor, in Matlab version 7.2 with machine epsilon ε ≈ 2.22 × 10−16 . 15 NEWTON’S ITERATION FOR POLAR DECOMPOSITION 1√ In the numerical experiments we use stopping criterion δ = n 4 ε, where n is the size of matrices. For Newton’s method with the (1, ∞)-scaling the scale factor is shifted to 1 when ||Xk+1 − Xk ||F /||Xk+1 ||F < 10−2 . Based on the results in Corolb and H b are the computed unitary and Hermitian polar factors produced lary 4.6, if Q b H|| b 2 /||A||2 , ||H − H|| b 2 /||H||2 , and by Algorithm 3.1, then we expect to observe ||A − Q b ||Q − Q||2 /(θ||A||2 ) not much larger than ε. In the tables we will use the following notations eQ = b 2 ||Q − Q|| , θ||A||2 eH = b 2 ||H − H|| , ||H||2 b H|| b 2 ||A − Q , ||A||2 res = b − I||2 , b∗ Q ror = ||Q where the “exact” factors Q and H for the matrices in the first example are obtained from the SVD of A using Matlab’s variable precision arithmetic vpa with 24 significant decimal digits. p is the number of iterations, and n is the dimension of matrices. The symbol “HSF” refers to Newton’s method (1.1) with Higham’s (1, ∞)-scaling. The symbol “SUB” refers to (1.1) with scaling (1.6) using the Frobenius norms for the initial interval, i.e., a = ||A−1 ||−1 F and b = ||A||F . Example 1. Three groups of twenty real matrices were constructed with dimension 20 by using Matlab’ gallery(’randsvd’,20,kappa,5) with kappa equal to 102 , 108 , 1015 , respectively. The singular values of the generated matrices are random values with uniformly distributed logarithm. For each group the ranges of the condition numbers κ2 (A) and θ||A||2 are listed below. kappa = 102 : κ2 (A) ∈ [37.7, 90.6], θ||A||2 ∈ [33.5, 83.7] kappa = 108 : κ2 (A) ∈ [1.13, 6.75] × 107 , θ||A||2 ∈ [0.2, 5.44] × 107 kappa = 1015 : κ2 (A) ∈ [6.35 × 1010 , 7.53 × 1014 ], θ||A||2 ∈ [1.78 × 1010 , 5.22 × 1014 ] The test results are summarized in Table 5.1, where for each group the minimum and maximum values of the errors, residuals, and numbers of iterations are listed. Table 5.1 Extreme values of errors, residuals, and iteration counts from Example 1. kappa p eQ eH res ror SUB HSF SUB HSF SUB HSF SUB HSF SUB HSF 102 Min 6 6 2.6e−17 2.2e−17 2.2e−16 1.7e−16 4.1e−16 3.5e−16 7.9e−16 7.0e−16 Max 6 7 7.2e−17 7.7e−17 4.1e−16 3.9e−16 8.9e−16 8.2e−16 1.1e−15 1.2e−15 108 Min 8 8 7.4e−19 7.4e−19 2.5e−16 1.9e−16 4.0e−16 2.7e−16 8.0e−16 7.9e−16 Max 8 8 1.7e−17 1.7e−17 4.1e−16 3.9e−16 7.5e−16 5.9e−16 1.1e−15 1.2e−15 1015 Min 8 8 6.9e−19 6.9e−19 1.9e−16 2.2e−16 3.5e−16 3.9e−16 7.8e−16 8.2e−16 Max 9 9 2.3e−17 2.3e−17 4.4e−16 4.0e−16 6.3e−16 7.2e−16 1.3e−15 1.1e−15 Example 2. In this example the test matrices are the Hilbert matrices which are n × n matrices with entries aij = 1/(i + j − 1). The example uses dimensions n = 6, n = 8, n = 10, n = 12 and n = 14. For every Hilbert matrix, the polar decomposition is An = In An . The condition number κ2 (An ) ranges from 1.5 × 107 to 5.1 × 1017 , and θ||An ||2 ranges from 2.6 × 105 to 1.0 × 1017 . The test results are reported in Table 5.2. 16 R. BYERS AND H. XU Table 5.2 Errors, residuals and iteration counts for Hilbert matrices from Example 2. n p eQ eH res ror SUB HSF SUB HSF SUB HSF SUB HSF SUB HSF 6 8 7 1.2e−18 1.2e−18 2.3e−16 1.9e−16 2.6e−16 2.5e−16 2.6e−16 2.8e−16 8 8 8 1.6e−19 1.6e−19 1.9e−16 1.4e−16 2.4e−16 2.5e−16 3.9e−16 5.3e−16 10 9 8 2.7e−19 2.7e−19 8.7e−17 8.7e−17 1.8e−16 1.8e−16 6.2e−16 6.8e−16 12 9 9 4.1e−19 4.1e−19 1.4e−16 1.6e−16 3.0e−16 3.3e−16 6.3e−16 8.5e−16 14 9 9 2.0e−17 2.0e−17 2.3e−16 2.7e−16 3.8e−16 6.3e−16 6.5e−16 1.0e−15 Newton’s method with the sub-optimal scaling performed well in both examples calculating the polar factors to nearly full precision. As predicted it takes at most 9 iterations. Newton’s method with the (1, ∞)-scaling performs equally well. 6. Conclusion. The sub-optimal scaling scheme (1.6) is essentially costless and simplifies the algorithm of Newton’s iteration (1.1) for computing polar factors. In a typical floating point system, with this scaling scheme, for matrices with κ2 (A) < 1016 no more than 9 iterations are needed for convergence to the unitary polar factor with a convergence tolerance roughly equal to the machine epsilon. By employing the bidiagonal factorization for matrix inversion, (1.1) with (1.6) forms a provably backward stable algorithm. Newton’s method with (1, ∞)-scaling and scaling (1.4) is also proved to be backward stable, provided the number of iterations is not too large. Acknowledgment. We thank Nick Higham for detailed suggestions and sending part of his unpublished book [11]. We thank Andrzej Kielbasiński for pointing out a mistake in an earlier version and the referees for their comments and suggestions. REFERENCES [1] Z. Bai and J. Demmel. Design of a parallel nonsymmetric eigenroutine toolbox. Part I. In R. F. Sincovec et. al., editor, Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing. SIAM, Philadelphia, PA, 1993. Also available as Computer Science Report CSD-92-718, University of California, Berkeley, CA 1992. [2] Z. Bai and J. Demmel. Design of a parallel nonsymmetric eigenroutine toolbox. Part II. Technical Report Department of Mathematics Research Report 95-11, University of Kentucky, Lexington, KY, 1995. [3] L. A. Balzer. Accelerated convergence of the matrix sign function method of solving Lyapunov, Riccati and other matrix equations. Internat. J. Control, 32(6):1057–1078, 1980. [4] R. Byers. Solving the algebraic Riccati equation with the matrix sign function. Linear Algebra Appl., 85:267–279, 1987. [5] A. A. Dubrulle. An optimum iteration for the matrix polar decomposition. Electron. Trans. Numer. Anal., 8:21–25 (electronic), 1999. [6] W. Gander. Algorithms for the polar decomposition. SIAM J. Sci. Statist. Comput., 11(6):1102–1115, 1990. [7] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD, third edition, 1996. [8] N. J. Higham. Computing the polar decomposition—with applications. SIAM J. Sci. Statist. Comput., 7(4):1160–1174, 1986. NEWTON’S ITERATION FOR POLAR DECOMPOSITION 17 [9] N. J. Higham. Computing a nearest symmetric positive semidefinite matrix. Linear Algebra Appl., 103:103–118, 1988. [10] N. J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM Publications, Philadelphia, PA,USA, second edition, 2002. [11] N. J. Higham. Functions of Matrices: Theory and Computation. SIAM Publications, Philadelphia, PA,USA, 2008. [12] N. J. Higham and P. Papadimitriou. Parallel singular value decomposition via the polar decomposition. Technical Report Numerical Analysis Report No. 239, Department of Mathematics, University of Manchester, Manchester M13 9PL, England, 1993. [13] N. J. Higham and P. Papadimitriou. A parallel algorithm for computing the polar decomposition. Parallel Computing, 20(8):1161–1173, 1994. [14] N. J. Higham and R. S. Schreiber. Fast polar decomposition of an arbitrary matrix. SIAM J. Sci. Statist. Comput., 11:648–655, 1990. [15] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, Cambridge, 1990. Corrected reprint of the 1985 original. [16] C. S. Kenney and A. J. Laub. On scaling Newton’s method for polar decomposition and the matrix sign function. SIAM J. Matrix Anal. Appl., 13:688–706, 1992. [17] C. S. Kenney and A. J. Laub. The matrix sign function. IEEE Trans. Automat. Control, 40(8):1330–1348, 1995. [18] A. Kielbasiński and K. Ziȩtak. Numerical behavior of Higham’s scaled method for polar decomposition. Numerical Algorithms, 32:105–140, 2003. [19] A. Kielbasiński, P. Zieliński, and K. Ziȩtak. Higham’s scaled method for polar decomposition and numerical matrix-inversion. Technical Report Institute of Mathematics and Computer Science Report I18/2007/P-045, Wroclaw University of Technology, Wroclaw, Poland, 2007. [20] R.-C. Li. New perturbation bounds for the unitary polar factor. SIAM J. Matrix Anal. Appl., 16:327–332, 1995. [21] R. Mathias. Perturbation bounds for the polar decomposition. SIAM J. Matrix Anal. Appl., 14:588–597, 1993. [22] J. D. Roberts. Linear model reduction and solution of the algebraic Riccati equation by use of the sign function. Internat. J. Control, 32:677–687, 1980. (Reprint of Technical Report No. TR-13, CUED/B-Control, Cambridge University, Engineering Department, 1971). [23] P. Zieliński and K. Ziȩtak. Polar decomposition - properties, applications and algorithms. Applied Mathematics, Annals of the Polish Mathematical Society, 38:23–49, 1995. Appendix A. A.1. Bidiagonal reduction-based matrix inversion algorithm. Algorithm BR Input: Nonsingular matrix A ∈ Cn×n Output: G = A−1 Step 1: Compute A = U BV ∗ with U, V unitary and B upper bidiagonal Step 2: Solve BY = U ∗ for Y by back substitution Step 3: Compute G = V Y In Step 1 one may use the Householder reflectors to perform the reduction. The reduction needs 83 n3 flops and computing U needs 43 n3 flops. The matrix V is stored in factorized form. The cost for solving the matrix equation is O(n2 ) flops. With the factorized form of V it needs 2n3 flops to compute G. So the total cost is 6n3 flops. In order to show that a matrix inverse computed by Algorithm BR follows Assumption 4.1, we need the following lemma. Lemma A.1. Consider the system Bx = z, where z ∈ Cn and B is nonsingular and upper bidiagonal denoted by α1 β1 0 .. .. . . . B= .. . βn−1 0 αn 18 R. BYERS AND H. XU Let x̂ be the numerical solution with back substitution. Then x̂ satisfies x̂ = B −1 (z + δz) + δx, where |δz| ≤ 3nε|z| + O(ε2 ), |δx| ≤ 3nε|x̂| + O(ε2 ). Proof. The components of the computed vector x̂ can be formulated as x̂n = zn , αn (1 + n ) x̂k = zk (1 + δk ) − βk x̂k+1 , αk (1 + k ) 1 ≤ k ≤ n − 1, where |n |, |δk | < ε, |k | < 3ε, k ≤ n − 1. So we have αn x̂n (1 + n ) = zn (A.1) αn−1 x̂n−1 (1 + n−1 ) + βn−1 x̂n = zn−1 (1 + δn−1 ) .. . α1 x̂1 (1 + 1 ) + β1 x̂2 = z1 (1 + δ1 ). Define x̃ = [x̃1 , . . . , x̃n ]T , z̃ = [z̃1 , . . . , z̃n ]T with x̃n = x̂n (1 + n ), z̃n = zn , x̃n−1 = x̂n−1 (1 + n−1 )(1 + n ), z̃n−1 = zn−1 (1 + δn−1 )(1 + n ), ..., ..., x̃1 = x̂1 z̃1 = z1 (1 + δ1 ) n Y (1 + k ), k=1 n Y (1 + k ); k=2 By multiplying (1 + n ) to the 2nd equation, (1 + n )(1 + n−1 ) to the 3rd equation, and so on, in (A.1), we obtain that x̃ and z̃ satisfy B x̃ = z̃. Let δz = z̃ − z and δx = x̂ − x̃. Then from x̃ = B −1 z̃ we have x̂ = B −1 (z + δz) + δx. The error bounds for |δx| and |δz| follows simply from the definitions. Theorem A.2. Let X be the inverse of A computed by Algorithm BR then X satisfies Assumption 4.1. Proof. We only consider the first order errors. Let Û B̂ V̂ ∗ be the computed bidiagonal factorization of A. Then Û = U + ∆U1 , V̂ = V + ∆V1 , where U, V are unitary and ||∆U1 ||2 ≤ d1 ε, ||∆V1 ||2 ≤ d2 ε, and U B̂V ∗ = A + E, where ||E||2 ≤ d3 ε||A||2 for some modest constants d1 , d2 , d3 . Let Ŷ be the numerical solution of B̂Y = Û ∗ computed by back substitution. By Lemma A.1, Ŷ = B̂ −1 (Û ∗ + ∆U2 ) + ∆Y, NEWTON’S ITERATION FOR POLAR DECOMPOSITION 19 where 3 3 ||∆Y ||2 ≤ 3n 2 ε||Ŷ ||2 . ||∆U2 ||2 ≤ 3n 2 ε, Let X be the computed matrix product V̂ Ŷ . We have X = V̂ Ŷ + ∆X, where ||∆X||2 ≤ d4 ε||Ŷ ||2 for some modest constant d4 . Now X = V̂ B̂ −1 (Û ∗ + ∆U2 ) + V̂ ∆Y + ∆X = V̂ B̂ −1 Û ∗ + V̂ B̂ −1 ∆U2 + V̂ ∆Y + ∆X =: V B̂ −1 U ∗ + F = (A + E)−1 + F, where F = ∆V1 B̂ −1 U ∗ + V B̂ −1 (∆U1 )∗ + V̂ B̂ −1 ∆U2 + V̂ ∆Y + ∆X. 3 It is easily verified ||F ||F ≤ (6n 2 + d1 + d2 + d4 )ε||A−1 ||2 . A.2. Proof of Lemma 4.3. Equations (4.5) and (4.7) are established in the proof of Theorem 2.5 in [8]. Here we slightly modify that proof to establish (4.6). Let A(t) = A + tF have polar decompositions Q(t)H(t). Note that H(t) = (A(t)∗ A(t))1/2 (positive definite square root) and Q(t) = A(t)H(t)−1 are sums, differences, products, quotients and compositions with C ∞ functions of the entries of A(t) which is trivially a C ∞ function. Hence, Q(t) and H(t) are also C ∞ . (Here we use the fact that A is nonsingular to observe that H(t) = (A(t)∗ A(t))1/2 avoids the singularity of the square root at zero.) Using Q̇ and Ḣ to denote differentiation by t, Taylor’s theorem implies that Q(t) = Q(0) + tQ̇(0) + O(t2 ) = Q(I + tQ∗ Q̇(0)) + O(t2 ) H(t) = H(0) + tḢ(0) + O(t2 ). The proof of Theorem 2.5 in [8] shows Ḣ(0) = G = G∗ with G given by (4.5) and Q∗ (0)Q̇(0) = E = −E ∗ with E given by (4.7). Differentiate A + tF = Q(t)H(t) to get F = Q̇H + QḢ. Evaluating at t = 0, letting E = Q∗ (0)Q̇(0), G = Ḣ(0) gives Q∗ F = EH + G. Using the facts that E is skew-Hermitian and G is Hermitian while subtracting this equation to its Hermitian transpose gives (4.6). The Lyapunov operator on the right is nonsingular, because A nonsingular implies that the eigenvalues of the Hermitian polar factor H are real and positive. Hence, the solution E is unique. b p = Up Σp Vp∗ be the SVD. Recall that the singular A.3. Proof for (4.15). Let Q (p) bp − Q b −∗ values satisfy σ ≥ 1. Then ||Q p ||2 < δ implies i (p) σi − 1 (p) < δ, i = 1, 2, . . . , n. σi Because (p) σi − 1 (p) σi (p) = (σi (p) + 1)(σi (p) σi − 1) , 20 R. BYERS AND H. XU we have (p) (p) σi δσi −1< (p) σi . +1 Then 1 2 (p) σi + 1 ! (p) (p) −1= (σi − 1)2 (p) σi 2σi (p) σi < (p) 2(σi + 1)2 δ2 . (p) (p) Since the function x/(x+1)2 is decreasing when x ≥ 1, we have σi /(σi +1)2 ≤ 1/4. Hence ! 1 1 (p) σi + (p) − 1 < δ 2 /8. 2 σ i and using Wp = Up Vp∗ , −1 ∗ b p ||2 = ||(Q bp + Q b −∗ ||∆Q p )/2 − Wp ||2 = ||Up ((Σp + Σp )/2 − I)Vp ||2 ! ! 1 1 (p) = max σi + (p) − 1 < δ 2 /8. i 2 σ i (k) A.4. Proof of Lemma 4.4. By (2.4), σmax ≤ bk for k ≥ 0. So we only need to show bk (k) σi + b , σi + σj ≤ (k) σj k ≥ 0. An easy calculation shows for all i and j, (k) σi (k) σj + ζk−1 (k−1) 1 (k−1) = (σi + σj ) 1+ (k−1) (k−1) 2 2 ζk−1 σi σj ! . Recall bk satisfies 1 bk = 2 1 ζk−1 bk−1 + ζk−1 bk−1 (k−1) (k−1) σj Because σi 1 ζk−1 = bk−1 1 + 2 2 2 ζk−1 bk−1 ≤ b2k−1 , bk (k) σi (k) = + σj ζk−1 2 ≤ bk−1 1 + ζ 2 1b2 k−1 k−1 (k−1) (k−1) 1 (σi + σj ) 1 + 2 (k−1) ζk−1 2 ζk−1 σi bk−1 (k−1) σi (k−1) (k−1) σj . + σj An easy induction on k now implies bk (k) σi Remark 2. + (k) σj ≤ b0 (0) σi + (0) σj = b , σi + σj k ≥ 0. NEWTON’S ITERATION FOR POLAR DECOMPOSITION 21 1. The condition a ≤ ||A−1 ||−1 is essential for proving Lemma 4.4. Without it 2 (k) one may not have the inequality σj ≤ bk and the result can’t be proved. (k) 2. In the case that the optimal scaling is employed, bk = σmax and one has the same result. A.5. Relation (4.23) for the scalings (1.3) and (1.4). Lemma A.3. Suppose that the matrix sequence Qk is generated by Newton’s (1,∞) (F ) iteration with scaling ζk that is either ζk in (1.3) or ζk in (1.4). The singular values of Qk satisfy ! k−1 (k) Y k ||A||2 ||A||2 σmax (A.2) (k) ≤ n2 , k ≥ 0, 1 ≤ i, j ≤ n, ≤ ψ` (k) σi + σj σi + σj σ +σ i `=0 j where ψ` = max 1, (`) √ 1 ≤ (`) ζ`2 σmax σmin n. Proof. Let wk = 1 (k) (k) ζk2 σmin σmax , k ≥ 0. If s ζk = using the inequalities ||A||2 ≤ (1,∞) ζk p = 4 −1 ||Q−1 k ||1 ||Qk ||∞ , ||Qk ||1 ||Qk ||∞ ||A||1 ||A||∞ ≤ √ n||A||2 , we have ([11, Page 208]) √ 1 1 √ ζk ≤ q ≤ 4 n ζk . 4 n (k) (k) σmin σmax (A.3) If s ζk = (F ) ζk using the inequalities ||A||2 ≤ ||A||F ≤ So in both cases we have √ = ||Q−1 k ||F , ||Qk ||F n ||A||2 , we also have (A.3). √ 1 √ ≤ wk ≤ n. n Construct an interval [ak , bk ] according to the rule (k) ak = σmin wk , (k) ak = σmin , (k) (k) (k) bk = σmax wk ≤ 1 . (k) bk = σmax wk , wk ≥ 1 (k) (k) Because ak ≤ σmin and bk ≥ σmax , we have σ1 , . . . , σn cases we have (ζk ak )−1 = ζk bk . So ρ(ζk ak ) = ρ(ζk bk ) and (k+1) σj (k) = ρ(ζk σj ) ∈ ρ(ζk [ak , bk ]) = [1, ρ(ζk bk )], ∈ [ak , bk ]. Also, in both j = 1, 2, . . . , n. 22 R. BYERS AND H. XU Since ρ(ζk bk ) = 1 2 1 ζk bk ζk bk + ζk bk 2 = 1+ 1 (ζk bk )2 , we have (k+1) σmax (k+1) (k+1) σi + σj ρ(ζk bk ) ≤ (k+1) σi (k+1) + σj 1 + (ζk 1bk )2 = (k) (k) ζk 1 + σj ) 1 + 2 (k) (k) 2 (σi ζk σi σj bk 1 + (ζk 1bk )2 = (k) (k) (σi + σj ) 1 + 2 (k) (k) (k+1) σi (k+1) + σj (k) bk ≤ (k) σi (k) + σj (k) (k) (k) If wk ≥ 1 then bk = σmax wk . Because σi σj (k+1) σmax (k+1) σi (k+1) + σj (k) σi σmax = (k) σi (k) . + σj 2 (k) ≤ σmax ≤ b2k , we have (k) bk ≤ (k) σj 2 (k) ≤ σmax = b2k , we have (k) (k) If wk ≤ 1, then bk = σmax . Because σi σj (k+1) . 1 ζk σi σmax ζk bk 2 (k) + σj = wk σmax (k) σi (k) . + σj Hence (k+1) σmax (k+1) σi + (k+1) σj (k) ≤ ψk σmax (k) σi + (k) σj , ψk = max{1, wk } ≤ √ n. Then the inequalities in (A.2) can be easily derived. (1,∞) (F ) Remark 3. Suppose that Newton’s method with scaling ζk or ζk terminates after p iterations. We have the same error bounds as in Theorem 4.5 but with p bH b − A||2 , ||H b − H||2 , and ||Q b − Q||2 . a factor n 2 in the first term of the bounds for ||Q When n and p are both large, this factor is notably large and one may not be able to use the bounds to claim backward stability. Unfortunately, we are unable to provide an upper bound for p, although it is observed that p is usually moderate in practice (p ≤ 9 for the (1, ∞)-scaling for all examples in Section 5). p However, we argue that the factor n 2 is an overestimate. The point is that when (F ) (1,∞) Qk is getting close to a unitary matrix, ζk and ζk are getting close to 1. Then wk as well as ψk will be close to 1. So in practice wk will be around 1 after couple iterations.