UNIVERSITETET I OSLO On the Early History of the Singular Value Decomposition Author: G. W. Stewart SIAM Review, Vol. 35, No. 4 (Dec. 1993) INSTITUTT FOR INFORMATIKK CICN apr05/1 UNIVERSITETET I OSLO The five mathematicians behind the theory of the SVD • Eugenio Beltrami (1835-1899) • Camille Jordan (1838-1921) • James Joseph Sylvester (1814-1897) • Erhard Schmidt (1876-1959) • Hermann Weyl (1885-1955) INSTITUTT FOR INFORMATIKK CICN apr05/2 UNIVERSITETET I OSLO Different areas: “Linear algebra”: • Beltrami • Jordan • Sylvester Integral equations: • Schmidt • Weyl They all considered the decomposition of real, square matrices. This is implied in the remainder of the lecture. INSTITUTT FOR INFORMATIKK CICN apr05/3 UNIVERSITETET I OSLO Some Prerequisites The Frobenius norm of a matrix (A ∈ Rn,n): v v un n uX n q uX X u u a2ij = trace (AAT) = t σi2 kAkF = t i=1 j=1 i=1 The Frobenius norm of a vector (x ∈ Rn): v uX un 2 p kxkF = t xi = xTx i=1 (NB: This is equal to the Euclidian norm of a vector.) Orthogonal matrix: AAT = ATA = In The rows and columns of A are two orthonormal bases for Rn. INSTITUTT FOR INFORMATIKK CICN apr05/4 UNIVERSITETET I OSLO The Singular Value Decomposition Assume: A ∈ Rn,n Its singular value decomposition (SVD) is given as: P n A = UΣVT = i=1 σiui vT i Matrix Properties: U ∈ Rn,n, V ∈ Rn,n UT U = V T V = I (i.e. U and V are orthogonal matrices) Σ = diag(σ1, ..., σn), σ1 ≥ σ2 ≥ ... ≥ σn ≥ 0 (i.e. Σ is a diagonal matrix) Matrix interpretation: U contains the eigenvectors of (the symmetric matrix) AAT. V contains the eigenvectors of (the symmetric matrix) AT A. σi is the square root of the eigenvalue associated with the eigenvectors ui and vi. INSTITUTT FOR INFORMATIKK CICN apr05/5 UNIVERSITETET I OSLO Eugenio Beltrami • Author of the first publication concerning the SVD. • Wanted it to encourage students to become familiar with bilinear forms. • His derivation is somewhat restricted. Goal: Reducing the bilinear form: f (x, y) = xTAy = n n X X xi aij yj j=1 i=1 to a canonical form: f (x, y) = ξ T Ση = INSTITUTT FOR INFORMATIKK n X ξi σ i ηi i=1 CICN apr05/6 UNIVERSITETET I OSLO Beltrami’s derivation of the SVD The bilinear form: f (x, y) = xTAy, A ∈ Rn,n Substitutions: x = Uξ y = Vη Rewrite: f (x, y) = ξ TUTAVη Substitution: S = UTAV Rewrite: f (x, y) = ξ TSη INSTITUTT FOR INFORMATIKK CICN apr05/7 UNIVERSITETET I OSLO U and V are required to be orthogonal. This gives us n2 − n degrees of freedom in their choice. Why? An orthogonal matrix A can be interpreted as a solution to the n2 n2 −n T equations given by AA = I, of which only 2 are independent. We need a pair of orthogonal matrices, meaning twice the degrees of freedom. Use these degrees of freedom to annihilate off-diagonal elements of S, creating the diagonal matrix S = Σ = diag (σ1, ..., σn). INSTITUTT FOR INFORMATIKK CICN apr05/8 UNIVERSITETET I OSLO U and V, being orthogonal, yield: UTAV VT = ΣVT ⇒ UTA = ΣVT and also: U UTAV = UΣ ⇒ AV = UΣ T in both equations: We multiply both sides by A UTA AT = ΣVT AT = Σ (AV)T = Σ (UΣ)T = Σ2 UT T T T T T T A (AV) = A (UΣ) = U A Σ = ΣV Σ = VΣ2 INSTITUTT FOR INFORMATIKK CICN apr05/9 UNIVERSITETET I OSLO This means that the diagonal elements of Σ are the roots of the equations: det AAT − σ 2 I = 0 T 2 det A A − σ I = 0 because they are the square roots of the eigenvalues associated with the eigenvectors of AAT and AT A respectively. Assuming that σi ≠ 0 and σi ≠ σj , i ≠ j, Beltrami argues that these two functions are identical, because: • det AAT − σi2 I = det ATA − σi2 I , i = 1, 2, ..., n • setting σ= 0 results in the common value det AAT = det ATA = det2 (A) INSTITUTT FOR INFORMATIKK CICN apr05/10 UNIVERSITETET I OSLO Beltrami states that the beforementioned roots are both real and positive. The latter is shown by: T 2 T T 0 < x A = x AA x = ξ TΣ2ξ F The matrix AAT is clearly positive definite, meaning all its eigenvalues are positive. This argument (i.e. the inequality) is only valid when A is nonsingular (and we assume x ≠ 0). Note: In this equation, Beltrami assumes that the vector ξ exists, although he has yet to proove this. INSTITUTT FOR INFORMATIKK CICN apr05/11 UNIVERSITETET I OSLO Beltrami’s Algorithm 1. Find the roots σi of the beforementioned equation. 2 2. Find the vectors ui , for instance by solving AAT − σi ui = c∀σi. 3. Find V through: V = ATUΣ−1 INSTITUTT FOR INFORMATIKK CICN apr05/12 UNIVERSITETET I OSLO Summary, Beltrami’s contribution • Dervived the SVD for a real, square, nonsingular matrix with distinct eigenvalues. • His derivation cannot handle degeneracies. • Intentional simplification to make the derivation more accessible to students? • Unintentional simplification from not having thought the problem through? INSTITUTT FOR INFORMATIKK CICN apr05/13 UNIVERSITETET I OSLO Camille Jordan • Discovered the SVD a year after Beltrami, though independently. • The SVD was “the simplest of three problems discussed in a paper”. • Presented as a way of reducing a bilinear form to a diagonal form by orthogonal substitutions. INSTITUTT FOR INFORMATIKK CICN apr05/14 UNIVERSITETET I OSLO Jordan’s Contribution Starts with the form: P = xT Ay and seeks the maximum and minimum of P subject to: The maximum is given by: kxk2F = kyk2F = 1 dP = dxTAy + xTAdy = 0 which must be satisfied for all: dxTx = 0, dyT y = 0 INSTITUTT FOR INFORMATIKK CICN apr05/15 UNIVERSITETET I OSLO A somewhat unclear argument from Jordan (possibly) states that ∃σ , τ such that the maximum can be expressed by the restrictions, resulting in the equations: Ay = σ x xTA = τyT This implies that the maximum is: but also i.e. σ = τ. INSTITUTT FOR INFORMATIKK xT (Ay) = σ xTx = σ T x A y = τyT y = τ CICN apr05/16 UNIVERSITETET I OSLO The maximum is then the value of σ where the determinant of the combined systems vanishes: −σ I A D = det T A −σ I INSTITUTT FOR INFORMATIKK CICN apr05/17 UNIVERSITETET I OSLO The canonical form is now found by deflation. This means that the problem is reduced to finding one set of coefficients at a time. Assume we have two vectors u and v that satisfy the equations for the largest root σ1 . By making the following substitutions: Û Õ [u, U∗] , V̂ Õ [v, V∗ ] , ÛÛT = V̂V̂T = I we get the modified function: x = Ûx̂, y = V̂ŷ P = x̂T Âŷ This is clearly maximized by selecting x̂ = ŷ = e1 , because this means that x = u and y = v, and we already know these are solutions. This again implies that: σ1 0 , A1 ∈ Rn−1,n−1 Â = 0 A1 INSTITUTT FOR INFORMATIKK CICN apr05/18 UNIVERSITETET I OSLO By setting ξ1 = x̂1 and η1 = ŷ1, we get: P = σ 1 ξ1 η1 + P 1 The last term is now a new bilinear form that can be maximized for the next root, σ2 in a similar way. By performing this iteration, we end up with the diagonalized (canonical) form: T P = ξ Ση = INSTITUTT FOR INFORMATIKK n X σ i ξi ηi i=1 CICN apr05/19 UNIVERSITETET I OSLO Summary, Jordan’s Contribution 1. Elegant solution that does not suffer under the same problems as Beltrami’s. 2. This is avoided trhough the use of deflation, a technique that was not widely recognized. INSTITUTT FOR INFORMATIKK CICN apr05/20 UNIVERSITETET I OSLO Sylvester’s Contribution Begins with the bilinear form: B = xTAy Consider the quadratic form: M= X i dB dyi !2 = xTAATx Assume canoncial forms: X X 2 λi ξ i , B = σiξi ηi ⇒ λi = σi2 M= We have that λi = σi2 P i i 2 (σiξ) is orthogonally equivalent to M, implying that INSTITUTT FOR INFORMATIKK CICN apr05/21 UNIVERSITETET I OSLO Definitions: M Õ AAT, N Õ AT A Sylvester states that the substitutions for x and y are those who diagonalize m and n, respectively. (This is in reality only true if all singular values of A are distinct.) Finding the coefficients of the x- and y-subsitutions: T 2 2 X Õ M − σ I, ξ = [M(X)i1 ... M(X)in] , ξ F = 1 T Y Õ N − σ 2I, η = [M(Y )i1 ... M(Y )in] , kηk2F = 1 This is done for all σ . NB: This only works when σ is simple. INSTITUTT FOR INFORMATIKK CICN apr05/22 UNIVERSITETET I OSLO Infinitesimal Iteration: The Problem Assume: A problem of order n − 1 can be (and is) solved. This gives us a problem of order n (example for n = 3), and we can assume a substitution for x: a 0 f 1 η A = 0 b g , x = − 1 θ ξ f g c −η −θ 1 Perform the transformation: Goal: B = xTAx • Preserve zeroes: B21 = B12 = A21 = A12 = 0 • Create zeroes: f = g = 0 INSTITUTT FOR INFORMATIKK CICN apr05/23 UNIVERSITETET I OSLO Infinitesimal Iteration: The Solution Assume: η, θ, so small that any order > 1 is approximately zero. Step 1: Select η and θ such that: 1 2 2 δ f + g = (a − c)f η + (b − c)gθ < 0 2 Step 2: Select such that: f θ + gη = a−b (Assuming a ≠ b.) Sylvester claims: This process repeted infinitely will force f or g to become zero, or result in a special case where the algrithm does not apply (which can be solved another way). INSTITUTT FOR INFORMATIKK CICN apr05/24 UNIVERSITETET I OSLO Summary, Sylvester’s contribution • Sylvester did not know of earlier, similar results by Jordan and Jacobi. • He ignores second-order terms, possibly intentionally? INSTITUTT FOR INFORMATIKK CICN apr05/25 UNIVERSITETET I OSLO Erhard Schmidt • ... as in “Gram-Schmidt orthogonalization” • SVD introduced in connection with integral equations with unsymmetric kernels (not linear algebra). • ... or rather the infinite dimension analogue to the SVD. • Application: Used the SVD to obtain optimal, low-rank approximations to an operator. INSTITUTT FOR INFORMATIKK CICN apr05/26 UNIVERSITETET I OSLO Schmidt’s Contribution Assume a kernel A(s, t) which is continous and symmetric on [a, b] × [a, b]. A continous, nonvanishing function satisfying: Zb A(s, t)φ(t) dt φ(s) = λ a is an eigenfunction of A(s, t) corresponding to the eigenvalue λ. (Note: this eigenvalue is the inverse of its “ordinary” counterpart.) INSTITUTT FOR INFORMATIKK CICN apr05/27 UNIVERSITETET I OSLO Facts: 1. A(s, t) has at least one eigenfunction. 2. All eigenfunctions and eigenvalues are real. 3. An eigenvalue has a finite number of corresponding, linearily independent eigenfunctions. 4. Every eigenfunction of A(s, t) can be expressed as a linear combination of a finite number of members from a set of linearily independent eigenfunctions. The eigenvalues of A(s, t) satisfy the following inequality: ZbZb X 12 2 (A(s, t)) ds dt ≥ λi a a i i.e. the sequence of eigenvalues is unbounded. INSTITUTT FOR INFORMATIKK CICN apr05/28 UNIVERSITETET I OSLO Unsymmetric kernels Assume A(s, t) to be unsymmetric. A pair of adjoint eigenfunctions is any nonzero pair u(s) and v(t) satisfying: Zb A(s, t)v(t) dt u(s) = λ a and: v(t) = λ Zb a A(s, t)u(s) ds where λ is the eigenvalue connected to the pair of eigenfunctions. We can create two symmetric kernels: Zb A(s, t) = A(s, r )A(t, r ) dr a A(s, t) = INSTITUTT FOR INFORMATIKK Zb a A(r , s)A(r , t) dr CICN apr05/29 UNIVERSITETET I OSLO Assume the eigenfunctions and eigenvalues for A(s, t) are those who satisfy: Zb A(s, t)ui (t) dt ui(s) = λi a This gives us the eigenfunctions of A(s, t) as: q Zb A(s, t)ui (s) ds vi (t) = λi a The previously mentioned adjoint pairs of eigenfunctions are ui (s), vi(t), t = 1, 2, ... INSTITUTT FOR INFORMATIKK CICN apr05/30 UNIVERSITETET I OSLO Expanding functions in series of eigenfunctions If: g(s) = Then: g(s) = INSTITUTT FOR INFORMATIKK Zb a A(s, t)h(t) dt X ui(s) Z b i λi a h(t)vi(t) dt CICN apr05/31 UNIVERSITETET I OSLO The canonical decomposition of a bilinear form ZbZb a a Zb X 1 Zb A(s, t)g(s)h(t) ds dt = g(s)ui (s) ds h(t)vi(t) dt λ a i a i INSTITUTT FOR INFORMATIKK CICN apr05/32 UNIVERSITETET I OSLO The Problem: The problem is on the form of finding the best approximation of a matrix Pk A ≈ i=1 xi yT i , where “best” is defined as satisfying the following minimization: Pk T A − i=1 xiyi min xi, yi, i ∈ [1, k] INSTITUTT FOR INFORMATIKK F CICN apr05/33 UNIVERSITETET I OSLO If the approximation can be written as the sum of the first k components (i.e. column vectors of U and V plus singular values) of its SVD: Ak = k X i=1 σ i ui v T i then the norm can be written as: X n 2 2 2 2 σi = σi kA − AkkF = kAkF − i=1 i=k+1 k X F because the Frobenius-norm squared is equal to the trace of the quadratic diagonal matrix of singular values (i.e. the diagonal matrix of eigenvalues). More complete evaluation: INSTITUTT FOR INFORMATIKK CICN apr05/34 UNIVERSITETET I OSLO Any other choice of vectors for the approximation will yield: k k X X 2 A − x i yT σi2 i ≥ kAkF − i=1 i=1 F which means that our initial approximation was optimal. We can show this. Assume that if the set of vectors x1, ..., xk either is orthogonal to begin with, or can be made orthogonal using the Gram-Schmidt orthogonalization process... INSTITUTT FOR INFORMATIKK CICN apr05/35 UNIVERSITETET I OSLO The norm for this alternative choice of vectors is: T k k k X X X T T T A − A − A − x y = trace x y x y i i i i i i i=1 i=1 i=1 F k k T X X T T T yi − A x i yi − A x i − = trace A A + AT x i x T iA i=1 i=1 Where we recognize the first term as kAkF . The second term is clearly nonnegative, meaning we can ignore it for our purpose. The last term can be recognized as a sum over kAxi k2F . What we need to show is now that: k X i=1 INSTITUTT FOR INFORMATIKK kAxik2F ≤ k X i=1 σi2 CICN apr05/36 UNIVERSITETET I OSLO Assuming V = [V1 V2 ], Σ = diag (Σ1Σ2), we can expand one term of this sum: 2 2 2 T kAxik2F = σk2 + Σ1 VT x − σ 1 i F k V 1 x i F 2 2 T x − V Σ − σk2 VT i 2 2 2 x i F F T 2 −σk 1 − v xi F INSTITUTT FOR INFORMATIKK CICN apr05/37 UNIVERSITETET I OSLO Hermann Weyl • Developed a general perturbation theory. • Gave an elegan proof of the approximation theorem. INSTITUTT FOR INFORMATIKK CICN apr05/38 UNIVERSITETET I OSLO Lemma: If Bk = XYT, X ∈ Ra,k, B ∈ Rb,k ⇒ rank (Bk) ≤ k, then: σ1 (A − Bk) ≥ σk+1 (A) where σi(·) means “the ith singular value of its argument”. We know that: ∃v = k+1 X i=1 γivi s.t. YTv = 0k k+1 where {xi }i=1 are the first k + 1 column vectors of the matrix V from the SVD of A. We assume that: kvkF = 1 or equivalently: k+1 X i=1 INSTITUTT FOR INFORMATIKK γi2 = 1 CICN apr05/39 UNIVERSITETET I OSLO Proof: σ12 (A − B) ≥ vT (A − B)T (A − B) v T T =v A A v = k+1 X i=1 γi2σi2 2 ≥ σk+1 INSTITUTT FOR INFORMATIKK CICN apr05/40 UNIVERSITETET I OSLO Two Theorems Theorem#1: Theorem#2: A = A0 + A00 ⇒ σi+j−1 ≤ σi0 + σj00 σi (A − Bk) ≥ σk+i, i = 1, 2, ... INSTITUTT FOR INFORMATIKK CICN apr05/41 UNIVERSITETET I OSLO Proof, Theorem#1: For the case i = j = 1: T 0 T 00 0 00 σ 1 = uT 1 Av1 = u1 A v1 + u1 A v1 ≤ σ1 + σ1 For the general case: σi0 σj00 0 A0i−1 00 A00 j−1 = σ1 A − + σ1 A − 0 00 ≥ σ1 A − Ai−1 − Aj−1 0 00 (and because rank Ai−1 + Aj−1 ≤ i + j − 2, it follows from the lemma:) + ≥ σi+j−1 INSTITUTT FOR INFORMATIKK CICN apr05/42 UNIVERSITETET I OSLO Proof, Theorem#2: Not so much a theorem as a corollary of theorem#1. We know that: rank (Bk) ≤ k ⇒ σk+1 (Bk) = 0 Setting j = k + 1 in theorem#1 yields: σi (A − Bk) = σi A So, we have that: 0 ≥ σi+j−1 − σj00 = σk+i − σk+1 (Bk) = σk+i σi (A − Bk) ≥ σk+i ⇒ kA − Bkk2F ≥ INSTITUTT FOR INFORMATIKK n X i=k+1 σi2 CICN apr05/43 UNIVERSITETET I OSLO Discussion: Weyl’s Contribution • This is not Weyl’s original derivation of the SVD. • Symmetric kernels can have positive and negative eigenvalues, so Weyls wrote down three inequalities (delaing with positive and negative eigenvalues and their absolute values). INSTITUTT FOR INFORMATIKK CICN apr05/44 UNIVERSITETET I OSLO Summary • The expression “singular value” probably from integral equation theory. • Not used consistently until the middle of the 20th century. • SVD closely related to spectral decompositions of AAT and ATA. • The SVD can be generalized, this derivation involves the CS-decomposition. • Can be used to derive the polar decomposition. • Can be used to calculate the Moore-Penrose pseudoinverse: A+ = UΣ+VT . • Used in deriving the solution of the Procrustes problem. • Used in PCA. • Stable and efficient numerical algorithm by Golub and Kahan. INSTITUTT FOR INFORMATIKK CICN apr05/45