On the Early History of the Singular Value Decomposition Author: G. W. Stewart

advertisement
UNIVERSITETET
I OSLO
On the Early History of the Singular Value
Decomposition
Author: G. W. Stewart
SIAM Review, Vol. 35, No. 4 (Dec. 1993)
INSTITUTT FOR INFORMATIKK
CICN apr05/1
UNIVERSITETET
I OSLO
The five mathematicians behind the theory of the
SVD
• Eugenio Beltrami (1835-1899)
• Camille Jordan (1838-1921)
• James Joseph Sylvester (1814-1897)
• Erhard Schmidt (1876-1959)
• Hermann Weyl (1885-1955)
INSTITUTT FOR INFORMATIKK
CICN apr05/2
UNIVERSITETET
I OSLO
Different areas:
“Linear algebra”:
• Beltrami
• Jordan
• Sylvester
Integral equations:
• Schmidt
• Weyl
They all considered the decomposition of real, square matrices. This is
implied in the remainder of the lecture.
INSTITUTT FOR INFORMATIKK
CICN apr05/3
UNIVERSITETET
I OSLO
Some Prerequisites
The Frobenius norm of a matrix (A ∈ Rn,n):
v
v
un n
uX
n
q
uX X
u
u
a2ij = trace (AAT) = t σi2
kAkF = t
i=1 j=1
i=1
The Frobenius norm of a vector (x ∈ Rn):
v
uX
un 2 p
kxkF = t xi = xTx
i=1
(NB: This is equal to the Euclidian norm of a vector.)
Orthogonal matrix:
AAT = ATA = In
The rows and columns of A are two orthonormal bases for Rn.
INSTITUTT FOR INFORMATIKK
CICN apr05/4
UNIVERSITETET
I OSLO
The Singular Value Decomposition
Assume: A ∈ Rn,n
Its singular value
decomposition (SVD) is given as:
P
n
A = UΣVT = i=1 σiui vT
i
Matrix Properties:
U ∈ Rn,n, V ∈ Rn,n
UT U = V T V = I
(i.e. U and V are orthogonal matrices)
Σ = diag(σ1, ..., σn), σ1 ≥ σ2 ≥ ... ≥ σn ≥ 0
(i.e. Σ is a diagonal matrix)
Matrix interpretation:
U contains the eigenvectors of (the symmetric matrix) AAT.
V contains the eigenvectors of (the symmetric matrix) AT A.
σi is the square root of the eigenvalue associated with the eigenvectors ui
and vi.
INSTITUTT FOR INFORMATIKK
CICN apr05/5
UNIVERSITETET
I OSLO
Eugenio Beltrami
• Author of the first publication concerning the SVD.
• Wanted it to encourage students to become familiar with bilinear
forms.
• His derivation is somewhat restricted.
Goal: Reducing the bilinear form:
f (x, y) = xTAy =
n
n X
X
xi aij yj
j=1 i=1
to a canonical form:
f (x, y) = ξ T Ση =
INSTITUTT FOR INFORMATIKK
n
X
ξi σ i ηi
i=1
CICN apr05/6
UNIVERSITETET
I OSLO
Beltrami’s derivation of the SVD
The bilinear form:
f (x, y) = xTAy, A ∈ Rn,n
Substitutions:
x = Uξ
y = Vη
Rewrite:
f (x, y) = ξ TUTAVη
Substitution:
S = UTAV
Rewrite:
f (x, y) = ξ TSη
INSTITUTT FOR INFORMATIKK
CICN apr05/7
UNIVERSITETET
I OSLO
U and V are required to be orthogonal. This gives us n2 − n degrees of
freedom in their choice.
Why? An orthogonal matrix A can be interpreted as a solution to the n2
n2 −n
T
equations given by AA = I, of which only 2 are independent. We need
a pair of orthogonal matrices, meaning twice the degrees of freedom.
Use these degrees of freedom to annihilate off-diagonal elements of S,
creating the diagonal matrix S = Σ = diag (σ1, ..., σn).
INSTITUTT FOR INFORMATIKK
CICN apr05/8
UNIVERSITETET
I OSLO
U and V,
being orthogonal, yield:
UTAV VT = ΣVT ⇒ UTA = ΣVT
and also:
U UTAV = UΣ ⇒ AV = UΣ
T
in both equations:
We multiply
both
sides
by
A
UTA AT = ΣVT AT = Σ (AV)T = Σ (UΣ)T = Σ2 UT
T
T
T
T
T T
A (AV) = A (UΣ) = U A Σ = ΣV
Σ = VΣ2
INSTITUTT FOR INFORMATIKK
CICN apr05/9
UNIVERSITETET
I OSLO
This means that the diagonal elements of Σ are the roots of the equations:
det AAT − σ 2 I = 0
T
2
det A A − σ I = 0
because they are the square roots of the eigenvalues associated with the
eigenvectors of AAT and AT A respectively.
Assuming that σi ≠ 0 and σi ≠ σj , i ≠ j, Beltrami argues that these two
functions are identical, because:
• det AAT − σi2 I = det ATA − σi2 I , i = 1, 2, ..., n
• setting σ= 0 results in
the common value
det AAT = det ATA = det2 (A)
INSTITUTT FOR INFORMATIKK
CICN apr05/10
UNIVERSITETET
I OSLO
Beltrami states that the beforementioned roots are both real and positive.
The latter is shown by:
T 2
T
T
0 < x A = x AA x = ξ TΣ2ξ
F
The matrix AAT is clearly positive definite, meaning all its eigenvalues are
positive. This argument (i.e. the inequality) is only valid when A is
nonsingular (and we assume x ≠ 0).
Note: In this equation, Beltrami assumes that the vector ξ exists,
although he has yet to proove this.
INSTITUTT FOR INFORMATIKK
CICN apr05/11
UNIVERSITETET
I OSLO
Beltrami’s Algorithm
1. Find the roots σi of the beforementioned equation.
2
2. Find the vectors ui , for instance by solving AAT − σi ui = c∀σi.
3. Find V through: V = ATUΣ−1
INSTITUTT FOR INFORMATIKK
CICN apr05/12
UNIVERSITETET
I OSLO
Summary, Beltrami’s contribution
• Dervived the SVD for a real, square, nonsingular matrix with distinct
eigenvalues.
• His derivation cannot handle degeneracies.
• Intentional simplification to make the derivation more accessible to
students?
• Unintentional simplification from not having thought the problem
through?
INSTITUTT FOR INFORMATIKK
CICN apr05/13
UNIVERSITETET
I OSLO
Camille Jordan
• Discovered the SVD a year after Beltrami, though independently.
• The SVD was “the simplest of three problems discussed in a paper”.
• Presented as a way of reducing a bilinear form to a diagonal form by
orthogonal substitutions.
INSTITUTT FOR INFORMATIKK
CICN apr05/14
UNIVERSITETET
I OSLO
Jordan’s Contribution
Starts with the form:
P = xT Ay
and seeks the maximum and minimum of P subject to:
The maximum is given by:
kxk2F = kyk2F = 1
dP = dxTAy + xTAdy = 0
which must be satisfied for all:
dxTx = 0, dyT y = 0
INSTITUTT FOR INFORMATIKK
CICN apr05/15
UNIVERSITETET
I OSLO
A somewhat unclear argument from Jordan (possibly) states that ∃σ , τ
such that the maximum can be expressed by the restrictions, resulting in
the equations:
Ay = σ x
xTA = τyT
This implies that the maximum is:
but also
i.e. σ = τ.
INSTITUTT FOR INFORMATIKK
xT (Ay) = σ xTx = σ
T
x A y = τyT y = τ
CICN apr05/16
UNIVERSITETET
I OSLO
The maximum is then the value of σ where the determinant of the
combined systems vanishes:


−σ I A

D = det  T
A −σ I
INSTITUTT FOR INFORMATIKK
CICN apr05/17
UNIVERSITETET
I OSLO
The canonical form is now found by deflation. This means that the
problem is reduced to finding one set of coefficients at a time. Assume we
have two vectors u and v that satisfy the equations for the largest root σ1 .
By making the following substitutions:
Û Õ [u, U∗] , V̂ Õ [v, V∗ ] , ÛÛT = V̂V̂T = I
we get the modified function:
x = Ûx̂, y = V̂ŷ
P = x̂T Âŷ
This is clearly maximized by selecting x̂ = ŷ = e1 , because this means that
x = u and y = v, and we already know these are solutions. This again
implies that:


σ1 0
 , A1 ∈ Rn−1,n−1

 =
0 A1
INSTITUTT FOR INFORMATIKK
CICN apr05/18
UNIVERSITETET
I OSLO
By setting ξ1 = x̂1 and η1 = ŷ1, we get:
P = σ 1 ξ1 η1 + P 1
The last term is now a new bilinear form that can be maximized for the
next root, σ2 in a similar way. By performing this iteration, we end up with
the diagonalized (canonical) form:
T
P = ξ Ση =
INSTITUTT FOR INFORMATIKK
n
X
σ i ξi ηi
i=1
CICN apr05/19
UNIVERSITETET
I OSLO
Summary, Jordan’s Contribution
1. Elegant solution that does not suffer under the same problems as
Beltrami’s.
2. This is avoided trhough the use of deflation, a technique that was not
widely recognized.
INSTITUTT FOR INFORMATIKK
CICN apr05/20
UNIVERSITETET
I OSLO
Sylvester’s Contribution
Begins with the bilinear form:
B = xTAy
Consider the quadratic form:
M=
X
i
dB
dyi
!2
= xTAATx
Assume canoncial forms:
X
X
2
λi ξ i , B =
σiξi ηi ⇒ λi = σi2
M=
We have that
λi = σi2
P
i
i
2
(σiξ) is orthogonally equivalent to M, implying that
INSTITUTT FOR INFORMATIKK
CICN apr05/21
UNIVERSITETET
I OSLO
Definitions: M Õ AAT, N Õ AT A
Sylvester states that the substitutions for x and y are those who
diagonalize m and n, respectively.
(This is in reality only true if all singular values of A are distinct.)
Finding the coefficients of the x- and y-subsitutions:
T 2
2
X Õ M − σ I, ξ = [M(X)i1 ... M(X)in] , ξ F = 1
T
Y Õ N − σ 2I, η = [M(Y )i1 ... M(Y )in] , kηk2F = 1
This is done for all σ .
NB: This only works when σ is simple.
INSTITUTT FOR INFORMATIKK
CICN apr05/22
UNIVERSITETET
I OSLO
Infinitesimal Iteration: The Problem
Assume: A problem of order n − 1 can be (and is) solved. This gives us a
problem of order n (example for n = 3), and we can assume a substitution
for x:




a 0 f
1 η







A =  0 b g  , x =  − 1 θ 
ξ
f g c
−η −θ 1
Perform the transformation:
Goal:
B = xTAx
• Preserve zeroes: B21 = B12 = A21 = A12 = 0
• Create zeroes: f = g = 0
INSTITUTT FOR INFORMATIKK
CICN apr05/23
UNIVERSITETET
I OSLO
Infinitesimal Iteration: The Solution
Assume: η, θ, so small that any order > 1 is approximately zero.
Step 1: Select η and θ such that:
1 2
2
δ f + g = (a − c)f η + (b − c)gθ < 0
2
Step 2: Select such that:
f θ + gη
=
a−b
(Assuming a ≠ b.)
Sylvester claims: This process repeted infinitely will force f or g to
become zero, or result in a special case where the algrithm does not apply
(which can be solved another way).
INSTITUTT FOR INFORMATIKK
CICN apr05/24
UNIVERSITETET
I OSLO
Summary, Sylvester’s contribution
• Sylvester did not know of earlier, similar results by Jordan and Jacobi.
• He ignores second-order terms, possibly intentionally?
INSTITUTT FOR INFORMATIKK
CICN apr05/25
UNIVERSITETET
I OSLO
Erhard Schmidt
• ... as in “Gram-Schmidt orthogonalization”
• SVD introduced in connection with integral equations with
unsymmetric kernels (not linear algebra).
• ... or rather the infinite dimension analogue to the SVD.
• Application: Used the SVD to obtain optimal, low-rank approximations
to an operator.
INSTITUTT FOR INFORMATIKK
CICN apr05/26
UNIVERSITETET
I OSLO
Schmidt’s Contribution
Assume a kernel A(s, t) which is continous and symmetric on
[a, b] × [a, b]. A continous, nonvanishing function satisfying:
Zb
A(s, t)φ(t) dt
φ(s) = λ
a
is an eigenfunction of A(s, t) corresponding to the eigenvalue λ. (Note:
this eigenvalue is the inverse of its “ordinary” counterpart.)
INSTITUTT FOR INFORMATIKK
CICN apr05/27
UNIVERSITETET
I OSLO
Facts:
1. A(s, t) has at least one eigenfunction.
2. All eigenfunctions and eigenvalues are real.
3. An eigenvalue has a finite number of corresponding, linearily
independent eigenfunctions.
4. Every eigenfunction of A(s, t) can be expressed as a linear combination
of a finite number of members from a set of linearily independent
eigenfunctions.
The eigenvalues of A(s, t) satisfy the following inequality:
ZbZb
X 12
2
(A(s, t)) ds dt ≥
λi
a a
i
i.e. the sequence of eigenvalues is unbounded.
INSTITUTT FOR INFORMATIKK
CICN apr05/28
UNIVERSITETET
I OSLO
Unsymmetric kernels
Assume A(s, t) to be unsymmetric. A pair of adjoint eigenfunctions is any
nonzero pair u(s) and v(t) satisfying:
Zb
A(s, t)v(t) dt
u(s) = λ
a
and:
v(t) = λ
Zb
a
A(s, t)u(s) ds
where λ is the eigenvalue connected to the pair of eigenfunctions. We can
create two symmetric kernels:
Zb
A(s, t) =
A(s, r )A(t, r ) dr
a
A(s, t) =
INSTITUTT FOR INFORMATIKK
Zb
a
A(r , s)A(r , t) dr
CICN apr05/29
UNIVERSITETET
I OSLO
Assume the eigenfunctions and eigenvalues for A(s, t) are those who
satisfy:
Zb
A(s, t)ui (t) dt
ui(s) = λi
a
This gives us the eigenfunctions of A(s, t) as:
q Zb
A(s, t)ui (s) ds
vi (t) = λi
a
The previously mentioned adjoint pairs of eigenfunctions are
ui (s), vi(t), t = 1, 2, ...
INSTITUTT FOR INFORMATIKK
CICN apr05/30
UNIVERSITETET
I OSLO
Expanding functions in series of eigenfunctions
If:
g(s) =
Then:
g(s) =
INSTITUTT FOR INFORMATIKK
Zb
a
A(s, t)h(t) dt
X ui(s) Z b
i
λi
a
h(t)vi(t) dt
CICN apr05/31
UNIVERSITETET
I OSLO
The canonical decomposition of a bilinear form
ZbZb
a
a
Zb
X 1 Zb
A(s, t)g(s)h(t) ds dt =
g(s)ui (s) ds
h(t)vi(t) dt
λ
a
i a
i
INSTITUTT FOR INFORMATIKK
CICN apr05/32
UNIVERSITETET
I OSLO
The Problem:
The problem is on the form of finding the best approximation of a matrix
Pk
A ≈ i=1 xi yT
i , where “best” is defined as satisfying the following
minimization:
Pk
T
A − i=1 xiyi min
xi, yi, i ∈ [1, k]
INSTITUTT FOR INFORMATIKK
F
CICN apr05/33
UNIVERSITETET
I OSLO
If the approximation can be written as the sum of the first k components
(i.e. column vectors of U and V plus singular values) of its SVD:
Ak =
k
X
i=1
σ i ui v T
i
then the norm can be written as:
X
n
2
2
2
2
σi = σi kA − AkkF = kAkF −
i=1
i=k+1
k
X
F
because the Frobenius-norm squared is equal to the trace of the quadratic
diagonal matrix of singular values (i.e. the diagonal matrix of eigenvalues).
More complete evaluation:
INSTITUTT FOR INFORMATIKK
CICN apr05/34
UNIVERSITETET
I OSLO
Any other choice of vectors for the approximation will yield:
k
k
X
X
2
A −
x i yT
σi2
i ≥ kAkF −
i=1
i=1
F
which means that our initial approximation was optimal. We can show
this. Assume that if the set of vectors x1, ..., xk either is orthogonal to
begin with, or can be made orthogonal using the Gram-Schmidt
orthogonalization process...
INSTITUTT FOR INFORMATIKK
CICN apr05/35
UNIVERSITETET
I OSLO
The norm for this alternative choice of vectors is:

T 

k
k
k
X
X
X

T
T 
T 
A −
A −
A
−
x
y
=
trace
x
y
x
y

i
i
i
i
i
i 
i=1
i=1
i=1
F


k
k
T X
X
T
T
T

yi − A x i yi − A x i −
= trace A A +
AT x i x T
iA
i=1
i=1
Where we recognize the first term as kAkF . The second term is clearly
nonnegative, meaning we can ignore it for our purpose. The last term can
be recognized as a sum over kAxi k2F . What we need to show is now that:
k
X
i=1
INSTITUTT FOR INFORMATIKK
kAxik2F
≤
k
X
i=1
σi2
CICN apr05/36
UNIVERSITETET
I OSLO
Assuming V = [V1 V2 ], Σ = diag (Σ1Σ2), we can expand one term of this
sum:
2
2 2 T kAxik2F = σk2 + Σ1 VT
x
−
σ
1 i F
k V 1 x i F
2 2 T x
−
V
Σ
− σk2 VT
i
2
2
2 x i F
F
T 2
−σk 1 − v xi
F
INSTITUTT FOR INFORMATIKK
CICN apr05/37
UNIVERSITETET
I OSLO
Hermann Weyl
• Developed a general perturbation theory.
• Gave an elegan proof of the approximation theorem.
INSTITUTT FOR INFORMATIKK
CICN apr05/38
UNIVERSITETET
I OSLO
Lemma:
If Bk = XYT, X ∈ Ra,k, B ∈ Rb,k ⇒ rank (Bk) ≤ k, then:
σ1 (A − Bk) ≥ σk+1 (A)
where σi(·) means “the ith singular value of its argument”. We know that:
∃v =
k+1
X
i=1
γivi s.t. YTv = 0k
k+1
where {xi }i=1
are the first k + 1 column vectors of the matrix V from the
SVD of A. We assume that:
kvkF = 1
or equivalently:
k+1
X
i=1
INSTITUTT FOR INFORMATIKK
γi2 = 1
CICN apr05/39
UNIVERSITETET
I OSLO
Proof:
σ12 (A − B) ≥ vT (A − B)T (A − B) v
T
T
=v A A v
=
k+1
X
i=1
γi2σi2
2
≥ σk+1
INSTITUTT FOR INFORMATIKK
CICN apr05/40
UNIVERSITETET
I OSLO
Two Theorems
Theorem#1:
Theorem#2:
A = A0 + A00 ⇒ σi+j−1 ≤ σi0 + σj00
σi (A − Bk) ≥ σk+i, i = 1, 2, ...
INSTITUTT FOR INFORMATIKK
CICN apr05/41
UNIVERSITETET
I OSLO
Proof, Theorem#1:
For the case i = j = 1:
T 0
T 00
0
00
σ 1 = uT
1 Av1 = u1 A v1 + u1 A v1 ≤ σ1 + σ1
For the general case:
σi0
σj00
0
A0i−1
00
A00
j−1
= σ1 A −
+ σ1 A −
0
00
≥ σ1 A − Ai−1 − Aj−1
0
00
(and because rank Ai−1 + Aj−1 ≤ i + j − 2, it follows from the lemma:)
+
≥ σi+j−1
INSTITUTT FOR INFORMATIKK
CICN apr05/42
UNIVERSITETET
I OSLO
Proof, Theorem#2:
Not so much a theorem as a corollary of theorem#1. We know that:
rank (Bk) ≤ k ⇒ σk+1 (Bk) = 0
Setting j = k + 1 in theorem#1 yields:
σi (A − Bk) = σi A
So, we have that:
0
≥ σi+j−1 − σj00 = σk+i − σk+1 (Bk) = σk+i
σi (A − Bk) ≥ σk+i ⇒ kA − Bkk2F ≥
INSTITUTT FOR INFORMATIKK
n
X
i=k+1
σi2
CICN apr05/43
UNIVERSITETET
I OSLO
Discussion: Weyl’s Contribution
• This is not Weyl’s original derivation of the SVD.
• Symmetric kernels can have positive and negative eigenvalues, so Weyls
wrote down three inequalities (delaing with positive and negative
eigenvalues and their absolute values).
INSTITUTT FOR INFORMATIKK
CICN apr05/44
UNIVERSITETET
I OSLO
Summary
• The expression “singular value” probably from integral equation theory.
• Not used consistently until the middle of the 20th century.
• SVD closely related to spectral decompositions of AAT and ATA.
• The SVD can be generalized, this derivation involves the
CS-decomposition.
• Can be used to derive the polar decomposition.
• Can be used to calculate the Moore-Penrose pseudoinverse:
A+ = UΣ+VT .
• Used in deriving the solution of the Procrustes problem.
• Used in PCA.
• Stable and efficient numerical algorithm by Golub and Kahan.
INSTITUTT FOR INFORMATIKK
CICN apr05/45
Download