Linear Algebra Review

advertisement
Linear Algebra Review
(with a Small Dose of Optimization)
Hristo Paskov
CS246
Outline
• Basic definitions
• Subspaces and Dimensionality
• Matrix functions: inverses and eigenvalue
decompositions
• Convex optimization
Vectors and Matrices
• Vector 𝑥 ∈ ℝ𝑑
𝑥1
𝑥2
𝑥= ⋮
𝑥𝑑
• May also write
𝑥 = 𝑥1
𝑥2
…
𝑥𝑑
𝑇
Vectors and Matrices
• Matrix 𝑀 ∈ ℝ𝑚×𝑛
𝑀11 ⋯ 𝑀1𝑛
⋱
⋮
𝑀= ⋮
𝑀𝑚1 ⋯ 𝑀𝑚𝑛
• Written in terms of rows or columns
𝒓1𝑇
𝑀 = ⋮ = 𝒄1 … 𝒄𝑛
𝒓𝑇𝑚
𝒓𝑖 = 𝑀𝑖1
… 𝑀𝑖𝑛
𝑇
𝒄𝑖 = 𝑀1𝑖
… 𝑀𝑚𝑖
𝑇
Multiplication
• Vector-vector: 𝑥, 𝑦 ∈ ℝ𝑑 → ℝ
𝑑
𝑥𝑇𝑦 =
𝑖=1
𝑛
𝑥𝑖 𝑦𝑖
• Matrix-vector: 𝑥 ∈ ℝ , 𝑀 ∈ ℝ𝑚×𝑛 → ℝ𝑚
𝒓1𝑇
𝒓1𝑇 𝑥
𝑀𝑥 = ⋮ 𝑥 = ⋮
𝒓𝑇𝑚
𝒓𝑇𝑚 𝑥
Multiplication
• Matrix-matrix: 𝐴 ∈ ℝ𝑚×𝑘 , 𝐵 ∈ ℝ𝑘×𝑛 → ℝ𝑚×𝑛
3
4
4
3
5
=
5
Multiplication
• Matrix-matrix: 𝐴 ∈ ℝ𝑚×𝑘 , 𝐵 ∈ ℝ𝑘×𝑛 → ℝ𝑚×𝑛
– 𝒂𝑖 rows of 𝐴, 𝒃𝑗 cols of 𝐵
𝐴𝐵 = 𝐴𝒃1
𝒂1𝑇 𝒃1
⋮
=
𝒂𝑇𝑚 𝒃1
… 𝐴𝒃𝑛 =
⋯
𝒂𝑇𝑖 𝒃𝑗
⋯
𝒂1𝑇 𝒃𝑛
⋮
𝒂𝑇𝑚 𝒃𝑛
𝑇
𝒂1 𝐵
⋮
𝒂𝑇𝑚 𝐵
Multiplication Properties
• Associative
𝐴𝐵 𝐶 = 𝐴 𝐵𝐶
• Distributive
𝐴 𝐵 + 𝐶 = 𝐴𝐵 + 𝐵𝐶
• NOT commutative
𝐴𝐵 ≠ 𝐵𝐴
– Dimensions may not even be conformable
Useful Matrices
• Identity matrix 𝐼 ∈ ℝ𝑚×𝑚
– 𝐴𝐼 = 𝐴, 𝐼𝐴 = 𝐴
1 0 0
0𝑖 ≠𝑗
𝐼𝑖𝑗 =
0 1 0
1𝑖 =𝑗
0 0 1
• Diagonal matrix 𝐴 ∈ ℝ𝑚×𝑚
𝑎1 ⋯ 0
⋮
𝐴 = diag 𝑎1 , … , 𝑎𝑚 = ⋮ 𝑎𝑖
0 ⋯ 𝑎𝑚
Useful Matrices
• Symmetric 𝐴 ∈ ℝ𝑚×𝑚 : 𝐴 = 𝐴𝑇
• Orthogonal 𝑈 ∈ ℝ𝑚×𝑚 :
𝑈 𝑇 𝑈 = 𝑈𝑈 𝑇 = 𝐼
– Columns/ rows are orthonormal
• Positive semidefinite 𝐴 ∈ ℝ𝑚×𝑚 :
𝑥 𝑇 𝐴𝑥 ≥ 0 for all 𝑥
– Equivalently, there exists 𝐿 ∈ ℝ𝑚×𝑚
𝐴 = 𝐿𝐿𝑇
Outline
• Basic definitions
• Subspaces and Dimensionality
• Matrix functions: inverses and eigenvalue
decompositions
• Convex optimization
Norms
• Quantify “size” of a vector
• Given 𝑥 ∈ ℝ𝑛 , a norm satisfies
1.
2.
3.
𝑐𝑥 = 𝑐 𝑥
𝑥 =0⇔𝑥=0
𝑥+𝑦 ≤ 𝑥 + 𝑦
• Common norms:
1. Euclidean 𝐿2 -norm: 𝑥 2 = 𝑥12 + ⋯ + 𝑥𝑛2
2. 𝐿1 -norm: 𝑥 1 = 𝑥1 + ⋯ + 𝑥𝑛
3. 𝐿∞ -norm: 𝑥 ∞ = max 𝑥𝑖
𝑖
Linear Subspaces
Linear Subspaces
• Subspace 𝒱 ⊂ ℝ𝑛 satisfies
1. 0 ∈ 𝒱
2. If 𝑥, 𝑦 ∈ 𝒱 and 𝑐 ∈ ℝ, then 𝑐 𝑥 + 𝑦 ∈ 𝒱
• Vectors 𝒙1 , … , 𝒙𝑚 span 𝒱 if
𝑚
𝒱 = 𝑖=1 𝛼𝑖 𝒙𝑖 𝛼 ∈ ℝ𝑚
Linear Independence and Dimension
• Vectors 𝒙1 , … , 𝒙𝑚 are linearly independent if
𝑚
𝑖=1 𝛼𝑖 𝒙𝑖 = 0 ⟺ 𝛼 = 0
– Every linear combination of the 𝒙𝑖 is unique
• Dim 𝒱 = 𝑚 if 𝒙1 , … , 𝒙𝑚 span 𝒱 and are
linearly independent
– If 𝒚1 , … , 𝒚𝑘 span 𝒱 then
• 𝑘≥𝑚
• If 𝑘 > 𝑚 then 𝒚𝑖 are NOT linearly independent
Linear Independence and Dimension
Matrix Subspaces
• Matrix 𝑀 ∈ ℝ𝑚×𝑛 defines two subspaces
– Column space col 𝑀 = 𝑀𝛼 𝛼 ∈ ℝ𝑛 ⊂ ℝ𝑚
– Row space row 𝑀 = 𝑀𝑇 𝛽 𝛽 ∈ ℝ𝑚 ⊂ ℝ𝑛
• Nullspace of 𝑀: null 𝑀 = 𝑥 ∈ ℝ𝑛 𝑀𝑥 = 0
– null 𝑀 ⊥ row 𝑀
– dim null 𝑀 + dim row 𝑀
– Analog for column space
=𝑛
Matrix Rank
• rank 𝑀 gives dimensionality of row and
column spaces
• If 𝑀 ∈ ℝ𝑚×𝑛 has rank 𝑘, can decompose into
product of 𝑚 × 𝑘 and 𝑘 × 𝑛 matrices
𝑀
𝑚
=
𝑘
𝑚
rank = 𝑘
𝑛
𝑛
𝑘
Properties of Rank
• For 𝐴, 𝐵 ∈ ℝ𝑚×𝑛
1. rank 𝐴 ≤ min 𝑚, 𝑛
2. rank 𝐴 = rank 𝐴𝑇
3. rank 𝐴𝐵 ≤ min rank 𝐴 , rank 𝐵
4. rank 𝐴 + 𝐵 ≤ rank 𝐴 + rank 𝐵
• 𝐴 has full rank if rank 𝐴 = min 𝑚, 𝑛
• If 𝑚 > rank 𝐴 rows not linearly independent
– Same for columns if 𝑛 > rank 𝐴
Outline
• Basic definitions
• Subspaces and Dimensionality
• Matrix functions: inverses and eigenvalue
decompositions
• Convex optimization
Matrix Inverse
• 𝑀 ∈ ℝ𝑚×𝑚 is invertible iff rank 𝑀 = 𝑚
• Inverse is unique and satisfies
1. 𝑀−1 𝑀 = 𝑀𝑀−1 = 𝐼
2. 𝑀−1 −1 = 𝑀
3. 𝑀𝑇 −1 = 𝑀−1 𝑇
4. If 𝐴 is invertible then 𝑀𝐴 is invertible and
𝑀𝐴 −1 = 𝐴−1 𝑀−1
Systems of Equations
• Given 𝑀 ∈ ℝ𝑚×𝑛 , 𝑦 ∈ ℝ𝑚 wish to solve
𝑀𝑥 = 𝑦
– Exists only if 𝑦 ∈ col 𝑀
• If
• Possibly infinite number of solutions
𝑀 is invertible then 𝑥 = 𝑀−1 𝑦
– Notational device, do not actually invert matrices
– Computationally, use solving routines like
Gaussian elimination
Systems of Equations
• What if 𝑦 ∉ col 𝑀 ?
• Find 𝑥 that gives 𝑦 = 𝑀𝑥 closest to 𝑦
– 𝑦 is projection of 𝑦 onto col 𝑀
– Also known as regression
• Assume rank 𝑀 = 𝑛 < 𝑚
𝑥 = 𝑀𝑇 𝑀
−1 𝑀 𝑇 𝑦
Invertible
𝑦 = 𝑀 𝑀𝑇 𝑀
−1 𝑀 𝑇 𝑦
Projection
matrix
Systems of Equations
1 0
.5
.5
=
2 1 −2.5
−1.5
−1 1 −1
.5
=
1 1 −.5
−1.5
Eigenvalue Decomposition
• Eigenvalue decomposition of symmetric 𝑀 ∈ ℝ𝑚×𝑚 is
𝑚
𝜆𝑖 𝒒𝑖 𝒒𝑇𝑖
𝑀 = 𝑄Σ𝑄 𝑇 =
𝑖=1
– Σ = diag 𝜆1 , … , 𝜆𝑛 contains eigenvalues of 𝑀
– 𝑄 is orthogonal and contains eigenvectors 𝒒𝑖 of 𝑀
• If 𝑀 is not symmetric but diagonalizable
𝑀 = 𝑄Σ𝑄 −1
– Σ is diagonal by possibly complex
– 𝑄 not necessarily orthogonal
Characterizations of Eigenvalues
• Traditional formulation
𝑀𝑥 = 𝜆𝑥
– Leads to characteristic polynomial
det 𝑀 − 𝜆𝐼 = 0
• Rayleigh quotient (symmetric 𝑀)
𝑥 𝑇 𝑀𝑥
max
𝑥
𝑥 22
Eigenvalue Properties
• For 𝑀 ∈ ℝ𝑚×𝑚 with eigenvalues 𝜆𝑖
1. tr 𝑀 = 𝑚
𝑖=1 𝜆𝑖
2. det 𝑀 = 𝜆1 𝜆2 … 𝜆𝑚
3. rank 𝑀 = #𝜆𝑖 ≠ 0
• When 𝑀 is symmetric
– Eigenvalue decomposition is singular value
decomposition
– Eigenvectors for nonzero eigenvalues give
orthogonal basis for row 𝑀 = col 𝑀
Simple Eigenvalue Proof
• Why det 𝑀 − 𝜆𝐼 = 0?
• Assume 𝑀 is symmetric and full rank
𝑄𝑄 𝑇 = 𝐼
1. 𝑀 = 𝑄Σ𝑄𝑇
2. 𝑀 − 𝜆𝐼 = 𝑄Σ𝑄𝑇 − 𝜆𝐼 = 𝑄 Σ − 𝜆𝐼 𝑄 𝑇
3. If 𝜆 = 𝜆𝑖 , 𝑖 th eigenvalue of 𝑀 − 𝜆𝐼 is 0
4. Since det 𝑀 − 𝜆𝐼 is product of eigenvalues,
one of the terms is 0, so product is 0
Outline
• Basic definitions
• Subspaces and Dimensionality
• Matrix functions: inverses and eigenvalue
decompositions
• Convex optimization
Convex Optimization
• Find minimum of a function subject to
solution constraints
• Business/economics/ game theory
– Resource allocation
– Optimal planning and strategies
• Statistics and Machine Learning
– All forms of regression and classification
– Unsupervised learning
• Control theory
– Keeping planes in the air!
Convex Sets
• A set 𝐶 is convex if ∀𝑥, 𝑦 ∈ 𝐶 and ∀𝛼 ∈ 0,1
𝛼𝑥 + 1 − 𝛼 𝑦 ∈ 𝐶
– Line segment between points in 𝐶 also lies in 𝐶
• Ex
– Intersection of halfspaces
– 𝐿𝑝 balls
– Intersection of convex sets
Convex Functions
• A real-valued function 𝑓 is convex if dom𝑓 is
convex and ∀𝑥, 𝑦 ∈ dom𝑓 and ∀𝛼 ∈ 0,1
𝑓 𝛼𝑥 + 1 − 𝛼 𝑦 ≤ 𝛼𝑓 𝑥 + 1 − 𝛼 𝑓 𝑦
– Graph of 𝑓 upper bounded by line segment
between points on graph
𝑥, 𝑓 𝑥
𝑦, 𝑓 𝑦
Gradients
• Differentiable convex 𝑓 with dom𝑓 = ℝ𝑑
• Gradient 𝛻𝑓 at 𝑥 gives linear approximation
𝛿𝑓
𝛻𝑓 =
𝛿𝑥1
𝛿𝑓
…
𝛿𝑥𝑑
𝑇
𝑓 𝑥 + 𝑤 𝑇 𝛻𝑓
𝑓
𝑥
Gradients
• Differentiable convex 𝑓 with dom𝑓 = ℝ𝑑
• Gradient 𝛻𝑓 at 𝑥 gives linear approximation
𝛿𝑓
𝛻𝑓 =
𝛿𝑥1
𝛿𝑓
…
𝛿𝑥𝑑
𝑇
𝑓 𝑥 + 𝑤 𝑇 𝛻𝑓
𝑓
𝑥
Gradient Descent
• To minimize 𝑓 move down gradient
– But not too far!
– Optimum when 𝛻𝑓 = 0
• Given 𝑓, learning rate 𝛼, starting point 𝑥0
𝑥 = 𝑥0
Do until 𝛻𝑓 = 0
𝑥 = 𝑥 − 𝛼𝛻𝑓
Stochastic Gradient Descent
• Many learning problems have extra structure
𝑛
𝑓 𝜃 =
𝐿 𝜃; 𝒙𝑖
𝑖=1
• Computing gradient requires iterating over all
points, can be too costly
• Instead, compute gradient at single training
example
Stochastic Gradient Descent
• Given 𝑓 𝜃 = 𝑛𝑖=1 𝐿 𝜃; 𝒙𝑖 , learning rate 𝛼,
starting point 𝜃0
𝜃 = 𝜃0
Do until 𝑓 𝜃 nearly optimal
For 𝑖 = 1 to 𝑛 in random order
𝜃 = 𝜃 − 𝛼𝛻𝐿 𝜃; 𝒙𝑖
• Finds nearly optimal 𝜃
Minimize
𝑛
𝑖=1
𝑦𝑖 − 𝜃 𝑇 𝒙𝑖
2
Learning Parameter
Download