Linear Algebra Review (with a Small Dose of Optimization) Hristo Paskov CS246 Outline • Basic definitions • Subspaces and Dimensionality • Matrix functions: inverses and eigenvalue decompositions • Convex optimization Vectors and Matrices • Vector 𝑥 ∈ ℝ𝑑 𝑥1 𝑥2 𝑥= ⋮ 𝑥𝑑 • May also write 𝑥 = 𝑥1 𝑥2 … 𝑥𝑑 𝑇 Vectors and Matrices • Matrix 𝑀 ∈ ℝ𝑚×𝑛 𝑀11 ⋯ 𝑀1𝑛 ⋱ ⋮ 𝑀= ⋮ 𝑀𝑚1 ⋯ 𝑀𝑚𝑛 • Written in terms of rows or columns 𝒓1𝑇 𝑀 = ⋮ = 𝒄1 … 𝒄𝑛 𝒓𝑇𝑚 𝒓𝑖 = 𝑀𝑖1 … 𝑀𝑖𝑛 𝑇 𝒄𝑖 = 𝑀1𝑖 … 𝑀𝑚𝑖 𝑇 Multiplication • Vector-vector: 𝑥, 𝑦 ∈ ℝ𝑑 → ℝ 𝑑 𝑥𝑇𝑦 = 𝑖=1 𝑛 𝑥𝑖 𝑦𝑖 • Matrix-vector: 𝑥 ∈ ℝ , 𝑀 ∈ ℝ𝑚×𝑛 → ℝ𝑚 𝒓1𝑇 𝒓1𝑇 𝑥 𝑀𝑥 = ⋮ 𝑥 = ⋮ 𝒓𝑇𝑚 𝒓𝑇𝑚 𝑥 Multiplication • Matrix-matrix: 𝐴 ∈ ℝ𝑚×𝑘 , 𝐵 ∈ ℝ𝑘×𝑛 → ℝ𝑚×𝑛 3 4 4 3 5 = 5 Multiplication • Matrix-matrix: 𝐴 ∈ ℝ𝑚×𝑘 , 𝐵 ∈ ℝ𝑘×𝑛 → ℝ𝑚×𝑛 – 𝒂𝑖 rows of 𝐴, 𝒃𝑗 cols of 𝐵 𝐴𝐵 = 𝐴𝒃1 𝒂1𝑇 𝒃1 ⋮ = 𝒂𝑇𝑚 𝒃1 … 𝐴𝒃𝑛 = ⋯ 𝒂𝑇𝑖 𝒃𝑗 ⋯ 𝒂1𝑇 𝒃𝑛 ⋮ 𝒂𝑇𝑚 𝒃𝑛 𝑇 𝒂1 𝐵 ⋮ 𝒂𝑇𝑚 𝐵 Multiplication Properties • Associative 𝐴𝐵 𝐶 = 𝐴 𝐵𝐶 • Distributive 𝐴 𝐵 + 𝐶 = 𝐴𝐵 + 𝐵𝐶 • NOT commutative 𝐴𝐵 ≠ 𝐵𝐴 – Dimensions may not even be conformable Useful Matrices • Identity matrix 𝐼 ∈ ℝ𝑚×𝑚 – 𝐴𝐼 = 𝐴, 𝐼𝐴 = 𝐴 1 0 0 0𝑖 ≠𝑗 𝐼𝑖𝑗 = 0 1 0 1𝑖 =𝑗 0 0 1 • Diagonal matrix 𝐴 ∈ ℝ𝑚×𝑚 𝑎1 ⋯ 0 ⋮ 𝐴 = diag 𝑎1 , … , 𝑎𝑚 = ⋮ 𝑎𝑖 0 ⋯ 𝑎𝑚 Useful Matrices • Symmetric 𝐴 ∈ ℝ𝑚×𝑚 : 𝐴 = 𝐴𝑇 • Orthogonal 𝑈 ∈ ℝ𝑚×𝑚 : 𝑈 𝑇 𝑈 = 𝑈𝑈 𝑇 = 𝐼 – Columns/ rows are orthonormal • Positive semidefinite 𝐴 ∈ ℝ𝑚×𝑚 : 𝑥 𝑇 𝐴𝑥 ≥ 0 for all 𝑥 – Equivalently, there exists 𝐿 ∈ ℝ𝑚×𝑚 𝐴 = 𝐿𝐿𝑇 Outline • Basic definitions • Subspaces and Dimensionality • Matrix functions: inverses and eigenvalue decompositions • Convex optimization Norms • Quantify “size” of a vector • Given 𝑥 ∈ ℝ𝑛 , a norm satisfies 1. 2. 3. 𝑐𝑥 = 𝑐 𝑥 𝑥 =0⇔𝑥=0 𝑥+𝑦 ≤ 𝑥 + 𝑦 • Common norms: 1. Euclidean 𝐿2 -norm: 𝑥 2 = 𝑥12 + ⋯ + 𝑥𝑛2 2. 𝐿1 -norm: 𝑥 1 = 𝑥1 + ⋯ + 𝑥𝑛 3. 𝐿∞ -norm: 𝑥 ∞ = max 𝑥𝑖 𝑖 Linear Subspaces Linear Subspaces • Subspace 𝒱 ⊂ ℝ𝑛 satisfies 1. 0 ∈ 𝒱 2. If 𝑥, 𝑦 ∈ 𝒱 and 𝑐 ∈ ℝ, then 𝑐 𝑥 + 𝑦 ∈ 𝒱 • Vectors 𝒙1 , … , 𝒙𝑚 span 𝒱 if 𝑚 𝒱 = 𝑖=1 𝛼𝑖 𝒙𝑖 𝛼 ∈ ℝ𝑚 Linear Independence and Dimension • Vectors 𝒙1 , … , 𝒙𝑚 are linearly independent if 𝑚 𝑖=1 𝛼𝑖 𝒙𝑖 = 0 ⟺ 𝛼 = 0 – Every linear combination of the 𝒙𝑖 is unique • Dim 𝒱 = 𝑚 if 𝒙1 , … , 𝒙𝑚 span 𝒱 and are linearly independent – If 𝒚1 , … , 𝒚𝑘 span 𝒱 then • 𝑘≥𝑚 • If 𝑘 > 𝑚 then 𝒚𝑖 are NOT linearly independent Linear Independence and Dimension Matrix Subspaces • Matrix 𝑀 ∈ ℝ𝑚×𝑛 defines two subspaces – Column space col 𝑀 = 𝑀𝛼 𝛼 ∈ ℝ𝑛 ⊂ ℝ𝑚 – Row space row 𝑀 = 𝑀𝑇 𝛽 𝛽 ∈ ℝ𝑚 ⊂ ℝ𝑛 • Nullspace of 𝑀: null 𝑀 = 𝑥 ∈ ℝ𝑛 𝑀𝑥 = 0 – null 𝑀 ⊥ row 𝑀 – dim null 𝑀 + dim row 𝑀 – Analog for column space =𝑛 Matrix Rank • rank 𝑀 gives dimensionality of row and column spaces • If 𝑀 ∈ ℝ𝑚×𝑛 has rank 𝑘, can decompose into product of 𝑚 × 𝑘 and 𝑘 × 𝑛 matrices 𝑀 𝑚 = 𝑘 𝑚 rank = 𝑘 𝑛 𝑛 𝑘 Properties of Rank • For 𝐴, 𝐵 ∈ ℝ𝑚×𝑛 1. rank 𝐴 ≤ min 𝑚, 𝑛 2. rank 𝐴 = rank 𝐴𝑇 3. rank 𝐴𝐵 ≤ min rank 𝐴 , rank 𝐵 4. rank 𝐴 + 𝐵 ≤ rank 𝐴 + rank 𝐵 • 𝐴 has full rank if rank 𝐴 = min 𝑚, 𝑛 • If 𝑚 > rank 𝐴 rows not linearly independent – Same for columns if 𝑛 > rank 𝐴 Outline • Basic definitions • Subspaces and Dimensionality • Matrix functions: inverses and eigenvalue decompositions • Convex optimization Matrix Inverse • 𝑀 ∈ ℝ𝑚×𝑚 is invertible iff rank 𝑀 = 𝑚 • Inverse is unique and satisfies 1. 𝑀−1 𝑀 = 𝑀𝑀−1 = 𝐼 2. 𝑀−1 −1 = 𝑀 3. 𝑀𝑇 −1 = 𝑀−1 𝑇 4. If 𝐴 is invertible then 𝑀𝐴 is invertible and 𝑀𝐴 −1 = 𝐴−1 𝑀−1 Systems of Equations • Given 𝑀 ∈ ℝ𝑚×𝑛 , 𝑦 ∈ ℝ𝑚 wish to solve 𝑀𝑥 = 𝑦 – Exists only if 𝑦 ∈ col 𝑀 • If • Possibly infinite number of solutions 𝑀 is invertible then 𝑥 = 𝑀−1 𝑦 – Notational device, do not actually invert matrices – Computationally, use solving routines like Gaussian elimination Systems of Equations • What if 𝑦 ∉ col 𝑀 ? • Find 𝑥 that gives 𝑦 = 𝑀𝑥 closest to 𝑦 – 𝑦 is projection of 𝑦 onto col 𝑀 – Also known as regression • Assume rank 𝑀 = 𝑛 < 𝑚 𝑥 = 𝑀𝑇 𝑀 −1 𝑀 𝑇 𝑦 Invertible 𝑦 = 𝑀 𝑀𝑇 𝑀 −1 𝑀 𝑇 𝑦 Projection matrix Systems of Equations 1 0 .5 .5 = 2 1 −2.5 −1.5 −1 1 −1 .5 = 1 1 −.5 −1.5 Eigenvalue Decomposition • Eigenvalue decomposition of symmetric 𝑀 ∈ ℝ𝑚×𝑚 is 𝑚 𝜆𝑖 𝒒𝑖 𝒒𝑇𝑖 𝑀 = 𝑄Σ𝑄 𝑇 = 𝑖=1 – Σ = diag 𝜆1 , … , 𝜆𝑛 contains eigenvalues of 𝑀 – 𝑄 is orthogonal and contains eigenvectors 𝒒𝑖 of 𝑀 • If 𝑀 is not symmetric but diagonalizable 𝑀 = 𝑄Σ𝑄 −1 – Σ is diagonal by possibly complex – 𝑄 not necessarily orthogonal Characterizations of Eigenvalues • Traditional formulation 𝑀𝑥 = 𝜆𝑥 – Leads to characteristic polynomial det 𝑀 − 𝜆𝐼 = 0 • Rayleigh quotient (symmetric 𝑀) 𝑥 𝑇 𝑀𝑥 max 𝑥 𝑥 22 Eigenvalue Properties • For 𝑀 ∈ ℝ𝑚×𝑚 with eigenvalues 𝜆𝑖 1. tr 𝑀 = 𝑚 𝑖=1 𝜆𝑖 2. det 𝑀 = 𝜆1 𝜆2 … 𝜆𝑚 3. rank 𝑀 = #𝜆𝑖 ≠ 0 • When 𝑀 is symmetric – Eigenvalue decomposition is singular value decomposition – Eigenvectors for nonzero eigenvalues give orthogonal basis for row 𝑀 = col 𝑀 Simple Eigenvalue Proof • Why det 𝑀 − 𝜆𝐼 = 0? • Assume 𝑀 is symmetric and full rank 𝑄𝑄 𝑇 = 𝐼 1. 𝑀 = 𝑄Σ𝑄𝑇 2. 𝑀 − 𝜆𝐼 = 𝑄Σ𝑄𝑇 − 𝜆𝐼 = 𝑄 Σ − 𝜆𝐼 𝑄 𝑇 3. If 𝜆 = 𝜆𝑖 , 𝑖 th eigenvalue of 𝑀 − 𝜆𝐼 is 0 4. Since det 𝑀 − 𝜆𝐼 is product of eigenvalues, one of the terms is 0, so product is 0 Outline • Basic definitions • Subspaces and Dimensionality • Matrix functions: inverses and eigenvalue decompositions • Convex optimization Convex Optimization • Find minimum of a function subject to solution constraints • Business/economics/ game theory – Resource allocation – Optimal planning and strategies • Statistics and Machine Learning – All forms of regression and classification – Unsupervised learning • Control theory – Keeping planes in the air! Convex Sets • A set 𝐶 is convex if ∀𝑥, 𝑦 ∈ 𝐶 and ∀𝛼 ∈ 0,1 𝛼𝑥 + 1 − 𝛼 𝑦 ∈ 𝐶 – Line segment between points in 𝐶 also lies in 𝐶 • Ex – Intersection of halfspaces – 𝐿𝑝 balls – Intersection of convex sets Convex Functions • A real-valued function 𝑓 is convex if dom𝑓 is convex and ∀𝑥, 𝑦 ∈ dom𝑓 and ∀𝛼 ∈ 0,1 𝑓 𝛼𝑥 + 1 − 𝛼 𝑦 ≤ 𝛼𝑓 𝑥 + 1 − 𝛼 𝑓 𝑦 – Graph of 𝑓 upper bounded by line segment between points on graph 𝑥, 𝑓 𝑥 𝑦, 𝑓 𝑦 Gradients • Differentiable convex 𝑓 with dom𝑓 = ℝ𝑑 • Gradient 𝛻𝑓 at 𝑥 gives linear approximation 𝛿𝑓 𝛻𝑓 = 𝛿𝑥1 𝛿𝑓 … 𝛿𝑥𝑑 𝑇 𝑓 𝑥 + 𝑤 𝑇 𝛻𝑓 𝑓 𝑥 Gradients • Differentiable convex 𝑓 with dom𝑓 = ℝ𝑑 • Gradient 𝛻𝑓 at 𝑥 gives linear approximation 𝛿𝑓 𝛻𝑓 = 𝛿𝑥1 𝛿𝑓 … 𝛿𝑥𝑑 𝑇 𝑓 𝑥 + 𝑤 𝑇 𝛻𝑓 𝑓 𝑥 Gradient Descent • To minimize 𝑓 move down gradient – But not too far! – Optimum when 𝛻𝑓 = 0 • Given 𝑓, learning rate 𝛼, starting point 𝑥0 𝑥 = 𝑥0 Do until 𝛻𝑓 = 0 𝑥 = 𝑥 − 𝛼𝛻𝑓 Stochastic Gradient Descent • Many learning problems have extra structure 𝑛 𝑓 𝜃 = 𝐿 𝜃; 𝒙𝑖 𝑖=1 • Computing gradient requires iterating over all points, can be too costly • Instead, compute gradient at single training example Stochastic Gradient Descent • Given 𝑓 𝜃 = 𝑛𝑖=1 𝐿 𝜃; 𝒙𝑖 , learning rate 𝛼, starting point 𝜃0 𝜃 = 𝜃0 Do until 𝑓 𝜃 nearly optimal For 𝑖 = 1 to 𝑛 in random order 𝜃 = 𝜃 − 𝛼𝛻𝐿 𝜃; 𝒙𝑖 • Finds nearly optimal 𝜃 Minimize 𝑛 𝑖=1 𝑦𝑖 − 𝜃 𝑇 𝒙𝑖 2 Learning Parameter