Deep Learning - Theory and Practice IE 643 Lectures 10, 11 September 13 & 15, 2023. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 1 / 73 Outline 1 Popular examples of MLP Encoder-Decoder Architecture P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 2 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture We consider a simple MLP architecture with: An input layer with n neurons A hidden layer with p neurons, where p ≤ n An output layer with n neurons All neurons in hidden and output layers have linear activations. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 3 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture This architecture is popularly called Encoder-Decoder Architecture. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 4 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Let B denote the p × n matrix connecting input layer and hidden layer. Let A denote the n × p matrix connecting hidden layer and output layer. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 5 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn , ∀i ∈ {1, . . . , S}. MLP Parameters: Weight matrices B and A. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 6 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn , ∀i ∈ {1, . . . , S}. MLP Parameters: Weight matrices B and A. P Error: E (B, A) = Si=1 ∥ABx i − y i ∥22 . P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 7 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn , ∀i ∈ {1, . . . , S}. MLP Parameters: Weight matrices B and A. Special case: When y i = x i , ∀i, the architecture is called Autoencoder. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 8 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn , ∀i ∈ {1, . . . , S}. MLP Parameters: Weight matrices B and A. P Error for autoencoder: E (B, A) = Si=1 ∥ABx i − x i ∥22 . P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 9 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Example: Consider the case: n = p = 1. P Error: E (b, a) = Si=1 (abx i − y i )2 . P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 10 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Sample Training Data: xi .708333 .583333 .166667 .458333 .875 P. Balamurugan yi .708333 .583333 .166667 .458333 .875 Deep Learning - Theory and Practice September 13 & 15, 2023. 11 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Loss surface for the sample training data: P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 12 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Loss surface for the sample training data: Observation: Though there are two valleys in the loss surface, they look symmetric. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 13 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Loss surface for the sample training data: Question: Does this extend to cases where y i ̸= x i and for n ≥ 2, n ≥ p? P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 14 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture However, when we fix a, we have: P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 15 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture However, when we fix a, we have: P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 16 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture However, when we fix a, we have: P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 17 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Similarly, when we fix b, we have: P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 18 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Similarly, when we fix b, we have: P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 19 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Similarly, when we fix b, we have: P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 20 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Error: E (b, a) = PS i=1 (abx i − y i )2 . When we fix a or when we fix b, we observe that the graph does not contain multiple valleys. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 21 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Error: E (b, a) = PS i=1 (abx i − y i )2 . When we fix a or when we fix b, we observe that the graph does not contain multiple valleys. Thus when we fix a or when we fix b, we observe that the graph looks as if every local optimum is a global optimum. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 22 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Error: E (b, a) = PS i=1 (abx i − y i )2 . When we fix a or when we fix b, we observe that the graph does not contain multiple valleys. Thus when we fix a or when we fix b, we observe that the graph looks as if every local optimum is a global optimum. Question: Does this behavior extend to higher dimensions? P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 23 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Error: E (b, a) = PS i=1 (abx i − y i )2 . When we fix a or when we fix b, we observe that the graph does not contain multiple valleys. Thus when we fix a or when we fix b, we observe that the graph looks as if every local optimum is a global optimum. Question: Does this behavior extend to higher dimensions? We shall investigate this in higher dimensions ! P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 24 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Some notations: ↑ ↑ ... ↑ ↑ ↑ ... ↑ Let X = x 1 x 2 . . . x S , Y = y 1 y 2 . . . y S . ↓ ↓ ... ↓ ↓ ↓ ... ↓ Note: X and Y are n × S matrices. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 25 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Some notations: h11 h12 . . . h1q .. . For a n × q matrix H = ... . . . . .. , hn1 hn2 . . . hnq ⊤ let vec(H) = h11 . . . hn1 h12 . . . hn2 . . . h1q . . . hnq denote a nq × 1 vector. Note: vec(H) contains the columns of matrix H stacked upon one another. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 26 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Recall: Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn , ∀i ∈ {1, . . . , S}. MLP Parameters: Weight matrices B and A. P Error: E = Si=1 ∥ABx i − y i ∥22 . (We have used E since the context is clear!) P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 27 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Recall: Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn , ∀i ∈ {1, . . . , S}. MLP Parameters: Weight matrices B and A. P Error: E = Si=1 ∥ABx i − y i ∥22 . (We have used E since the context is clear!) Using the new notations, we write: E = ∥vec(ABX − Y )∥22 . (Homework!) Note: The norm is a simple vector norm. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 28 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture We have: E = ∥vec(ABX − Y )∥22 . Now note: vec(ABX − Y ) = vec(ABX ) − vec(Y ). (Homework: Verify this claim!) P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 29 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture We have: E = ∥vec(ABX − Y )∥22 . Now note: vec(ABX − Y ) = vec(ABX ) − vec(Y ). (Homework: Verify this claim!) Thus we have: E = ∥vec(ABX ) − vec(Y )∥22 . P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 30 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Some more new notations: g11 .. For a m × n matrix G = . ... g1n .. . ... gm1 . . . gmn h11 . . . h1q and a p × q matrix H = ... . . . ... , hp1 . . . hpq g11 H . . . g1n H .. as Kronecker product of G and H. define: G ⊗ H = ... ... . gm1 H . . . gmn H Note: G ⊗ H is of size mp × nq. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 31 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Claim: vec(ABX ) = (X ⊤ ⊗ A)vec(B). Proof idea: P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 32 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Using: vec(ABX ) = (X ⊤ ⊗ A)vec(B). we can write: E = ∥vec(ABX − Y )∥22 = ∥vec(ABX ) − vec(Y )∥22 = ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 . P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 33 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Using: vec(ABX ) = (X ⊤ ⊗ A)vec(B). we can write: E = ∥vec(ABX − Y )∥22 = ∥vec(ABX ) − vec(Y )∥22 = ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 . This is of the form: F (z) = ∥Mz − c∥22 , where M = (X ⊤ ⊗ A), z = vec(B) and c = vec(Y ). P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 34 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Using: vec(ABX ) = (X ⊤ ⊗ A)vec(B). we can write: E = ∥vec(ABX − Y )∥22 = ∥vec(ABX ) − vec(Y )∥22 = ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 . This is of the form: F (z) = ∥Mz − c∥22 , where M = (X ⊤ ⊗ A), z = vec(B) and c = vec(Y ). Observe: M is of size nS × np, z is a np × 1 vector and c is a nS × 1 vector. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 35 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Consider the function F (z) = ∥Mz − c∥22 . We have the following result: Convexity of function F The function F (z) is convex. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 36 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Consider the function F (z) = ∥Mz − c∥22 . We have the following result: Convexity of function F The function F (z) is convex. Question: What is a convex function? P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 37 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Consider the problem F (z) = ∥Mz − c∥22 . We have the following result: Convexity of function F The function F (z) is convex. Question: What is a convex function? We discuss convex sets and convex functions in Part 2 of this lecture ! P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 38 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Consider the function F (z) = ∥Mz − c∥22 . We have the following result: Convexity of function F The function F (z) is convex. Hence minz F (z) is a convex optimization problem. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 39 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Consider the function F (z) = ∥Mz − c∥22 . We have the following result: Convexity of function F The function F (z) is convex. Hence minz F (z) is a convex optimization problem. Note: argmin F (z) = argmin ∥Mz − c∥22 z z P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 40 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Consider the function F (z) = ∥Mz − c∥22 . We have the following result: Convexity of function F The function F (z) is convex. Hence minz F (z) is a convex optimization problem. Note: argmin F (z) = argmin ∥Mz − c∥22 = argmin(Mz − c)⊤ (Mz − c) z P. Balamurugan z z Deep Learning - Theory and Practice September 13 & 15, 2023. 41 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Consider the function F (z) = ∥Mz − c∥22 . We have the following result: Convexity of function F The function F (z) is convex. Hence minz F (z) is a convex optimization problem. Note: argmin F (z) = argmin ∥Mz − c∥22 = argmin(Mz − c)⊤ (Mz − c) z z z ⊤ ⊤ ⊤ = argmin z M Mz − 2z M ⊤ c + c ⊤ c z P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 42 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Consider the function F (z) = ∥Mz − c∥22 . We have the following result: Convexity of function F The function F (z) is convex. Hence minz F (z) is a convex optimization problem. Note: argmin F (z) = min ∥Mz − c∥22 = argmin(Mz − c)⊤ (Mz − c) z z z ⊤ ⊤ = argmin z M Mz − 2z ⊤ M ⊤ c + c ⊤ c z = argmin z ⊤ M ⊤ Mz − 2z ⊤ M ⊤ c z P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 43 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Consider the function F (z) = ∥Mz − c∥22 . We have the following result: Convexity of function F The function F (z) is convex. Hence minz F (z) is a convex optimization problem. Note: argmin F (z) = min ∥Mz − c∥22 = argmin z ⊤ M ⊤ Mz − 2z ⊤ M ⊤ c z z z By first-order optimality characterization, we have z ∗ is a solution of minz F (z) if and only if ∇F (z ∗ ) = 0. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 44 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Consider the function F (z) = ∥Mz − c∥22 . We have the following result: Convexity of function F The function F (z) is convex. Note: argmin F (z) = min ∥Mz − c∥22 = argmin z ⊤ M ⊤ Mz − 2z ⊤ M ⊤ c z z z Hence minz F (z) is a convex optimization problem. By first-order optimality characterization, we have z ∗ is a solution of minz F (z) if and only if ∇F (z ∗ ) = 0. =⇒ 2M ⊤ Mz ∗ − 2M ⊤ c = 0. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 45 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Consider the function F (z) = ∥Mz − c∥22 . We have the following result: Convexity of function F The function F (z) is convex. Hence minz F (z) is a convex optimization problem. Note: argmin F (z) = min ∥Mz − c∥22 = argmin z ⊤ M ⊤ Mz − 2z ⊤ M ⊤ c z z z By first-order optimality characterization, we have z ∗ is a solution of minz F (z) if and only if ∇F (z ∗ ) = 0. =⇒ 2M ⊤ Mz ∗ − 2M ⊤ c = 0. =⇒ M ⊤ Mz ∗ = M ⊤ c. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 46 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Consider the function F (z) = ∥Mz − c∥22 . We have the following result: Convexity of function F The function F (z) is convex. Hence minz F (z) is a convex optimization problem. Note: argmin F (z) = min ∥Mz − c∥22 = argmin z ⊤ M ⊤ Mz − 2z ⊤ M ⊤ c z z z By first-order optimality characterization, we have z ∗ is a solution of minz F (z) if and only if ∇F (z ∗ ) = 0. =⇒ 2M ⊤ Mz ∗ − 2M ⊤ c = 0. =⇒ M ⊤ Mz ∗ = M ⊤ c. Further if M ⊤ M is positive definite, z ∗ is the unique optimal solution and is given by z ∗ = (M ⊤ M)−1 M ⊤ c. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 47 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Recall: min F (z) = ∥Mz − c∥22 is same as min ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 . z vec(B) Hence from M ⊤ Mz ∗ = M ⊤ c, we have: (X ⊤ ⊗ A)⊤ (X ⊤ ⊗ A)z ∗ = (X ⊤ ⊗ A)⊤ c. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 48 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Some properties of Kronecker Product: (P1) vec(ABC ) = (C ⊤ ⊗ A)vec(B). (P2) (G ⊗ H)⊤ = (G ⊤ ⊗ H ⊤ ). (P3) (G ⊗ H)−1 = (G −1 ⊗ H −1 ). (P4) (G ⊗ H)(U ⊗ V ) = (GU ⊗ HV ) (P5) If G and H are symmetric and positive (semi-)definite, then (G ⊗ H) is also symmetric and positive (semi-)definite. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 49 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Recall: min F (z) = ∥Mz − c∥22 is same as min ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 . z vec(B) Hence from M ⊤ Mz ∗ = M ⊤ c, we have: (X ⊤ ⊗ A)⊤ (X ⊤ ⊗ A)z ∗ = (X ⊤ ⊗ A)⊤ c =⇒ (X ⊗ A⊤ )(X ⊤ ⊗ A)z ∗ = (X ⊗ A⊤ )c (from (P2)) =⇒ (XX ⊤ ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c (from (P4)) P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 50 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture ↑ Recall: X = x 1 ↓ ↑ x2 ↓ ... ... ... ↑ ↑ S x , Y = y 1 ↓ ↓ ↑ y2 ↓ ↑ S y . ↓ ... ... ... Also note: XX ⊤ = S X ⊤ x i (x i ) , i=1 YY ⊤ = S X ⊤ y i (y i ) , i=1 XY ⊤ = S X x i (y i ) ⊤ i=1 and YX ⊤ = S X ⊤ y i (x i ) . i=1 (Homework: try to see if these equalities are true!) Denote XX ⊤ by ΣXX , YY ⊤ by ΣYY , XY ⊤ by ΣXY and YX ⊤ by ΣYX . P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 51 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture ↑ Recall: X = x 1 ↓ ↑ x2 ↓ ... ... ... ↑ ↑ x S , Y = y 1 ↓ ↓ ↑ y2 ↓ ↑ y S . ↓ ... ... ... Also note: XX ⊤ = S X ⊤ x i (x i ) , i=1 YY ⊤ = S X ⊤ y i (y i ) , i=1 XY ⊤ = S X x i (y i ) ⊤ i=1 and YX ⊤ = S X ⊤ y i (x i ) . i=1 (Homework: try to see if these equalities are true!) Denote XX ⊤ by ΣXX , YY ⊤ by ΣYY , XY ⊤ by ΣXY and YX ⊤ by ΣYX . Also note ΣXX , ΣYY are symmetric and (ΣXY )⊤ = (XY ⊤ )⊤ = YX ⊤ = ΣYX . P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 52 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Recall: min F (z) = ∥Mz − c∥22 is same as min ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 . z vec(B) Hence from M ⊤ Mz ∗ = M ⊤ c, we have: (X ⊤ ⊗ A)⊤ (X ⊤ ⊗ A)z ∗ = (X ⊤ ⊗ A)⊤ c =⇒ (X ⊗ A⊤ )(X ⊤ ⊗ A)z ∗ = (X ⊗ A⊤ )c (from (P2)) =⇒ (XX ⊤ ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c (from (P4)) P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 53 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Recall: min F (z) = ∥Mz − c∥22 is same as min ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 . z vec(B) Hence from M ⊤ Mz ∗ = M ⊤ c, we have: (X ⊤ ⊗ A)⊤ (X ⊤ ⊗ A)z ∗ = (X ⊤ ⊗ A)⊤ c =⇒ (X ⊗ A⊤ )(X ⊤ ⊗ A)z ∗ = (X ⊗ A⊤ )c (from (P2)) =⇒ (XX ⊤ ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c (from (P4)) =⇒ (ΣXX ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 54 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Recall: min F (z) = ∥Mz − c∥22 is same as min ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 . z vec(B) Hence from M ⊤ Mz ∗ = M ⊤ c, we have: (X ⊤ ⊗ A)⊤ (X ⊤ ⊗ A)z ∗ = (X ⊤ ⊗ A)⊤ c =⇒ (X ⊗ A⊤ )(X ⊤ ⊗ A)z ∗ = (X ⊗ A⊤ )c (from (P2)) =⇒ (XX ⊤ ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c (from (P4)) =⇒ (ΣXX ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c =⇒ (ΣXX ⊗ A⊤ A)(vec(B ∗ )) = (X ⊗ A⊤ )(vec(Y )) P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 55 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Recall: min F (z) = ∥Mz − c∥22 is same as min ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 . z vec(B) Hence from M ⊤ Mz ∗ = M ⊤ c, we have: (X ⊤ ⊗ A)⊤ (X ⊤ ⊗ A)z ∗ = (X ⊤ ⊗ A)⊤ c =⇒ (X ⊗ A⊤ )(X ⊤ ⊗ A)z ∗ = (X ⊗ A⊤ )c (from (P2)) =⇒ (XX ⊤ ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c (from (P4)) =⇒ (ΣXX ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c =⇒ (ΣXX ⊗ A⊤ A)(vec(B ∗ )) = (X ⊗ A⊤ )(vec(Y )) =⇒ A⊤ AB ∗ ΣXX = A⊤ YX ⊤ (from (P1)) P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 56 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Recall: min F (z) = ∥Mz − c∥22 is same as min ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 . z vec(B) Hence from M ⊤ Mz ∗ = M ⊤ c, we have: (X ⊤ ⊗ A)⊤ (X ⊤ ⊗ A)z ∗ = (X ⊤ ⊗ A)⊤ c =⇒ (X ⊗ A⊤ )(X ⊤ ⊗ A)z ∗ = (X ⊗ A⊤ )c (from (P2)) =⇒ (XX ⊤ ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c (from (P4)) =⇒ (ΣXX ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c =⇒ (ΣXX ⊗ A⊤ A)(vec(B ∗ )) = (X ⊗ A⊤ )(vec(Y )) =⇒ A⊤ AB ∗ ΣXX = A⊤ YX ⊤ (from (P1)) =⇒ A⊤ AB ∗ ΣXX = A⊤ ΣYX P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 57 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture We have the following result: The minimizer z ∗ of minz ∥(X ⊤ ⊗ A)z − vec(Y )∥22 satisfies A⊤ AB ∗ ΣXX = A⊤ ΣYX where vec(B ∗ ) = z ∗ . P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 58 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Assume: ΣXX is invertible. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 59 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Assume: ΣXX is invertible. ▶ Recall: ΣXX = XX ⊤ = P. Balamurugan PS j=1 x i (x i )⊤ . Deep Learning - Theory and Practice September 13 & 15, 2023. 60 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Assume: ΣXX is invertible. PS x i (x i )⊤ . ▶ Recall: ΣXX = XX ⊤ = ▶ Hence ΣXX is symmetric and positive semi-definite. P. Balamurugan j=1 Deep Learning - Theory and Practice September 13 & 15, 2023. 61 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Assume: ΣXX is invertible. PS x i (x i )⊤ . ▶ Recall: ΣXX = XX ⊤ = ▶ Hence ΣXX is symmetric and positive semi-definite. ▶ Now invertibility of ΣXX implies that ΣXX is symmetric and positive definite. P. Balamurugan j=1 Deep Learning - Theory and Practice September 13 & 15, 2023. 62 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Assume: ΣXX is invertible. A is full rank. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 63 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Assume: ΣXX is invertible. A is full rank. ▶ Recall: Rank of matrix A is the number of linearly independent columns or rows in A. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 64 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Assume: ΣXX is invertible. A is full rank. ▶ Recall: Rank of matrix A is the number of linearly independent columns or rows in A. ▶ Also recall: A is of size n × p and p ≤ n. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 65 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Assume: ΣXX is invertible. A is full rank. ▶ Recall: Rank of matrix A is the number of linearly independent columns or rows in A. ▶ Also recall: A is of size n × p and p ≤ n. ▶ A is full rank =⇒ rank(A) = min{n, p} = p. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 66 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Assume: ΣXX is invertible. A is full rank. ▶ Recall: Rank of matrix A is the number of linearly independent columns or rows in A. ▶ Also recall: A is of size n × p and p ≤ n. ▶ A is full rank =⇒ rank(A) = min{n, p} = p. The full rank assumption on A =⇒ A⊤ A is symmetric and positive definite and hence invertible. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 67 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Recall: min F (z) = ∥Mz − c∥22 is same as min ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 . z vec(B) Hence from M ⊤ Mz ∗ = M ⊤ c, we have: (X ⊤ ⊗ A)⊤ (X ⊤ ⊗ A)z ∗ = (X ⊤ ⊗ A)⊤ c =⇒ (X ⊗ A⊤ )(X ⊤ ⊗ A)z ∗ = (X ⊗ A⊤ )c (from (P2)) =⇒ (XX ⊤ ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c (from (P4)) =⇒ (ΣXX ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c If the assumptions that ΣXX is invertible and A is full rank hold, then we have ΣXX and A⊤ A are symmetric and positive definite, hence M ⊤ M = (ΣXX ⊗ A⊤ A) is also symmetric and positive definite(by (P5)). P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 68 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Recall: min F (z) = ∥Mz − c∥22 is same as min ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 . z vec(B) Since M ⊤ M is positive definite, the function ∥Mz − c∥22 is strictly convex and admits a unique minimizer. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 69 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Consider the result: For a fixed A, the minimizer z ∗ of minz ∥(X ⊤ ⊗ A)z − vec(Y )∥22 satisfies A⊤ AB ∗ ΣXX = A⊤ ΣYX where vec(B ∗ ) = z ∗ . If the assumptions that ΣXX is invertible and A is full rank hold, then we have B ∗ = (A⊤ A)−1 A⊤ ΣYX ΣXX −1 as the unique minimizer. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 70 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Recall: E = ∥vec(ABX ) − vec(Y )∥22 . Homework: Fix B and vary weights of A matrix and check for convexity of the loss function and solution characteristics of the associated minimization problem. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 71 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture We have the corresponding result: For a fixed B, the loss function is convex with respect to vec(A) and the minimizer A satisfies the following relation: ABΣXX B ⊤ = ΣYX B ⊤ If the assumptions that ΣXX is invertible and B is full rank hold, then we have A∗ = ΣYX B ⊤ (BΣXX B ⊤ )−1 as the unique minimizer. P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 72 / 73 Popular examples of MLP Encoder-Decoder Architecture Multi Layer Perceptron: Encoder-Decoder Architecture Further, we have the corresponding result: Assume ΣXX is invertible and A is of full rank. Then A and B denote the critical points of the loss (or error) function E , that is, ∂E ∂Aij = 0, ∀i ∈ {1, . . . , n}, j ∈ {1, . . . , p} and ∂E ∂Bij = 0, ∀i ∈ {1, . . . , p}, j ∈ {1, . . . , n}, if and only if A satisfies: PA Σ = ΣPA = PA ΣPA , where PA = A(A⊤ A)−1 A⊤ , Σ = ΣYX Σ−1 XX ΣXY . and W = AB is of the form: W = PA ΣYX Σ−1 XX . P. Balamurugan Deep Learning - Theory and Practice September 13 & 15, 2023. 73 / 73