Uploaded by Shainal Jain

IE643 Lecture12 2023Sep26

advertisement
Deep Learning - Theory and Practice
IE 643
Lectures 10, 11
September 13 & 15, 2023.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
1 / 73
Outline
1
Popular examples of MLP
Encoder-Decoder Architecture
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
2 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
We consider a simple MLP architecture with:
An input layer with n neurons
A hidden layer with p neurons, where p ≤ n
An output layer with n neurons
All neurons in hidden and output layers have linear activations.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
3 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
This architecture is popularly called Encoder-Decoder Architecture.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
4 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Let B denote the p × n matrix connecting input layer and hidden layer.
Let A denote the n × p matrix connecting hidden layer and output layer.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
5 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn ,
∀i ∈ {1, . . . , S}.
MLP Parameters: Weight matrices B and A.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
6 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn ,
∀i ∈ {1, . . . , S}.
MLP Parameters: Weight matrices B and A.
P
Error: E (B, A) = Si=1 ∥ABx i − y i ∥22 .
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
7 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn ,
∀i ∈ {1, . . . , S}.
MLP Parameters: Weight matrices B and A.
Special case: When y i = x i , ∀i, the architecture is called Autoencoder.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
8 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn ,
∀i ∈ {1, . . . , S}.
MLP Parameters: Weight matrices B and A.
P
Error for autoencoder: E (B, A) = Si=1 ∥ABx i − x i ∥22 .
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
9 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Example: Consider the case: n = p = 1.
P
Error: E (b, a) = Si=1 (abx i − y i )2 .
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
10 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Sample Training Data:
xi
.708333
.583333
.166667
.458333
.875
P. Balamurugan
yi
.708333
.583333
.166667
.458333
.875
Deep Learning - Theory and Practice
September 13 & 15, 2023.
11 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Loss surface for the sample training data:
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
12 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Loss surface for the sample training data:
Observation: Though there are two valleys in the loss surface, they look
symmetric.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
13 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Loss surface for the sample training data:
Question: Does this extend to cases where y i ̸= x i and for n ≥ 2, n ≥ p?
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
14 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
However, when we fix a, we have:
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
15 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
However, when we fix a, we have:
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
16 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
However, when we fix a, we have:
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
17 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Similarly, when we fix b, we have:
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
18 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Similarly, when we fix b, we have:
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
19 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Similarly, when we fix b, we have:
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
20 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Error: E (b, a) =
PS
i=1 (abx
i
− y i )2 .
When we fix a or when we fix b, we observe that the graph does not
contain multiple valleys.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
21 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Error: E (b, a) =
PS
i=1 (abx
i
− y i )2 .
When we fix a or when we fix b, we observe that the graph does not
contain multiple valleys.
Thus when we fix a or when we fix b, we observe that the graph looks as if
every local optimum is a global optimum.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
22 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Error: E (b, a) =
PS
i=1 (abx
i
− y i )2 .
When we fix a or when we fix b, we observe that the graph does not
contain multiple valleys.
Thus when we fix a or when we fix b, we observe that the graph looks as if
every local optimum is a global optimum.
Question: Does this behavior extend to higher dimensions?
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
23 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Error: E (b, a) =
PS
i=1 (abx
i
− y i )2 .
When we fix a or when we fix b, we observe that the graph does not
contain multiple valleys.
Thus when we fix a or when we fix b, we observe that the graph looks as if
every local optimum is a global optimum.
Question: Does this behavior extend to higher dimensions?
We shall investigate this in higher dimensions !
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
24 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Some notations:




↑ ↑ ... ↑
↑ ↑ ... ↑
Let X = x 1 x 2 . . . x S , Y = y 1 y 2 . . . y S .
↓ ↓ ... ↓
↓ ↓ ... ↓
Note: X and Y are n × S matrices.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
25 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Some notations:


h11 h12 . . . h1q

..
. 
For a n × q matrix H =  ...
. . . . .. ,
hn1 hn2 . . . hnq
⊤
let vec(H) = h11 . . . hn1 h12 . . . hn2 . . . h1q . . . hnq
denote a nq × 1 vector.
Note: vec(H) contains the columns of matrix H stacked upon one
another.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
26 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Recall:
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn ,
∀i ∈ {1, . . . , S}.
MLP Parameters: Weight matrices B and A.
P
Error: E = Si=1 ∥ABx i − y i ∥22 . (We have used E since the context is
clear!)
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
27 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Recall:
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ Rn , y i ∈ Rn ,
∀i ∈ {1, . . . , S}.
MLP Parameters: Weight matrices B and A.
P
Error: E = Si=1 ∥ABx i − y i ∥22 . (We have used E since the context is
clear!)
Using the new notations, we write: E = ∥vec(ABX − Y )∥22 .
(Homework!)
Note: The norm is a simple vector norm.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
28 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
We have: E = ∥vec(ABX − Y )∥22 .
Now note: vec(ABX − Y ) = vec(ABX ) − vec(Y ). (Homework: Verify this
claim!)
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
29 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
We have: E = ∥vec(ABX − Y )∥22 .
Now note: vec(ABX − Y ) = vec(ABX ) − vec(Y ). (Homework: Verify this
claim!)
Thus we have: E = ∥vec(ABX ) − vec(Y )∥22 .
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
30 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Some more new notations:

g11
 ..
For a m × n matrix G =  .
...

g1n
.. 
. 
...
gm1 . . . gmn


h11 . . . h1q


and a p × q matrix H =  ... . . . ... ,
hp1 . . . hpq


g11 H . . . g1n H

..  as Kronecker product of G and H.
define: G ⊗ H =  ...
...
. 
gm1 H . . . gmn H
Note: G ⊗ H is of size mp × nq.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
31 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Claim:
vec(ABX ) = (X ⊤ ⊗ A)vec(B).
Proof idea:
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
32 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Using:
vec(ABX ) = (X ⊤ ⊗ A)vec(B).
we can write:
E = ∥vec(ABX − Y )∥22 = ∥vec(ABX ) − vec(Y )∥22
= ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 .
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
33 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Using:
vec(ABX ) = (X ⊤ ⊗ A)vec(B).
we can write:
E = ∥vec(ABX − Y )∥22 = ∥vec(ABX ) − vec(Y )∥22
= ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 .
This is of the form: F (z) = ∥Mz − c∥22 , where M = (X ⊤ ⊗ A),
z = vec(B) and c = vec(Y ).
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
34 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Using:
vec(ABX ) = (X ⊤ ⊗ A)vec(B).
we can write:
E = ∥vec(ABX − Y )∥22 = ∥vec(ABX ) − vec(Y )∥22
= ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 .
This is of the form: F (z) = ∥Mz − c∥22 , where M = (X ⊤ ⊗ A),
z = vec(B) and c = vec(Y ).
Observe: M is of size nS × np, z is a np × 1 vector and c is a nS × 1
vector.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
35 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Consider the function F (z) = ∥Mz − c∥22 . We have the following result:
Convexity of function F
The function F (z) is convex.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
36 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Consider the function F (z) = ∥Mz − c∥22 . We have the following result:
Convexity of function F
The function F (z) is convex.
Question: What is a convex function?
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
37 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Consider the problem F (z) = ∥Mz − c∥22 . We have the following result:
Convexity of function F
The function F (z) is convex.
Question: What is a convex function?
We discuss convex sets and convex functions in Part 2 of this
lecture !
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
38 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Consider the function F (z) = ∥Mz − c∥22 . We have the following result:
Convexity of function F
The function F (z) is convex.
Hence minz F (z) is a convex optimization problem.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
39 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Consider the function F (z) = ∥Mz − c∥22 . We have the following result:
Convexity of function F
The function F (z) is convex.
Hence minz F (z) is a convex optimization problem.
Note:
argmin F (z) = argmin ∥Mz − c∥22
z
z
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
40 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Consider the function F (z) = ∥Mz − c∥22 . We have the following result:
Convexity of function F
The function F (z) is convex.
Hence minz F (z) is a convex optimization problem.
Note:
argmin F (z) = argmin ∥Mz − c∥22 = argmin(Mz − c)⊤ (Mz − c)
z
P. Balamurugan
z
z
Deep Learning - Theory and Practice
September 13 & 15, 2023.
41 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Consider the function F (z) = ∥Mz − c∥22 . We have the following result:
Convexity of function F
The function F (z) is convex.
Hence minz F (z) is a convex optimization problem.
Note:
argmin F (z) = argmin ∥Mz − c∥22 = argmin(Mz − c)⊤ (Mz − c)
z
z
z
⊤
⊤
⊤
= argmin z M Mz − 2z M ⊤ c + c ⊤ c
z
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
42 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Consider the function F (z) = ∥Mz − c∥22 . We have the following result:
Convexity of function F
The function F (z) is convex.
Hence minz F (z) is a convex optimization problem.
Note:
argmin F (z) = min ∥Mz − c∥22 = argmin(Mz − c)⊤ (Mz − c)
z
z
z
⊤
⊤
= argmin z M Mz − 2z ⊤ M ⊤ c + c ⊤ c
z
= argmin z ⊤ M ⊤ Mz − 2z ⊤ M ⊤ c
z
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
43 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Consider the function F (z) = ∥Mz − c∥22 . We have the following result:
Convexity of function F
The function F (z) is convex.
Hence minz F (z) is a convex optimization problem.
Note:
argmin F (z) = min ∥Mz − c∥22 = argmin z ⊤ M ⊤ Mz − 2z ⊤ M ⊤ c
z
z
z
By first-order optimality characterization, we have z ∗ is a solution of
minz F (z) if and only if ∇F (z ∗ ) = 0.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
44 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Consider the function F (z) = ∥Mz − c∥22 . We have the following result:
Convexity of function F
The function F (z) is convex.
Note:
argmin F (z) = min ∥Mz − c∥22 = argmin z ⊤ M ⊤ Mz − 2z ⊤ M ⊤ c
z
z
z
Hence minz F (z) is a convex optimization problem.
By first-order optimality characterization, we have z ∗ is a solution of
minz F (z) if and only if ∇F (z ∗ ) = 0.
=⇒ 2M ⊤ Mz ∗ − 2M ⊤ c = 0.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
45 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Consider the function F (z) = ∥Mz − c∥22 . We have the following result:
Convexity of function F
The function F (z) is convex.
Hence minz F (z) is a convex optimization problem.
Note:
argmin F (z) = min ∥Mz − c∥22 = argmin z ⊤ M ⊤ Mz − 2z ⊤ M ⊤ c
z
z
z
By first-order optimality characterization, we have z ∗ is a solution of
minz F (z) if and only if ∇F (z ∗ ) = 0.
=⇒ 2M ⊤ Mz ∗ − 2M ⊤ c = 0.
=⇒ M ⊤ Mz ∗ = M ⊤ c.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
46 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Consider the function F (z) = ∥Mz − c∥22 . We have the following result:
Convexity of function F
The function F (z) is convex.
Hence minz F (z) is a convex optimization problem.
Note:
argmin F (z) = min ∥Mz − c∥22 = argmin z ⊤ M ⊤ Mz − 2z ⊤ M ⊤ c
z
z
z
By first-order optimality characterization, we have z ∗ is a solution of
minz F (z) if and only if ∇F (z ∗ ) = 0.
=⇒ 2M ⊤ Mz ∗ − 2M ⊤ c = 0.
=⇒ M ⊤ Mz ∗ = M ⊤ c.
Further if M ⊤ M is positive definite, z ∗ is the unique optimal solution and
is given by z ∗ = (M ⊤ M)−1 M ⊤ c.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
47 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Recall:
min F (z) = ∥Mz − c∥22 is same as min ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 .
z
vec(B)
Hence from M ⊤ Mz ∗ = M ⊤ c, we have:
(X ⊤ ⊗ A)⊤ (X ⊤ ⊗ A)z ∗ = (X ⊤ ⊗ A)⊤ c.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
48 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Some properties of Kronecker Product:
(P1) vec(ABC ) = (C ⊤ ⊗ A)vec(B).
(P2) (G ⊗ H)⊤ = (G ⊤ ⊗ H ⊤ ).
(P3) (G ⊗ H)−1 = (G −1 ⊗ H −1 ).
(P4) (G ⊗ H)(U ⊗ V ) = (GU ⊗ HV )
(P5) If G and H are symmetric and positive (semi-)definite, then
(G ⊗ H) is also symmetric and positive (semi-)definite.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
49 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Recall:
min F (z) = ∥Mz − c∥22 is same as min ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 .
z
vec(B)
Hence from M ⊤ Mz ∗ = M ⊤ c, we have:
(X ⊤ ⊗ A)⊤ (X ⊤ ⊗ A)z ∗ = (X ⊤ ⊗ A)⊤ c
=⇒ (X ⊗ A⊤ )(X ⊤ ⊗ A)z ∗ = (X ⊗ A⊤ )c (from (P2))
=⇒ (XX ⊤ ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c (from (P4))
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
50 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture

↑
Recall: X = x 1
↓
↑
x2
↓
...
...
...


↑
↑
S
x , Y = y 1
↓
↓
↑
y2
↓

↑
S
y .
↓
...
...
...
Also note:
XX ⊤ =
S
X
⊤
x i (x i ) ,
i=1
YY ⊤ =
S
X
⊤
y i (y i ) ,
i=1
XY ⊤ =
S
X
x i (y i )
⊤
i=1
and YX ⊤ =
S
X
⊤
y i (x i ) .
i=1
(Homework: try to see if these equalities are true!)
Denote XX ⊤ by ΣXX , YY ⊤ by ΣYY , XY ⊤ by ΣXY and YX ⊤ by ΣYX .
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
51 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer
Perceptron:
Encoder-Decoder
Architecture




↑
Recall: X = x 1
↓
↑
x2
↓
...
...
...
↑
↑
x S , Y = y 1
↓
↓
↑
y2
↓
↑
y S .
↓
...
...
...
Also note:
XX ⊤ =
S
X
⊤
x i (x i ) ,
i=1
YY ⊤ =
S
X
⊤
y i (y i ) ,
i=1
XY ⊤ =
S
X
x i (y i )
⊤
i=1
and YX ⊤ =
S
X
⊤
y i (x i ) .
i=1
(Homework: try to see if these equalities are true!)
Denote XX ⊤ by ΣXX , YY ⊤ by ΣYY , XY ⊤ by ΣXY and YX ⊤ by ΣYX .
Also note ΣXX , ΣYY are symmetric and (ΣXY )⊤ = (XY ⊤ )⊤ = YX ⊤ = ΣYX .
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
52 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Recall:
min F (z) = ∥Mz − c∥22 is same as min ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 .
z
vec(B)
Hence from M ⊤ Mz ∗ = M ⊤ c, we have:
(X ⊤ ⊗ A)⊤ (X ⊤ ⊗ A)z ∗ = (X ⊤ ⊗ A)⊤ c
=⇒ (X ⊗ A⊤ )(X ⊤ ⊗ A)z ∗ = (X ⊗ A⊤ )c (from (P2))
=⇒ (XX ⊤ ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c (from (P4))
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
53 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Recall:
min F (z) = ∥Mz − c∥22 is same as min ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 .
z
vec(B)
Hence from M ⊤ Mz ∗ = M ⊤ c, we have:
(X ⊤ ⊗ A)⊤ (X ⊤ ⊗ A)z ∗ = (X ⊤ ⊗ A)⊤ c
=⇒ (X ⊗ A⊤ )(X ⊤ ⊗ A)z ∗ = (X ⊗ A⊤ )c (from (P2))
=⇒ (XX ⊤ ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c (from (P4))
=⇒ (ΣXX ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
54 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Recall:
min F (z) = ∥Mz − c∥22 is same as min ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 .
z
vec(B)
Hence from M ⊤ Mz ∗ = M ⊤ c, we have:
(X ⊤ ⊗ A)⊤ (X ⊤ ⊗ A)z ∗ = (X ⊤ ⊗ A)⊤ c
=⇒ (X ⊗ A⊤ )(X ⊤ ⊗ A)z ∗ = (X ⊗ A⊤ )c (from (P2))
=⇒ (XX ⊤ ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c (from (P4))
=⇒ (ΣXX ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c
=⇒ (ΣXX ⊗ A⊤ A)(vec(B ∗ )) = (X ⊗ A⊤ )(vec(Y ))
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
55 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Recall:
min F (z) = ∥Mz − c∥22 is same as min ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 .
z
vec(B)
Hence from M ⊤ Mz ∗ = M ⊤ c, we have:
(X ⊤ ⊗ A)⊤ (X ⊤ ⊗ A)z ∗ = (X ⊤ ⊗ A)⊤ c
=⇒ (X ⊗ A⊤ )(X ⊤ ⊗ A)z ∗ = (X ⊗ A⊤ )c (from (P2))
=⇒ (XX ⊤ ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c (from (P4))
=⇒ (ΣXX ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c
=⇒ (ΣXX ⊗ A⊤ A)(vec(B ∗ )) = (X ⊗ A⊤ )(vec(Y ))
=⇒ A⊤ AB ∗ ΣXX = A⊤ YX ⊤ (from (P1))
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
56 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Recall:
min F (z) = ∥Mz − c∥22 is same as min ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 .
z
vec(B)
Hence from M ⊤ Mz ∗ = M ⊤ c, we have:
(X ⊤ ⊗ A)⊤ (X ⊤ ⊗ A)z ∗ = (X ⊤ ⊗ A)⊤ c
=⇒ (X ⊗ A⊤ )(X ⊤ ⊗ A)z ∗ = (X ⊗ A⊤ )c (from (P2))
=⇒ (XX ⊤ ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c (from (P4))
=⇒ (ΣXX ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c
=⇒ (ΣXX ⊗ A⊤ A)(vec(B ∗ )) = (X ⊗ A⊤ )(vec(Y ))
=⇒ A⊤ AB ∗ ΣXX = A⊤ YX ⊤ (from (P1))
=⇒ A⊤ AB ∗ ΣXX = A⊤ ΣYX
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
57 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
We have the following result:
The minimizer z ∗ of minz ∥(X ⊤ ⊗ A)z − vec(Y )∥22 satisfies
A⊤ AB ∗ ΣXX = A⊤ ΣYX
where vec(B ∗ ) = z ∗ .
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
58 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Assume:
ΣXX is invertible.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
59 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Assume:
ΣXX is invertible.
▶
Recall: ΣXX = XX ⊤ =
P. Balamurugan
PS
j=1
x i (x i )⊤ .
Deep Learning - Theory and Practice
September 13 & 15, 2023.
60 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Assume:
ΣXX is invertible.
PS
x i (x i )⊤ .
▶
Recall: ΣXX = XX ⊤ =
▶
Hence ΣXX is symmetric and positive semi-definite.
P. Balamurugan
j=1
Deep Learning - Theory and Practice
September 13 & 15, 2023.
61 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Assume:
ΣXX is invertible.
PS
x i (x i )⊤ .
▶
Recall: ΣXX = XX ⊤ =
▶
Hence ΣXX is symmetric and positive semi-definite.
▶
Now invertibility of ΣXX implies that ΣXX is symmetric and positive
definite.
P. Balamurugan
j=1
Deep Learning - Theory and Practice
September 13 & 15, 2023.
62 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Assume:
ΣXX is invertible.
A is full rank.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
63 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Assume:
ΣXX is invertible.
A is full rank.
▶
Recall: Rank of matrix A is the number of linearly independent
columns or rows in A.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
64 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Assume:
ΣXX is invertible.
A is full rank.
▶
Recall: Rank of matrix A is the number of linearly independent
columns or rows in A.
▶
Also recall: A is of size n × p and p ≤ n.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
65 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Assume:
ΣXX is invertible.
A is full rank.
▶
Recall: Rank of matrix A is the number of linearly independent
columns or rows in A.
▶
Also recall: A is of size n × p and p ≤ n.
▶
A is full rank =⇒ rank(A) = min{n, p} = p.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
66 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Assume:
ΣXX is invertible.
A is full rank.
▶
Recall: Rank of matrix A is the number of linearly independent
columns or rows in A.
▶
Also recall: A is of size n × p and p ≤ n.
▶
A is full rank =⇒ rank(A) = min{n, p} = p.
The full rank assumption on A =⇒ A⊤ A is symmetric and positive
definite and hence invertible.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
67 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Recall:
min F (z) = ∥Mz − c∥22 is same as min ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 .
z
vec(B)
Hence from M ⊤ Mz ∗ = M ⊤ c, we have:
(X ⊤ ⊗ A)⊤ (X ⊤ ⊗ A)z ∗ = (X ⊤ ⊗ A)⊤ c
=⇒ (X ⊗ A⊤ )(X ⊤ ⊗ A)z ∗ = (X ⊗ A⊤ )c (from (P2))
=⇒ (XX ⊤ ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c (from (P4))
=⇒ (ΣXX ⊗ A⊤ A)z ∗ = (X ⊗ A⊤ )c
If the assumptions that ΣXX is invertible and A is full rank hold, then we have ΣXX and A⊤ A
are symmetric and positive definite, hence
M ⊤ M = (ΣXX ⊗ A⊤ A) is also symmetric and positive definite(by (P5)).
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
68 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Recall:
min F (z) = ∥Mz − c∥22 is same as min ∥(X ⊤ ⊗ A)vec(B) − vec(Y )∥22 .
z
vec(B)
Since M ⊤ M is positive definite, the function ∥Mz − c∥22 is strictly convex
and admits a unique minimizer.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
69 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Consider the result:
For a fixed A, the minimizer z ∗ of minz ∥(X ⊤ ⊗ A)z − vec(Y )∥22 satisfies
A⊤ AB ∗ ΣXX = A⊤ ΣYX
where vec(B ∗ ) = z ∗ .
If the assumptions that ΣXX is invertible and A is full rank hold, then we
have
B ∗ = (A⊤ A)−1 A⊤ ΣYX ΣXX −1
as the unique minimizer.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
70 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Recall: E = ∥vec(ABX ) − vec(Y )∥22 .
Homework: Fix B and vary weights of A matrix and check for
convexity of the loss function and solution characteristics of the
associated minimization problem.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
71 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
We have the corresponding result:
For a fixed B, the loss function is convex with respect to vec(A) and the
minimizer A satisfies the following relation:
ABΣXX B ⊤ = ΣYX B ⊤
If the assumptions that ΣXX is invertible and B is full rank hold, then we
have
A∗ = ΣYX B ⊤ (BΣXX B ⊤ )−1
as the unique minimizer.
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
72 / 73
Popular examples of MLP
Encoder-Decoder Architecture
Multi Layer Perceptron: Encoder-Decoder Architecture
Further, we have the corresponding result:
Assume ΣXX is invertible and A is of full rank. Then A and B denote the
critical points of the loss (or error) function E , that is,
∂E
∂Aij = 0, ∀i ∈ {1, . . . , n}, j ∈ {1, . . . , p} and
∂E
∂Bij
= 0, ∀i ∈ {1, . . . , p}, j ∈ {1, . . . , n}, if and only if A satisfies:
PA Σ = ΣPA = PA ΣPA , where PA = A(A⊤ A)−1 A⊤ , Σ = ΣYX Σ−1
XX ΣXY .
and W = AB is of the form:
W = PA ΣYX Σ−1
XX .
P. Balamurugan
Deep Learning - Theory and Practice
September 13 & 15, 2023.
73 / 73
Download