Deep Learning Assignment 2
1.
π
1
∇πΈππ (π€) = ∇( ∑(π‘ππβ(π€ π π₯π ) − π¦π ) 2 )
π
π=1
π
1
∇πΈππ (π€) = ( ∑ ∇ (π‘ππβ(π€ π π₯π ) − π¦π )2 )
π
π=1
π
2
∇πΈππ (π€) = ( ∑(π‘ππβ(π€ π π₯π ) − π¦π ) ∇(π‘ππβ(π€ π π₯π ) − π¦π ))
π
π=1
π
2
∇πΈππ (π€) = ( ∑(π‘ππβ(π€ π π₯π ) − π¦π ) ∇(π‘ππβ(π€ π π₯π ) − π¦π ))
π
π=1
π
(π π₯ + π −π₯ ) (π π₯ + π −π₯ ) − (π π₯ − π −π₯ ) (π π₯ − π −π₯ )
π‘ππβ(π₯) =
ππ₯
(π π₯ + π −π₯ )2
π
(π π₯ − π −π₯ )2
π‘ππβ(π₯) = 1 −
ππ₯
(π π₯ + π −π₯ )2
π
π‘ππβ(π₯) = 1 − π‘ππβ(π₯)2
ππ₯
π
2
∇πΈππ (π€) = ( ∑(π‘ππβ(π€ π π₯π ) − π¦π ) (1 − π‘ππβ2 (π€ π π₯π ))∇(π€ π π₯π ))
π
π=1
π
2
∇πΈππ (π€) = ( ∑(π‘ππβ(π€ π π₯π ) − π¦π ) (1 − π‘ππβ2 (π€ π π₯)) π₯π )
π
π=1
As w → ∞, the slope of tanh function becomes almost zero that results in slow learning which makes
it difficult to optimize the perceptron.
2.
W(1) = [
0.2
W(2) = [ 1 ]
−3
0.1 0.2
]
0.3 0.4
1
W(2) = [ ]
2
x = 2, y = 1
x(0)
s(1)
x(1)
s(2)
1
[ ]
2
0.1 0.3 1
0.7
[
][ ] = [ ]
0.2 0.4 2
1
1
[ 0.6 ]
0.76
[−1.48]
πΉ(π)
πΉ(π)
[−3.6]
[(1 − 0.9)2 . 2 . (−3.6) ] = 1.368
ππ
ππ (1)
−π. ππ π. ππ
= [
]
−π. ππ π. ππ
[
x(2)
s(3)
x(3)
1
]
−0.90
[−0.8]
[−0.8]
πΉ(π)
ππ
ππ (2)
[
−0.88
]
1.73
−π. ππ
= [−π. ππ]
−π. ππ
ππ
ππ (3)
−π. π
= [
]
π. ππ
3. (20 points) Consider the standard residual block and the bottleneck block in the case where
inputs and outputs have the same dimension (e.g. Figure 5 in [1]). In another word, the
residual connection is an identity connection. For the standard residual block, compute the
number of training parameters when the dimension of inputs and outputs is 128x16x16x32.
Here, 128 is the batch size, 16 x 16 is the spatial size of feature maps, and 32 is the number
of channels. For the bottleneck block, compute the number of training parameters when the
dimension of inputs and outputs is 128x16x16x128. Compare the two results and explain
the advantages and disadvantages of the bottleneck block.
For Standard residual block,
Given input dimensions 16 x 16 x 32
Number of Training parameters in Conv1 3 x 3, 32 = (3*3*32+1)*32
Number of Training parameters in batchnorm1 32 feature maps = 2*32
Number of Training parameters in Conv2 3 x 3, 32 = (3*3*32+1)*32
Number of Training parameters in batchnorm2 32 feature maps = 2*32
Total Number of Training parameters = 18624
For Bottleneck block,
Given input dimensions 16 x 16 x 32
Number of Training parameters in conv1 1 x 1, 32 = (1*1*128+1)*32
Number of Training parameters in batchnorm1 32 feature maps = 2*32
Number of Training parameters in Conv2 3 x 3, 32 = (3*3*32+1)*32
Number of Training parameters in batchnorm2 32 feature maps = 2*32
Number of Training parameters in Conv3 1 x 1, 128 = (1*1*32+1)*128
Number of Training parameters in batchnorm3 128 feature maps = 2*128
Total Number of Training parameters = 17984
4. (20 points) Using batch normalization in training requires computing the mean and variance
of a tensor.
(a) (8 points) Suppose the tensor x is the output of a fully-connected layer and we want to
perform batch normalization on it. The training batch size is N and the fully-connected
layer has C output nodes. Therefore, the shape of x is N x C. What is the shape of the
mean and variance computed in batch normalization, respectively?
1 x C, 1 x C
Mean and variance have same shape. Mean depends on number of channels. Since it
(b) (12 points) Now suppose the tensor x is the output of a 2D convolution and has shape
N x H x W x C. What is the shape of the mean and variance computed in batch
normalization, respectively?
b) 1 x C x 1 x 1, 1 x C x 1 x 1
Instead of calculating mean for each element in a matrix and in a specific channel, it is
calculated for an entire matrix for a single channel.
5. a) We get equation for πππ , πππ , πππ , πππ as below
ππ
ππ
ππ
ππ
ππ
πππ = πππ
π πππ + ππ πππ + ππ πππ + ππ πππ + ππ πππ + ππ πππ
ππ
ππ
ππ
ππ
ππ
πππ = πππ
π πππ + ππ πππ + ππ πππ + ππ πππ + ππ πππ + ππ πππ
ππ
ππ
ππ
ππ
ππ
πππ = πππ
π πππ + ππ πππ + ππ πππ + ππ πππ + ππ πππ + ππ πππ
ππ
ππ
ππ
ππ
ππ
πππ = πππ
π πππ + ππ πππ + ππ πππ + ππ πππ + ππ πππ + ππ πππ
ππ
π
πππ
=
[ πππ
π
πππ
π
πππ
π
πππ
ππ ππ
] [πππ ] + [ πππ
π ππ ππ ]
πππ
πππ
[πππ ]
πππ
πππ
πππ
πππ
πππ
ππ ππ
ππ ππ
= [ πππ
+ [π π π π πππ
π π π ππ π π π π π ] π
π ππ ππ π]
ππ
πππ
πππ
[πππ ]
πππ
πππ
πππ
πππ
πππ
πππ
πππ
[πππ ]
Similarly, we get for
πππ
πππ
πππ
πππ
πππ
ππ ππ
ππ ππ
= [ π πππ
+ [π π π π π πππ
π ππ ππ π π π π ] π
π ππ ππ ]
ππ
πππ
πππ
[πππ ]
πππ
πππ
πππ
πππ
πππ
πππ
πππ
[πππ ]
πππ
πππ
πππ
πππ
πππ
ππ ππ
ππ ππ
= [ πππ
+ [π π π π πππ
π π π ππ π π π π π ] π
π ππ ππ π]
ππ
πππ
πππ
[πππ ]
πππ
πππ
πππ
πππ
πππ
πππ
πππ
[πππ ]
πππ
πππ
πππ
πππ
πππ
ππ ππ
ππ ππ
= [ π πππ
+ [π π π π π πππ
π ππ ππ π π π π ] π
π ππ ππ ]
ππ
πππ
πππ
[πππ ]
πππ
πππ
πππ
πππ
πππ
πππ
πππ
[πππ ]
Combining all these equations, we have
πππ
π
0
πΜ =
πππ
π
[ 0
c)
πππ
π
πππ
π
πππ
π
πππ
π
πππ
π
πππ
π
πππ
π
πππ
π
0
πππ
π
0
πππ
π
πππ
π
0
πππ
π
0
πππ
π
πππ
π
πππ
π
πππ
π
πππ
π
πππ
π
πππ
π
πππ
π
0
πππ
π
πΜ
0
πππ
π ]