Uploaded by Swetha Reddy

Deep Learning Assignment 2 file

advertisement
Deep Learning Assignment 2
1.
𝑁
1
∇𝐸𝑖𝑛 (𝑀) = ∇( ∑(π‘‘π‘Žπ‘›β„Ž(𝑀 𝑇 π‘₯𝑛 ) − 𝑦𝑛 ) 2 )
𝑁
𝑛=1
𝑁
1
∇𝐸𝑖𝑛 (𝑀) = ( ∑ ∇ (π‘‘π‘Žπ‘›β„Ž(𝑀 𝑇 π‘₯𝑛 ) − 𝑦𝑛 )2 )
𝑁
𝑛=1
𝑁
2
∇𝐸𝑖𝑛 (𝑀) = ( ∑(π‘‘π‘Žπ‘›β„Ž(𝑀 𝑇 π‘₯𝑛 ) − 𝑦𝑛 ) ∇(π‘‘π‘Žπ‘›β„Ž(𝑀 𝑇 π‘₯𝑛 ) − 𝑦𝑛 ))
𝑁
𝑛=1
𝑁
2
∇𝐸𝑖𝑛 (𝑀) = ( ∑(π‘‘π‘Žπ‘›β„Ž(𝑀 𝑇 π‘₯𝑛 ) − 𝑦𝑛 ) ∇(π‘‘π‘Žπ‘›β„Ž(𝑀 𝑇 π‘₯𝑛 ) − 𝑦𝑛 ))
𝑁
𝑛=1
𝑑
(𝑒 π‘₯ + 𝑒 −π‘₯ ) (𝑒 π‘₯ + 𝑒 −π‘₯ ) − (𝑒 π‘₯ − 𝑒 −π‘₯ ) (𝑒 π‘₯ − 𝑒 −π‘₯ )
π‘‘π‘Žπ‘›β„Ž(π‘₯) =
𝑑π‘₯
(𝑒 π‘₯ + 𝑒 −π‘₯ )2
𝑑
(𝑒 π‘₯ − 𝑒 −π‘₯ )2
π‘‘π‘Žπ‘›β„Ž(π‘₯) = 1 −
𝑑π‘₯
(𝑒 π‘₯ + 𝑒 −π‘₯ )2
𝑑
π‘‘π‘Žπ‘›β„Ž(π‘₯) = 1 − π‘‘π‘Žπ‘›β„Ž(π‘₯)2
𝑑π‘₯
𝑁
2
∇𝐸𝑖𝑛 (𝑀) = ( ∑(π‘‘π‘Žπ‘›β„Ž(𝑀 𝑇 π‘₯𝑛 ) − 𝑦𝑛 ) (1 − π‘‘π‘Žπ‘›β„Ž2 (𝑀 𝑇 π‘₯𝑛 ))∇(𝑀 𝑇 π‘₯𝑛 ))
𝑁
𝑛=1
𝑁
2
∇𝐸𝑖𝑛 (𝑀) = ( ∑(π‘‘π‘Žπ‘›β„Ž(𝑀 𝑇 π‘₯𝑛 ) − 𝑦𝑛 ) (1 − π‘‘π‘Žπ‘›β„Ž2 (𝑀 𝑇 π‘₯)) π‘₯𝑛 )
𝑁
𝑛=1
As w → ∞, the slope of tanh function becomes almost zero that results in slow learning which makes
it difficult to optimize the perceptron.
2.
W(1) = [
0.2
W(2) = [ 1 ]
−3
0.1 0.2
]
0.3 0.4
1
W(2) = [ ]
2
x = 2, y = 1
x(0)
s(1)
x(1)
s(2)
1
[ ]
2
0.1 0.3 1
0.7
[
][ ] = [ ]
0.2 0.4 2
1
1
[ 0.6 ]
0.76
[−1.48]
𝜹(πŸ‘)
𝜹(𝟐)
[−3.6]
[(1 − 0.9)2 . 2 . (−3.6) ] = 1.368
𝝏𝒆
ππ‘Š (1)
−𝟎. πŸ–πŸ– 𝟏. πŸ•πŸ‘
= [
]
−𝟏. πŸ•πŸ” πŸ‘. πŸ’πŸ”
[
x(2)
s(3)
x(3)
1
]
−0.90
[−0.8]
[−0.8]
𝜹(𝟏)
𝝏𝒆
ππ‘Š (2)
[
−0.88
]
1.73
−𝟏. πŸ‘πŸ•
= [−𝟎. πŸ–πŸ]
−𝟏. πŸŽπŸ’
𝝏𝒆
ππ‘Š (3)
−πŸ‘. πŸ”
= [
]
πŸ‘. πŸπŸ’
3. (20 points) Consider the standard residual block and the bottleneck block in the case where
inputs and outputs have the same dimension (e.g. Figure 5 in [1]). In another word, the
residual connection is an identity connection. For the standard residual block, compute the
number of training parameters when the dimension of inputs and outputs is 128x16x16x32.
Here, 128 is the batch size, 16 x 16 is the spatial size of feature maps, and 32 is the number
of channels. For the bottleneck block, compute the number of training parameters when the
dimension of inputs and outputs is 128x16x16x128. Compare the two results and explain
the advantages and disadvantages of the bottleneck block.
For Standard residual block,
Given input dimensions 16 x 16 x 32
Number of Training parameters in Conv1 3 x 3, 32 = (3*3*32+1)*32
Number of Training parameters in batchnorm1 32 feature maps = 2*32
Number of Training parameters in Conv2 3 x 3, 32 = (3*3*32+1)*32
Number of Training parameters in batchnorm2 32 feature maps = 2*32
Total Number of Training parameters = 18624
For Bottleneck block,
Given input dimensions 16 x 16 x 32
Number of Training parameters in conv1 1 x 1, 32 = (1*1*128+1)*32
Number of Training parameters in batchnorm1 32 feature maps = 2*32
Number of Training parameters in Conv2 3 x 3, 32 = (3*3*32+1)*32
Number of Training parameters in batchnorm2 32 feature maps = 2*32
Number of Training parameters in Conv3 1 x 1, 128 = (1*1*32+1)*128
Number of Training parameters in batchnorm3 128 feature maps = 2*128
Total Number of Training parameters = 17984
4. (20 points) Using batch normalization in training requires computing the mean and variance
of a tensor.
(a) (8 points) Suppose the tensor x is the output of a fully-connected layer and we want to
perform batch normalization on it. The training batch size is N and the fully-connected
layer has C output nodes. Therefore, the shape of x is N x C. What is the shape of the
mean and variance computed in batch normalization, respectively?
1 x C, 1 x C
Mean and variance have same shape. Mean depends on number of channels. Since it
(b) (12 points) Now suppose the tensor x is the output of a 2D convolution and has shape
N x H x W x C. What is the shape of the mean and variance computed in batch
normalization, respectively?
b) 1 x C x 1 x 1, 1 x C x 1 x 1
Instead of calculating mean for each element in a matrix and in a specific channel, it is
calculated for an entire matrix for a single channel.
5. a) We get equation for π’šπŸπŸ , π’šπŸπŸ , π’šπŸπŸ , π’šπŸπŸ as below
𝟏𝟏
𝟏𝟏
𝟐𝟏
𝟐𝟏
𝟐𝟏
π’šπŸπŸ = π’˜πŸπŸ
𝟏 π’™πŸπŸ + π’˜πŸ π’™πŸπŸ + π’˜πŸ‘ π’™πŸπŸ‘ + π’˜πŸ π’™πŸπŸ + π’˜πŸ π’™πŸπŸ + π’˜πŸ‘ π’™πŸπŸ‘
𝟏𝟏
𝟏𝟏
𝟐𝟏
𝟐𝟏
𝟐𝟏
π’šπŸπŸ = π’˜πŸπŸ
𝟏 π’™πŸπŸ + π’˜πŸ π’™πŸπŸ‘ + π’˜πŸ‘ π’™πŸπŸ’ + π’˜πŸ π’™πŸπŸ + π’˜πŸ π’™πŸπŸ‘ + π’˜πŸ‘ π’™πŸπŸ’
𝟏𝟐
𝟏𝟐
𝟐𝟐
𝟐𝟐
𝟐𝟐
π’šπŸπŸ = π’˜πŸπŸ
𝟏 π’™πŸπŸ + π’˜πŸ π’™πŸπŸ + π’˜πŸ‘ π’™πŸπŸ‘ + π’˜πŸ π’™πŸπŸ + π’˜πŸ π’™πŸπŸ + π’˜πŸ‘ π’™πŸπŸ‘
𝟏𝟐
𝟏𝟐
𝟐𝟐
𝟐𝟐
𝟐𝟐
π’šπŸπŸ = π’˜πŸπŸ
𝟏 π’™πŸπŸ + π’˜πŸ π’™πŸπŸ‘ + π’˜πŸ‘ π’™πŸπŸ’ + π’˜πŸ π’™πŸπŸ + π’˜πŸ π’™πŸπŸ‘ + π’˜πŸ‘ π’™πŸπŸ’
𝟏𝟏
π’š
π’šπŸπŸ
=
[ π’˜πŸπŸ
𝟏
π’˜πŸπŸ
𝟐
π’˜πŸπŸ
πŸ‘
π’™πŸπŸ
𝟐𝟏 𝟐𝟏
] [π’™πŸπŸ ] + [ π’˜πŸπŸ
𝟏 π’˜πŸ π’˜πŸ‘ ]
π’™πŸπŸ‘
π’™πŸπŸ
[π’™πŸπŸ ]
π’™πŸπŸ‘
π’™πŸπŸ
π’™πŸπŸ
π’™πŸπŸ‘
π’™πŸπŸ’
𝟏𝟏 𝟏𝟏
𝟐𝟏 𝟐𝟏
= [ π’˜πŸπŸ
+ [𝟎 𝟎 𝟎 𝟎 π’˜πŸπŸ
𝟏 π’˜ 𝟐 π’˜πŸ‘ 𝟎 𝟎 𝟎 𝟎 𝟎 ] 𝒙
𝟏 π’˜πŸ π’˜πŸ‘ 𝟎]
𝟐𝟏
π’™πŸπŸ
π’™πŸπŸ‘
[π’™πŸπŸ’ ]
π’™πŸπŸ
π’™πŸπŸ
π’™πŸπŸ‘
π’™πŸπŸ’
π’™πŸπŸ
π’™πŸπŸ
π’™πŸπŸ‘
[π’™πŸπŸ’ ]
Similarly, we get for
π’šπŸπŸ
π’™πŸπŸ
π’™πŸπŸ
π’™πŸπŸ‘
π’™πŸπŸ’
𝟏𝟏 𝟏𝟏
𝟐𝟏 𝟐𝟏
= [ 𝟎 π’˜πŸπŸ
+ [𝟎 𝟎 𝟎 𝟎 𝟎 π’˜πŸπŸ
𝟏 π’˜πŸ π’˜πŸ‘ 𝟎 𝟎 𝟎 𝟎 ] 𝒙
𝟏 π’˜πŸ π’˜πŸ‘ ]
𝟐𝟏
π’™πŸπŸ
π’™πŸπŸ‘
[π’™πŸπŸ’ ]
π’™πŸπŸ
π’™πŸπŸ
π’™πŸπŸ‘
π’™πŸπŸ’
π’™πŸπŸ
π’™πŸπŸ
π’™πŸπŸ‘
[π’™πŸπŸ’ ]
π’šπŸπŸ
π’™πŸπŸ
π’™πŸπŸ
π’™πŸπŸ‘
π’™πŸπŸ’
𝟏𝟐 𝟏𝟐
𝟐𝟐 𝟐𝟐
= [ π’˜πŸπŸ
+ [𝟎 𝟎 𝟎 𝟎 π’˜πŸπŸ
𝟏 π’˜ 𝟐 π’˜πŸ‘ 𝟎 𝟎 𝟎 𝟎 𝟎 ] 𝒙
𝟏 π’˜πŸ π’˜πŸ‘ 𝟎]
𝟐𝟏
π’™πŸπŸ
π’™πŸπŸ‘
[π’™πŸπŸ’ ]
π’™πŸπŸ
π’™πŸπŸ
π’™πŸπŸ‘
π’™πŸπŸ’
π’™πŸπŸ
π’™πŸπŸ
π’™πŸπŸ‘
[π’™πŸπŸ’ ]
π’šπŸπŸ
π’™πŸπŸ
π’™πŸπŸ
π’™πŸπŸ‘
π’™πŸπŸ’
𝟏𝟐 𝟏𝟐
𝟐𝟐 𝟐𝟐
= [ 𝟎 π’˜πŸπŸ
+ [𝟎 𝟎 𝟎 𝟎 𝟎 π’˜πŸπŸ
𝟏 π’˜πŸ π’˜πŸ‘ 𝟎 𝟎 𝟎 𝟎 ] 𝒙
𝟏 π’˜πŸ π’˜πŸ‘ ]
𝟐𝟏
π’™πŸπŸ
π’™πŸπŸ‘
[π’™πŸπŸ’ ]
π’™πŸπŸ
π’™πŸπŸ
π’™πŸπŸ‘
π’™πŸπŸ’
π’™πŸπŸ
π’™πŸπŸ
π’™πŸπŸ‘
[π’™πŸπŸ’ ]
Combining all these equations, we have
π’˜πŸπŸ
𝟏
0
π‘ŒΜƒ =
π’˜πŸπŸ
𝟏
[ 0
c)
π’˜πŸπŸ
𝟐
π’˜πŸπŸ
𝟏
π’˜πŸπŸ
𝟐
π’˜πŸπŸ
𝟏
π’˜πŸπŸ
πŸ‘
π’˜πŸπŸ
𝟐
π’˜πŸπŸ
πŸ‘
π’˜πŸπŸ
𝟐
0
π’˜πŸπŸ
πŸ‘
0
π’˜πŸπŸ
πŸ‘
π’˜πŸπŸ
𝟏
0
π’˜πŸπŸ
𝟏
0
π’˜πŸπŸ
𝟐
π’˜πŸπŸ
𝟏
π’˜πŸπŸ
𝟐
π’˜πŸπŸ
𝟏
π’˜πŸπŸ
πŸ‘
π’˜πŸπŸ
𝟐
π’˜πŸπŸ
πŸ‘
π’˜πŸπŸ
𝟐
0
π’˜πŸπŸ
πŸ‘
𝑋̃
0
π’˜πŸπŸ
πŸ‘ ]
Download