Deep Learning Assignment 2 1. π 1 ∇πΈππ (π€) = ∇( ∑(π‘ππβ(π€ π π₯π ) − π¦π ) 2 ) π π=1 π 1 ∇πΈππ (π€) = ( ∑ ∇ (π‘ππβ(π€ π π₯π ) − π¦π )2 ) π π=1 π 2 ∇πΈππ (π€) = ( ∑(π‘ππβ(π€ π π₯π ) − π¦π ) ∇(π‘ππβ(π€ π π₯π ) − π¦π )) π π=1 π 2 ∇πΈππ (π€) = ( ∑(π‘ππβ(π€ π π₯π ) − π¦π ) ∇(π‘ππβ(π€ π π₯π ) − π¦π )) π π=1 π (π π₯ + π −π₯ ) (π π₯ + π −π₯ ) − (π π₯ − π −π₯ ) (π π₯ − π −π₯ ) π‘ππβ(π₯) = ππ₯ (π π₯ + π −π₯ )2 π (π π₯ − π −π₯ )2 π‘ππβ(π₯) = 1 − ππ₯ (π π₯ + π −π₯ )2 π π‘ππβ(π₯) = 1 − π‘ππβ(π₯)2 ππ₯ π 2 ∇πΈππ (π€) = ( ∑(π‘ππβ(π€ π π₯π ) − π¦π ) (1 − π‘ππβ2 (π€ π π₯π ))∇(π€ π π₯π )) π π=1 π 2 ∇πΈππ (π€) = ( ∑(π‘ππβ(π€ π π₯π ) − π¦π ) (1 − π‘ππβ2 (π€ π π₯)) π₯π ) π π=1 As w → ∞, the slope of tanh function becomes almost zero that results in slow learning which makes it difficult to optimize the perceptron. 2. W(1) = [ 0.2 W(2) = [ 1 ] −3 0.1 0.2 ] 0.3 0.4 1 W(2) = [ ] 2 x = 2, y = 1 x(0) s(1) x(1) s(2) 1 [ ] 2 0.1 0.3 1 0.7 [ ][ ] = [ ] 0.2 0.4 2 1 1 [ 0.6 ] 0.76 [−1.48] πΉ(π) πΉ(π) [−3.6] [(1 − 0.9)2 . 2 . (−3.6) ] = 1.368 ππ ππ (1) −π. ππ π. ππ = [ ] −π. ππ π. ππ [ x(2) s(3) x(3) 1 ] −0.90 [−0.8] [−0.8] πΉ(π) ππ ππ (2) [ −0.88 ] 1.73 −π. ππ = [−π. ππ] −π. ππ ππ ππ (3) −π. π = [ ] π. ππ 3. (20 points) Consider the standard residual block and the bottleneck block in the case where inputs and outputs have the same dimension (e.g. Figure 5 in [1]). In another word, the residual connection is an identity connection. For the standard residual block, compute the number of training parameters when the dimension of inputs and outputs is 128x16x16x32. Here, 128 is the batch size, 16 x 16 is the spatial size of feature maps, and 32 is the number of channels. For the bottleneck block, compute the number of training parameters when the dimension of inputs and outputs is 128x16x16x128. Compare the two results and explain the advantages and disadvantages of the bottleneck block. For Standard residual block, Given input dimensions 16 x 16 x 32 Number of Training parameters in Conv1 3 x 3, 32 = (3*3*32+1)*32 Number of Training parameters in batchnorm1 32 feature maps = 2*32 Number of Training parameters in Conv2 3 x 3, 32 = (3*3*32+1)*32 Number of Training parameters in batchnorm2 32 feature maps = 2*32 Total Number of Training parameters = 18624 For Bottleneck block, Given input dimensions 16 x 16 x 32 Number of Training parameters in conv1 1 x 1, 32 = (1*1*128+1)*32 Number of Training parameters in batchnorm1 32 feature maps = 2*32 Number of Training parameters in Conv2 3 x 3, 32 = (3*3*32+1)*32 Number of Training parameters in batchnorm2 32 feature maps = 2*32 Number of Training parameters in Conv3 1 x 1, 128 = (1*1*32+1)*128 Number of Training parameters in batchnorm3 128 feature maps = 2*128 Total Number of Training parameters = 17984 4. (20 points) Using batch normalization in training requires computing the mean and variance of a tensor. (a) (8 points) Suppose the tensor x is the output of a fully-connected layer and we want to perform batch normalization on it. The training batch size is N and the fully-connected layer has C output nodes. Therefore, the shape of x is N x C. What is the shape of the mean and variance computed in batch normalization, respectively? 1 x C, 1 x C Mean and variance have same shape. Mean depends on number of channels. Since it (b) (12 points) Now suppose the tensor x is the output of a 2D convolution and has shape N x H x W x C. What is the shape of the mean and variance computed in batch normalization, respectively? b) 1 x C x 1 x 1, 1 x C x 1 x 1 Instead of calculating mean for each element in a matrix and in a specific channel, it is calculated for an entire matrix for a single channel. 5. a) We get equation for πππ , πππ , πππ , πππ as below ππ ππ ππ ππ ππ πππ = πππ π πππ + ππ πππ + ππ πππ + ππ πππ + ππ πππ + ππ πππ ππ ππ ππ ππ ππ πππ = πππ π πππ + ππ πππ + ππ πππ + ππ πππ + ππ πππ + ππ πππ ππ ππ ππ ππ ππ πππ = πππ π πππ + ππ πππ + ππ πππ + ππ πππ + ππ πππ + ππ πππ ππ ππ ππ ππ ππ πππ = πππ π πππ + ππ πππ + ππ πππ + ππ πππ + ππ πππ + ππ πππ ππ π πππ = [ πππ π πππ π πππ π πππ ππ ππ ] [πππ ] + [ πππ π ππ ππ ] πππ πππ [πππ ] πππ πππ πππ πππ πππ ππ ππ ππ ππ = [ πππ + [π π π π πππ π π π ππ π π π π π ] π π ππ ππ π] ππ πππ πππ [πππ ] πππ πππ πππ πππ πππ πππ πππ [πππ ] Similarly, we get for πππ πππ πππ πππ πππ ππ ππ ππ ππ = [ π πππ + [π π π π π πππ π ππ ππ π π π π ] π π ππ ππ ] ππ πππ πππ [πππ ] πππ πππ πππ πππ πππ πππ πππ [πππ ] πππ πππ πππ πππ πππ ππ ππ ππ ππ = [ πππ + [π π π π πππ π π π ππ π π π π π ] π π ππ ππ π] ππ πππ πππ [πππ ] πππ πππ πππ πππ πππ πππ πππ [πππ ] πππ πππ πππ πππ πππ ππ ππ ππ ππ = [ π πππ + [π π π π π πππ π ππ ππ π π π π ] π π ππ ππ ] ππ πππ πππ [πππ ] πππ πππ πππ πππ πππ πππ πππ [πππ ] Combining all these equations, we have πππ π 0 πΜ = πππ π [ 0 c) πππ π πππ π πππ π πππ π πππ π πππ π πππ π πππ π 0 πππ π 0 πππ π πππ π 0 πππ π 0 πππ π πππ π πππ π πππ π πππ π πππ π πππ π πππ π 0 πππ π πΜ 0 πππ π ]