Deep Learning Assignment: Gradient, Backpropagation, Batch Norm

Deep Learning Assignment 2 1. 𝑁 1 ∇𝐸𝑖𝑛 (𝑤) = ∇( ∑(𝑡𝑎𝑛ℎ(𝑤 𝑇 𝑥𝑛 ) − 𝑦𝑛 ) 2 ) 𝑁 𝑛=1 𝑁 1 ∇𝐸𝑖𝑛 (𝑤) = ( ∑ ∇ (𝑡𝑎𝑛ℎ(𝑤 𝑇 𝑥𝑛 ) − 𝑦𝑛 )2 ) 𝑁 𝑛=1 𝑁 2 ∇𝐸𝑖𝑛 (𝑤) = ( ∑(𝑡𝑎𝑛ℎ(𝑤 𝑇 𝑥𝑛 ) − 𝑦𝑛 ) ∇(𝑡𝑎𝑛ℎ(𝑤 𝑇 𝑥𝑛 ) − 𝑦𝑛 )) 𝑁 𝑛=1 𝑁 2 ∇𝐸𝑖𝑛 (𝑤) = ( ∑(𝑡𝑎𝑛ℎ(𝑤 𝑇 𝑥𝑛 ) − 𝑦𝑛 ) ∇(𝑡𝑎𝑛ℎ(𝑤 𝑇 𝑥𝑛 ) − 𝑦𝑛 )) 𝑁 𝑛=1 𝑑 (𝑒 𝑥 + 𝑒 −𝑥 ) (𝑒 𝑥 + 𝑒 −𝑥 ) − (𝑒 𝑥 − 𝑒 −𝑥 ) (𝑒 𝑥 − 𝑒 −𝑥 ) 𝑡𝑎𝑛ℎ(𝑥) = 𝑑𝑥 (𝑒 𝑥 + 𝑒 −𝑥 )2 𝑑 (𝑒 𝑥 − 𝑒 −𝑥 )2 𝑡𝑎𝑛ℎ(𝑥) = 1 − 𝑑𝑥 (𝑒 𝑥 + 𝑒 −𝑥 )2 𝑑 𝑡𝑎𝑛ℎ(𝑥) = 1 − 𝑡𝑎𝑛ℎ(𝑥)2 𝑑𝑥 𝑁 2 ∇𝐸𝑖𝑛 (𝑤) = ( ∑(𝑡𝑎𝑛ℎ(𝑤 𝑇 𝑥𝑛 ) − 𝑦𝑛 ) (1 − 𝑡𝑎𝑛ℎ2 (𝑤 𝑇 𝑥𝑛 ))∇(𝑤 𝑇 𝑥𝑛 )) 𝑁 𝑛=1 𝑁 2 ∇𝐸𝑖𝑛 (𝑤) = ( ∑(𝑡𝑎𝑛ℎ(𝑤 𝑇 𝑥𝑛 ) − 𝑦𝑛 ) (1 − 𝑡𝑎𝑛ℎ2 (𝑤 𝑇 𝑥)) 𝑥𝑛 ) 𝑁 𝑛=1 As w → ∞, the slope of tanh function becomes almost zero that results in slow learning which makes it difficult to optimize the perceptron. 2. W(1) = [ 0.2 W(2) = [ 1 ] −3 0.1 0.2 ] 0.3 0.4 1 W(2) = [ ] 2 x = 2, y = 1 x(0) s(1) x(1) s(2) 1 [ ] 2 0.1 0.3 1 0.7 [ ][ ] = [ ] 0.2 0.4 2 1 1 [ 0.6 ] 0.76 [−1.48] 𝜹(𝟑) 𝜹(𝟐) [−3.6] [(1 − 0.9)2 . 2 . (−3.6) ] = 1.368 𝝏𝒆 𝝏𝑊 (1) −𝟎. 𝟖𝟖 𝟏. 𝟕𝟑 = [ ] −𝟏. 𝟕𝟔 𝟑. 𝟒𝟔 [ x(2) s(3) x(3) 1 ] −0.90 [−0.8] [−0.8] 𝜹(𝟏) 𝝏𝒆 𝝏𝑊 (2) [ −0.88 ] 1.73 −𝟏. 𝟑𝟕 = [−𝟎. 𝟖𝟐] −𝟏. 𝟎𝟒 𝝏𝒆 𝝏𝑊 (3) −𝟑. 𝟔 = [ ] 𝟑. 𝟐𝟒 3. (20 points) Consider the standard residual block and the bottleneck block in the case where inputs and outputs have the same dimension (e.g. Figure 5 in [1]). In another word, the residual connection is an identity connection. For the standard residual block, compute the number of training parameters when the dimension of inputs and outputs is 128x16x16x32. Here, 128 is the batch size, 16 x 16 is the spatial size of feature maps, and 32 is the number of channels. For the bottleneck block, compute the number of training parameters when the dimension of inputs and outputs is 128x16x16x128. Compare the two results and explain the advantages and disadvantages of the bottleneck block. For Standard residual block, Given input dimensions 16 x 16 x 32 Number of Training parameters in Conv1 3 x 3, 32 = (3*3*32+1)*32 Number of Training parameters in batchnorm1 32 feature maps = 2*32 Number of Training parameters in Conv2 3 x 3, 32 = (3*3*32+1)*32 Number of Training parameters in batchnorm2 32 feature maps = 2*32 Total Number of Training parameters = 18624 For Bottleneck block, Given input dimensions 16 x 16 x 32 Number of Training parameters in conv1 1 x 1, 32 = (1*1*128+1)*32 Number of Training parameters in batchnorm1 32 feature maps = 2*32 Number of Training parameters in Conv2 3 x 3, 32 = (3*3*32+1)*32 Number of Training parameters in batchnorm2 32 feature maps = 2*32 Number of Training parameters in Conv3 1 x 1, 128 = (1*1*32+1)*128 Number of Training parameters in batchnorm3 128 feature maps = 2*128 Total Number of Training parameters = 17984 4. (20 points) Using batch normalization in training requires computing the mean and variance of a tensor. (a) (8 points) Suppose the tensor x is the output of a fully-connected layer and we want to perform batch normalization on it. The training batch size is N and the fully-connected layer has C output nodes. Therefore, the shape of x is N x C. What is the shape of the mean and variance computed in batch normalization, respectively? 1 x C, 1 x C Mean and variance have same shape. Mean depends on number of channels. Since it (b) (12 points) Now suppose the tensor x is the output of a 2D convolution and has shape N x H x W x C. What is the shape of the mean and variance computed in batch normalization, respectively? b) 1 x C x 1 x 1, 1 x C x 1 x 1 Instead of calculating mean for each element in a matrix and in a specific channel, it is calculated for an entire matrix for a single channel. 5. a) We get equation for 𝒚𝟏𝟏 , 𝒚𝟏𝟐 , 𝒚𝟐𝟏 , 𝒚𝟐𝟐 as below 𝟏𝟏 𝟏𝟏 𝟐𝟏 𝟐𝟏 𝟐𝟏 𝒚𝟏𝟏 = 𝒘𝟏𝟏 𝟏 𝒙𝟏𝟏 + 𝒘𝟐 𝒙𝟏𝟐 + 𝒘𝟑 𝒙𝟏𝟑 + 𝒘𝟏 𝒙𝟐𝟏 + 𝒘𝟐 𝒙𝟐𝟐 + 𝒘𝟑 𝒙𝟐𝟑 𝟏𝟏 𝟏𝟏 𝟐𝟏 𝟐𝟏 𝟐𝟏 𝒚𝟏𝟐 = 𝒘𝟏𝟏 𝟏 𝒙𝟏𝟐 + 𝒘𝟐 𝒙𝟏𝟑 + 𝒘𝟑 𝒙𝟏𝟒 + 𝒘𝟏 𝒙𝟐𝟐 + 𝒘𝟐 𝒙𝟐𝟑 + 𝒘𝟑 𝒙𝟐𝟒 𝟏𝟐 𝟏𝟐 𝟐𝟐 𝟐𝟐 𝟐𝟐 𝒚𝟐𝟏 = 𝒘𝟏𝟐 𝟏 𝒙𝟏𝟏 + 𝒘𝟐 𝒙𝟏𝟐 + 𝒘𝟑 𝒙𝟏𝟑 + 𝒘𝟏 𝒙𝟐𝟏 + 𝒘𝟐 𝒙𝟐𝟐 + 𝒘𝟑 𝒙𝟐𝟑 𝟏𝟐 𝟏𝟐 𝟐𝟐 𝟐𝟐 𝟐𝟐 𝒚𝟐𝟐 = 𝒘𝟏𝟐 𝟏 𝒙𝟏𝟐 + 𝒘𝟐 𝒙𝟏𝟑 + 𝒘𝟑 𝒙𝟏𝟒 + 𝒘𝟏 𝒙𝟐𝟐 + 𝒘𝟐 𝒙𝟐𝟑 + 𝒘𝟑 𝒙𝟐𝟒 𝟏𝟏 𝒚 𝒚𝟏𝟏 = [ 𝒘𝟏𝟏 𝟏 𝒘𝟏𝟏 𝟐 𝒘𝟏𝟏 𝟑 𝒙𝟏𝟏 𝟐𝟏 𝟐𝟏 ] [𝒙𝟏𝟐 ] + [ 𝒘𝟐𝟏 𝟏 𝒘𝟐 𝒘𝟑 ] 𝒙𝟏𝟑 𝒙𝟐𝟏 [𝒙𝟐𝟐 ] 𝒙𝟐𝟑 𝒙𝟏𝟏 𝒙𝟏𝟐 𝒙𝟏𝟑 𝒙𝟏𝟒 𝟏𝟏 𝟏𝟏 𝟐𝟏 𝟐𝟏 = [ 𝒘𝟏𝟏 + [𝟎 𝟎 𝟎 𝟎 𝒘𝟐𝟏 𝟏 𝒘 𝟐 𝒘𝟑 𝟎 𝟎 𝟎 𝟎 𝟎 ] 𝒙 𝟏 𝒘𝟐 𝒘𝟑 𝟎] 𝟐𝟏 𝒙𝟐𝟐 𝒙𝟐𝟑 [𝒙𝟐𝟒 ] 𝒙𝟏𝟏 𝒙𝟏𝟐 𝒙𝟏𝟑 𝒙𝟏𝟒 𝒙𝟐𝟏 𝒙𝟐𝟐 𝒙𝟐𝟑 [𝒙𝟐𝟒 ] Similarly, we get for 𝒚𝟏𝟐 𝒙𝟏𝟏 𝒙𝟏𝟐 𝒙𝟏𝟑 𝒙𝟏𝟒 𝟏𝟏 𝟏𝟏 𝟐𝟏 𝟐𝟏 = [ 𝟎 𝒘𝟏𝟏 + [𝟎 𝟎 𝟎 𝟎 𝟎 𝒘𝟐𝟏 𝟏 𝒘𝟐 𝒘𝟑 𝟎 𝟎 𝟎 𝟎 ] 𝒙 𝟏 𝒘𝟐 𝒘𝟑 ] 𝟐𝟏 𝒙𝟐𝟐 𝒙𝟐𝟑 [𝒙𝟐𝟒 ] 𝒙𝟏𝟏 𝒙𝟏𝟐 𝒙𝟏𝟑 𝒙𝟏𝟒 𝒙𝟐𝟏 𝒙𝟐𝟐 𝒙𝟐𝟑 [𝒙𝟐𝟒 ] 𝒚𝟐𝟏 𝒙𝟏𝟏 𝒙𝟏𝟐 𝒙𝟏𝟑 𝒙𝟏𝟒 𝟏𝟐 𝟏𝟐 𝟐𝟐 𝟐𝟐 = [ 𝒘𝟏𝟐 + [𝟎 𝟎 𝟎 𝟎 𝒘𝟐𝟐 𝟏 𝒘 𝟐 𝒘𝟑 𝟎 𝟎 𝟎 𝟎 𝟎 ] 𝒙 𝟏 𝒘𝟐 𝒘𝟑 𝟎] 𝟐𝟏 𝒙𝟐𝟐 𝒙𝟐𝟑 [𝒙𝟐𝟒 ] 𝒙𝟏𝟏 𝒙𝟏𝟐 𝒙𝟏𝟑 𝒙𝟏𝟒 𝒙𝟐𝟏 𝒙𝟐𝟐 𝒙𝟐𝟑 [𝒙𝟐𝟒 ] 𝒚𝟐𝟐 𝒙𝟏𝟏 𝒙𝟏𝟐 𝒙𝟏𝟑 𝒙𝟏𝟒 𝟏𝟐 𝟏𝟐 𝟐𝟐 𝟐𝟐 = [ 𝟎 𝒘𝟏𝟐 + [𝟎 𝟎 𝟎 𝟎 𝟎 𝒘𝟐𝟐 𝟏 𝒘𝟐 𝒘𝟑 𝟎 𝟎 𝟎 𝟎 ] 𝒙 𝟏 𝒘𝟐 𝒘𝟑 ] 𝟐𝟏 𝒙𝟐𝟐 𝒙𝟐𝟑 [𝒙𝟐𝟒 ] 𝒙𝟏𝟏 𝒙𝟏𝟐 𝒙𝟏𝟑 𝒙𝟏𝟒 𝒙𝟐𝟏 𝒙𝟐𝟐 𝒙𝟐𝟑 [𝒙𝟐𝟒 ] Combining all these equations, we have 𝒘𝟏𝟏 𝟏 0 𝑌̃ = 𝒘𝟏𝟐 𝟏 [ 0 c) 𝒘𝟏𝟏 𝟐 𝒘𝟏𝟏 𝟏 𝒘𝟏𝟐 𝟐 𝒘𝟏𝟐 𝟏 𝒘𝟏𝟏 𝟑 𝒘𝟏𝟏 𝟐 𝒘𝟏𝟐 𝟑 𝒘𝟏𝟐 𝟐 0 𝒘𝟏𝟏 𝟑 0 𝒘𝟏𝟐 𝟑 𝒘𝟐𝟏 𝟏 0 𝒘𝟐𝟐 𝟏 0 𝒘𝟐𝟏 𝟐 𝒘𝟐𝟏 𝟏 𝒘𝟐𝟐 𝟐 𝒘𝟐𝟐 𝟏 𝒘𝟐𝟏 𝟑 𝒘𝟐𝟏 𝟐 𝒘𝟐𝟐 𝟑 𝒘𝟐𝟐 𝟐 0 𝒘𝟐𝟏 𝟑 𝑋̃ 0 𝒘𝟐𝟐 𝟑 ]

Deep Learning Assignment: Gradient, Backpropagation, Batch Norm

Related documents

Products

Support

Deep Learning Assignment: Gradient, Backpropagation, Batch Norm

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib