11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook Convolutional Neural Networks The Cross-Correlation Operation The input is a two-dimensional tensor with a height of 3 and width of 3. We mark the shape of the tensor as 3×3 or ( 3 , 3 ). The height and width of the kernel are both 2. The shape of the kernel window (or convolution window) is given by the height and width of the kernel (here it is 2×2 ). In the two-dimensional cross-correlation operation, we begin with the convolution window positioned at the upper-left corner of the input tensor and slide it across the input tensor, both from left to right and top to bottom. When the convolution window slides to a certain position, the input subtensor contained in that window and the kernel tensor are multiplied elementwise and the resulting tensor is summed up yielding a single scalar value. This result gives the value of the output tensor at the corresponding location. Here, the output tensor has a height of 2 and width of 2 and the four elements are derived from the two-dimensional cross-correlation operation: 0 × 0 + 1 × 1 + 3 × 2 + 4 × 3 = 19, 1 × 0 + 2 × 1 + 4 × 2 + 5 × 3 = 25, 3 × 0 + 4 × 1 + 6 × 2 + 7 × 3 = 37, 4 × 0 + 5 × 1 + 7 × 2 + 8 × 3 = 43. Note that along each axis, the output size is slightly smaller than the input size. Because the kernel has width and height greater than one, we can only properly compute the cross-correlation for locations where the kernel fits wholly within the image, the output size is given by the input size 𝑛 𝑛 minus the size of the convolution × via: (𝑛ℎ − 𝑘ℎ + 1) × (𝑛𝑤 − 𝑘𝑤 + 1) kernel 𝑘 ℎ 𝑘 ℎ × 𝑤 𝑤 In [1]: import torch from torch import nn from d2l import torch as d2l localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 1/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [2]: def corr2d(X, K): #@save """Compute 2D cross-correlation.""" h, w = K.shape Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1)) for i in range(Y.shape[0]): for j in range(Y.shape[1]): Y[i, j] = (X[i:i + h, j:j + w] * K).sum() return Y In [3]: X = torch.tensor([[0.0, [3.0, [6.0, ]) K = torch.tensor([[0.0, [2.0, ]) corr2d(X, K) 1.0, 2.0], 4.0, 5.0], 7.0, 8.0] 1.0], 3.0] Out[3]: tensor([[19., 25.], [37., 43.]]) Convolutional Layers A convolutional layer cross-correlates the input and kernel and adds a scalar bias to produce an output. The two parameters of a convolutional layer are the kernel and the scalar bias. When training models based on convolutional layers, we typically initialize the kernels randomly, just as we would with a fully-connected layer. We are now ready to implement a two-dimensional convolutional layer based on the corr2d function defined above. In the constructor function, we declare weight and bias as the two model parameters. The forward propagation function calls the corr2d function and adds the bias. __𝑖𝑛𝑖𝑡__ In [4]: class Conv2D(nn.Module): def __init__(self, kernel_size): super().__init__() self.weight = nn.Parameter(torch.rand(kernel_size)) self.bias = nn.Parameter(torch.zeros(1)) def forward(self, x): return corr2d(x, self.weight) + self.bias Object Edge Detection in Images Let us take a moment to parse a simple application of a convolutional layer: detecting the edge of an object in an image by finding the location of the pixel change. First, we construct an “image” of 6×8 pixels. The middle four columns are black (0) and the rest are white (1). localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 2/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [5]: X = torch.ones((6, 8)) X[:, 2:6] = 0 X Out[5]: tensor([[1., [1., [1., [1., [1., [1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1.], 1.], 1.], 1.], 1.], 1.]]) Next, we construct a kernel K with a height of 1 and a width of 2. When we perform the cross-correlation operation with the input, if the horizontally adjacent elements are the same, the output is 0. Otherwise, the output is non-zero. In [6]: K = torch.tensor([[1.0, -1.0]]) We are ready to perform the cross-correlation operation with arguments X (our input) and K (our kernel). As you can see, we detect 1 for the edge from white to black and -1 for the edge from black to white. All other outputs take value 0. In [7]: Y = corr2d(X, K) Y Out[7]: tensor([[ [ [ [ [ [ 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -1., -1., -1., -1., -1., -1., 0.], 0.], 0.], 0.], 0.], 0.]]) We can now apply the kernel to the transposed image. As expected, it vanishes. The kernel K only detects vertical edges. localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 3/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [8]: corr2d(X.t(), K.t()) Out[8]: tensor([[ 0., 0., 0., 0., 0., 0.], [ 1., 1., 1., 1., 1., 1.], [ 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0.], [-1., -1., -1., -1., -1., -1.], [ 0., 0., 0., 0., 0., 0.]]) Learning a Kernel Designing an edge detector by finite differences [1, -1] is neat if we know this is precisely what we are looking for. However, as we look at larger kernels, and consider successive layers of convolutions, it might be impossible to specify precisely what each filter should be doing manually. Now let us see whether we can learn the kernel that generated Y from X by looking at the input–output pairs only. We first construct a convolutional layer and initialize its kernel as a random tensor. Next, in each iteration, we will use the squared error to compare Y with the output of the convolutional layer. We can then calculate the gradient to update the kernel. For the sake of simplicity, in the following we use the built-in class for twodimensional convolutional layers and ignore the bias. In [9]: # Construct a two-dimensional convolutional layer with 1 output channel and a # kernel of shape (1, 2). For the sake of simplicity, we ignore the bias here conv2d = nn.Conv2d(1, 1, kernel_size=(1, 2), bias=False) In [10]: # The two-dimensional convolutional layer uses four-dimensional input and # output in the format of (example, channel, height, width), where the batch # size (number of examples in the batch) and the number of channels are both 1 X = X.reshape((1, 1, 6, 8)) Y = Y.reshape((1, 1, 6, 7)) lr = 3e-2 # Learning rate localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 4/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [11]: for i in range(10): Y_hat = conv2d(X) l = (Y_hat - Y)**2 conv2d.zero_grad() l.sum().backward() # Update the kernel conv2d.weight.data[:] -= lr * conv2d.weight.grad print(f'batch {i + 1}, loss {l.sum():.3f}') batch batch batch batch batch batch batch batch batch batch 1, loss 22.013 2, loss 9.498 3, loss 4.199 4, loss 1.917 5, loss 0.912 6, loss 0.454 7, loss 0.238 8, loss 0.130 9, loss 0.075 10, loss 0.044 In [12]: conv2d.weight.data.reshape((1, 2)) Out[12]: tensor([[ 1.0052, -0.9651]]) Padding One tricky issue when applying convolutional layers is that we tend to lose pixels on the perimeter of our image. Since we typically use small kernels, for any given convolution, we might only lose a few pixels, but this can add up as we apply many successive convolutional layers. One straightforward solution to this problem is to add extra pixels of filler around the boundary of our input image, thus increasing the effective size of the image. Typically, we set the values of the extra pixels to zero. In the given example, we pad a 3×3 input, increasing its size to 5×5. The corresponding output then increases to a 4×4 matrix. The shaded portions are the first output element as well as the input and kernel tensor elements used for the output computation: 0×0+0×1+0×2+0×3=0. In general, if we add a total of 𝑝 rows of padding (roughly half on top and half on bottom) and a total of 𝑝 ℎ localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 𝑤 5/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook columns of padding (roughly half on the left and half on the right), the output shape will be: ( − + + 1) × ( − + + 1) 𝑛 ℎ 𝑘 ℎ 𝑝 ℎ 𝑛 𝑤 𝑘 𝑤 𝑝 𝑤 CNNs commonly use convolution kernels with odd height and width values, such as 1, 3, 5, or 7. Choosing odd kernel sizes has the benefit that we can preserve the spatial dimensionality while padding with the same number of rows on top and bottom, and the same number of columns on left and right. Moreover, this practice of using odd kernels and padding to precisely preserve dimensionality offers a clerical benefit. For any two-dimensional tensor X, when the kernel’s size is odd and the number of padding rows and columns on all sides are the same, producing an output with the same height and width as the input, we know that the output Y[i, j] is calculated by cross-correlation of the input and convolution kernel with the window centered on X[i, j]. In [13]: # We define a convenience function to calculate the convolutional layer. This # function initializes the convolutional layer weights and performs # corresponding dimensionality elevations and reductions on the input and # output def comp_conv2d(conv2d, X): # Here (1, 1) indicates that the batch size and the number of channels # are both 1 X = X.reshape((1, 1) + X.shape) Y = conv2d(X) # Exclude the first two dimensions that do not interest us: examples and # channels return Y.reshape(Y.shape[2:]) In [14]: # Note that here 1 row or column is padded on either side, so a total of 2 # rows or columns are added conv2d = nn.Conv2d(1, 1, kernel_size=3, padding=1) X = torch.rand(size=(8, 8)) comp_conv2d(conv2d, X).shape Out[14]: torch.Size([8, 8]) In [15]: # Here, we use a convolution kernel with a height of 5 and a width of 3. The # padding numbers on either side of the height and width are 2 and 1, # respectively conv2d = nn.Conv2d(1, 1, kernel_size=(5, 3), padding=(2, 1)) comp_conv2d(conv2d, X).shape Out[15]: torch.Size([8, 8]) Stride When computing the cross-correlation, we start with the convolution window at the upper-left corner of the input tensor, and then slide it over all locations both down and to the right. In previous examples, we default to localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 6/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook sliding one element at a time. However, sometimes, either for computational efficiency or because we wish to downsample, we move our window more than one element at a time, skipping the intermediate locations. We refer to the number of rows and columns traversed per slide as the stride. So far, we have used strides of 1, both for height and width. Sometimes, we may want to use a larger stride. Example above shows a twodimensional cross-correlation operation with a stride of 3 vertically and 2 horizontally. The shaded portions are the output elements as well as the input and kernel tensor elements used for the output computation: 0×0+0×1+1×2+2×3=8 , 0×0+6×1+0×2+0×3=6. We can see that when the second element of the first column is outputted, the convolution window slides down three rows. The convolution window slides two columns to the right when the second element of the first row is outputted. When the convolution window continues to slide two columns to the right on the input, there is no output because the input element cannot fill the window (unless we add another column of padding). In general, when the stride for the height is 𝑠 and the stride for the width is 𝑠 , the output shape is: ℎ ⌊(𝑛ℎ − 𝑘ℎ + 𝑝ℎ + 𝑠ℎ )/𝑠ℎ ⌋ × ⌊(𝑛𝑤 − 𝑘𝑤 + 𝑝𝑤 + 𝑠𝑤 )/𝑠𝑤 ⌋ 𝑤 In [16]: conv2d = nn.Conv2d(1, 1, kernel_size=3, padding=1, stride=2) comp_conv2d(conv2d, X).shape Out[16]: torch.Size([4, 4]) In [17]: conv2d = nn.Conv2d(1, 1, kernel_size=(3, 5), padding=(0, 1), stride=(3, 4)) comp_conv2d(conv2d, X).shape Out[17]: torch.Size([2, 2]) Multiple Input Channels When the input data contain multiple channels (e.g., color images have the standard RGB channels to indicate the amount of red, green and blue), we need to construct a convolution kernel with the same number of input channels as the input data, so that it can perform cross-correlation with the input data. Assuming that the localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 7/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook number of channels for the input data is 𝑐 , the number of input channels of the convolution kernel also needs 𝑖 to be 𝑐 . 𝑖 If our convolution kernel’s window shape is 𝑘 as just a two-dimensional tensor of shape 𝑘 However, when 𝑐 𝑖 ℎ × × ℎ 𝑘 𝑘 𝑤 𝑤 , then when 𝑐 . 𝑖 = 1 , we can think of our convolution kernel > 1, we need a kernel that contains a tensor of shape × 𝑘 ℎ Concatenating these 𝑐 tensors together yields a convolution kernel of shape 𝑐 𝑖 𝑖 for every input channel. × × 𝑘 𝑤 𝑘 ℎ 𝑘 𝑤 . Since the input and convolution kernel each have 𝑐 channels, we can perform a cross-correlation operation on the two𝑖 dimensional tensor of the input and the two-dimensional tensor of the convolution kernel for each channel, adding the 𝑐 results together (summing over the channels) to yield a two-dimensional tensor. This is the result 𝑖 of a two-dimensional cross-correlation between a multi-channel input and a multi-input-channel convolution kernel. In the given below example, we demonstrate an example of a two-dimensional cross-correlation with two input channels. The shaded portions are the first output element as well as the input and kernel tensor elements used for the output computation: (1×1+2×2+4×3+5×4)+(0×0+1×1+3×2+4×3)=56. In [18]: def corr2d_multi_in(X, K): # First, iterate through the 0th dimension (channel dimension) of `X` and # `K`. Then, add them together return sum(d2l.corr2d(x, k) for x, k in zip(X, K)) localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 8/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [19]: X = torch.tensor([[[0.0, [3.0, [6.0, ], [[1.0, [4.0, [7.0, ]]) K = torch.tensor([[[0.0, [2.0, ], [[1.0, [3.0, ]]) corr2d_multi_in(X, K) 1.0, 2.0], 4.0, 5.0], 7.0, 8.0] 2.0, 3.0], 5.0, 6.0], 8.0, 9.0] 1.0], 3.0] 2.0], 4.0] Out[19]: tensor([[ 56., 72.], [104., 120.]]) Multiple Output Channels Regardless of the number of input channels, so far we always ended up with one output channel. However, it turns out to be essential to have multiple channels at each layer. In the most popular neural network architectures, we actually increase the channel dimension as we go higher up in the neural network, typically downsampling to trade off spatial resolution for greater channel depth. Intuitively, you could think of each channel as responding to some different set of features. Reality is a bit more complicated than the most naive interpretations of this intuition since representations are not learned independent but are rather optimized to be jointly useful. So it may not be that a single channel learns an edge detector but rather that some direction in channel space corresponds to detecting edges. Denote by 𝑐 and 𝑐 the number of input and output channels, respectively, and let 𝑘 and 𝑘 be the height and 𝑖 𝑜 ℎ 𝑤 width of the kernel. To get an output with multiple channels, we can create a kernel tensor of shape for every output channel. We concatenate them on the output channel dimension, so that the 𝑐 𝑘 𝑘 𝑖 × × ℎ 𝑤 shape of the convolution kernel is 𝑐 𝑜 × × × 𝑐 𝑖 𝑘 ℎ 𝑘 𝑤 . In cross-correlation operations, the result on each output channel is calculated from the convolution kernel corresponding to that output channel and takes input from all channels in the input tensor. In [20]: def corr2d_multi_in_out(X, K): # Iterate through the 0th dimension of `K`, and each time, perform # cross-correlation operations with input `X`. All of the results are # stacked together return torch.stack([corr2d_multi_in(X, k) for k in K], 0) localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 9/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [21]: K = torch.tensor([[[0.0, 1.0], [2.0, 3.0] ], [[1.0, 2.0], [3.0, 4.0] ]]) K = torch.stack((K, K + 1, K + 2), 0) K.shape Out[21]: torch.Size([3, 2, 2, 2]) In [22]: K Out[22]: tensor([[[[0., 1.], [2., 3.]], [[1., 2.], [3., 4.]]], [[[1., 2.], [3., 4.]], [[2., 3.], [4., 5.]]], [[[2., 3.], [4., 5.]], [[3., 4.], [5., 6.]]]]) In [23]: X = torch.tensor([[[0.0, [3.0, [6.0, ], [[1.0, [4.0, [7.0, ]]) 1.0, 2.0], 4.0, 5.0], 7.0, 8.0] 2.0, 3.0], 5.0, 6.0], 8.0, 9.0] localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 10/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [24]: corr2d_multi_in_out(X, K) Out[24]: tensor([[[ 56., 72.], [104., 120.]], [[ 76., 100.], [148., 172.]], [[ 96., 128.], [192., 224.]]]) 1×1 Convolutional Layer At first, a 1×1 convolution, i.e., 𝑘 ℎ = = 1, does not seem to make much sense. After all, a convolution 𝑘 𝑤 correlates adjacent pixels. A 1×1 convolution obviously does not. Nonetheless, they are popular operations that are sometimes included in the designs of complex deep networks. Let us see in some detail what it actually does. Because the minimum window is used, the 1×1 convolution loses the ability of larger convolutional layers to recognize patterns consisting of interactions among adjacent elements in the height and width dimensions. The only computation of the 1×1 convolution occurs on the channel dimension. The example below shows the cross-correlation computation using the 1×1 convolution kernel with 3 input channels and 2 output channels. Note that the inputs and outputs have the same height and width. Each element in the output is derived from a linear combination of elements at the same position in the input image. You could think of the 1×1 convolutional layer as constituting a fully-connected layer applied at every single pixel location to transform the 𝑐 corresponding input values into 𝑐 output values. Because this is still a 𝑖 𝑜 convolutional layer, the weights are tied across pixel location. Thus the 1×1 convolutional layer requires 𝑐 weights (plus the bias). 𝑜 × 𝑐 𝑖 In [25]: def corr2d_multi_in_out_1x1(X, K): c_i, h, w = X.shape c_o = K.shape[0] X = X.reshape((c_i, h * w)) K = K.reshape((c_o, c_i)) # Matrix multiplication in the fully-connected layer Y = torch.matmul(K, X) return Y.reshape((c_o, h, w)) localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 11/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [26]: X = torch.normal(0, 1, (3, 3, 3)) K = torch.normal(0, 1, (2, 3, 1, 1)) Y1 = corr2d_multi_in_out_1x1(X, K) Y1 Out[26]: tensor([[[-1.2791, 0.4097, 0.5345], [ 0.1310, -1.8611, -0.5171], [-1.1235, -0.7653, -1.7798]], [[-2.0987, 3.3594, -0.5957], [-0.1891, -3.5485, 1.1903], [-2.5009, -0.1530, -1.4922]]]) In [27]: def corr2d_multi_in_out(X, K): return torch.stack([corr2d_multi_in(X, k) for k in K], 0) In [28]: Y2 = corr2d_multi_in_out(X, K) Y2 Out[28]: tensor([[[-1.2791, 0.4097, 0.5345], [ 0.1310, -1.8611, -0.5171], [-1.1235, -0.7653, -1.7798]], [[-2.0987, 3.3594, -0.5957], [-0.1891, -3.5485, 1.1903], [-2.5009, -0.1530, -1.4922]]]) In [29]: assert float(torch.abs(Y1 - Y2).sum()) < 1e-6 Pooling Like convolutional layers, pooling operators consist of a fixed-shape window that is slid over all regions in the input according to its stride, computing a single output for each location traversed by the fixed-shape window (sometimes known as the pooling window). However, unlike the cross-correlation computation of the inputs and kernels in the convolutional layer, the pooling layer contains no parameters (there is no kernel). Instead, pooling operators are deterministic, typically calculating either the maximum or the average value of the elements in the pooling window. These operations are called maximum pooling (max pooling for short) and average pooling, respectively. In both cases, as with the cross-correlation operator, we can think of the pooling window as starting from the upper-left of the input tensor and sliding across the input tensor from left to right and top to bottom. At each location that the pooling window hits, it computes the maximum or average value of the input subtensor in the window, depending on whether max or average pooling is employed. localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 12/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook max(0,1,3,4) = 4, max(1,2,4,5) = 5, max(3,4,6,7) = 7, max(4,5,7,8) = 8. In [30]: def pool2d(X, pool_size, mode='max'): p_h, p_w = pool_size Y = torch.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1)) for i in range(Y.shape[0]): for j in range(Y.shape[1]): if mode == 'max': Y[i, j] = X[i:i + p_h, j:j + p_w].max() elif mode == 'avg': Y[i, j] = X[i:i + p_h, j:j + p_w].mean() return Y In [31]: X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]]) pool2d(X, (2, 2)) Out[31]: tensor([[4., 5.], [7., 8.]]) In [32]: pool2d(X, (2, 2), 'avg') Out[32]: tensor([[2., 3.], [5., 6.]]) Putthing everything together localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 13/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [33]: #Padding and Stride X = torch.arange(16, dtype=torch.float32).reshape((1, 1, 4, 4)) X Out[33]: tensor([[[[ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.], [12., 13., 14., 15.]]]]) In [34]: #https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html pool2d = nn.MaxPool2d(3) pool2d(X) Out[34]: tensor([[[[10.]]]]) In [35]: pool2d = nn.MaxPool2d(3, padding=1, stride=2) pool2d(X) Out[35]: tensor([[[[ 5., 7.], [13., 15.]]]]) In [36]: pool2d = nn.MaxPool2d((2, 3), stride=(2, 3), padding=(0, 1)) pool2d(X) Out[36]: tensor([[[[ 5., 7.], [13., 15.]]]]) In [37]: # Multiple Channels X = torch.cat((X, X + 1), 1) X Out[37]: tensor([[[[ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.], [12., 13., 14., 15.]], [[ 1., 2., 3., 4.], [ 5., 6., 7., 8.], [ 9., 10., 11., 12.], [13., 14., 15., 16.]]]]) localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 14/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [38]: pool2d = nn.MaxPool2d(3, padding=1, stride=2) pool2d(X) Out[38]: tensor([[[[ 5., 7.], [13., 15.]], [[ 6., 8.], [14., 16.]]]]) LeNet At a high level, LeNet (LeNet-5) consists of two parts: (i) a convolutional encoder consisting of two convolutional layers; and (ii) a dense block consisting of three fully-connected layers In [39]: net = nn.Sequential(nn.Conv2d(1, 6, kernel_size=5, padding=2), nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2), nn.Conv2d(6, 16, kernel_size=5), nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2), nn.Flatten(), nn.Linear(16 * 5 * 5, 120), nn.Sigmoid(), nn.Linear(120, 84), nn.Sigmoid(), nn.Linear(84, 10)) We took a small liberty with the original model, removing the Gaussian activation in the final layer. Other than that, this network matches the original LeNet-5 architecture. By passing a single-channel (black and white) 28×28 image through the network and printing the output shape at each layer, we can inspect the model to make sure that its operations line up with what we expect from the figure below: localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 15/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [40]: X = torch.rand(size=(1, 1, 28, 28), dtype=torch.float32) for layer in net: X = layer(X) print(layer.__class__.__name__, 'output shape: \t', X.shape) Conv2d output shape: Sigmoid output shape: AvgPool2d output shape: Conv2d output shape: Sigmoid output shape: AvgPool2d output shape: Flatten output shape: Linear output shape: Sigmoid output shape: Linear output shape: Sigmoid output shape: Linear output shape: torch.Size([1, 6, 28, 28]) torch.Size([1, 6, 28, 28]) torch.Size([1, 6, 14, 14]) torch.Size([1, 16, 10, 10]) torch.Size([1, 16, 10, 10]) torch.Size([1, 16, 5, 5]) torch.Size([1, 400]) torch.Size([1, 120]) torch.Size([1, 120]) torch.Size([1, 84]) torch.Size([1, 84]) torch.Size([1, 10]) Training localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 16/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [41]: batch_size = 256 train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size) Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.co m/train-images-idx3-ubyte.gz (http://fashion-mnist.s3-website.eu-centr al-1.amazonaws.com/train-images-idx3-ubyte.gz) Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.co m/train-images-idx3-ubyte.gz (http://fashion-mnist.s3-website.eu-centr al-1.amazonaws.com/train-images-idx3-ubyte.gz) to ../data/FashionMNIS T/raw/train-images-idx3-ubyte.gz 0%| | 0/26421880 [00:00<?, ?it/s] Extracting ../data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ../d ata/FashionMNIST/raw Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.co m/train-labels-idx1-ubyte.gz (http://fashion-mnist.s3-website.eu-centr al-1.amazonaws.com/train-labels-idx1-ubyte.gz) Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.co m/train-labels-idx1-ubyte.gz (http://fashion-mnist.s3-website.eu-centr al-1.amazonaws.com/train-labels-idx1-ubyte.gz) to ../data/FashionMNIS T/raw/train-labels-idx1-ubyte.gz 0%| | 0/29515 [00:00<?, ?it/s] Extracting ../data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ../d ata/FashionMNIST/raw Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.co m/t10k-images-idx3-ubyte.gz (http://fashion-mnist.s3-website.eu-centra l-1.amazonaws.com/t10k-images-idx3-ubyte.gz) Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.co m/t10k-images-idx3-ubyte.gz (http://fashion-mnist.s3-website.eu-centra l-1.amazonaws.com/t10k-images-idx3-ubyte.gz) to ../data/FashionMNIST/r aw/t10k-images-idx3-ubyte.gz 0%| | 0/4422102 [00:00<?, ?it/s] Extracting ../data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ../da ta/FashionMNIST/raw Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.co m/t10k-labels-idx1-ubyte.gz (http://fashion-mnist.s3-website.eu-centra l-1.amazonaws.com/t10k-labels-idx1-ubyte.gz) Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.co m/t10k-labels-idx1-ubyte.gz (http://fashion-mnist.s3-website.eu-centra l-1.amazonaws.com/t10k-labels-idx1-ubyte.gz) to ../data/FashionMNIST/r aw/t10k-labels-idx1-ubyte.gz 0%| | 0/5148 [00:00<?, ?it/s] Extracting ../data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ../da ta/FashionMNIST/raw /u/ric-d1/csrfac/siddique/anaconda3/envs/pytorchcuda11/lib/python3.9/s ite-packages/torchvision/datasets/mnist.py:498: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeab le tensors. This means you can write to the underlying (supposedly non -writeable) NumPy array using the tensor. You may want to copy the arr localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 17/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook ay to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this p rogram. (Triggered internally at /opt/conda/conda-bld/pytorch_1631630 841592/work/torch/csrc/utils/tensor_numpy.cpp:180.) return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s) In [42]: def evaluate_accuracy(net, data_iter, device=None): """Compute the accuracy for a model on a dataset using a GPU.""" if isinstance(net, nn.Module): net.eval() # Set the model to evaluation mode if not device: device = next(iter(net.parameters())).device # No. of correct predictions, no. of predictions metric = d2l.Accumulator(2) with torch.no_grad(): for X, y in data_iter: if isinstance(X, list): # Required for BERT Fine-tuning (to be covered later) X = [x.to(device) for x in X] else: X = X.to(device) y = y.to(device) metric.add(d2l.accuracy(net(X), y), y.numel()) return metric[0] / metric[1] localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 18/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [43]: def train(net, train_iter, test_iter, num_epochs, lr, device): def init_weights(m): if type(m) == nn.Linear or type(m) == nn.Conv2d: nn.init.xavier_uniform_(m.weight) net.apply(init_weights) print('training on', device) net.to(device) optimizer = torch.optim.SGD(net.parameters(), lr=lr) loss = nn.CrossEntropyLoss() animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs], legend=['train loss', 'train acc', 'test acc']) timer, num_batches = d2l.Timer(), len(train_iter) for epoch in range(num_epochs): # Sum of training loss, sum of training accuracy, no. of examples metric = d2l.Accumulator(3) net.train() for i, (X, y) in enumerate(train_iter): timer.start() optimizer.zero_grad() X, y = X.to(device), y.to(device) y_hat = net(X) l = loss(y_hat, y) l.backward() optimizer.step() with torch.no_grad(): metric.add(l * X.shape[0], d2l.accuracy(y_hat, y), X.shape[0]) timer.stop() train_l = metric[0] / metric[2] train_acc = metric[1] / metric[2] if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1: animator.add(epoch + (i + 1) / num_batches, (train_l, train_acc, None)) test_acc = evaluate_accuracy(net, test_iter) animator.add(epoch + 1, (None, None, test_acc)) print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, ' f'test acc {test_acc:.3f}') print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec ' f'on {str(device)}') In [44]: device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") In [45]: lr, num_epochs = 0.9, 10 train(net, train_iter, test_iter, num_epochs, lr, device) loss 0.465, train acc 0.826, test acc 0.781 78497.3 examples/sec on cuda <Figure size 252x180 with 1 Axes> AlexNet localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 19/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook AlexNet, which employed an 8-layer CNN, won the ImageNet Large Scale Visual Recognition Challenge 2012 by a phenomenally large margin. This network showed, for the first time, that the features obtained by learning can transcend manually-designed features, breaking the previous paradigm in computer vision. The architectures of AlexNet and LeNet are very similar, as the figure below illustrates. Note that we provide a slightly streamlined version of AlexNet removing some of the design quirks that were needed in 2012 to make the model fit on two small GPUs. The design philosophies of AlexNet and LeNet are very similar, but there are also significant differences. First, AlexNet is much deeper than the comparatively small LeNet5. AlexNet consists of eight layers: five convolutional layers, two fully-connected hidden layers, and one fully-connected output layer. Second, AlexNet used the ReLU instead of the sigmoid as its activation function. Let us delve into the details below. localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 20/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [46]: net = nn.Sequential( # Here, we use a larger 11 x 11 window to capture objects. At the same # time, we use a stride of 4 to greatly reduce the height and width of the # output. Here, the number of output channels is much larger than that in # LeNet nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=1), nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2), # Make the convolution window smaller, set padding to 2 for consistent # height and width across the input and output, and increase the number of # output channels nn.Conv2d(96, 256, kernel_size=5, padding=2), nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2), # Use three successive convolutional layers and a smaller convolution # window. Except for the final convolutional layer, the number of output # channels is further increased. Pooling layers are not used to reduce the # height and width of input after the first two convolutional layers nn.Conv2d(256, 384, kernel_size=3, padding=1), nn.ReLU(), nn.Conv2d(384, 384, kernel_size=3, padding=1), nn.ReLU(), nn.Conv2d(384, 256, kernel_size=3, padding=1), nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2), nn.Flatten(), # Here, the number of outputs of the fully-connected layer is several # times larger than that in LeNet. Use the dropout layer to mitigate # overfitting nn.Linear(6400, 4096), nn.ReLU(), nn.Dropout(p=0.5), nn.Linear(4096, 4096), nn.ReLU(), nn.Dropout(p=0.5), # Output layer. Since we are using Fashion-MNIST, the number of classes is # 10, instead of 1000 as in the paper nn.Linear(4096, 10)) In [47]: X = torch.randn(1, 1, 224, 224) for layer in net: X = layer(X) print(layer.__class__.__name__, 'output shape:\t', X.shape) Conv2d output shape: ReLU output shape: MaxPool2d output shape: Conv2d output shape: ReLU output shape: MaxPool2d output shape: Conv2d output shape: ReLU output shape: Conv2d output shape: ReLU output shape: Conv2d output shape: ReLU output shape: MaxPool2d output shape: Flatten output shape: Linear output shape: ReLU output shape: Dropout output shape: Linear output shape: ReLU output shape: Dropout output shape: Linear output shape: torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 96, 54, 54]) 96, 54, 54]) 96, 26, 26]) 256, 26, 26]) 256, 26, 26]) 256, 12, 12]) 384, 12, 12]) 384, 12, 12]) 384, 12, 12]) 384, 12, 12]) 256, 12, 12]) 256, 12, 12]) 256, 5, 5]) 6400]) 4096]) 4096]) 4096]) 4096]) 4096]) 4096]) 10]) 21/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [48]: batch_size = 128 train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224) In [49]: lr, num_epochs = 0.01, 10 train(net, train_iter, test_iter, num_epochs, lr, device) loss 0.330, train acc 0.880, test acc 0.878 5014.0 examples/sec on cuda <Figure size 252x180 with 1 Axes> Networks Using Blocks (VGG) The idea of using blocks first emerged from the Visual Geometry Group (VGG) at Oxford University, in their eponymously-named VGG network. It is easy to implement these repeated structures in code with any modern deep learning framework by using loops and subroutines. VGG Blocks The basic building block of classic CNNs is a sequence of the following: (i) a convolutional layer with padding to maintain the resolution, (ii) a non-linearity such as a ReLU, (iii) a pooling layer such as a maximum pooling layer. One VGG block consists of a sequence of convolutional layers, followed by a maximum pooling layer for spatial downsampling. In the original VGG paper, the authors employed convolutions with 3×3 kernels with padding of 1 (keeping height and width) and 2×2 maximum pooling with stride of 2 (halving the resolution after each block). In the code below, we define a function called vgg_block to implement one VGG block. In [50]: def vgg_block(num_convs, in_channels, out_channels): layers = [] for _ in range(num_convs): layers.append( nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1)) layers.append(nn.ReLU()) in_channels = out_channels layers.append(nn.MaxPool2d(kernel_size=2, stride=2)) return nn.Sequential(*layers) VGG Network localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 22/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [51]: conv_arch = ((1, 64), (1, 128), (2, 256), (2, 512), (2, 512)) In [52]: def vgg(conv_arch): conv_blks = [] in_channels = 1 # The convolutional part for (num_convs, out_channels) in conv_arch: conv_blks.append(vgg_block(num_convs, in_channels, out_channels)) in_channels = out_channels return nn.Sequential( *conv_blks, nn.Flatten(), # The fully-connected part nn.Linear(out_channels * 7 * 7, 4096), nn.ReLU(), nn.Dropout(0.5), nn.Linear(4096, 4096), nn.ReLU(), nn.Dropout(0.5), nn.Linear(4096, 10)) net = vgg(conv_arch) localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 23/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [53]: X = torch.randn(size=(1, 1, 224, 224)) for blk in net: X = blk(X) print(blk.__class__.__name__, 'output shape:\t', X.shape) Sequential output shape: Sequential output shape: Sequential output shape: Sequential output shape: Sequential output shape: Flatten output shape: Linear output shape: ReLU output shape: Dropout output shape: Linear output shape: ReLU output shape: Dropout output shape: Linear output shape: torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, torch.Size([1, 25088]) torch.Size([1, 4096]) torch.Size([1, 4096]) torch.Size([1, 4096]) torch.Size([1, 4096]) torch.Size([1, 4096]) torch.Size([1, 4096]) torch.Size([1, 10]) 64, 112, 112]) 128, 56, 56]) 256, 28, 28]) 512, 14, 14]) 512, 7, 7]) In [54]: # Since VGG-11 is more computationally-heavy than AlexNet we construct a network wit #This is more than sufficient for training on Fashion-MNIST. ratio = 4 small_conv_arch = [(pair[0], pair[1] // ratio) for pair in conv_arch] net = vgg(small_conv_arch) In [55]: lr, num_epochs, batch_size = 0.05, 10, 128 train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224) train(net, train_iter, test_iter, num_epochs, lr, device) loss 0.175, train acc 0.935, test acc 0.919 2384.3 examples/sec on cuda <Figure size 252x180 with 1 Axes> Network in Network (NiN) LeNet, AlexNet, and VGG all share a common design pattern: extract features exploiting spatial structure via a sequence of convolution and pooling layers and then post-process the representations via fully-connected layers. The improvements upon LeNet by AlexNet and VGG mainly lie in how these later networks widen and deepen these two modules. Alternatively, one could imagine using fully-connected layers earlier in the process. However, a careless use of dense layers might give up the spatial structure of the representation entirely, network in network (NiN) blocks offer an alternative. They were proposed based on a very simple insight: to use an MLP on the channels for each pixel separately. localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 24/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [56]: def nin_block(in_channels, out_channels, kernel_size, strides, padding): return nn.Sequential( nn.Conv2d(in_channels, out_channels, kernel_size, strides, padding), nn.ReLU(), nn.Conv2d(out_channels, out_channels, kernel_size=1), nn.ReLU(), nn.Conv2d(out_channels, out_channels, kernel_size=1), nn.ReLU()) localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 25/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [57]: net = nn.Sequential( nin_block(1, 96, kernel_size=11, strides=4, padding=0), nn.MaxPool2d(3, stride=2), nin_block(96, 256, kernel_size=5, strides=1, padding=2), nn.MaxPool2d(3, stride=2), nin_block(256, 384, kernel_size=3, strides=1, padding=1), nn.MaxPool2d(3, stride=2), nn.Dropout(0.5), # There are 10 label classes nin_block(384, 10, kernel_size=3, strides=1, padding=1), nn.AdaptiveAvgPool2d((1, 1)), # Transform the four-dimensional output into two-dimensional output with a # shape of (batch size, 10) nn.Flatten()) In [58]: X = torch.rand(size=(1, 1, 224, 224)) for layer in net: X = layer(X) print(layer.__class__.__name__, 'output shape:\t', X.shape) Sequential output shape: MaxPool2d output shape: Sequential output shape: MaxPool2d output shape: Sequential output shape: MaxPool2d output shape: Dropout output shape: Sequential output shape: AdaptiveAvgPool2d output Flatten output shape: torch.Size([1, 96, 54, 54]) torch.Size([1, 96, 26, 26]) torch.Size([1, 256, 26, 26]) torch.Size([1, 256, 12, 12]) torch.Size([1, 384, 12, 12]) torch.Size([1, 384, 5, 5]) torch.Size([1, 384, 5, 5]) torch.Size([1, 10, 5, 5]) shape: torch.Size([1, 10, 1, 1]) torch.Size([1, 10]) In [59]: lr, num_epochs, batch_size = 0.1, 10, 128 train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224) In [60]: train(net, train_iter, test_iter, num_epochs, lr, device) loss 0.333, train acc 0.878, test acc 0.855 2728.6 examples/sec on cuda <Figure size 252x180 with 1 Axes> Batch Normalization 𝐱 ∈ an input to batch normalization (BN) that is from a minibatch , batch 𝐱 according to the following expression: BN(𝐱) = 𝜸 ⊙ 𝐱−𝝁𝝈̂ ̂ + 𝜷 𝐡 = 𝜙(BN(𝐖𝐱 + 𝐛)) Formally, denoting by normalization transforms localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 26/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [61]: def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum): # Use `is_grad_enabled` to determine whether the current mode is training # mode or prediction mode if not torch.is_grad_enabled(): # If it is prediction mode, directly use the mean and variance # obtained by moving average X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps) else: assert len(X.shape) in (2, 4) if len(X.shape) == 2: # When using a fully-connected layer, calculate the mean and # variance on the feature dimension mean = X.mean(dim=0) var = ((X - mean)**2).mean(dim=0) else: # When using a two-dimensional convolutional layer, calculate the # mean and variance on the channel dimension (axis=1). Here we # need to maintain the shape of `X`, so that the broadcasting # operation can be carried out later mean = X.mean(dim=(0, 2, 3), keepdim=True) var = ((X - mean)**2).mean(dim=(0, 2, 3), keepdim=True) # In training mode, the current mean and variance are used for the # standardization X_hat = (X - mean) / torch.sqrt(var + eps) # Update the mean and variance using moving average moving_mean = momentum * moving_mean + (1.0 - momentum) * mean moving_var = momentum * moving_var + (1.0 - momentum) * var Y = gamma * X_hat + beta # Scale and shift return Y, moving_mean.data, moving_var.data localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 27/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [62]: class BatchNorm(nn.Module): # `num_features`: the number of outputs for a fully-connected layer # or the number of output channels for a convolutional layer. `num_dims`: # 2 for a fully-connected layer and 4 for a convolutional layer def __init__(self, num_features, num_dims): super().__init__() if num_dims == 2: shape = (1, num_features) else: shape = (1, num_features, 1, 1) # The scale parameter and the shift parameter (model parameters) are # initialized to 1 and 0, respectively self.gamma = nn.Parameter(torch.ones(shape)) self.beta = nn.Parameter(torch.zeros(shape)) # The variables that are not model parameters are initialized to 0 and 1 self.moving_mean = torch.zeros(shape) self.moving_var = torch.ones(shape) def forward(self, X): # If `X` is not on the main memory, copy `moving_mean` and # `moving_var` to the device where `X` is located if self.moving_mean.device != X.device: self.moving_mean = self.moving_mean.to(X.device) self.moving_var = self.moving_var.to(X.device) # Save the updated `moving_mean` and `moving_var` Y, self.moving_mean, self.moving_var = batch_norm( X, self.gamma, self.beta, self.moving_mean, self.moving_var, eps=1e-5, momentum=0.9) return Y Applying Batch Normalization in LeNet In [63]: net = nn.Sequential(nn.Conv2d(1, 6, kernel_size=5), BatchNorm(6, num_dims=4), nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2), nn.Conv2d(6, 16, kernel_size=5), BatchNorm(16, num_dims=4), nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2), nn.Flatten(), nn.Linear(16 * 4 * 4, 120), BatchNorm(120, num_dims=2), nn.Sigmoid(), nn.Linear(120, 84), BatchNorm(84, num_dims=2), nn.Sigmoid(), nn.Linear(84, 10)) In [64]: lr, num_epochs, batch_size = 1.0, 10, 256 train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size) localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 28/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [65]: train(net, train_iter, test_iter, num_epochs, lr, device) loss 0.263, train acc 0.902, test acc 0.798 79612.6 examples/sec on cuda <Figure size 252x180 with 1 Axes> In [66]: net[1].gamma.reshape((-1,)), net[1].beta.reshape((-1,)) Out[66]: (tensor([4.1125, 3.0199, 2.8736, 3.3110, 3.3188, 2.3615], device='cud a:0', grad_fn=<ViewBackward>), tensor([-2.0603, -3.0580, -3.5240, 1.9239, -1.2915, -0.8189], device ='cuda:0', grad_fn=<ViewBackward>)) In [67]: ## using built-in nn.BatchNorm2d net = nn.Sequential(nn.Conv2d(1, 6, kernel_size=5), nn.BatchNorm2d(6), nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2), nn.Conv2d(6, 16, kernel_size=5), nn.BatchNorm2d(16), nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2), nn.Flatten(), nn.Linear(256, 120), nn.BatchNorm1d(120), nn.Sigmoid(), nn.Linear(120, 84), nn.BatchNorm1d(84), nn.Sigmoid(), nn.Linear(84, 10)) In [68]: train(net, train_iter, test_iter, num_epochs, lr, device) loss 0.263, train acc 0.903, test acc 0.825 106445.5 examples/sec on cuda <Figure size 252x180 with 1 Axes> Residual Networks (ResNet) localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 29/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook Let us focus on a local part of a neural network. Denote the input by 𝐱. We assume that the desired underlying mapping we want to obtain by learning is 𝑓(𝐱), to be used as the input to the activation function on the top. On the left of the given below figure, the portion within the dotted-line box must directly learn the mapping 𝑓(𝐱). On the right, the portion within the dotted-line box needs to learn the residual mapping 𝑓(𝐱)−𝐱, which is how the residual block derives its name. If the identity mapping 𝑓(𝐱)=𝐱 is the desired underlying mapping, the residual mapping is easier to learn: we only need to push the weights and biases of the upper weight layer (e.g., fullyconnected layer and convolutional layer) within the dotted-line box to zero. The right of the given below figure illustrates the residual block of ResNet, where the solid line carrying the layer input 𝐱 to the addition operator is called a residual connection (or shortcut connection). With residual blocks, inputs can forward propagate faster through the residual connections across layers. In [71]: from torch.nn import functional as F localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 30/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [72]: class Residual(nn.Module): """The Residual block of ResNet.""" def __init__(self, input_channels, num_channels, use_1x1conv=False, strides=1): super().__init__() self.conv1 = nn.Conv2d(input_channels, num_channels, kernel_size=3, padding=1, stride=strides) self.conv2 = nn.Conv2d(num_channels, num_channels, kernel_size=3, padding=1) if use_1x1conv: self.conv3 = nn.Conv2d(input_channels, num_channels, kernel_size=1, stride=strides) else: self.conv3 = None self.bn1 = nn.BatchNorm2d(num_channels) self.bn2 = nn.BatchNorm2d(num_channels) def forward(self, X): Y = F.relu(self.bn1(self.conv1(X))) Y = self.bn2(self.conv2(Y)) if self.conv3: X = self.conv3(X) Y += X return F.relu(Y) localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 31/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [73]: blk = Residual(3, 3) X = torch.rand(4, 3, 6, 6) Y = blk(X) Y.shape Out[73]: torch.Size([4, 3, 6, 6]) In [74]: blk = Residual(3, 6, use_1x1conv=True, strides=2) blk(X).shape Out[74]: torch.Size([4, 6, 3, 3]) In [75]: b1 = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3), nn.BatchNorm2d(64), nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2, padding=1)) In [76]: def resnet_block(input_channels, num_channels, num_residuals, first_block=False): blk = [] for i in range(num_residuals): if i == 0 and not first_block: blk.append( Residual(input_channels, num_channels, use_1x1conv=True, strides=2)) else: blk.append(Residual(num_channels, num_channels)) return blk In [77]: b2 b3 b4 b5 = = = = nn.Sequential(*resnet_block(64, 64, 2, first_block=True)) nn.Sequential(*resnet_block(64, 128, 2)) nn.Sequential(*resnet_block(128, 256, 2)) nn.Sequential(*resnet_block(256, 512, 2)) In [78]: net = nn.Sequential(b1, b2, b3, b4, b5, nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(), nn.Linear(512, 10)) localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 32/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 33/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [79]: X = torch.rand(size=(1, 1, 224, 224)) for layer in net: X = layer(X) print(layer.__class__.__name__, 'output shape:\t', X.shape) Sequential output shape: torch.Size([1, Sequential output shape: torch.Size([1, Sequential output shape: torch.Size([1, Sequential output shape: torch.Size([1, Sequential output shape: torch.Size([1, AdaptiveAvgPool2d output shape: torch.Size([1, Flatten output shape: torch.Size([1, 512]) Linear output shape: torch.Size([1, 10]) 64, 56, 56]) 64, 56, 56]) 128, 28, 28]) 256, 14, 14]) 512, 7, 7]) 512, 1, 1]) In [80]: lr, num_epochs, batch_size = 0.05, 10, 256 train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96) In [81]: train(net, train_iter, test_iter, num_epochs, lr, device) loss 0.011, train acc 0.997, test acc 0.915 5904.9 examples/sec on cuda <Figure size 252x180 with 1 Axes> From ResNet to DenseNet The key difference between ResNet and DenseNet is that in the latter case outputs are concatenated (denoted by [,]) rather than added. In [82]: def conv_block(input_channels, num_channels): return nn.Sequential( nn.BatchNorm2d(input_channels), nn.ReLU(), nn.Conv2d(input_channels, num_channels, kernel_size=3, padding=1)) localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 34/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [83]: class DenseBlock(nn.Module): def __init__(self, num_convs, input_channels, num_channels): super(DenseBlock, self).__init__() layer = [] for i in range(num_convs): layer.append( conv_block(num_channels * i + input_channels, num_channels)) self.net = nn.Sequential(*layer) def forward(self, X): for blk in self.net: Y = blk(X) # Concatenate the input and output of each block on the channel # dimension X = torch.cat((X, Y), dim=1) return X In [84]: blk = DenseBlock(2, 3, 10) X = torch.randn(4, 3, 8, 8) Y = blk(X) Y.shape Out[84]: torch.Size([4, 23, 8, 8]) Transition Layers In [85]: def transition_block(input_channels, num_channels): return nn.Sequential( nn.BatchNorm2d(input_channels), nn.ReLU(), nn.Conv2d(input_channels, num_channels, kernel_size=1), nn.AvgPool2d(kernel_size=2, stride=2)) In [86]: blk = transition_block(23, 10) blk(Y).shape Out[86]: torch.Size([4, 10, 4, 4]) In [87]: #model b1 = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3), nn.BatchNorm2d(64), nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2, padding=1)) localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 35/36 11/12/21, 9:33 PM From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook In [88]: # `num_channels`: the current number of channels num_channels, growth_rate = 64, 32 num_convs_in_dense_blocks = [4, 4, 4, 4] blks = [] for i, num_convs in enumerate(num_convs_in_dense_blocks): blks.append(DenseBlock(num_convs, num_channels, growth_rate)) # This is the number of output channels in the previous dense block num_channels += num_convs * growth_rate # A transition layer that halves the number of channels is added between # the dense blocks if i != len(num_convs_in_dense_blocks) - 1: blks.append(transition_block(num_channels, num_channels // 2)) num_channels = num_channels // 2 In [92]: net = nn.Sequential(b1, *blks, nn.BatchNorm2d(num_channels), nn.ReLU(), nn.AdaptiveMaxPool2d((1, 1)), nn.Flatten(), nn.Linear(num_channels, 10)) In [93]: lr, num_epochs, batch_size = 0.1, 10, 256 train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96) In [94]: train(net, train_iter, test_iter, num_epochs, lr, device) loss 0.146, train acc 0.946, test acc 0.821 6900.0 examples/sec on cuda <Figure size 252x180 with 1 Axes> In [ ]: localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb 36/36