Uploaded by seun

From Fully ConnectedLayers to Convolutions

advertisement
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
Convolutional Neural Networks
The Cross-Correlation Operation
The input is a two-dimensional tensor with a height of 3 and width of 3. We mark the shape of the tensor as
3×3 or ( 3 , 3 ). The height and width of the kernel are both 2. The shape of the kernel window (or convolution
window) is given by the height and width of the kernel (here it is 2×2 ).
In the two-dimensional cross-correlation operation, we begin with the convolution window positioned at the
upper-left corner of the input tensor and slide it across the input tensor, both from left to right and top to
bottom. When the convolution window slides to a certain position, the input subtensor contained in that
window and the kernel tensor are multiplied elementwise and the resulting tensor is summed up yielding a
single scalar value. This result gives the value of the output tensor at the corresponding location. Here, the
output tensor has a height of 2 and width of 2 and the four elements are derived from the two-dimensional
cross-correlation operation:
0 × 0 + 1 × 1 + 3 × 2 + 4 × 3 = 19,
1 × 0 + 2 × 1 + 4 × 2 + 5 × 3 = 25,
3 × 0 + 4 × 1 + 6 × 2 + 7 × 3 = 37,
4 × 0 + 5 × 1 + 7 × 2 + 8 × 3 = 43.
Note that along each axis, the output size is slightly smaller than the input size. Because the kernel has width
and height greater than one, we can only properly compute the cross-correlation for locations where the kernel
fits wholly within the image, the output size is given by the input size 𝑛
𝑛
minus the size of the convolution
× via:
(𝑛ℎ − 𝑘ℎ + 1) × (𝑛𝑤 − 𝑘𝑤 + 1)
kernel 𝑘
ℎ
𝑘
ℎ
×
𝑤
𝑤
In [1]:
import torch
from torch import nn
from d2l import torch as d2l
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
1/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [2]:
def corr2d(X, K): #@save
"""Compute 2D cross-correlation."""
h, w = K.shape
Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
Y[i, j] = (X[i:i + h, j:j + w] * K).sum()
return Y
In [3]:
X = torch.tensor([[0.0,
[3.0,
[6.0,
])
K = torch.tensor([[0.0,
[2.0,
])
corr2d(X, K)
1.0, 2.0],
4.0, 5.0],
7.0, 8.0]
1.0],
3.0]
Out[3]:
tensor([[19., 25.],
[37., 43.]])
Convolutional Layers
A convolutional layer cross-correlates the input and kernel and adds a scalar bias to produce an output. The
two parameters of a convolutional layer are the kernel and the scalar bias. When training models based on
convolutional layers, we typically initialize the kernels randomly, just as we would with a fully-connected layer.
We are now ready to implement a two-dimensional convolutional layer based on the corr2d function defined
above. In the
constructor function, we declare weight and bias as the two model parameters. The
forward propagation function calls the corr2d function and adds the bias.
__𝑖𝑛𝑖𝑡__
In [4]:
class Conv2D(nn.Module):
def __init__(self, kernel_size):
super().__init__()
self.weight = nn.Parameter(torch.rand(kernel_size))
self.bias = nn.Parameter(torch.zeros(1))
​
def forward(self, x):
return corr2d(x, self.weight) + self.bias
Object Edge Detection in Images
Let us take a moment to parse a simple application of a convolutional layer: detecting the edge of an object in
an image by finding the location of the pixel change. First, we construct an “image” of 6×8 pixels. The middle
four columns are black (0) and the rest are white (1).
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
2/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [5]:
X = torch.ones((6, 8))
X[:, 2:6] = 0
X
Out[5]:
tensor([[1.,
[1.,
[1.,
[1.,
[1.,
[1.,
1.,
1.,
1.,
1.,
1.,
1.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
1.,
1.,
1.,
1.,
1.,
1.,
1.],
1.],
1.],
1.],
1.],
1.]])
Next, we construct a kernel K with a height of 1 and a width of 2. When we perform the cross-correlation
operation with the input, if the horizontally adjacent elements are the same, the output is 0. Otherwise, the
output is non-zero.
In [6]:
K = torch.tensor([[1.0, -1.0]])
We are ready to perform the cross-correlation operation with arguments X (our input) and K (our kernel). As you
can see, we detect 1 for the edge from white to black and -1 for the edge from black to white. All other outputs
take value 0.
In [7]:
Y = corr2d(X, K)
Y
Out[7]:
tensor([[
[
[
[
[
[
0.,
0.,
0.,
0.,
0.,
0.,
1.,
1.,
1.,
1.,
1.,
1.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
0.,
-1.,
-1.,
-1.,
-1.,
-1.,
-1.,
0.],
0.],
0.],
0.],
0.],
0.]])
We can now apply the kernel to the transposed image. As expected, it vanishes. The kernel K only detects
vertical edges.
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
3/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [8]:
corr2d(X.t(), K.t())
Out[8]:
tensor([[ 0., 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1., 1.],
[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.],
[-1., -1., -1., -1., -1., -1.],
[ 0., 0., 0., 0., 0., 0.]])
Learning a Kernel
Designing an edge detector by finite differences [1, -1] is neat if we know this is precisely what we are looking
for. However, as we look at larger kernels, and consider successive layers of convolutions, it might be
impossible to specify precisely what each filter should be doing manually.
Now let us see whether we can learn the kernel that generated Y from X by looking at the input–output pairs
only. We first construct a convolutional layer and initialize its kernel as a random tensor. Next, in each iteration,
we will use the squared error to compare Y with the output of the convolutional layer. We can then calculate the
gradient to update the kernel. For the sake of simplicity, in the following we use the built-in class for twodimensional convolutional layers and ignore the bias.
In [9]:
# Construct a two-dimensional convolutional layer with 1 output channel and a
# kernel of shape (1, 2). For the sake of simplicity, we ignore the bias here
conv2d = nn.Conv2d(1, 1, kernel_size=(1, 2), bias=False)
In [10]:
# The two-dimensional convolutional layer uses four-dimensional input and
# output in the format of (example, channel, height, width), where the batch
# size (number of examples in the batch) and the number of channels are both 1
X = X.reshape((1, 1, 6, 8))
Y = Y.reshape((1, 1, 6, 7))
lr = 3e-2 # Learning rate
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
4/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [11]:
for i in range(10):
Y_hat = conv2d(X)
l = (Y_hat - Y)**2
conv2d.zero_grad()
l.sum().backward()
# Update the kernel
conv2d.weight.data[:] -= lr * conv2d.weight.grad
print(f'batch {i + 1}, loss {l.sum():.3f}')
batch
batch
batch
batch
batch
batch
batch
batch
batch
batch
1, loss 22.013
2, loss 9.498
3, loss 4.199
4, loss 1.917
5, loss 0.912
6, loss 0.454
7, loss 0.238
8, loss 0.130
9, loss 0.075
10, loss 0.044
In [12]:
conv2d.weight.data.reshape((1, 2))
Out[12]:
tensor([[ 1.0052, -0.9651]])
Padding
One tricky issue when applying convolutional layers is that we tend to lose pixels on the perimeter of our
image. Since we typically use small kernels, for any given convolution, we might only lose a few pixels, but this
can add up as we apply many successive convolutional layers.
One straightforward solution to this problem is to add extra pixels of filler around the boundary of our input
image, thus increasing the effective size of the image. Typically, we set the values of the extra pixels to zero. In
the given example, we pad a 3×3 input, increasing its size to 5×5. The corresponding output then increases to
a 4×4 matrix. The shaded portions are the first output element as well as the input and kernel tensor elements
used for the output computation: 0×0+0×1+0×2+0×3=0.
In general, if we add a total of 𝑝 rows of padding (roughly half on top and half on bottom) and a total of 𝑝
ℎ
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
𝑤
5/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
columns of padding (roughly half on the left and half on the right), the output shape will be:
( − + + 1) × ( − + + 1)
𝑛
ℎ
𝑘
ℎ
𝑝
ℎ
𝑛
𝑤
𝑘
𝑤
𝑝
𝑤
CNNs commonly use convolution kernels with odd height and width values, such as 1, 3, 5, or 7. Choosing
odd kernel sizes has the benefit that we can preserve the spatial dimensionality while padding with the same
number of rows on top and bottom, and the same number of columns on left and right.
Moreover, this practice of using odd kernels and padding to precisely preserve dimensionality offers a clerical
benefit. For any two-dimensional tensor X, when the kernel’s size is odd and the number of padding rows and
columns on all sides are the same, producing an output with the same height and width as the input, we know
that the output Y[i, j] is calculated by cross-correlation of the input and convolution kernel with the window
centered on X[i, j].
In [13]:
# We define a convenience function to calculate the convolutional layer. This
# function initializes the convolutional layer weights and performs
# corresponding dimensionality elevations and reductions on the input and
# output
def comp_conv2d(conv2d, X):
# Here (1, 1) indicates that the batch size and the number of channels
# are both 1
X = X.reshape((1, 1) + X.shape)
Y = conv2d(X)
# Exclude the first two dimensions that do not interest us: examples and
# channels
return Y.reshape(Y.shape[2:])
In [14]:
# Note that here 1 row or column is padded on either side, so a total of 2
# rows or columns are added
conv2d = nn.Conv2d(1, 1, kernel_size=3, padding=1)
X = torch.rand(size=(8, 8))
comp_conv2d(conv2d, X).shape
Out[14]:
torch.Size([8, 8])
In [15]:
# Here, we use a convolution kernel with a height of 5 and a width of 3. The
# padding numbers on either side of the height and width are 2 and 1,
# respectively
conv2d = nn.Conv2d(1, 1, kernel_size=(5, 3), padding=(2, 1))
comp_conv2d(conv2d, X).shape
Out[15]:
torch.Size([8, 8])
Stride
When computing the cross-correlation, we start with the convolution window at the upper-left corner of the
input tensor, and then slide it over all locations both down and to the right. In previous examples, we default to
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
6/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
sliding one element at a time. However, sometimes, either for computational efficiency or because we wish to
downsample, we move our window more than one element at a time, skipping the intermediate locations.
We refer to the number of rows and columns traversed per slide as the stride. So far, we have used strides of 1,
both for height and width. Sometimes, we may want to use a larger stride. Example above shows a twodimensional cross-correlation operation with a stride of 3 vertically and 2 horizontally. The shaded portions are
the output elements as well as the input and kernel tensor elements used for the output computation:
0×0+0×1+1×2+2×3=8 , 0×0+6×1+0×2+0×3=6.
We can see that when the second element of the first column is outputted, the convolution window slides
down three rows. The convolution window slides two columns to the right when the second element of the first
row is outputted. When the convolution window continues to slide two columns to the right on the input, there
is no output because the input element cannot fill the window (unless we add another column of padding).
In general, when the stride for the height is 𝑠 and the stride for the width is 𝑠 , the output shape is:
ℎ
⌊(𝑛ℎ − 𝑘ℎ + 𝑝ℎ + 𝑠ℎ )/𝑠ℎ ⌋ × ⌊(𝑛𝑤 − 𝑘𝑤 + 𝑝𝑤 + 𝑠𝑤 )/𝑠𝑤 ⌋
𝑤
In [16]:
conv2d = nn.Conv2d(1, 1, kernel_size=3, padding=1, stride=2)
comp_conv2d(conv2d, X).shape
Out[16]:
torch.Size([4, 4])
In [17]:
conv2d = nn.Conv2d(1, 1, kernel_size=(3, 5), padding=(0, 1), stride=(3, 4))
comp_conv2d(conv2d, X).shape
Out[17]:
torch.Size([2, 2])
Multiple Input Channels
When the input data contain multiple channels (e.g., color images have the standard RGB channels to indicate
the amount of red, green and blue), we need to construct a convolution kernel with the same number of input
channels as the input data, so that it can perform cross-correlation with the input data. Assuming that the
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
7/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
number of channels for the input data is 𝑐 , the number of input channels of the convolution kernel also needs
𝑖
to be 𝑐 .
𝑖
If our convolution kernel’s window shape is 𝑘
as just a two-dimensional tensor of shape 𝑘
However, when 𝑐
𝑖
ℎ
×
×
ℎ
𝑘
𝑘
𝑤
𝑤
, then when 𝑐
.
𝑖
= 1 , we can think of our convolution kernel
> 1, we need a kernel that contains a tensor of shape ×
𝑘
ℎ
Concatenating these 𝑐 tensors together yields a convolution kernel of shape 𝑐
𝑖
𝑖
for every input channel.
× ×
𝑘
𝑤
𝑘
ℎ
𝑘
𝑤
. Since the input and
convolution kernel each have 𝑐 channels, we can perform a cross-correlation operation on the two𝑖
dimensional tensor of the input and the two-dimensional tensor of the convolution kernel for each channel,
adding the 𝑐 results together (summing over the channels) to yield a two-dimensional tensor. This is the result
𝑖
of a two-dimensional cross-correlation between a multi-channel input and a multi-input-channel convolution
kernel.
In the given below example, we demonstrate an example of a two-dimensional cross-correlation with two input
channels. The shaded portions are the first output element as well as the input and kernel tensor elements
used for the output computation: (1×1+2×2+4×3+5×4)+(0×0+1×1+3×2+4×3)=56.
In [18]:
def corr2d_multi_in(X, K):
# First, iterate through the 0th dimension (channel dimension) of `X` and
# `K`. Then, add them together
return sum(d2l.corr2d(x, k) for x, k in zip(X, K))
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
8/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [19]:
X = torch.tensor([[[0.0,
[3.0,
[6.0,
],
[[1.0,
[4.0,
[7.0,
]])
K = torch.tensor([[[0.0,
[2.0,
],
[[1.0,
[3.0,
]])
​
corr2d_multi_in(X, K)
1.0, 2.0],
4.0, 5.0],
7.0, 8.0]
2.0, 3.0],
5.0, 6.0],
8.0, 9.0]
1.0],
3.0]
2.0],
4.0]
Out[19]:
tensor([[ 56., 72.],
[104., 120.]])
Multiple Output Channels
Regardless of the number of input channels, so far we always ended up with one output channel. However, it
turns out to be essential to have multiple channels at each layer. In the most popular neural network
architectures, we actually increase the channel dimension as we go higher up in the neural network, typically
downsampling to trade off spatial resolution for greater channel depth. Intuitively, you could think of each
channel as responding to some different set of features. Reality is a bit more complicated than the most naive
interpretations of this intuition since representations are not learned independent but are rather optimized to be
jointly useful. So it may not be that a single channel learns an edge detector but rather that some direction in
channel space corresponds to detecting edges.
Denote by 𝑐 and 𝑐 the number of input and output channels, respectively, and let 𝑘 and 𝑘 be the height and
𝑖
𝑜
ℎ
𝑤
width of the kernel. To get an output with multiple channels, we can create a kernel tensor of shape
for every output channel. We concatenate them on the output channel dimension, so that the
𝑐
𝑘
𝑘
𝑖
× ×
ℎ
𝑤
shape of the convolution kernel is 𝑐
𝑜
× × ×
𝑐
𝑖
𝑘
ℎ
𝑘
𝑤
. In cross-correlation operations, the result on each output
channel is calculated from the convolution kernel corresponding to that output channel and takes input from all
channels in the input tensor.
In [20]:
def corr2d_multi_in_out(X, K):
# Iterate through the 0th dimension of `K`, and each time, perform
# cross-correlation operations with input `X`. All of the results are
# stacked together
return torch.stack([corr2d_multi_in(X, k) for k in K], 0)
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
9/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [21]:
K = torch.tensor([[[0.0, 1.0],
[2.0, 3.0]
],
[[1.0, 2.0],
[3.0, 4.0]
]])
​
K = torch.stack((K, K + 1, K + 2), 0)
K.shape
Out[21]:
torch.Size([3, 2, 2, 2])
In [22]:
K
Out[22]:
tensor([[[[0., 1.],
[2., 3.]],
[[1., 2.],
[3., 4.]]],
[[[1., 2.],
[3., 4.]],
[[2., 3.],
[4., 5.]]],
[[[2., 3.],
[4., 5.]],
[[3., 4.],
[5., 6.]]]])
In [23]:
X = torch.tensor([[[0.0,
[3.0,
[6.0,
],
[[1.0,
[4.0,
[7.0,
]])
1.0, 2.0],
4.0, 5.0],
7.0, 8.0]
2.0, 3.0],
5.0, 6.0],
8.0, 9.0]
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
10/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [24]:
corr2d_multi_in_out(X, K)
Out[24]:
tensor([[[ 56., 72.],
[104., 120.]],
[[ 76., 100.],
[148., 172.]],
[[ 96., 128.],
[192., 224.]]])
1×1 Convolutional Layer
At first, a 1×1 convolution, i.e., 𝑘
ℎ
= = 1, does not seem to make much sense. After all, a convolution
𝑘
𝑤
correlates adjacent pixels. A 1×1 convolution obviously does not. Nonetheless, they are popular operations
that are sometimes included in the designs of complex deep networks. Let us see in some detail what it
actually does.
Because the minimum window is used, the 1×1 convolution loses the ability of larger convolutional layers to
recognize patterns consisting of interactions among adjacent elements in the height and width dimensions.
The only computation of the 1×1 convolution occurs on the channel dimension.
The example below shows the cross-correlation computation using the 1×1 convolution kernel with 3 input
channels and 2 output channels. Note that the inputs and outputs have the same height and width. Each
element in the output is derived from a linear combination of elements at the same position in the input image.
You could think of the 1×1 convolutional layer as constituting a fully-connected layer applied at every single
pixel location to transform the 𝑐 corresponding input values into 𝑐 output values. Because this is still a
𝑖
𝑜
convolutional layer, the weights are tied across pixel location. Thus the 1×1 convolutional layer requires 𝑐
weights (plus the bias).
𝑜
×
𝑐
𝑖
In [25]:
def corr2d_multi_in_out_1x1(X, K):
c_i, h, w = X.shape
c_o = K.shape[0]
X = X.reshape((c_i, h * w))
K = K.reshape((c_o, c_i))
# Matrix multiplication in the fully-connected layer
Y = torch.matmul(K, X)
return Y.reshape((c_o, h, w))
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
11/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [26]:
X = torch.normal(0, 1, (3, 3, 3))
K = torch.normal(0, 1, (2, 3, 1, 1))
​
Y1 = corr2d_multi_in_out_1x1(X, K)
Y1
Out[26]:
tensor([[[-1.2791, 0.4097, 0.5345],
[ 0.1310, -1.8611, -0.5171],
[-1.1235, -0.7653, -1.7798]],
[[-2.0987, 3.3594, -0.5957],
[-0.1891, -3.5485, 1.1903],
[-2.5009, -0.1530, -1.4922]]])
In [27]:
def corr2d_multi_in_out(X, K):
return torch.stack([corr2d_multi_in(X, k) for k in K], 0)
In [28]:
Y2 = corr2d_multi_in_out(X, K)
Y2
Out[28]:
tensor([[[-1.2791, 0.4097, 0.5345],
[ 0.1310, -1.8611, -0.5171],
[-1.1235, -0.7653, -1.7798]],
[[-2.0987, 3.3594, -0.5957],
[-0.1891, -3.5485, 1.1903],
[-2.5009, -0.1530, -1.4922]]])
In [29]:
assert float(torch.abs(Y1 - Y2).sum()) < 1e-6
Pooling
Like convolutional layers, pooling operators consist of a fixed-shape window that is slid over all regions in the
input according to its stride, computing a single output for each location traversed by the fixed-shape window
(sometimes known as the pooling window). However, unlike the cross-correlation computation of the inputs
and kernels in the convolutional layer, the pooling layer contains no parameters (there is no kernel). Instead,
pooling operators are deterministic, typically calculating either the maximum or the average value of the
elements in the pooling window. These operations are called maximum pooling (max pooling for short) and
average pooling, respectively.
In both cases, as with the cross-correlation operator, we can think of the pooling window as starting from the
upper-left of the input tensor and sliding across the input tensor from left to right and top to bottom. At each
location that the pooling window hits, it computes the maximum or average value of the input subtensor in the
window, depending on whether max or average pooling is employed.
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
12/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
max(0,1,3,4) = 4,
max(1,2,4,5) = 5,
max(3,4,6,7) = 7,
max(4,5,7,8) = 8.
In [30]:
def pool2d(X, pool_size, mode='max'):
p_h, p_w = pool_size
Y = torch.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
if mode == 'max':
Y[i, j] = X[i:i + p_h, j:j + p_w].max()
elif mode == 'avg':
Y[i, j] = X[i:i + p_h, j:j + p_w].mean()
return Y
In [31]:
X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
pool2d(X, (2, 2))
Out[31]:
tensor([[4., 5.],
[7., 8.]])
In [32]:
pool2d(X, (2, 2), 'avg')
Out[32]:
tensor([[2., 3.],
[5., 6.]])
Putthing everything together
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
13/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [33]:
#Padding and Stride
X = torch.arange(16, dtype=torch.float32).reshape((1, 1, 4, 4))
X
Out[33]:
tensor([[[[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[12., 13., 14., 15.]]]])
In [34]:
#https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html
pool2d = nn.MaxPool2d(3)
pool2d(X)
Out[34]:
tensor([[[[10.]]]])
In [35]:
pool2d = nn.MaxPool2d(3, padding=1, stride=2)
pool2d(X)
Out[35]:
tensor([[[[ 5., 7.],
[13., 15.]]]])
In [36]:
pool2d = nn.MaxPool2d((2, 3), stride=(2, 3), padding=(0, 1))
pool2d(X)
Out[36]:
tensor([[[[ 5., 7.],
[13., 15.]]]])
In [37]:
# Multiple Channels
X = torch.cat((X, X + 1), 1)
X
Out[37]:
tensor([[[[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[12., 13., 14., 15.]],
[[ 1., 2., 3., 4.],
[ 5., 6., 7., 8.],
[ 9., 10., 11., 12.],
[13., 14., 15., 16.]]]])
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
14/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [38]:
pool2d = nn.MaxPool2d(3, padding=1, stride=2)
pool2d(X)
Out[38]:
tensor([[[[ 5., 7.],
[13., 15.]],
[[ 6., 8.],
[14., 16.]]]])
LeNet
At a high level, LeNet (LeNet-5) consists of two parts: (i) a convolutional encoder consisting of two
convolutional layers; and (ii) a dense block consisting of three fully-connected layers
In [39]:
net = nn.Sequential(nn.Conv2d(1, 6, kernel_size=5, padding=2), nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2),
nn.Conv2d(6, 16, kernel_size=5), nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2), nn.Flatten(),
nn.Linear(16 * 5 * 5, 120), nn.Sigmoid(),
nn.Linear(120, 84), nn.Sigmoid(), nn.Linear(84, 10))
We took a small liberty with the original model, removing the Gaussian activation in the final layer. Other than
that, this network matches the original LeNet-5 architecture.
By passing a single-channel (black and white) 28×28 image through the network and printing the output shape
at each layer, we can inspect the model to make sure that its operations line up with what we expect from the
figure below:
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
15/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [40]:
X = torch.rand(size=(1, 1, 28, 28), dtype=torch.float32)
​
for layer in net:
X = layer(X)
print(layer.__class__.__name__, 'output shape: \t', X.shape)
Conv2d output shape:
Sigmoid output shape:
AvgPool2d output shape:
Conv2d output shape:
Sigmoid output shape:
AvgPool2d output shape:
Flatten output shape:
Linear output shape:
Sigmoid output shape:
Linear output shape:
Sigmoid output shape:
Linear output shape:
torch.Size([1, 6, 28, 28])
torch.Size([1, 6, 28, 28])
torch.Size([1, 6, 14, 14])
torch.Size([1, 16, 10, 10])
torch.Size([1, 16, 10, 10])
torch.Size([1, 16, 5, 5])
torch.Size([1, 400])
torch.Size([1, 120])
torch.Size([1, 120])
torch.Size([1, 84])
torch.Size([1, 84])
torch.Size([1, 10])
Training
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
16/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [41]:
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.co
m/train-images-idx3-ubyte.gz (http://fashion-mnist.s3-website.eu-centr
al-1.amazonaws.com/train-images-idx3-ubyte.gz)
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.co
m/train-images-idx3-ubyte.gz (http://fashion-mnist.s3-website.eu-centr
al-1.amazonaws.com/train-images-idx3-ubyte.gz) to ../data/FashionMNIS
T/raw/train-images-idx3-ubyte.gz
0%|
| 0/26421880 [00:00<?, ?it/s]
Extracting ../data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ../d
ata/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.co
m/train-labels-idx1-ubyte.gz (http://fashion-mnist.s3-website.eu-centr
al-1.amazonaws.com/train-labels-idx1-ubyte.gz)
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.co
m/train-labels-idx1-ubyte.gz (http://fashion-mnist.s3-website.eu-centr
al-1.amazonaws.com/train-labels-idx1-ubyte.gz) to ../data/FashionMNIS
T/raw/train-labels-idx1-ubyte.gz
0%|
| 0/29515 [00:00<?, ?it/s]
Extracting ../data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ../d
ata/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.co
m/t10k-images-idx3-ubyte.gz (http://fashion-mnist.s3-website.eu-centra
l-1.amazonaws.com/t10k-images-idx3-ubyte.gz)
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.co
m/t10k-images-idx3-ubyte.gz (http://fashion-mnist.s3-website.eu-centra
l-1.amazonaws.com/t10k-images-idx3-ubyte.gz) to ../data/FashionMNIST/r
aw/t10k-images-idx3-ubyte.gz
0%|
| 0/4422102 [00:00<?, ?it/s]
Extracting ../data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ../da
ta/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.co
m/t10k-labels-idx1-ubyte.gz (http://fashion-mnist.s3-website.eu-centra
l-1.amazonaws.com/t10k-labels-idx1-ubyte.gz)
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.co
m/t10k-labels-idx1-ubyte.gz (http://fashion-mnist.s3-website.eu-centra
l-1.amazonaws.com/t10k-labels-idx1-ubyte.gz) to ../data/FashionMNIST/r
aw/t10k-labels-idx1-ubyte.gz
0%|
| 0/5148 [00:00<?, ?it/s]
Extracting ../data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ../da
ta/FashionMNIST/raw
/u/ric-d1/csrfac/siddique/anaconda3/envs/pytorchcuda11/lib/python3.9/s
ite-packages/torchvision/datasets/mnist.py:498: UserWarning: The given
NumPy array is not writeable, and PyTorch does not support non-writeab
le tensors. This means you can write to the underlying (supposedly non
-writeable) NumPy array using the tensor. You may want to copy the arr
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
17/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
ay to protect its data or make it writeable before converting it to a
tensor. This type of warning will be suppressed for the rest of this p
rogram. (Triggered internally at /opt/conda/conda-bld/pytorch_1631630
841592/work/torch/csrc/utils/tensor_numpy.cpp:180.)
return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
In [42]:
def evaluate_accuracy(net, data_iter, device=None):
"""Compute the accuracy for a model on a dataset using a GPU."""
if isinstance(net, nn.Module):
net.eval() # Set the model to evaluation mode
if not device:
device = next(iter(net.parameters())).device
# No. of correct predictions, no. of predictions
metric = d2l.Accumulator(2)
​
with torch.no_grad():
for X, y in data_iter:
if isinstance(X, list):
# Required for BERT Fine-tuning (to be covered later)
X = [x.to(device) for x in X]
else:
X = X.to(device)
y = y.to(device)
metric.add(d2l.accuracy(net(X), y), y.numel())
return metric[0] / metric[1]
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
18/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [43]:
def train(net, train_iter, test_iter, num_epochs, lr, device):
def init_weights(m):
if type(m) == nn.Linear or type(m) == nn.Conv2d:
nn.init.xavier_uniform_(m.weight)
​
net.apply(init_weights)
print('training on', device)
net.to(device)
optimizer = torch.optim.SGD(net.parameters(), lr=lr)
loss = nn.CrossEntropyLoss()
animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
legend=['train loss', 'train acc', 'test acc'])
timer, num_batches = d2l.Timer(), len(train_iter)
for epoch in range(num_epochs):
# Sum of training loss, sum of training accuracy, no. of examples
metric = d2l.Accumulator(3)
net.train()
for i, (X, y) in enumerate(train_iter):
timer.start()
optimizer.zero_grad()
X, y = X.to(device), y.to(device)
y_hat = net(X)
l = loss(y_hat, y)
l.backward()
optimizer.step()
with torch.no_grad():
metric.add(l * X.shape[0], d2l.accuracy(y_hat, y), X.shape[0])
timer.stop()
train_l = metric[0] / metric[2]
train_acc = metric[1] / metric[2]
if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
animator.add(epoch + (i + 1) / num_batches,
(train_l, train_acc, None))
test_acc = evaluate_accuracy(net, test_iter)
animator.add(epoch + 1, (None, None, test_acc))
print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, '
f'test acc {test_acc:.3f}')
print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec '
f'on {str(device)}')
In [44]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
In [45]:
lr, num_epochs = 0.9, 10
train(net, train_iter, test_iter, num_epochs, lr, device)
loss 0.465, train acc 0.826, test acc 0.781
78497.3 examples/sec on cuda
<Figure size 252x180 with 1 Axes>
AlexNet
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
19/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
AlexNet, which employed an 8-layer CNN, won the ImageNet Large Scale Visual Recognition Challenge 2012
by a phenomenally large margin. This network showed, for the first time, that the features obtained by learning
can transcend manually-designed features, breaking the previous paradigm in computer vision.
The architectures of AlexNet and LeNet are very similar, as the figure below illustrates. Note that we provide a
slightly streamlined version of AlexNet removing some of the design quirks that were needed in 2012 to make
the model fit on two small GPUs.
The design philosophies of AlexNet and LeNet are very similar, but there are also significant differences. First,
AlexNet is much deeper than the comparatively small LeNet5. AlexNet consists of eight layers: five
convolutional layers, two fully-connected hidden layers, and one fully-connected output layer. Second, AlexNet
used the ReLU instead of the sigmoid as its activation function. Let us delve into the details below.
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
20/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [46]:
net = nn.Sequential(
# Here, we use a larger 11 x 11 window to capture objects. At the same
# time, we use a stride of 4 to greatly reduce the height and width of the
# output. Here, the number of output channels is much larger than that in
# LeNet
nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=1), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2),
# Make the convolution window smaller, set padding to 2 for consistent
# height and width across the input and output, and increase the number of
# output channels
nn.Conv2d(96, 256, kernel_size=5, padding=2), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2),
# Use three successive convolutional layers and a smaller convolution
# window. Except for the final convolutional layer, the number of output
# channels is further increased. Pooling layers are not used to reduce the
# height and width of input after the first two convolutional layers
nn.Conv2d(256, 384, kernel_size=3, padding=1), nn.ReLU(),
nn.Conv2d(384, 384, kernel_size=3, padding=1), nn.ReLU(),
nn.Conv2d(384, 256, kernel_size=3, padding=1), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2), nn.Flatten(),
# Here, the number of outputs of the fully-connected layer is several
# times larger than that in LeNet. Use the dropout layer to mitigate
# overfitting
nn.Linear(6400, 4096), nn.ReLU(), nn.Dropout(p=0.5),
nn.Linear(4096, 4096), nn.ReLU(), nn.Dropout(p=0.5),
# Output layer. Since we are using Fashion-MNIST, the number of classes is
# 10, instead of 1000 as in the paper
nn.Linear(4096, 10))
In [47]:
X = torch.randn(1, 1, 224, 224)
for layer in net:
X = layer(X)
print(layer.__class__.__name__, 'output shape:\t', X.shape)
Conv2d output shape:
ReLU output shape:
MaxPool2d output shape:
Conv2d output shape:
ReLU output shape:
MaxPool2d output shape:
Conv2d output shape:
ReLU output shape:
Conv2d output shape:
ReLU output shape:
Conv2d output shape:
ReLU output shape:
MaxPool2d output shape:
Flatten output shape:
Linear output shape:
ReLU output shape:
Dropout output shape:
Linear output shape:
ReLU output shape:
Dropout output shape:
Linear output shape:
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
96, 54, 54])
96, 54, 54])
96, 26, 26])
256, 26, 26])
256, 26, 26])
256, 12, 12])
384, 12, 12])
384, 12, 12])
384, 12, 12])
384, 12, 12])
256, 12, 12])
256, 12, 12])
256, 5, 5])
6400])
4096])
4096])
4096])
4096])
4096])
4096])
10])
21/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [48]:
batch_size = 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
In [49]:
lr, num_epochs = 0.01, 10
train(net, train_iter, test_iter, num_epochs, lr, device)
loss 0.330, train acc 0.880, test acc 0.878
5014.0 examples/sec on cuda
<Figure size 252x180 with 1 Axes>
Networks Using Blocks (VGG)
The idea of using blocks first emerged from the Visual Geometry Group (VGG) at Oxford University, in their
eponymously-named VGG network. It is easy to implement these repeated structures in code with any modern
deep learning framework by using loops and subroutines.
VGG Blocks
The basic building block of classic CNNs is a sequence of the following: (i) a convolutional layer with padding
to maintain the resolution, (ii) a non-linearity such as a ReLU, (iii) a pooling layer such as a maximum pooling
layer.
One VGG block consists of a sequence of convolutional layers, followed by a maximum pooling layer for
spatial downsampling. In the original VGG paper, the authors employed convolutions with 3×3 kernels with
padding of 1 (keeping height and width) and 2×2 maximum pooling with stride of 2 (halving the resolution after
each block). In the code below, we define a function called vgg_block to implement one VGG block.
In [50]:
def vgg_block(num_convs, in_channels, out_channels):
layers = []
for _ in range(num_convs):
layers.append(
nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1))
layers.append(nn.ReLU())
in_channels = out_channels
layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
return nn.Sequential(*layers)
VGG Network
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
22/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [51]:
conv_arch = ((1, 64), (1, 128), (2, 256), (2, 512), (2, 512))
In [52]:
def vgg(conv_arch):
conv_blks = []
in_channels = 1
# The convolutional part
for (num_convs, out_channels) in conv_arch:
conv_blks.append(vgg_block(num_convs, in_channels, out_channels))
in_channels = out_channels
​
return nn.Sequential(
*conv_blks, nn.Flatten(),
# The fully-connected part
nn.Linear(out_channels * 7 * 7, 4096), nn.ReLU(), nn.Dropout(0.5),
nn.Linear(4096, 4096), nn.ReLU(), nn.Dropout(0.5),
nn.Linear(4096, 10))
​
net = vgg(conv_arch)
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
23/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [53]:
X = torch.randn(size=(1, 1, 224, 224))
for blk in net:
X = blk(X)
print(blk.__class__.__name__, 'output shape:\t', X.shape)
Sequential output shape:
Sequential output shape:
Sequential output shape:
Sequential output shape:
Sequential output shape:
Flatten output shape:
Linear output shape:
ReLU output shape:
Dropout output shape:
Linear output shape:
ReLU output shape:
Dropout output shape:
Linear output shape:
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1,
torch.Size([1, 25088])
torch.Size([1, 4096])
torch.Size([1, 4096])
torch.Size([1, 4096])
torch.Size([1, 4096])
torch.Size([1, 4096])
torch.Size([1, 4096])
torch.Size([1, 10])
64, 112, 112])
128, 56, 56])
256, 28, 28])
512, 14, 14])
512, 7, 7])
In [54]:
# Since VGG-11 is more computationally-heavy than AlexNet we construct a network wit
#This is more than sufficient for training on Fashion-MNIST.
ratio = 4
small_conv_arch = [(pair[0], pair[1] // ratio) for pair in conv_arch]
net = vgg(small_conv_arch)
In [55]:
lr, num_epochs, batch_size = 0.05, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
train(net, train_iter, test_iter, num_epochs, lr, device)
loss 0.175, train acc 0.935, test acc 0.919
2384.3 examples/sec on cuda
<Figure size 252x180 with 1 Axes>
Network in Network (NiN)
LeNet, AlexNet, and VGG all share a common design pattern: extract features exploiting spatial structure via a
sequence of convolution and pooling layers and then post-process the representations via fully-connected
layers. The improvements upon LeNet by AlexNet and VGG mainly lie in how these later networks widen and
deepen these two modules. Alternatively, one could imagine using fully-connected layers earlier in the process.
However, a careless use of dense layers might give up the spatial structure of the representation entirely,
network in network (NiN) blocks offer an alternative. They were proposed based on a very simple insight: to use
an MLP on the channels for each pixel separately.
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
24/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [56]:
def nin_block(in_channels, out_channels, kernel_size, strides, padding):
return nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size, strides, padding),
nn.ReLU(), nn.Conv2d(out_channels, out_channels, kernel_size=1),
nn.ReLU(), nn.Conv2d(out_channels, out_channels, kernel_size=1),
nn.ReLU())
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
25/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [57]:
net = nn.Sequential(
nin_block(1, 96, kernel_size=11, strides=4, padding=0),
nn.MaxPool2d(3, stride=2),
nin_block(96, 256, kernel_size=5, strides=1, padding=2),
nn.MaxPool2d(3, stride=2),
nin_block(256, 384, kernel_size=3, strides=1, padding=1),
nn.MaxPool2d(3, stride=2), nn.Dropout(0.5),
# There are 10 label classes
nin_block(384, 10, kernel_size=3, strides=1, padding=1),
nn.AdaptiveAvgPool2d((1, 1)),
# Transform the four-dimensional output into two-dimensional output with a
# shape of (batch size, 10)
nn.Flatten())
In [58]:
X = torch.rand(size=(1, 1, 224, 224))
for layer in net:
X = layer(X)
print(layer.__class__.__name__, 'output shape:\t', X.shape)
Sequential output shape:
MaxPool2d output shape:
Sequential output shape:
MaxPool2d output shape:
Sequential output shape:
MaxPool2d output shape:
Dropout output shape:
Sequential output shape:
AdaptiveAvgPool2d output
Flatten output shape:
torch.Size([1, 96, 54, 54])
torch.Size([1, 96, 26, 26])
torch.Size([1, 256, 26, 26])
torch.Size([1, 256, 12, 12])
torch.Size([1, 384, 12, 12])
torch.Size([1, 384, 5, 5])
torch.Size([1, 384, 5, 5])
torch.Size([1, 10, 5, 5])
shape: torch.Size([1, 10, 1, 1])
torch.Size([1, 10])
In [59]:
lr, num_epochs, batch_size = 0.1, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
In [60]:
train(net, train_iter, test_iter, num_epochs, lr, device)
loss 0.333, train acc 0.878, test acc 0.855
2728.6 examples/sec on cuda
<Figure size 252x180 with 1 Axes>
Batch Normalization
𝐱 ∈  an input to batch normalization (BN) that is from a minibatch , batch
𝐱 according to the following expression:
BN(𝐱) = 𝜸 ⊙ 𝐱−𝝁𝝈̂ ̂ + 𝜷
𝐡 = 𝜙(BN(𝐖𝐱 + 𝐛))
Formally, denoting by
normalization transforms
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
26/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [61]:
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
# Use `is_grad_enabled` to determine whether the current mode is training
# mode or prediction mode
if not torch.is_grad_enabled():
# If it is prediction mode, directly use the mean and variance
# obtained by moving average
X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
else:
assert len(X.shape) in (2, 4)
if len(X.shape) == 2:
# When using a fully-connected layer, calculate the mean and
# variance on the feature dimension
mean = X.mean(dim=0)
var = ((X - mean)**2).mean(dim=0)
else:
# When using a two-dimensional convolutional layer, calculate the
# mean and variance on the channel dimension (axis=1). Here we
# need to maintain the shape of `X`, so that the broadcasting
# operation can be carried out later
mean = X.mean(dim=(0, 2, 3), keepdim=True)
var = ((X - mean)**2).mean(dim=(0, 2, 3), keepdim=True)
# In training mode, the current mean and variance are used for the
# standardization
X_hat = (X - mean) / torch.sqrt(var + eps)
# Update the mean and variance using moving average
moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
moving_var = momentum * moving_var + (1.0 - momentum) * var
Y = gamma * X_hat + beta # Scale and shift
return Y, moving_mean.data, moving_var.data
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
27/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [62]:
class BatchNorm(nn.Module):
# `num_features`: the number of outputs for a fully-connected layer
# or the number of output channels for a convolutional layer. `num_dims`:
# 2 for a fully-connected layer and 4 for a convolutional layer
def __init__(self, num_features, num_dims):
super().__init__()
if num_dims == 2:
shape = (1, num_features)
else:
shape = (1, num_features, 1, 1)
# The scale parameter and the shift parameter (model parameters) are
# initialized to 1 and 0, respectively
self.gamma = nn.Parameter(torch.ones(shape))
self.beta = nn.Parameter(torch.zeros(shape))
# The variables that are not model parameters are initialized to 0 and 1
self.moving_mean = torch.zeros(shape)
self.moving_var = torch.ones(shape)
​
def forward(self, X):
# If `X` is not on the main memory, copy `moving_mean` and
# `moving_var` to the device where `X` is located
if self.moving_mean.device != X.device:
self.moving_mean = self.moving_mean.to(X.device)
self.moving_var = self.moving_var.to(X.device)
# Save the updated `moving_mean` and `moving_var`
Y, self.moving_mean, self.moving_var = batch_norm(
X, self.gamma, self.beta, self.moving_mean, self.moving_var,
eps=1e-5, momentum=0.9)
return Y
Applying Batch Normalization in LeNet
In [63]:
net = nn.Sequential(nn.Conv2d(1, 6, kernel_size=5), BatchNorm(6, num_dims=4),
nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2),
nn.Conv2d(6, 16,
kernel_size=5), BatchNorm(16, num_dims=4),
nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2),
nn.Flatten(), nn.Linear(16 * 4 * 4, 120),
BatchNorm(120, num_dims=2), nn.Sigmoid(),
nn.Linear(120, 84), BatchNorm(84, num_dims=2),
nn.Sigmoid(), nn.Linear(84, 10))
In [64]:
lr, num_epochs, batch_size = 1.0, 10, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
28/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [65]:
train(net, train_iter, test_iter, num_epochs, lr, device)
loss 0.263, train acc 0.902, test acc 0.798
79612.6 examples/sec on cuda
<Figure size 252x180 with 1 Axes>
In [66]:
net[1].gamma.reshape((-1,)), net[1].beta.reshape((-1,))
Out[66]:
(tensor([4.1125, 3.0199, 2.8736, 3.3110, 3.3188, 2.3615], device='cud
a:0',
grad_fn=<ViewBackward>),
tensor([-2.0603, -3.0580, -3.5240, 1.9239, -1.2915, -0.8189], device
='cuda:0',
grad_fn=<ViewBackward>))
In [67]:
## using built-in nn.BatchNorm2d
net = nn.Sequential(nn.Conv2d(1, 6, kernel_size=5), nn.BatchNorm2d(6),
nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2),
nn.Conv2d(6, 16, kernel_size=5), nn.BatchNorm2d(16),
nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2),
nn.Flatten(), nn.Linear(256, 120), nn.BatchNorm1d(120),
nn.Sigmoid(), nn.Linear(120, 84), nn.BatchNorm1d(84),
nn.Sigmoid(), nn.Linear(84, 10))
In [68]:
train(net, train_iter, test_iter, num_epochs, lr, device)
loss 0.263, train acc 0.903, test acc 0.825
106445.5 examples/sec on cuda
<Figure size 252x180 with 1 Axes>
Residual Networks (ResNet)
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
29/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
Let us focus on a local part of a neural network. Denote the input by 𝐱. We assume that the desired underlying
mapping we want to obtain by learning is 𝑓(𝐱), to be used as the input to the activation function on the top.
On the left of the given below figure, the portion within the dotted-line box must directly learn the mapping 𝑓(𝐱).
On the right, the portion within the dotted-line box needs to learn the residual mapping 𝑓(𝐱)−𝐱, which is how the
residual block derives its name. If the identity mapping 𝑓(𝐱)=𝐱 is the desired underlying mapping, the residual
mapping is easier to learn: we only need to push the weights and biases of the upper weight layer (e.g., fullyconnected layer and convolutional layer) within the dotted-line box to zero. The right of the given below figure
illustrates the residual block of ResNet, where the solid line carrying the layer input 𝐱 to the addition operator is
called a residual connection (or shortcut connection). With residual blocks, inputs can forward propagate faster
through the residual connections across layers.
In [71]:
from torch.nn import functional as F
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
30/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [72]:
class Residual(nn.Module):
"""The Residual block of ResNet."""
def __init__(self, input_channels, num_channels, use_1x1conv=False,
strides=1):
super().__init__()
self.conv1 = nn.Conv2d(input_channels, num_channels, kernel_size=3,
padding=1, stride=strides)
self.conv2 = nn.Conv2d(num_channels, num_channels, kernel_size=3,
padding=1)
if use_1x1conv:
self.conv3 = nn.Conv2d(input_channels, num_channels,
kernel_size=1, stride=strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm2d(num_channels)
self.bn2 = nn.BatchNorm2d(num_channels)
​
def forward(self, X):
Y = F.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
Y += X
return F.relu(Y)
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
31/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [73]:
blk = Residual(3, 3)
X = torch.rand(4, 3, 6, 6)
Y = blk(X)
Y.shape
Out[73]:
torch.Size([4, 3, 6, 6])
In [74]:
blk = Residual(3, 6, use_1x1conv=True, strides=2)
blk(X).shape
Out[74]:
torch.Size([4, 6, 3, 3])
In [75]:
b1 = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
nn.BatchNorm2d(64), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
In [76]:
def resnet_block(input_channels, num_channels, num_residuals,
first_block=False):
blk = []
for i in range(num_residuals):
if i == 0 and not first_block:
blk.append(
Residual(input_channels, num_channels, use_1x1conv=True,
strides=2))
else:
blk.append(Residual(num_channels, num_channels))
return blk
In [77]:
b2
b3
b4
b5
=
=
=
=
nn.Sequential(*resnet_block(64, 64, 2, first_block=True))
nn.Sequential(*resnet_block(64, 128, 2))
nn.Sequential(*resnet_block(128, 256, 2))
nn.Sequential(*resnet_block(256, 512, 2))
In [78]:
net = nn.Sequential(b1, b2, b3, b4, b5, nn.AdaptiveAvgPool2d((1, 1)),
nn.Flatten(), nn.Linear(512, 10))
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
32/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
33/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [79]:
X = torch.rand(size=(1, 1, 224, 224))
for layer in net:
X = layer(X)
print(layer.__class__.__name__, 'output shape:\t', X.shape)
Sequential output shape:
torch.Size([1,
Sequential output shape:
torch.Size([1,
Sequential output shape:
torch.Size([1,
Sequential output shape:
torch.Size([1,
Sequential output shape:
torch.Size([1,
AdaptiveAvgPool2d output shape: torch.Size([1,
Flatten output shape:
torch.Size([1, 512])
Linear output shape:
torch.Size([1, 10])
64, 56, 56])
64, 56, 56])
128, 28, 28])
256, 14, 14])
512, 7, 7])
512, 1, 1])
In [80]:
lr, num_epochs, batch_size = 0.05, 10, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)
In [81]:
train(net, train_iter, test_iter, num_epochs, lr, device)
loss 0.011, train acc 0.997, test acc 0.915
5904.9 examples/sec on cuda
<Figure size 252x180 with 1 Axes>
From ResNet to DenseNet
The key difference between ResNet and DenseNet is that in the latter case outputs are concatenated (denoted
by [,]) rather than added.
In [82]:
def conv_block(input_channels, num_channels):
return nn.Sequential(
nn.BatchNorm2d(input_channels), nn.ReLU(),
nn.Conv2d(input_channels, num_channels, kernel_size=3, padding=1))
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
34/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [83]:
class DenseBlock(nn.Module):
def __init__(self, num_convs, input_channels, num_channels):
super(DenseBlock, self).__init__()
layer = []
for i in range(num_convs):
layer.append(
conv_block(num_channels * i + input_channels, num_channels))
self.net = nn.Sequential(*layer)
​
def forward(self, X):
for blk in self.net:
Y = blk(X)
# Concatenate the input and output of each block on the channel
# dimension
X = torch.cat((X, Y), dim=1)
return X
In [84]:
blk = DenseBlock(2, 3, 10)
X = torch.randn(4, 3, 8, 8)
Y = blk(X)
Y.shape
Out[84]:
torch.Size([4, 23, 8, 8])
Transition Layers
In [85]:
def transition_block(input_channels, num_channels):
return nn.Sequential(
nn.BatchNorm2d(input_channels), nn.ReLU(),
nn.Conv2d(input_channels, num_channels, kernel_size=1),
nn.AvgPool2d(kernel_size=2, stride=2))
In [86]:
blk = transition_block(23, 10)
blk(Y).shape
Out[86]:
torch.Size([4, 10, 4, 4])
In [87]:
#model
b1 = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
nn.BatchNorm2d(64), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
35/36
11/12/21, 9:33 PM
From_Fully_ConnectedLayers_to_Convolutions - Jupyter Notebook
In [88]:
# `num_channels`: the current number of channels
num_channels, growth_rate = 64, 32
num_convs_in_dense_blocks = [4, 4, 4, 4]
blks = []
for i, num_convs in enumerate(num_convs_in_dense_blocks):
blks.append(DenseBlock(num_convs, num_channels, growth_rate))
# This is the number of output channels in the previous dense block
num_channels += num_convs * growth_rate
# A transition layer that halves the number of channels is added between
# the dense blocks
if i != len(num_convs_in_dense_blocks) - 1:
blks.append(transition_block(num_channels, num_channels // 2))
num_channels = num_channels // 2
In [92]:
net = nn.Sequential(b1, *blks, nn.BatchNorm2d(num_channels), nn.ReLU(),
nn.AdaptiveMaxPool2d((1, 1)), nn.Flatten(),
nn.Linear(num_channels, 10))
In [93]:
lr, num_epochs, batch_size = 0.1, 10, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)
In [94]:
train(net, train_iter, test_iter, num_epochs, lr, device)
loss 0.146, train acc 0.946, test acc 0.821
6900.0 examples/sec on cuda
<Figure size 252x180 with 1 Axes>
In [ ]:
​
localhost:8888/notebooks/From_Fully_ConnectedLayers_to_Convolutions.ipynb
36/36
Download