Uploaded by WALID AMARA

CNNs Course: TensorFlow 2.0, Deep Learning

TensorFlow 2.0
In this course we will see Convolutional neural networks (CNNs). Stay
tuned, this will be a good course about the actual applications of machine learning
in small and large scale when it comes to image recognition, classification and
computer vision in general. This is ½ of what you need for your Deep learning
I. Foundations of Convolutional neural networks:
Image Classification, object detection and neural style transfer are 3 of the main problems that
CNNs were designed to solve, image classification is classifying images and labeling them, while
object detection is detecting objects and drawing a square around them. Neural style transfer is
transferring an image’s style and applying it to another image
One of the major challenges that created CNNs was the vast amount of data of images that make
it impossible to compute al the weights and biases, e.g.: an HD photo has 1000x1000 pixels, each
pixel has a RGB value, meaning 1000x1000x3, that’s 3 million inputs, with 1000 hidden units
(hidden neurons) 3m * 1000, 3 billion parameters, near impossible to compute, even if computed
it will be overfitting. This is where the “Convolution” operation comes into play.
a. Edge detection:
Instead of detecting every pixel an image has, we try to limit it using a filter (Kernel) to detect
only the edges of that object, a filter or kernel is a FxF matrix, where we multiply(Convolution)
each filter to the matrix of the image’s pixels, that’ to that filer which is the matrix that we chose.
We apply the filter to each cell and we move horizontally to the next one right. In Keras we use
the function Conv2D, in the above example, that filter was used to detect vertical edges.
Here’s an example of how it works:
If you notice that the detected edge is really thick compared to the actual image edge, it’s because
we’re using a very small 6x6 image, in real life projects we work with 1000x1000 or more.
Filters (Kernels) can come in all types, these are some of the common types of edge detection
The last one is where we treat the filter’s values as parameters and we use back-propagation to
get the optimal value that we want for detecting awkward edges that are not geometrically
To calculate the resulting matrix of a convolution, this is the formula:
For a NxN scale image and a FxF filter, the resulting image will be: N–F+1 x N-F+1
b. Padding and Strides:
a. Padding:
Padding is adding a border of pixels to the image so that we can apply the convolutions to the
corners of the image as well thus reducing information loss. P in padding is the amount of extra
border, the extra pixels are filled with 0 as values.
Here the main purpose of padding value is to get the same scale for the output image as the input
image. In order to find the P value, we solve the following formula: N+2P-F = N → P = (F-1)/2
Filters are usually odd (3x3, 5x5…etc.).
b. Strides:
Strides are the amount of steps you take when your applying filter and you want to move to the
next cell, the output image size is computed as follows: ((N+2P-F)/2) + 1
c. Convolutions over Volume:
It turns out that using 1 filter can be insufficient most of the times unlike using multiple filters to
detect different edges, on an RGB scale image, there’s a 3rd dimension added with a size of 3,
thus the filter should also be able to detect edges in different colors:
In his example, we use 2 filters, 1 for detecting horizontal and the other for vertical edges. Notice
how the result is only a 4x4 image and not a 4x4x3, that’s because the 3x3x3 filter is a cube that’s
multiplied (convolution) for each cell, the multiplications for each layer is then summed into 1
cell, thus eliminating the 3 layers of color.
d. One layer of a Convolutional neural network:
As we’ve seen before, the formula of a 1 step of forward propagation in a neural network is
Z[i] = W[i] a[i-1] + b[i] whereas a[i] = g(Z[i]) whereas g() is the activation function
This above is an example of 1 layer of a CNN, the resulting 2 matrices from the convolution
operation on the image matrix will be added to a bias 1 and a bias 2 respectively, the result of
each one will be fed into a Relu activation function thus producing 2 matrices that are gathered
into 1 resulting into a 4x4x2 sized matrix. If we had 10 filters per example, the result is 4x4x10.
❖ This is the mathematical summary of 1 layer of a CNN where l is the layer
c. Simple convolutional neural network:
We will see a simple CNN as a resume of all what we’ve seen so far, we’re trying to do image
classification of an RGB colored 39x39 image, meaning it’s 39x39x3, this what it looks like:
The first layer we applied 10 3x3 filters, a stride of 1 and no padding, the resulting image
size is calculated using the formula in the bottom left of the image ((N+2P-F/S)) +1).
Notice how we increase the applied filters as we get into deeper layers while the image’s
height and depth decreases and the channel size increases(the image gets deeper)
The last layer is where we set the result 7x7x40 image into a 1960 vector for calculation,
this vector is fed into a softmax or linear regression algorithm for classification.
So far we’ve only seen CNN layers using Convolutional method, usually a good CNN will
have Convolutional layers, Pooling layers and fully connected layers.
a. Pooling layers:
ML experts usually use Pooling layers to reduce the size and speed up computation, as well
increasing the possibility of detecting more features. It’s using a filter that selects only certain
values of the mini matrix that’s been applied to, like choosing either they’re max or min or
center, Pooling has 2 parameters: filter size value and Strides value.
This is a Max Pooling example where we use a 2x2 filter and a stride of, the filter selects the max
value between the mini matrix that is applied to. This works the same on RGB colored images
and the result will be a 3 layer deep matrix with the size of ((N+2P-F/S) + 1).
F here is the filter size! Not the filters count. Common pooling parameters are:
F = 2, S = 2 / F = 3, S = 2. Max Pooling is used much more commonly than average pooling.
b. Full example of a CNN:
This is a full example of a Convolutional neural network, with all the steps of Pooling and
Convolutions and Flattening, and a fully connected layer linked to a softmax regression.
We have an RGB colored 32x32 image, notice how each layer has a Convolution and a pooling
operation, the first layer we apply a 6 filters of 5x5 and a maxpool of size 5x5, second layer we
apply 10 filters the size of 5x5 and 1 maxpool of 2x2 and stride of 2, then we apply flattening as
we convert the result of the second layer into a vector of size 400, this vector will be the input
layer of a fully connected neural network of 3 layers. The output layer of that NN is in turn fed to
a Softmax regression for classification. And that’s the breakdown of this CNN.
Summary of notations and calculations of parameters:
II. Case Studies (ResNets, Inception, Transfer leanring...etc)
1. Classic CNNs:
ResNets predecessors, LeNet used a plain Pooling and Convolution sequences until we flatten the
pixels. Conv→Pool→Conv2→Pool2...fully connected layer→fully connected layer→output
vector (softmax).
AlexNets worked the same only it used solely Max Pooling and had 3 layers in the fully
connected layer. (It also used Local Response Regularization to not had activations with really
high values and parallel GPU computing to train quicker cuz they had slow GPU’s in the 90s)
VGG-16 is a big CNN with 132 Million parameters, 5 Conv and 5 pooling layers, 2 fully
connected layers and softmax output.
Remember that Convolution is the use of Filters, while Pooling is the use of matrix reduction by
selecting only one value (Max pooling, Min Pooling) or something else.
2. ResNets:
Resnets tackle the Vanishing Gradient problem(the loss of the effects of previous activations over
the next layers, basically the first layers have vanishing effect on next layers) or Exploding
Gradient(The first layers have too much of an impact on the upcoming layers causing them to
over fit).
Skip connections is the core of ResNets, it uses the residual blocks to transfer activation values
very deeply into the network:
3. Inception:
Inception module is about using multiple parallel branches with different filter sizes to learn
different aspects of the input image, and then concatenate the outputs from these branches into a
single tensor that can be used as input to the next layer in the network.
An inception network contains multiple inception blocks: and It’s also called GoogleNet
4. MobileNets :
A MobileNet is very computationally inexpensive, usefull for embedded(edge computing)
hardware and Mobile application, it’s key idea is: Depthwise-seperable convolution
We need to count the computational cost so we can compare between a normal CNN and a
MobileNet architecture:
Filter position means how many times the filter must move to count every possible sub-matrix of
the whole matrix.
A Depthwise convolution will carry the matrix multiplication for each filter on different
channels, basically each filter will be applied to only 1 channel, but Pointwise is applying a 1x1
conv to reduce the complexity and get the exact shape we want in the end, we can modify how
many filters we want to use for Pointwise conv to get the wanted result.
This is just an example of the main block of MobileNets, however MobileNet v2 uses residual
connection to reduce vanishing and exploding gradients:
5. EfficientNets:
EfficientNets is about manipulating your CNN architecture to scale up or down the computation
to fit it to the device you’re applying the model in. There are 3 parameters that play into scaling
your CNN architecture: The Resolution, Depth and Width.
6. Practical advices to apply ConvNets:
Many actual implementations of computer vision research papers are found in Github, so you can
look for the Research paper’s name or the model’s name alongside ‘Github’ to find it’s
implementation. The just ‘git clone repository_url’
7. Transfer Learning:
Transfer learning is about using a pre-trained model, it already learned the necessary features to
distinguish between classes and apply it on your particular problem. The weights are already
learned, you just remove the last fully connected layers or just the last Softmax layer and train it
on your particular dataset and you will get good results. In Transfer learning you will hear the
keyword Freeze, which is applied to a layer, a frozen layer’s weights won’t be modified when retraining the model for the specific problem, we usually freeze the first layers because they
contain the fundamental/important features learned, while the later layers have the specific task
features learned. Therefore, you can either replace the Softmax and output layers, or replace a
few of the last NN layers with a new NN. Another thing that Transfer Learning does, is save you
from the random initialization, because when your training a NN from scratch it takes a lot of
time for the random initializations to be properly fixed.
A computer vision problem where you have little data, requires more creativity and hand
engineering to perfect the algorithm and maximize the output from the data you have, but the
more data you get, the less need for complex engineering is required, because with a large
dataset, the deep learning algorithm is bound to learn. Therefore, the state of Computer Vision
requires researchers to be more skillful in hand engineering rather than deep learning knowledge,
like coding skills that will help you augment images and somehow automate labeling.