TensorFlow 2.0 CNNs In this course we will see Convolutional neural networks (CNNs). Stay tuned, this will be a good course about the actual applications of machine learning in small and large scale when it comes to image recognition, classification and computer vision in general. This is ½ of what you need for your Deep learning knowledge. I. Foundations of Convolutional neural networks: Image Classification, object detection and neural style transfer are 3 of the main problems that CNNs were designed to solve, image classification is classifying images and labeling them, while object detection is detecting objects and drawing a square around them. Neural style transfer is transferring an image’s style and applying it to another image One of the major challenges that created CNNs was the vast amount of data of images that make it impossible to compute al the weights and biases, e.g.: an HD photo has 1000x1000 pixels, each pixel has a RGB value, meaning 1000x1000x3, that’s 3 million inputs, with 1000 hidden units (hidden neurons) 3m * 1000, 3 billion parameters, near impossible to compute, even if computed it will be overfitting. This is where the “Convolution” operation comes into play. a. Edge detection: Instead of detecting every pixel an image has, we try to limit it using a filter (Kernel) to detect only the edges of that object, a filter or kernel is a FxF matrix, where we multiply(Convolution) each filter to the matrix of the image’s pixels, that’ to that filer which is the matrix that we chose. We apply the filter to each cell and we move horizontally to the next one right. In Keras we use the function Conv2D, in the above example, that filter was used to detect vertical edges. Here’s an example of how it works: If you notice that the detected edge is really thick compared to the actual image edge, it’s because we’re using a very small 6x6 image, in real life projects we work with 1000x1000 or more. Filters (Kernels) can come in all types, these are some of the common types of edge detection filters: The last one is where we treat the filter’s values as parameters and we use back-propagation to get the optimal value that we want for detecting awkward edges that are not geometrically simple. To calculate the resulting matrix of a convolution, this is the formula: For a NxN scale image and a FxF filter, the resulting image will be: N–F+1 x N-F+1 b. Padding and Strides: a. Padding: Padding is adding a border of pixels to the image so that we can apply the convolutions to the corners of the image as well thus reducing information loss. P in padding is the amount of extra border, the extra pixels are filled with 0 as values. Here the main purpose of padding value is to get the same scale for the output image as the input image. In order to find the P value, we solve the following formula: N+2P-F = N → P = (F-1)/2 Filters are usually odd (3x3, 5x5…etc.). b. Strides: Strides are the amount of steps you take when your applying filter and you want to move to the next cell, the output image size is computed as follows: ((N+2P-F)/2) + 1 c. Convolutions over Volume: It turns out that using 1 filter can be insufficient most of the times unlike using multiple filters to detect different edges, on an RGB scale image, there’s a 3rd dimension added with a size of 3, thus the filter should also be able to detect edges in different colors: In his example, we use 2 filters, 1 for detecting horizontal and the other for vertical edges. Notice how the result is only a 4x4 image and not a 4x4x3, that’s because the 3x3x3 filter is a cube that’s multiplied (convolution) for each cell, the multiplications for each layer is then summed into 1 cell, thus eliminating the 3 layers of color. d. One layer of a Convolutional neural network: As we’ve seen before, the formula of a 1 step of forward propagation in a neural network is Z[i] = W[i] a[i-1] + b[i] whereas a[i] = g(Z[i]) whereas g() is the activation function This above is an example of 1 layer of a CNN, the resulting 2 matrices from the convolution operation on the image matrix will be added to a bias 1 and a bias 2 respectively, the result of each one will be fed into a Relu activation function thus producing 2 matrices that are gathered into 1 resulting into a 4x4x2 sized matrix. If we had 10 filters per example, the result is 4x4x10. ❖ This is the mathematical summary of 1 layer of a CNN where l is the layer number: c. Simple convolutional neural network: We will see a simple CNN as a resume of all what we’ve seen so far, we’re trying to do image classification of an RGB colored 39x39 image, meaning it’s 39x39x3, this what it looks like: • • The first layer we applied 10 3x3 filters, a stride of 1 and no padding, the resulting image size is calculated using the formula in the bottom left of the image ((N+2P-F/S)) +1). Notice how we increase the applied filters as we get into deeper layers while the image’s height and depth decreases and the channel size increases(the image gets deeper) • The last layer is where we set the result 7x7x40 image into a 1960 vector for calculation, this vector is fed into a softmax or linear regression algorithm for classification. So far we’ve only seen CNN layers using Convolutional method, usually a good CNN will have Convolutional layers, Pooling layers and fully connected layers. a. Pooling layers: ML experts usually use Pooling layers to reduce the size and speed up computation, as well increasing the possibility of detecting more features. It’s using a filter that selects only certain values of the mini matrix that’s been applied to, like choosing either they’re max or min or center, Pooling has 2 parameters: filter size value and Strides value. This is a Max Pooling example where we use a 2x2 filter and a stride of, the filter selects the max value between the mini matrix that is applied to. This works the same on RGB colored images and the result will be a 3 layer deep matrix with the size of ((N+2P-F/S) + 1). F here is the filter size! Not the filters count. Common pooling parameters are: F = 2, S = 2 / F = 3, S = 2. Max Pooling is used much more commonly than average pooling. b. Full example of a CNN: This is a full example of a Convolutional neural network, with all the steps of Pooling and Convolutions and Flattening, and a fully connected layer linked to a softmax regression. We have an RGB colored 32x32 image, notice how each layer has a Convolution and a pooling operation, the first layer we apply a 6 filters of 5x5 and a maxpool of size 5x5, second layer we apply 10 filters the size of 5x5 and 1 maxpool of 2x2 and stride of 2, then we apply flattening as we convert the result of the second layer into a vector of size 400, this vector will be the input layer of a fully connected neural network of 3 layers. The output layer of that NN is in turn fed to a Softmax regression for classification. And that’s the breakdown of this CNN. Summary of notations and calculations of parameters: II. Case Studies (ResNets, Inception, Transfer leanring...etc) 1. Classic CNNs: ResNets predecessors, LeNet used a plain Pooling and Convolution sequences until we flatten the pixels. Conv→Pool→Conv2→Pool2...fully connected layer→fully connected layer→output vector (softmax). AlexNets worked the same only it used solely Max Pooling and had 3 layers in the fully connected layer. (It also used Local Response Regularization to not had activations with really high values and parallel GPU computing to train quicker cuz they had slow GPU’s in the 90s) VGG-16 is a big CNN with 132 Million parameters, 5 Conv and 5 pooling layers, 2 fully connected layers and softmax output. Remember that Convolution is the use of Filters, while Pooling is the use of matrix reduction by selecting only one value (Max pooling, Min Pooling) or something else. 2. ResNets: Resnets tackle the Vanishing Gradient problem(the loss of the effects of previous activations over the next layers, basically the first layers have vanishing effect on next layers) or Exploding Gradient(The first layers have too much of an impact on the upcoming layers causing them to over fit). Skip connections is the core of ResNets, it uses the residual blocks to transfer activation values very deeply into the network: 3. Inception: Inception module is about using multiple parallel branches with different filter sizes to learn different aspects of the input image, and then concatenate the outputs from these branches into a single tensor that can be used as input to the next layer in the network. An inception network contains multiple inception blocks: and It’s also called GoogleNet 4. MobileNets : A MobileNet is very computationally inexpensive, usefull for embedded(edge computing) hardware and Mobile application, it’s key idea is: Depthwise-seperable convolution We need to count the computational cost so we can compare between a normal CNN and a MobileNet architecture: Filter position means how many times the filter must move to count every possible sub-matrix of the whole matrix. A Depthwise convolution will carry the matrix multiplication for each filter on different channels, basically each filter will be applied to only 1 channel, but Pointwise is applying a 1x1 conv to reduce the complexity and get the exact shape we want in the end, we can modify how many filters we want to use for Pointwise conv to get the wanted result. This is just an example of the main block of MobileNets, however MobileNet v2 uses residual connection to reduce vanishing and exploding gradients: 5. EfficientNets: EfficientNets is about manipulating your CNN architecture to scale up or down the computation to fit it to the device you’re applying the model in. There are 3 parameters that play into scaling your CNN architecture: The Resolution, Depth and Width. 6. Practical advices to apply ConvNets: Many actual implementations of computer vision research papers are found in Github, so you can look for the Research paper’s name or the model’s name alongside ‘Github’ to find it’s implementation. The just ‘git clone repository_url’ 7. Transfer Learning: Transfer learning is about using a pre-trained model, it already learned the necessary features to distinguish between classes and apply it on your particular problem. The weights are already learned, you just remove the last fully connected layers or just the last Softmax layer and train it on your particular dataset and you will get good results. In Transfer learning you will hear the keyword Freeze, which is applied to a layer, a frozen layer’s weights won’t be modified when retraining the model for the specific problem, we usually freeze the first layers because they contain the fundamental/important features learned, while the later layers have the specific task features learned. Therefore, you can either replace the Softmax and output layers, or replace a few of the last NN layers with a new NN. Another thing that Transfer Learning does, is save you from the random initialization, because when your training a NN from scratch it takes a lot of time for the random initializations to be properly fixed. A computer vision problem where you have little data, requires more creativity and hand engineering to perfect the algorithm and maximize the output from the data you have, but the more data you get, the less need for complex engineering is required, because with a large dataset, the deep learning algorithm is bound to learn. Therefore, the state of Computer Vision requires researchers to be more skillful in hand engineering rather than deep learning knowledge, like coding skills that will help you augment images and somehow automate labeling.