DLVS SCHEME OF EVALUATION PART-A 1a.What is Visual Perception? A: Visual perception, at its most basic, is the act of observing patterns and objects through sight or visual input. With an autonomous vehicle, for example, visual perception means understanding the surrounding objects and their specific details [2M] 1b.Define Input Vector and Weight vector. A: Input vector—The feature vector that is fed to the neuron. It is usually denoted with an uppercase X to represent a vector of inputs (x1, x2, . . ., xn). Weights vector—Each x1 is assigned a weight value w1 that represents its importance to distinguish between different input datapoints. [3M] 1c.Give formula to calculate precision and recall. A: [2M] 1d. List layers used in Image Classification using MLP and what is the use of each layer A: The main components of the neural network architecture are as follows: Input layer—Contains the feature vector. Hidden layers—The neurons are stacked on top of each other in hidden layers.They are called “hidden” layers because we don’t see or control the input goinginto these layers or the output. All we do is feed the feature vector to the inputlayer and see the output coming out of the output layer. Weight connections (edges)—Weights are assigned to each connection between thenodes to reflect the importance of their influence on the final output prediction. In graph network terms, these are called edges connecting the nodes. [3M] 1e.What is Dropout Layer? A: A dropout layer is one of the most commonly used layers to prevent overfitting. Dropout turns off a percentage of neurons (nodes) that make up a layer of your network. [2M] 1f. Why is fine tuning better than training fromn scratch? A: When we train a network from scratch, we usually randomly initialize the weights and apply a gradient descent optimizer to find the best set of weights that optimizes ourerror function .Since these weights start with random values, there is no guarantee that they will begin with values that are close to the desiredoptimal values. And if the initialized value is far from the optimal value, the optimizerwill take a long time to converge. This is when fine-tuning can be very useful. [3M] 1g.Define Evaluation Metrics in object detection framework A: When evaluating the performance of an object detector, we use two main evaluation metrics: frames per second and mean average precision. FRAMES PER SECOND (FPS) TO MEASURE DETECTION SPEED The most common metric used to measure detection speed is the number of frames per second (FPS). MEAN AVERAGE PRECISION (MAP) TO MEASURE NETWORK PRECISION The most common evaluation metric used in object recognition tasks is mean averageprecision (mAP). It is a percentage from 0 to 100, and higher values are typically better,but its value is different from the accuracy metric used in classification. [2M] 1h. Define Semantic Segmentation A: Semantic Segmentation is a computer vision task in which the goal is to categorize each pixel in an image into a class or object. The goal is to produce a dense pixel-wise segmentation map of an image, where each pixel is assigned to a specific class or object. [3M] 1i. Define Optimization A: Optimization is a way of framing a problem to maximize or minimize some value. The best thing about computing an error function is that we turn the neural network into an optimization problem where our goal is to minimize the error. [2M] 1j. Differentiate between generator and discriminator A: Generator training requires tighter integration between the generator and the discriminator than discriminator training requires. The discriminator in a GAN is simply a classifier. It tries to distinguish real data from the data created by the generator. [3M] PART-B 2a. Discuss Computer Vision pipeline : The big picture with neat diagram. A: DIG-2M A computer receives visual input from an imaging device like a camera. Thisinput is typically captured as an image or a sequence of images forming a video. Each image is then sent through some preprocessing steps whose purpose is tostandardize the images. Common preprocessing steps include resizing an image, blurring, rotating, changing its shape, or transforming the image from one color to another, such as from color to grayscale. Only by standardizing the images—for example, making them the same size— can you then compare them and further analyze them. We extract features. Features are what help us define objects, and they are usually information about object shape or color. For example, some features that distinguish a motorcycle are the shape of the wheels, headlights, mudguards, and so on. The output of this process is a feature vector that is a list of unique shapes that identify the object. The features are fed into a classification model. This step looks at the feature vector from the previous step and predicts the class of the image. Pretend that you are the classifier model for a few minutes, and let’s go through the classification process. [3M] 2b. What is the need of converting a color image to gray scale image ? Justify. A: An image can be represented as a function of two variables x and y, which define a two dimensional area. A digital image is made of a grid of pixels. The pixel is the raw building block of an image. Every image consists of a set of pixels in which their values represent the intensity of light that appears in a given place in the image. In color images, instead of representing the value of the pixel by just one number, the value is represented by three numbers representing the intensity of each color in the pixel. In an RGB system, for example, the value of the pixel is represented by three numbers: the intensity of red, intensity of green, and intensity of blue. There are other color systems for images like HSV and Lab. All follow the same concept when representing the pixel value (more on color images soon). Here is the function representing color images in the RGB system: Color image in RGB => F(x, y) = [ red (x, y), green (x, y), blue (x, y) ] Thinking of an image as a function is very useful in image processing. We can think of an image as a function of F(x, y) and operate on it mathematically to transform it to a new image function G(x, y). Explanation with example[5M] Or 3a. What is ANN? Draw ANN and Explain A: Computer vision algorithms are typically employed as interpreting devices. The interpreter is the brain of the vision system. Its role is to take the output image from the sensing device and learn features and patterns to identify objects. So we need to build a brain. Simple! Scientists were inspired by how our brains work and tried to reverse engineer the central nervous system to get some insight on how to build an artificial brain. Thus, artificial neural networks (ANNs) were born we can see an analogy between biological neurons and artificial systems. Both contain a main processing element, a neuron, with input signals (x1, x2, …,xn) and an output. The learning behavior of biological neurons inspired scientists to create a network of neurons that are connected to each other. Imitating how information is processed in the human brain, each artificial neuron fires a signal to all the neurons that it’s connected to when enough of its input signals are activated. Thus, neurons have a very simple mechanism on the individual level but when you have millions of these neurons stacked in layers and connected together each neuron is connected to thousands of other neurons, yielding a learning behavior. Building a multilayer neural network is called deep learning. Explanation [3M] [DIG-2M] 3b. Explain About Weight sum function. A: Not all input features are equally important (or useful) features. Each input feature (x1) is assigned its own weight (w1) that reflects its importance in the decision-making process. Inputs assigned greater weight have a greater effect on the output. If the weight is high, it amplifies the input signal; and if the weight is low, it diminishes the input signal. In common representations of neural networks, the weights are represented by lines or edges from the input node to the perceptron. For example, if you are predicting a house price based on a set of features like size, neighborhood, and number of rooms, there are three input features (x1, x2, and x3). Each of these inputs will have a different weight value that represents its effect on the final decision. For example, if the size of the house has double the effect on the price compared with the neighborhood, and the neighborhood has double the effect compared with the number of rooms, you will see weights something like 8, 4, and 2, respectively. [Explanation 3M,Dig-2M] 4a.Draw CNN architecture and explain in detail. A: Regular neural networks contain multiple layers that allow each layer to find successively complex features, and this is the way CNNs work. The first layer of convolutions learns some basic features (edges and lines), the next layer learns features that are a little more complex (circles, squares, and so on), the following layer finds even more complex features (like parts of the face, a car wheel, dog whiskers, and the like), and so on. You will see this demonstrated shortly. For now, know that the CNN architecture follows the same pattern as neural networks: we stack neurons in hidden layers on top of each other; weights are randomly initiated and learned during network training; and we apply activation functions, calculate the error (y – yˆ), and backpropagate the error to update the weights. This process is the same. The difference is that we use convolutional layers instead of regular fully connected layers for the feature learning part. [Explanation-3M,Dig-2M] 4b. Describe work principle of Convolution layers with 3*3 convolution filter A: Convolutional layers are the major building blocks used in convolutional neural networks. A convolution is the simple application of a filter to an input that results in an activation. Repeated application of the same filter to an input results in a map of activations called a feature map, indicating the locations and strength of a detected feature in an input, such as an image. [Esxplanation 3M,Dig-2M] Or 5a. Discuss Data Augmentation A: One way to avoid overfitting is to obtain more data. Since this is not always a feasible option, we can augment our training data by generating new instances of the same images with some ransformations. Data augmentation can be an inexpensive way to give your learning algorithm more training data and therefore reduce overfitting. The many image-augmentation techniques include flipping, rotation, scaling, zooming, lighting conditions, and many other transformations that you can apply to your dataset to provide a variety of images to train on. The main advantage of synthesizing images like this is that now you have more data (20×) that tells your algorithm that if an image is the digit 6, then even if you flip it vertically or horizontally or rotate it, it’s still the digit 6. This makes the model more robust to detect the number 6 in any form and shape. Data augmentation is considered a regularization technique because allowing the network to see many variants of the object reduces its dependence on the original form of the object during feature learning. This makes the network more resilient when tested on new data. [Explanation-3M,Dig-1M,Code-1M] 5b. Discuss Batch Normalization A: The normalization techniques we discussed were focused on preprocessing the training set before feeding it to the input layer. If the input layer benefits from normalization, why not do the same thing for the extracted features in the hidden units, which are changing all the time and get much more improvement in training speed and network resilience (figure 4.28)? This process is called batch normalization (BN). [Explanation 3M,Dig-2M] 6a.Summarize CNN design patterns. A. Pattern 1: Feature extraction and classification—Convolutional nets are typically composed of two parts: the feature extraction part, which consists of a series of convolutional layers; and the classification part, which consists of a series of fully connected layers. This is pretty much always the case with ConvNets, starting from LeNet and AlexNet to the very recent CNNs that have come out in the past few years, like Inception and ResNet. Pattern 2: Image depth increases, and dimensions decrease—The input data at each layer is an image. With each layer, we apply a new convolutional layer over a new image. This pushes us to think of an image in a more generic way. First, you see that each image is a 3D object that has a height, width, and depth. Depth is referred to as the color channel, where depth is 1 for grayscale images and 3 for color images. In the later layers, the images still have depth, but they are not colors per se: they are feature maps that represent the features extracted from the previous layers. Pattern 3: Fully connected layers—This generally isn’t as strict a pattern as the previous two, but it’s very helpful to know. Typically, all fully connected layers in a network either have the same number of hidden units or decrease at each layer. It is rare to find a network where the number of units in the fully connected layers increases at each layer. [Explanation 3M,Digs-2M] 6b. Analyse the novel features of Inception. A:There are some architectural decisions that you need to make for each layer when you are designing a network, such as these: The kernel size of the convolutional layer—We’ve seen in previous architectures thathe kernel size varies: 1 × 1, 3 × 3, 5 × 5, and, in some cases, 11 × 11 (as in AlexNet). When designing theonvolutional layer, we find ourselves trying to picand tune the kernel size of each layer that fits our dataset. Smaller kernels capture finer details of the image, whereas bigger filters will leave out minute details. When to use the pooling layer—AlexNet uses pooling layers every one or two convolutional layers to downsize spatial features. VGGNet applies pooling after every two, three, or four convolutional layers as the network gets deeper. Configuring the kernel size and positioning the pool layers are decisions we make mostly by trial and error and experiment with to get the optimal results. Inception says, “Instead of choosing a desired filter size in a convolutional layer and deciding where to place the pooling layers, let’s apply all of them all together in one block and call it the inception module.” [Explalnation 3M,Dig-2M] Or 7a. How transfer learning works. A: Recall that neural networks iteratively update their weights during the training cycle of feedforward and backpropagation. We say the network has been trained when we go through a series of training iterations and hyperparameter tuning until the network yields satisfactory results. When training is complete, we output two main items: the network architecture and the trained weights. So, when we say that we are going to use a pretrained network, we mean that we will download the network architecture together with the weights. During training, the model learns only the features that exist in this training dataset. But when we download large models (like Inception) that have been trained on huge numbers of datasets (like ImageNet), all the features that have already been extracted from these large datasets are now available for us to use. I find that really exciting because these pretrained models have spotted other features that weren’t in our dataset and will help us build better convolutional networks. In vision problems, there’s a huge amount of stuff for neural networks to learn about the training dataset. There are low-level features like edges, corners, round shapes, curvy shapes, and blobs; and then there are mid- and higher-level features like eyes, circles, squares, and wheels. There are many details in the images that CNNs can pick up on—but if we have only 1,000 images or even 25,000 images in our training dataset, this may not be enough data for the model to learn al those things. By using a pretrained network, we can basically download all this knowledge into our neural network to give it a huge and much faster start with even higher performance levels. [Explantion-3M,Dig-2M] 7b. List the open source dataset and explain in detail. A: The CV research community has been pretty good about posting datasets on the internet. So, when you hear names like ImageNet, MS COCO, Open Images, MNIST, CIFAR, and many others, these are datasets that people have posted online and that a lot of computer researchers have used as benchmarks to train their algorithms and get state-of-the-art results. MNIST stands for Modified National Institute of Standards and Technology. It contains labeled handwritten images of digits from 0 to 9. The goal of this dataset is to classify handwritten digits. MNIST has been popular with the research community for benchmarking classification algorithms. In fact, it is considered the “hello, world!” of image datasets. But nowadays, the MNIST dataset is comparatively pretty simple, and a basic CNN can achieve more than 99% accuracy, so MNIST is no longer considered a benchmark for CNN performance. We implemented a CNN classification project using MNIST dataset in chapter 3; feel free to go back and review it. MNIST consists of 60,000 training images and 10,000 test images. All are grayscale (one-channel), and each image is 28 pixels high and 28 pixels wide. Figure 6.12 shows some sample images from the MNIST dataset. [5M] 8a. Explain R-CNN architecture with neat diagram A: R-CNN is the least sophisticated region-based architecture in its family, but it is the basis for understanding how multiple object-recognition algorithms work for all of them. It was one of the first large, successful applications of convolutional neural networks to the problem of object detection and localization, and it paved the way for the other advanced detection algorithms. The approach was demonstrated on benchmark atasets, achieving then-state-of-the-art results on the PASCAL VOC-2012 dataset and the ILSVRC 2013 object detection challenge. The R-CNN model consists of four components: Extract regions of interest—Also known as extracting region proposals. These regions have a high probability of containing an object. An algorithm called selective search scans the input image to find regions that contain blobs, and proposes them as RoIs to be processed by the next modules in the pipeline. The proposed RoIs are then warped to have a fixed size; they usually vary in size, but as we learned in previous chapters, CNNs require a fixed input image size. Feature extraction module—We run a pretrained convolutional network on top of the region proposals to extract features from each candidate region. This is the typical CNN feature extractor that we learned about in previous chapters. Classification module—We train a classifier like a support vector machine (SVM), a traditional machine learning algorithm, to classify candidate detections based on the extracted features from the previous step. Localization module—Also known as a bounding-box regressor. Let’s take a step back,to understand regression. ML problems are categorized as classification or regression problems. Classification algorithms output discrete, predefined classes (dog, cat, elephant), whereas regression algorithms output continuous value predictions. [Explantion-3M,Dig-2M] 8b.Explain high level SDD architecture with neat diagram A: The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a NMS step to produce the final detections. The architecture of the SSD model is composed of three main parts: Base network to extract feature maps—A standard pretrained network used for highquality image classification, which is truncated before any classification layers. In their paper, Liu et al. used a VGG16 network. Other networks like VGG19 and Re Net can be used and should produce good results. Multi-scale feature layers—A series of convolution filters are added after the base network. These layers decrease in size progressively to allow predictions of detections at multiple scales. Non-maximum suppression—NMS is used to eliminate overlapping boxes and keep only one box for each object detected. [Explanation-3M,Dig-2M] Or 9a. How does instance segmentation work? A: Instance segmentation involves classifying pixels based on the instances of an object (as opposed to object classes). Instance segmentation algorithms do not know which class each region belongs to—rather, they separate similar or overlapping regions based on the boundaries of objects. We can refer to Instance Segmentation as a combination of semantic segmentation and object detection (detecting all instances of a category in an image) with the additional feature of demarcating separate instances of any particular segment class added to the vanilla segmentation task. Instance Segmentation produces a richer output format as compared to both object detection and semantic segmentation networks. With Instance Segmentation, one can find the bounding boxes of each instance (which in this case pertains to a dog and two cats) as well as the object segmentation maps for each instance, thereby knowing the number of instances (cats and a dog) in the image. [Expalanation-5M] 9b.Discuss about Variational Encoders. A: In order to be able to use the decoder of our autoencoder for generative purpose, we have to be sure that the latent space is regular enough. one possible solution to obtain such regularity is to introduce explicit regularisation during the training process. thus, as we briefly mentioned in the introduction of this post, a variational autoencoder can be defined as being an autoencoder whose training is regularised to avoid overfitting and ensure that the latent space has good properties that enable generative process. Just as a standard autoencoder, a variational autoencoder is an architecture composed of both an encoder and a decoder and that is trained to minimise the reconstruction error between the encoded-decoded data and the initial data. however, in order to introduce some regularisation of the latent space, we proceed to a slight modification of the encodingdecoding process: instead of encoding an input as a single point, we encode it as a distribution over the latent space. the model is then trained as follows: first, the input is encoded as distribution over the latent space second, a point from the latent space is sampled from that distribution third, the sampled point is decoded and the reconstruction error can be computed finally, the reconstruction error is backpropagated through the network [Explanation-5M] Or 10a. Compare gradient ascent and gradient descent. A: Gradient descent is an iterative process through which we optimize the parameters of a machine learning model. It’s particularly used in neural networks, but also in logistic regression and support vector machines. It’s the most typical method for iterative minimization of a cost function. Its major limitation, though, consists of its guaranteed convergence to a local, not necessarily global, minimum: A hyperparameter \alpha, also called the learning rate, allows the fine-tuning of the process of descent. In particular, with an appropriate choice of \alpha, we can escape the convergence to a local minimum, and descend towards a global minimum instead. The gradient is calculated with respect to a vector of parameters for the model, typically the weights w. In neural networks, the process of applying gradient descent to the weight matrix takes the name of the backpropagation of the error. Gradient ascent works in the same manner as gradient descent, with one difference. The task it fulfills isn’t minimization, but rather maximization of some function. The reason for the difference is that, at times, we may want to reach the maximum, not the minimum of some function; this is the case, for instance, if we’re maximizing the distance between separation hyperplanes and observations. For this reason, the formula that describes gradient ascent is similar to the one for gradient descent. Only, with a flipped sign: W_{n+1} = w_n + \alpha \nabla_w f(w) If gradient descent indicates an iterative movement towards the closest minimum, gradient ascent, conversely, indicates a movement towards the nearest maximum. In this sense, for any function f on which we apply gradient descent, there is a symmetric function -f on which we can apply gradient ascent. This means also that a problem tackled through gradient descent also has solutions that we can find through gradient ascent, if only we reflect it upon the axis of the independent variable. This image shows the same function of the previous graph, but reflected along the x axis: [EXPLANTION-3M,DIG-2M] 10b.Explain Deep Learning on Edge devices A: Deep Learning on the edge alleviates the issues, and provides other benefits. Edge here refers to the computation that is performed locally on the consumer’s products. This blog explores the benefits of using edge computing for Deep Learning, and the problems associated with it. Bandwidth and latency Security and decentralization Job specific usage Swarm intelligence Redundancy Cost efective in long run Constarints Parameter efficency Pruning Dsitillation [EXPLANATION-5M]