Uploaded by Mahesh N

DLVS SCHEME

advertisement
DLVS
SCHEME OF EVALUATION
PART-A
1a.What is Visual Perception?
A: Visual perception, at its most basic, is the act of observing patterns and objects through sight or
visual input. With an autonomous vehicle, for example, visual perception means understanding the
surrounding objects and their specific details
[2M]
1b.Define Input Vector and Weight vector.
A: Input vector—The feature vector that is fed to the neuron. It is usually denoted with an uppercase
X to represent a vector of inputs (x1, x2, . . ., xn).
Weights vector—Each x1 is assigned a weight value w1 that represents its importance to distinguish
between different input datapoints.
[3M]
1c.Give formula to calculate precision and recall.
A:
[2M]
1d. List layers used in Image Classification using MLP and what is the use of each layer
A: The main components of the neural network architecture are as follows:
Input layer—Contains the feature vector.
 Hidden layers—The neurons are stacked on top of each other in hidden layers.They are called
“hidden” layers because we don’t see or control the input goinginto these layers or the output. All
we do is feed the feature vector to the inputlayer and see the output coming out of the output layer.
Weight connections (edges)—Weights are assigned to each connection between thenodes to
reflect the importance of their influence on the final output prediction. In graph network terms,
these are called edges connecting the nodes.
[3M]
1e.What is Dropout Layer?
A: A dropout layer is one of the most commonly used layers to prevent overfitting. Dropout turns off
a percentage of neurons (nodes) that make up a layer of your network.
[2M]
1f. Why is fine tuning better than training fromn scratch?
A: When we train a network from scratch, we usually randomly initialize the weights and apply a
gradient descent optimizer to find the best set of weights that optimizes ourerror function .Since
these weights start with random values, there is no guarantee that they will begin with values that
are close to the desiredoptimal values. And if the initialized value is far from the optimal value, the
optimizerwill take a long time to converge. This is when fine-tuning can be very useful.
[3M]
1g.Define Evaluation Metrics in object detection framework
A: When evaluating the performance of an object detector, we use two main evaluation
metrics: frames per second and mean average precision.
FRAMES PER SECOND (FPS) TO MEASURE DETECTION SPEED
The most common metric used to measure detection speed is the number of frames per second
(FPS).
MEAN AVERAGE PRECISION (MAP) TO MEASURE NETWORK PRECISION
The most common evaluation metric used in object recognition tasks is mean averageprecision
(mAP). It is a percentage from 0 to 100, and higher values are typically better,but its value is different
from the accuracy metric used in classification.
[2M]
1h. Define Semantic Segmentation
A: Semantic Segmentation is a computer vision task in which the goal is to categorize each pixel in an
image into a class or object. The goal is to produce a dense pixel-wise segmentation map of an
image, where each pixel is assigned to a specific class or object.
[3M]
1i. Define Optimization
A: Optimization is a way of framing a problem to maximize or minimize some value. The best thing
about computing an error function is that we turn the neural network into an optimization problem
where our goal is to minimize the error.
[2M]
1j. Differentiate between generator and discriminator
A: Generator training requires tighter integration between the generator and the discriminator than
discriminator training requires.
The discriminator in a GAN is simply a classifier. It tries to distinguish real data from the data created
by the generator.
[3M]
PART-B
2a. Discuss Computer Vision pipeline : The big picture with neat diagram.
A:




DIG-2M
A computer receives visual input from an imaging device like a camera. Thisinput is typically
captured as an image or a sequence of images forming a video.
Each image is then sent through some preprocessing steps whose purpose is tostandardize
the images. Common preprocessing steps include resizing an image, blurring, rotating,
changing its shape, or transforming the image from one color to another, such as from color
to grayscale. Only by standardizing the images—for example, making them the same size—
can you then compare them and further analyze them.
We extract features. Features are what help us define objects, and they are usually
information about object shape or color. For example, some features that distinguish a
motorcycle are the shape of the wheels, headlights, mudguards, and so on. The output of
this process is a feature vector that is a list of unique shapes that identify the object.
The features are fed into a classification model. This step looks at the feature vector from
the previous step and predicts the class of the image. Pretend that you are the classifier
model for a few minutes, and let’s go through the classification process.
[3M]
2b. What is the need of converting a color image to gray scale image ? Justify.
A: An image can be represented as a function of two variables x and y, which define a two
dimensional area. A digital image is made of a grid of pixels. The pixel is the raw building block of an
image. Every image consists of a set of pixels in which their values represent the intensity of light
that appears in a given place in the image.
In color images, instead of representing the value of the pixel by just one number, the value is
represented by three numbers representing the intensity of each color in the pixel. In an RGB system,
for example, the value of the pixel is represented by three numbers: the intensity of red, intensity of
green, and intensity of blue. There are other color systems for images like HSV and Lab. All follow the
same concept when representing the pixel value (more on color images soon). Here is the function
representing color images in the RGB system: Color image in RGB => F(x, y) = [ red (x, y), green (x, y),
blue (x, y) ] Thinking of an image as a function is very useful in image processing. We can think of an
image as a function of F(x, y) and operate on it mathematically to transform it to a new image
function G(x, y).
Explanation
with
example[5M]
Or
3a. What is ANN? Draw ANN and Explain
A: Computer vision algorithms are typically employed as interpreting devices. The interpreter is the
brain of the vision system. Its role is to take the output image from the sensing device and learn
features and patterns to identify objects. So we need to build a brain. Simple! Scientists were
inspired by how our brains work and tried to reverse engineer the central nervous system to get
some insight on how to build an artificial brain. Thus, artificial neural networks (ANNs) were born we
can see an analogy between biological neurons and artificial systems. Both contain a main processing
element, a neuron, with input signals (x1, x2, …,xn) and an output. The learning behavior of
biological neurons inspired scientists to create a network of neurons that are connected to each
other. Imitating how information is processed in the human brain, each artificial neuron fires a signal
to all the neurons that it’s connected to when enough of its input signals are activated. Thus,
neurons have a very simple mechanism on the individual level but when you have millions of these
neurons stacked in layers and connected together each neuron is connected to thousands of other
neurons, yielding a learning behavior. Building a multilayer neural network is called deep learning.
Explanation
[3M]
[DIG-2M]
3b. Explain About Weight sum function.
A: Not all input features are equally important (or useful) features. Each input feature (x1) is assigned
its own weight (w1) that reflects its importance in the decision-making process. Inputs assigned
greater weight have a greater effect on the output. If the weight is high, it amplifies the input signal;
and if the weight is low, it diminishes the input signal. In common representations of neural
networks, the weights are represented by lines or edges from the input node to the perceptron. For
example, if you are predicting a house price based on a set of features like size, neighborhood, and
number of rooms, there are three input features (x1, x2, and x3). Each of these inputs will have a
different weight value that represents its effect on the final decision. For example, if the size of the
house has double the effect on the price compared with the neighborhood, and the neighborhood
has double the effect
compared with the number of rooms, you will see weights something like 8, 4, and 2, respectively.
[Explanation 3M,Dig-2M]
4a.Draw CNN architecture and explain in detail.
A: Regular neural networks contain multiple layers that allow each layer to find successively complex
features, and this is the way CNNs work. The first layer of convolutions learns some basic features
(edges and lines), the next layer learns features that are a little more complex (circles, squares, and
so on), the following layer finds even more complex features (like parts of the face, a car wheel, dog
whiskers, and the like), and so on. You will see this demonstrated shortly. For now, know that the
CNN architecture follows the same pattern as neural networks: we stack neurons in hidden layers on
top of each other; weights are randomly initiated and learned during network training; and we apply
activation functions, calculate the error (y – yˆ), and backpropagate the error to update the weights.
This process is the same. The difference is that we use convolutional layers instead of regular fully
connected layers for the feature learning part.
[Explanation-3M,Dig-2M]
4b. Describe work principle of Convolution layers with 3*3 convolution filter
A: Convolutional layers are the major building blocks used in convolutional neural networks.
A convolution is the simple application of a filter to an input that results in an activation. Repeated
application of the same filter to an input results in a map of activations called a feature map,
indicating the locations and strength of a detected feature in an input, such as an image.
[Esxplanation 3M,Dig-2M]
Or
5a. Discuss Data Augmentation
A: One way to avoid overfitting is to obtain more data. Since this is not always a feasible option, we
can augment our training data by generating new instances of the same images with some
ransformations. Data augmentation can be an inexpensive way to give your learning algorithm more
training data and therefore reduce overfitting. The many image-augmentation techniques include
flipping, rotation, scaling, zooming, lighting conditions, and many other transformations that you can
apply to your dataset to provide a variety of images to train on. The main advantage of synthesizing
images like this is that now you have more data (20×) that tells your algorithm that if an image is the
digit 6, then even if you flip it vertically or horizontally or rotate it, it’s still the digit 6. This makes the
model more robust to detect the number 6 in any form and shape. Data augmentation is considered
a regularization technique because allowing the network to see many variants of the object reduces
its dependence on the original form of the object during feature learning. This makes the network
more resilient when tested on new data.
[Explanation-3M,Dig-1M,Code-1M]
5b. Discuss Batch Normalization
A: The normalization techniques we discussed were focused on preprocessing the training set before
feeding it to the input layer. If the input layer benefits from normalization, why not do the same thing
for the extracted features in the hidden units, which are changing all the time and get much more
improvement in training speed and network resilience (figure 4.28)? This process is called batch
normalization (BN).
[Explanation 3M,Dig-2M]
6a.Summarize CNN design patterns.
A. Pattern 1: Feature extraction and classification—Convolutional nets are typically composed of two
parts: the feature extraction part, which consists of a series of convolutional layers; and the
classification part, which consists of a series of fully connected layers. This is pretty much always the
case with ConvNets, starting from LeNet and AlexNet to the very recent CNNs that have come out in
the past few years, like Inception and ResNet.
Pattern 2: Image depth increases, and dimensions decrease—The input data at each layer is an
image. With each layer, we apply a new convolutional layer over a new image. This pushes us to think
of an image in a more generic way. First, you see that each image is a 3D object that has a height,
width, and depth. Depth is referred to as the color channel, where depth is 1 for grayscale images
and 3 for color images. In the later layers, the images still have depth, but they are not colors per se:
they are feature maps that represent the features extracted from the previous layers.
Pattern 3: Fully connected layers—This generally isn’t as strict a pattern as the previous two, but it’s
very helpful to know. Typically, all fully connected layers in a network either have the same number
of hidden units or decrease at each layer. It is rare to find a network where the number of units in
the fully connected layers increases at each layer.
[Explanation 3M,Digs-2M]
6b. Analyse the novel features of Inception.
A:There are some architectural decisions that you need to make for each layer when you are
designing a network, such as these:
The kernel size of the convolutional layer—We’ve seen in previous architectures thathe kernel size
varies: 1 × 1, 3 × 3, 5 × 5, and, in some cases, 11 × 11 (as in AlexNet). When designing
theonvolutional layer, we find ourselves trying to picand tune the kernel size of each layer that fits
our dataset. Smaller kernels capture finer details of the image, whereas bigger filters will leave out
minute details. When to use the pooling layer—AlexNet uses pooling layers every one or two
convolutional layers to downsize spatial features. VGGNet applies pooling after every two, three, or
four convolutional layers as the network gets deeper. Configuring the kernel size and positioning the
pool layers are decisions we make mostly by trial and error and experiment with to get the optimal
results. Inception says, “Instead of choosing a desired filter size in a convolutional layer and deciding
where to place the pooling layers, let’s apply all of them all together in one block and call it the
inception module.”
[Explalnation 3M,Dig-2M]
Or
7a. How transfer learning works.
A: Recall that neural networks iteratively update their weights during the training cycle of
feedforward and backpropagation. We say the network has been trained when we go through a
series of training iterations and hyperparameter tuning until the network yields satisfactory results.
When training is complete, we output two main items:
the network architecture and the trained weights. So, when we say that we are going to use a
pretrained network, we mean that we will download the network architecture together with the
weights. During training, the model learns only the features that exist in this training dataset. But
when we download large models (like Inception) that have been trained on huge numbers of
datasets (like ImageNet), all the features that have already been extracted from these large datasets
are now available for us to use. I find that really exciting because these pretrained models have
spotted other features that weren’t in our dataset and will help us build better convolutional
networks.
In vision problems, there’s a huge amount of stuff for neural networks to learn about the training
dataset. There are low-level features like edges, corners, round shapes, curvy shapes, and blobs; and
then there are mid- and higher-level features like eyes, circles, squares, and wheels. There are many
details in the images that CNNs can pick up on—but if we have only 1,000 images or even 25,000
images in our training dataset, this may not be enough data for the model to learn al those things.
By using a pretrained network, we can basically download all this knowledge into our neural network
to give it a huge and much faster start with even higher performance levels.
[Explantion-3M,Dig-2M]
7b. List the open source dataset and explain in detail.
A: The CV research community has been pretty good about posting datasets on the internet. So,
when you hear names like ImageNet, MS COCO, Open Images, MNIST, CIFAR, and many others, these
are datasets that people have posted online and that a lot of computer researchers have used as
benchmarks to train their algorithms and get state-of-the-art results.
MNIST stands for Modified National Institute of Standards and Technology. It contains labeled
handwritten images of digits from 0 to 9. The goal of this dataset is to classify handwritten digits.
MNIST has been popular with the research community for benchmarking classification algorithms. In
fact, it is considered the “hello, world!” of image datasets. But nowadays, the MNIST dataset is
comparatively pretty simple, and a basic CNN can achieve more than 99% accuracy, so MNIST is no
longer considered a benchmark for CNN performance. We implemented a CNN classification project
using MNIST dataset in chapter 3; feel free to go back and review it. MNIST consists of 60,000
training images and 10,000 test images. All are grayscale (one-channel), and each image is 28 pixels
high and 28 pixels wide. Figure 6.12 shows some sample images from the MNIST dataset.
[5M]
8a. Explain R-CNN architecture with neat diagram
A: R-CNN is the least sophisticated region-based architecture in its family, but it is the basis for
understanding how multiple object-recognition algorithms work for all of them. It was one of the
first large, successful applications of convolutional neural networks to the problem of object
detection and localization, and it paved the way for the other advanced detection algorithms. The
approach was demonstrated on benchmark atasets, achieving then-state-of-the-art results on the
PASCAL VOC-2012 dataset and the ILSVRC 2013 object detection challenge.
The R-CNN model consists of four components:
 Extract regions of interest—Also known as extracting region proposals. These regions have a
high probability of containing an object. An algorithm called selective search scans the input
image to find regions that contain blobs, and proposes them as RoIs to be processed by the
next modules in the pipeline. The proposed RoIs are then warped to have a fixed size; they
usually vary in size, but as we learned in previous chapters, CNNs require a fixed input image
size.
 Feature extraction module—We run a pretrained convolutional network on top of the region
proposals to extract features from each candidate region. This is the typical CNN feature
extractor that we learned about in previous chapters.
 Classification module—We train a classifier like a support vector machine (SVM), a
traditional machine learning algorithm, to classify candidate detections based on the
extracted features from the previous step.
 Localization module—Also known as a bounding-box regressor. Let’s take a step back,to
understand regression. ML problems are categorized as classification or regression problems.
Classification algorithms output discrete, predefined classes (dog, cat, elephant), whereas
regression algorithms output continuous value predictions.
[Explantion-3M,Dig-2M]
8b.Explain high level SDD architecture with neat diagram
A: The SSD approach is based on a feed-forward convolutional network that produces a
fixed-size collection of bounding boxes and scores for the presence of object class instances
in those boxes, followed by a NMS step to produce the final detections. The architecture of
the SSD model is composed of three main parts:
Base network to extract feature maps—A standard pretrained network used for highquality image classification, which is truncated before any classification layers. In their paper,
Liu et al. used a VGG16 network. Other networks like VGG19 and Re Net can be used and
should produce good results.
Multi-scale feature layers—A series of convolution filters are added after the base
network. These layers decrease in size progressively to allow predictions of detections at
multiple scales.
Non-maximum suppression—NMS is used to eliminate overlapping boxes and
keep only one box for each object detected.
[Explanation-3M,Dig-2M]
Or
9a. How does instance segmentation work?
A: Instance segmentation involves classifying pixels based on the instances of an object (as
opposed to object classes). Instance segmentation algorithms do not know which class each
region belongs to—rather, they separate similar or overlapping regions based on the
boundaries of objects. We can refer to Instance Segmentation as a combination of semantic
segmentation and object detection (detecting all instances of a category in an image) with
the additional feature of demarcating separate instances of any particular segment class
added to the vanilla segmentation task.
Instance Segmentation produces a richer output format as compared to both object
detection and semantic segmentation networks. With Instance Segmentation, one can find
the bounding boxes of each instance (which in this case pertains to a dog and two cats) as
well as the object segmentation maps for each instance, thereby knowing the number of
instances (cats and a dog) in the image.
[Expalanation-5M]
9b.Discuss about Variational Encoders.
A: In order to be able to use the decoder of our autoencoder for generative purpose, we
have to be sure that the latent space is regular enough. one possible solution to obtain such
regularity is to introduce explicit regularisation during the training process. thus, as we
briefly mentioned in the introduction of this post, a variational autoencoder can be defined
as being an autoencoder whose training is regularised to avoid overfitting and ensure that
the latent space has good properties that enable generative process.
Just as a standard autoencoder, a variational autoencoder is an architecture composed of
both an encoder and a decoder and that is trained to minimise the reconstruction error
between the encoded-decoded data and the initial data. however, in order to introduce
some regularisation of the latent space, we proceed to a slight modification of the encodingdecoding process: instead of encoding an input as a single point, we encode it as a
distribution over the latent space. the model is then trained as follows:
first, the input is encoded as distribution over the latent space
second, a point from the latent space is sampled from that distribution
third, the sampled point is decoded and the reconstruction error can be computed
finally, the reconstruction error is backpropagated through the network
[Explanation-5M]
Or
10a. Compare gradient ascent and gradient descent.
A: Gradient descent is an iterative process through which we optimize the parameters of a
machine learning model. It’s particularly used in neural networks, but also in logistic
regression and support vector machines.
It’s the most typical method for iterative minimization of a cost function. Its major limitation,
though, consists of its guaranteed convergence to a local, not necessarily global, minimum:
A hyperparameter \alpha, also called the learning rate, allows the fine-tuning of the process
of descent. In particular, with an appropriate choice of \alpha, we can escape the
convergence to a local minimum, and descend towards a global minimum instead.
The gradient is calculated with respect to a vector of parameters for the model, typically the
weights w. In neural networks, the process of applying gradient descent to the weight matrix
takes the name of the backpropagation of the error.
Gradient ascent works in the same manner as gradient descent, with one difference. The
task it fulfills isn’t minimization, but rather maximization of some function. The reason for
the difference is that, at times, we may want to reach the maximum, not the minimum of
some function; this is the case, for instance, if we’re maximizing the distance between
separation hyperplanes and observations.
For this reason, the formula that describes gradient ascent is similar to the one for gradient
descent. Only, with a flipped sign:
W_{n+1} = w_n + \alpha \nabla_w f(w)
If gradient descent indicates an iterative movement towards the closest minimum, gradient
ascent, conversely, indicates a movement towards the nearest maximum. In this sense, for
any function f on which we apply gradient descent, there is a symmetric function -f on which
we can apply gradient ascent.
This means also that a problem tackled through gradient descent also has solutions that we
can find through gradient ascent, if only we reflect it upon the axis of the independent
variable. This image shows the same function of the previous graph, but reflected along the x
axis:
[EXPLANTION-3M,DIG-2M]
10b.Explain Deep Learning on Edge devices
A: Deep Learning on the edge alleviates the issues, and provides other benefits. Edge here
refers to the computation that is performed locally on the consumer’s products. This blog
explores the benefits of using edge computing for Deep Learning, and the problems
associated with it.
Bandwidth and latency
Security and decentralization
Job specific usage
Swarm intelligence
Redundancy
Cost efective in long run
Constarints
Parameter efficency
Pruning
Dsitillation
[EXPLANATION-5M]
Download