Uploaded by Abdalzyz098

Lec2 CNN

advertisement
lOMoARcPSD|15872722
Convolutional Neural Networks
(CNN)
CNNs are state of the art for image processing / DL with images as input
applying filters on the input data, basically, the filters replace the weights through
different kernel convolutional operations being responsible for the filter effect
two main ideas:
give a better structure to NN: instead of connecting everything with
everything, connect neurons of one layer with neurons of another layer that
are neighbors
use the same weights for different parts of the image; intuitively if feature of
one image is interesting it will prob. also be interesting in another image
Convolutions
convolve = falten; applying a filter to a function; filter in the sense of a matrix/grid
of values that alter the output of a given function
Discrete Case: Box Filter
Sliding filter kernel from left to right, multiplying and summing up every overlapping fields
applying the same filter to all pixels of an image is the idea of weight sharing
handling overlapping fields: either ignore → shrinking image; or padding: add
0 to compute a value
Convolutional Neural Networks (CNN)
Downloaded by Eng Esraa (esraahassan.esraa@gmail.com)
1
lOMoARcPSD|15872722
Convolution on Images
Image of 5x5 with a convolutional filter of size 3x3 generating an output of size 3x3
example computation is reduced to actually necessary computations: 3 ⋅
0 + 3 ⋅ (−1) + 5 ⋅ 0 + 1 ⋅ (−1) + 4 ⋅ 5 + 4 ⋅ (−1) + 7 ⋅ 0 + 9 ⋅
(−1) + (−1) ⋅ 0 = 3 ⋅ (−1) + 1 ⋅ (−1) + 4 ⋅ 5 + 4 ⋅ (−1) + 9 ⋅
(−1) = −3 − 1 + 20 − 4 − 9 = 20 − 17 = 3
Image Filter Examples → that is exactly how filters are applied by any image
altering application, e.g. Instagram
Convolutional Neural Networks (CNN)
Downloaded by Eng Esraa (esraahassan.esraa@gmail.com)
2
lOMoARcPSD|15872722
In CNNs these filters represent the weights of the network
Convolutions on RGB Images
images have depth due to RGB split we have 3 channels
depth dimension of image must match depth of filter (convolutional kernel)
same procedure as before: slide filter over image and apply filter through dot
product at every position resulting in zi
=
wT
xi
(5×5×3)×1(5×5×3)×1
+ b
1
where the weights represent the filter, note that the output matrix z is of
dimension 1
Example: 32 x 32 x 3 image results in 28 x 28 output image without padding
Convolutional Neural Networks (CNN)
Downloaded by Eng Esraa (esraahassan.esraa@gmail.com)
3
lOMoARcPSD|15872722
Convolution Layer
def.: applying different filters to the same image, for every filter we apply to the
image we create a new convolutional layer, e.g. applying 2 filters to an 32 x 32 x
3 image results in 28 x 28 x 2 convolutional layers
layer defined by filter width & height, depth implicitly given by dot-product
number of layers defined by number of different weights (i.e. filters)
each filter captures different image characteristic, e.g. horizontal/vertical edges,
circles, squares, etc.
Dimensions of Convolutional Layers - Examples
stride trigger a jump, e.g. stride = 2
Convolutional Neural Networks (CNN)
Downloaded by Eng Esraa (esraahassan.esraa@gmail.com)
4
lOMoARcPSD|15872722
without padding the outputs shrink with every iteration which is not a good
idea
padding assures that corner pixels are considered as well and image sizes
don't get smaller as quickler as they would otherwise → most common
−F
paddiong: zero-padding, leading to output size: (+ N +2⋅P
S
−F
(+ N +2⋅P
S
, + 1) ×
, + 1)
N: width of image
F: width of filter
P: number of padding; padding should usually be set to P
=
F −1
2
S: stride
number of parameters (weights): each number in filter is considered as
one weight, i.e. 5x5x3 filter has 5*5*3+1 = 76 parameters (+1 for bias for
every layer), if we apply 10 filters we have a total of 76 * 10 = 760
parameters
Exam Example Question
Convolutional Neural Networks (CNN)
Downloaded by Eng Esraa (esraahassan.esraa@gmail.com)
5
lOMoARcPSD|15872722
Convolutional Neural Network (CNN)
concatenation of convolutional layers and activations
Pooling
another operator heavily used in CNNs
using padding assures that the images don't shrink as we apply the filters,
pooling allows to shrink images nevertheless but only when required → reducing
feature map size
pooling is the same as downsampling usually by 2
Different ways:
Max Pooling: define equally sized regions within input and then create new
pooled output of that size consisting of highest numbers from each
corresponding input region, e.g.
Convolutional Neural Networks (CNN)
Downloaded by Eng Esraa (esraahassan.esraa@gmail.com)
6
lOMoARcPSD|15872722
if within a region more than one highest number exist just take either
Average Pooling: averaging all values of a region instead of taking max value
conv layer = feature extraction computing feature in a given region and pooling
layer = feature selection picking the strongest activation in a region
most common setting of a pool: 2 x 2, e.g. image of 200x200 results in 100x100
Other properties
Example of a fully connected network using convolution, ReLU as activation
function and applying Pooling to shrink the image size
Convolutional Neural Networks (CNN)
Downloaded by Eng Esraa (esraahassan.esraa@gmail.com)
7
lOMoARcPSD|15872722
CNN Prototype; FC applies brute force connecting everything with everything, not using shared
weights and thus not applying inductive bias
Convolutions allows us to structure a Neural Network
Receptive Field
describing the field of pixels from which a pixel of field within a convolutional
kernel has been created (computed through dot products) from
the deeper one goes into a network, the bigger the receptive field must be
preferably, use more layers with smaller filters (e.g. 3 layers with filter size 3x3)
as this also injects more non-linearity (with every additional layer), also less
weights → less overfitting
Convolutional Neural Networks (CNN)
Downloaded by Eng Esraa (esraahassan.esraa@gmail.com)
8
lOMoARcPSD|15872722
Classic Architectures
LeNet
32x32x1 image recognition of grey scale images therefore only 1 as 3rd
dimension classifying into 10 classes
on a high level: gradually reduce spatial dimensions
Test Benchmarks: ImageNet Dataset - ImageNet Large Scale Visual
Recognition Competition marked key milestone in DL
Common Performance Metrics using top-k scores
top-1 score: checking if sample's top class with highest probability is the
same as target label
top-5 score: if any of 5 predictions with highest prob → top-5 error
percentage of test samples for which correct class wasn't in top 5
predicted classes
AlexNet has about 60 mio parameters
Convolutional Neural Networks (CNN)
Downloaded by Eng Esraa (esraahassan.esraa@gmail.com)
9
lOMoARcPSD|15872722
1000 outputs for 1000 classes: in order to get from spatial data 6x6x256 we
use fully connected networks converting data into 9216 data points, then
4096, again 4096, and finally 1000
VGGNet simplifying AlexNet by fixing CONV = 3x3 filters with stride 1 &
MAXPOOL = 2x2 filters with stride 2
again switching between CONV & POOL in 16 layers, again width & height
decreases + # of filters increase as we go deeper resulting in 138 mio
parameters
Skip Connections - ResNet
Problem of Depth - why don't we simply add more layers? → more and more
layers makes training harder, gradies explode and vanish
Convolutional Neural Networks (CNN)
Downloaded by Eng Esraa (esraahassan.esraa@gmail.com)
10
lOMoARcPSD|15872722
Residual Block - how can we train very deep nets (i.e. more layers) while
keeping training stable?
skipping connection: taing output from L-1 directly to L+1
ResNet Block
Convolutional Neural Networks (CNN)
Downloaded by Eng Esraa (esraahassan.esraa@gmail.com)
11
lOMoARcPSD|15872722
ResNets with set of good network design choices - mostly used for computer
vision networks to classify images
Why do ResNets work?
Convolutional Neural Networks (CNN)
Downloaded by Eng Esraa (esraahassan.esraa@gmail.com)
12
lOMoARcPSD|15872722
if these value become 0 the output of layer L+1 will be equal to L-1, nothing
changes because gradients vanish → reason why we cant have unlimited
main layers
1x1 Convolutions
simply scales input answer by constant while keeping dimension of input
useful to shrink number of channels + adds non-linearity allowing us to learn
more complex functions
Inception Layer
core idea: too many layers result in huge computational costs, reduce # of layers
with 1x1 convolution
finding the perfect number of filters → choose them all: same convolutions with
different sizes + 3x3 max pooling with stride 1
Convolutional Neural Networks (CNN)
Downloaded by Eng Esraa (esraahassan.esraa@gmail.com)
13
lOMoARcPSD|15872722
Computational Cost: inserting layer of 32x32x16 saves a lot of computational
effort
GoogLeNet using inception blocks with extra added max pool layer to reduce
dimensionality
Xception Net being extrem version of inception applying Depthwise Separable
Convolutions instead of normal convolutions, 36 conv layers structured into
several modules with skip connections
depthwise separable convolutions using different filters for each slide of
depth 3 → reduces # of computations significantly
Convolutional Neural Networks (CNN)
Downloaded by Eng Esraa (esraahassan.esraa@gmail.com)
14
lOMoARcPSD|15872722
Fully Convolutional Network
convolutions act as feature extraction methods
fully connected convolutional network assures in the last few layers to take
activation/feature maps and turn the information into a classification result
converting fully connected layers also to convolutional layers using 1x1
convolution as it is exactly the same as the fully connected network layers
using bigger images is not a problem resulting in (H/32 x W/32 x # of channels)
Convolutional Neural Networks (CNN)
Downloaded by Eng Esraa (esraahassan.esraa@gmail.com)
15
lOMoARcPSD|15872722
Semantic Segmentation: reduce dimension of information and increase it back to
original size of image in the last layer/step → How to upsample / go back to
original size?
interpolation - double up size, e..g nearest neighbot interpolation (pixel
without value looking at nearest neighbouring pixel and copying its value),
bilinear interpolation (looking at different neighbours taking weighted average
of their values), bicubic interpolation (again taking values from neighbours)
transposed conv: taking representation, blowing it up by spreading given
information equally across new spatial dimension, processing representation
by series of convolutions
performing unpooling
initializing all empty spaced to 0, then continuing with convolutions to
adjust the 0 values
U-Net
from left (contraction path, i.e. encoder) to right (expansion path, i.e. decoder)
performing series of convolutions (feature extraction) and pooling (feature
selection) → during encoding we loose spatial detail, therefore results copied to
decoder such that it also has the previous information
Convolutional Neural Networks (CNN)
Downloaded by Eng Esraa (esraahassan.esraa@gmail.com)
16
lOMoARcPSD|15872722
Convolutional Neural Networks (CNN)
Downloaded by Eng Esraa (esraahassan.esraa@gmail.com)
17
Download