lOMoARcPSD|15872722 Convolutional Neural Networks (CNN) CNNs are state of the art for image processing / DL with images as input applying filters on the input data, basically, the filters replace the weights through different kernel convolutional operations being responsible for the filter effect two main ideas: give a better structure to NN: instead of connecting everything with everything, connect neurons of one layer with neurons of another layer that are neighbors use the same weights for different parts of the image; intuitively if feature of one image is interesting it will prob. also be interesting in another image Convolutions convolve = falten; applying a filter to a function; filter in the sense of a matrix/grid of values that alter the output of a given function Discrete Case: Box Filter Sliding filter kernel from left to right, multiplying and summing up every overlapping fields applying the same filter to all pixels of an image is the idea of weight sharing handling overlapping fields: either ignore → shrinking image; or padding: add 0 to compute a value Convolutional Neural Networks (CNN) Downloaded by Eng Esraa (esraahassan.esraa@gmail.com) 1 lOMoARcPSD|15872722 Convolution on Images Image of 5x5 with a convolutional filter of size 3x3 generating an output of size 3x3 example computation is reduced to actually necessary computations: 3 ⋅ 0 + 3 ⋅ (−1) + 5 ⋅ 0 + 1 ⋅ (−1) + 4 ⋅ 5 + 4 ⋅ (−1) + 7 ⋅ 0 + 9 ⋅ (−1) + (−1) ⋅ 0 = 3 ⋅ (−1) + 1 ⋅ (−1) + 4 ⋅ 5 + 4 ⋅ (−1) + 9 ⋅ (−1) = −3 − 1 + 20 − 4 − 9 = 20 − 17 = 3 Image Filter Examples → that is exactly how filters are applied by any image altering application, e.g. Instagram Convolutional Neural Networks (CNN) Downloaded by Eng Esraa (esraahassan.esraa@gmail.com) 2 lOMoARcPSD|15872722 In CNNs these filters represent the weights of the network Convolutions on RGB Images images have depth due to RGB split we have 3 channels depth dimension of image must match depth of filter (convolutional kernel) same procedure as before: slide filter over image and apply filter through dot product at every position resulting in zi = wT xi (5×5×3)×1(5×5×3)×1 + b 1 where the weights represent the filter, note that the output matrix z is of dimension 1 Example: 32 x 32 x 3 image results in 28 x 28 output image without padding Convolutional Neural Networks (CNN) Downloaded by Eng Esraa (esraahassan.esraa@gmail.com) 3 lOMoARcPSD|15872722 Convolution Layer def.: applying different filters to the same image, for every filter we apply to the image we create a new convolutional layer, e.g. applying 2 filters to an 32 x 32 x 3 image results in 28 x 28 x 2 convolutional layers layer defined by filter width & height, depth implicitly given by dot-product number of layers defined by number of different weights (i.e. filters) each filter captures different image characteristic, e.g. horizontal/vertical edges, circles, squares, etc. Dimensions of Convolutional Layers - Examples stride trigger a jump, e.g. stride = 2 Convolutional Neural Networks (CNN) Downloaded by Eng Esraa (esraahassan.esraa@gmail.com) 4 lOMoARcPSD|15872722 without padding the outputs shrink with every iteration which is not a good idea padding assures that corner pixels are considered as well and image sizes don't get smaller as quickler as they would otherwise → most common −F paddiong: zero-padding, leading to output size: (+ N +2⋅P S −F (+ N +2⋅P S , + 1) × , + 1) N: width of image F: width of filter P: number of padding; padding should usually be set to P = F −1 2 S: stride number of parameters (weights): each number in filter is considered as one weight, i.e. 5x5x3 filter has 5*5*3+1 = 76 parameters (+1 for bias for every layer), if we apply 10 filters we have a total of 76 * 10 = 760 parameters Exam Example Question Convolutional Neural Networks (CNN) Downloaded by Eng Esraa (esraahassan.esraa@gmail.com) 5 lOMoARcPSD|15872722 Convolutional Neural Network (CNN) concatenation of convolutional layers and activations Pooling another operator heavily used in CNNs using padding assures that the images don't shrink as we apply the filters, pooling allows to shrink images nevertheless but only when required → reducing feature map size pooling is the same as downsampling usually by 2 Different ways: Max Pooling: define equally sized regions within input and then create new pooled output of that size consisting of highest numbers from each corresponding input region, e.g. Convolutional Neural Networks (CNN) Downloaded by Eng Esraa (esraahassan.esraa@gmail.com) 6 lOMoARcPSD|15872722 if within a region more than one highest number exist just take either Average Pooling: averaging all values of a region instead of taking max value conv layer = feature extraction computing feature in a given region and pooling layer = feature selection picking the strongest activation in a region most common setting of a pool: 2 x 2, e.g. image of 200x200 results in 100x100 Other properties Example of a fully connected network using convolution, ReLU as activation function and applying Pooling to shrink the image size Convolutional Neural Networks (CNN) Downloaded by Eng Esraa (esraahassan.esraa@gmail.com) 7 lOMoARcPSD|15872722 CNN Prototype; FC applies brute force connecting everything with everything, not using shared weights and thus not applying inductive bias Convolutions allows us to structure a Neural Network Receptive Field describing the field of pixels from which a pixel of field within a convolutional kernel has been created (computed through dot products) from the deeper one goes into a network, the bigger the receptive field must be preferably, use more layers with smaller filters (e.g. 3 layers with filter size 3x3) as this also injects more non-linearity (with every additional layer), also less weights → less overfitting Convolutional Neural Networks (CNN) Downloaded by Eng Esraa (esraahassan.esraa@gmail.com) 8 lOMoARcPSD|15872722 Classic Architectures LeNet 32x32x1 image recognition of grey scale images therefore only 1 as 3rd dimension classifying into 10 classes on a high level: gradually reduce spatial dimensions Test Benchmarks: ImageNet Dataset - ImageNet Large Scale Visual Recognition Competition marked key milestone in DL Common Performance Metrics using top-k scores top-1 score: checking if sample's top class with highest probability is the same as target label top-5 score: if any of 5 predictions with highest prob → top-5 error percentage of test samples for which correct class wasn't in top 5 predicted classes AlexNet has about 60 mio parameters Convolutional Neural Networks (CNN) Downloaded by Eng Esraa (esraahassan.esraa@gmail.com) 9 lOMoARcPSD|15872722 1000 outputs for 1000 classes: in order to get from spatial data 6x6x256 we use fully connected networks converting data into 9216 data points, then 4096, again 4096, and finally 1000 VGGNet simplifying AlexNet by fixing CONV = 3x3 filters with stride 1 & MAXPOOL = 2x2 filters with stride 2 again switching between CONV & POOL in 16 layers, again width & height decreases + # of filters increase as we go deeper resulting in 138 mio parameters Skip Connections - ResNet Problem of Depth - why don't we simply add more layers? → more and more layers makes training harder, gradies explode and vanish Convolutional Neural Networks (CNN) Downloaded by Eng Esraa (esraahassan.esraa@gmail.com) 10 lOMoARcPSD|15872722 Residual Block - how can we train very deep nets (i.e. more layers) while keeping training stable? skipping connection: taing output from L-1 directly to L+1 ResNet Block Convolutional Neural Networks (CNN) Downloaded by Eng Esraa (esraahassan.esraa@gmail.com) 11 lOMoARcPSD|15872722 ResNets with set of good network design choices - mostly used for computer vision networks to classify images Why do ResNets work? Convolutional Neural Networks (CNN) Downloaded by Eng Esraa (esraahassan.esraa@gmail.com) 12 lOMoARcPSD|15872722 if these value become 0 the output of layer L+1 will be equal to L-1, nothing changes because gradients vanish → reason why we cant have unlimited main layers 1x1 Convolutions simply scales input answer by constant while keeping dimension of input useful to shrink number of channels + adds non-linearity allowing us to learn more complex functions Inception Layer core idea: too many layers result in huge computational costs, reduce # of layers with 1x1 convolution finding the perfect number of filters → choose them all: same convolutions with different sizes + 3x3 max pooling with stride 1 Convolutional Neural Networks (CNN) Downloaded by Eng Esraa (esraahassan.esraa@gmail.com) 13 lOMoARcPSD|15872722 Computational Cost: inserting layer of 32x32x16 saves a lot of computational effort GoogLeNet using inception blocks with extra added max pool layer to reduce dimensionality Xception Net being extrem version of inception applying Depthwise Separable Convolutions instead of normal convolutions, 36 conv layers structured into several modules with skip connections depthwise separable convolutions using different filters for each slide of depth 3 → reduces # of computations significantly Convolutional Neural Networks (CNN) Downloaded by Eng Esraa (esraahassan.esraa@gmail.com) 14 lOMoARcPSD|15872722 Fully Convolutional Network convolutions act as feature extraction methods fully connected convolutional network assures in the last few layers to take activation/feature maps and turn the information into a classification result converting fully connected layers also to convolutional layers using 1x1 convolution as it is exactly the same as the fully connected network layers using bigger images is not a problem resulting in (H/32 x W/32 x # of channels) Convolutional Neural Networks (CNN) Downloaded by Eng Esraa (esraahassan.esraa@gmail.com) 15 lOMoARcPSD|15872722 Semantic Segmentation: reduce dimension of information and increase it back to original size of image in the last layer/step → How to upsample / go back to original size? interpolation - double up size, e..g nearest neighbot interpolation (pixel without value looking at nearest neighbouring pixel and copying its value), bilinear interpolation (looking at different neighbours taking weighted average of their values), bicubic interpolation (again taking values from neighbours) transposed conv: taking representation, blowing it up by spreading given information equally across new spatial dimension, processing representation by series of convolutions performing unpooling initializing all empty spaced to 0, then continuing with convolutions to adjust the 0 values U-Net from left (contraction path, i.e. encoder) to right (expansion path, i.e. decoder) performing series of convolutions (feature extraction) and pooling (feature selection) → during encoding we loose spatial detail, therefore results copied to decoder such that it also has the previous information Convolutional Neural Networks (CNN) Downloaded by Eng Esraa (esraahassan.esraa@gmail.com) 16 lOMoARcPSD|15872722 Convolutional Neural Networks (CNN) Downloaded by Eng Esraa (esraahassan.esraa@gmail.com) 17